Article: The art of writing Linux utilities
Feb 13, 2004 — by LinuxDevices Staff — from the LinuxDevices Archive — viewsLinux is famous for coming with a large toolbox and good ways to integrate tools. Peter Seebach discusses how new tools are developed and how to make a one-off program into a utility you'll be using for years to come.
Developing small, useful command-line tools
Linux and other UNIX-like systems have always come with a broad variety of tools that perform functions ranging from the obvious to the arcane. The success of UNIX-like programming environments comes largely from the quality and selection of tools, and the ease with which they can be joined together.
As a developer, you may have found that existing utilities don't always solve your problem. While you can solve many problems easily by stringing together existing utilities, solving other problems requires at least some amount of real programming. These latter tasks are often candidates for creating a new utility that, when combined with existing utilities, will solve the problem with a minimum of effort. This article looks at the qualities that make for a good utility and the design process that goes into it.
There is a wonderful discussion of this question in The UNIX Programming Environment, by Kernighan & Pike. A good utility is one that does its job as well as possible. It has to play well with others; it has to be amenable to being combined with other utilities. A program that doesn't combine with others isn't a utility; it's an application.
Utilities are supposed to let you build one-off applications cheaply and easily from the materials at hand. A lot of people think of them as being like tools in a toolbox. The goal is not to have a single widget that does everything, but to have a handful of tools, each of which does one thing as well as possible.
Some utilities are reasonably useful on their own, whereas others imply cooperation in pipelines of utilities. Examples of the former include sort
and grep
. On the other hand, xargs
is rarely used except with other utilities, most often find
.
Most of the UNIX system utilities are written in C. The examples here are in Perl and sh. Use the right tool for the right job. If you use a utility heavily enough, the cost of writing it in a compiled language might be justified by the performance gain. On the other hand, for the fairly common case where a program's workload is light, a scripting language may offer faster development. If you aren't sure, you should use the language you know best. At least when you're prototyping a utility, or figuring out how useful it is, favor programmer efficiency over performance tuning. Most of the UNIX system utilities are in C, simply because they're heavily used enough to justify the development cost. Perl and sh (or ksh) can be good languages for a quick prototype. Utilities that tie other programs together may be easier to write in a shell than in a more conventional programming language. On the other hand, any time you want to interact with raw bytes, C is probably looming on your horizon. |
A good rule of thumb is to start thinking about the design of a utility the second time you have to solve a problem. Don't mourn the one-off hack you write the first time; think of it as a prototype. The second time, compare what you need to do with what you needed to do the first time. Around the third time, you should start thinking about taking the time to write a general utility. Even a merely repetitive task might merit the development of a utility; for instance, many generalized file-renaming programs have been written based on the frustration of trying to rename files in a generalized way.
Here are some design goals of utilities; each gets its own section, below.
- Do one thing well.
- Be a filter.
- Generalize.
- Be robust.
- Be new.
Do one thing well; don't do multiple things badly. The best example of this doing one thing well is probably sort
. No utilities other than sort
have a sort feature. The idea is simple; if you only solve a problem once, you can take the time to do it well.
Imagine how frustrating it would be if most programs sorted data, but some supported only lexographic sorts, while others supported only numeric sorts, and a few even supported selection of keys rather than sorting by whole lines. It would be annoying at best.
When you find a problem to solve, try to break the problem up into parts, and don't duplicate the parts for which utilities already exist. The more you can focus on a tool that lets you work with existing tools, the better the chances that your utility will stay useful.
You may need to write more than one program. The best way to solve a specialized task is often to write one or two utilities and a bit of glue to tie them together, rather than writing a single program to solve the whole thing. It's fine to use a 20-line shell script to tie your new utility together with existing tools. If you try to solve the whole problem at once, the first change that comes along might require you to rethink everything.
I have occasionally needed to produce two-column or three-column output from a database. It is generally more efficient to write a program to build the output in a single column and then glue it to a program that puts things in columns. The shell script that combines these two utilities is itself a throwaway; the separate utilities have outlived it.
Some utilities serve very specialized needs. If the output of ls
in a crowded directory scrolls off the screen very quickly, it might be because there's a file with a very long name, forcing ls
to use only a single column for output. Paging through it using more
takes time. Why not just sort lines by length, and pipe the result through tail
, as follows?
Listing 1. One of the smallest utilities anywhere, sl
|
The script in Listing 1 does exactly one thing. It takes no options, because it needs no options; it only cares about the length of lines. Thanks to Perl's convenient <>
idiom, this automatically works either on standard input or on files named on the command line.
Almost all utilities are best conceived of as filters, although a few very useful utilities don't fit this model. (For instance, a program that counts might be very useful, even though it doesn't work well as a filter. Programs that take only command-line arguments as input, and produce potentially complicated output, can be very useful.) Most utilities, though, should work as filters. By convention, filters work on lines of text. Most filters should have some support for running on multiple input files.
Remember that a utility needs to work on the command line and in scripts. Sometimes, the ideal behavior varies a little. For instance, most versions of ls
automatically sort input into columns when writing to a terminal. The default behavior of grep
is to print the file name in which a match was found only if multiple files were specified. Such differences should have to do with how users will want the utility to work, not with other agendas. For instance, old versions of GNU bc
displayed an intrusive copyright notice when started. Please don't do that. Make your utility stick to doing its job.
Utilities like to live in pipelines. A pipeline lets a utility focus on doing its job, and nothing else. To live in a pipeline, a utility needs to read data from standard input and write data to standard output. If you want to deal with records, it's best if you can make each line be a “record.” Existing programs such as sort
and join
are already thinking that way. They'll thank you for it.
One utility I occasionally use is a program that calls other programs iteratively over a tree of files. This makes very good use of the standard UNIX utility filter model, but it only works with utilities that read input and write output; you can't use it with utilities that operate in place, or take input and output file names.
Most programs that can run from standard input can also reasonably be run on a single file, or possibly on a group of files. Note that this arguably violates the rule against duplicating effort; obviously, this could be managed by feeding cat
into the next program in the series. However, in practice, it seems to be justified.
Some programs may legitimately read records in one format but produce something entirely different. An example would be a utility to put material into columnar form. Such a utility might equate lines to records on input, but produce multiple records per line on output.
Not every utility fits entirely into this model. For instance, xargs
takes not records but names of files as input, and all of the actual processing is done by some other program.
Try to think of tasks similar to the one you're actually performing; if you can find a general description of these tasks, it may be best to try to write a utility that fits that description. For instance, if you find yourself sorting text lexicographically one day and numerically another day, it might make sense to consider attempting a general sort utility.
Generalizing functionality sometimes leads to the discovery that what seemed like a single utility is really two utilities used in concert. That's fine. Two well-defined utilities can be easier to write than one ugly or complicated one.
Doing one thing well doesn't mean doing exactly one thing. It means handling a consistent but useful problem space. Lots of people use grep
. However, a great deal of its utility comes from the ability to perform related tasks. The various options to grep
do the work of a handful of small utilities that would have ended up sharing, or duplicating, a lot of code.
This rule, and the rule to do one thing, are both corollaries of an underlying principle: avoid duplication of code whenever possible. If you write a half-dozen programs, each of which sorts lines, you can end up having to fix similar bugs half a dozen times instead of having one better-maintained sort
program to work on.
This is the part of writing a utility that adds the most work to the process of getting it completed. You may not have time to generalize something fully at first, but it pays off when you get to keep using the utility.
Sometimes, it's very useful to add related functionality to a program, even when it's not quite the same task. For instance, a program to pretty-print raw binary data might be more useful if, when run on a terminal device, it threw the terminal into raw mode. This makes it a lot easier to test questions involving keymaps, new keyboards, and the like. Not sure why you're getting tildes when you hit the delete key? This is an easy way to find out what's really getting sent. It's not exactly the same task, but it's similar enough to be a likely addition.
The errno
utility in Listing 2 below is a good example of generalizing, as it supports both numeric and symbolic names.
It's important that a utility be durable. A utility that crashes easily or can't handle real data is not a useful utility. Utilities should handle arbitrarily long lines, huge files, and so on. It is perhaps tolerable for a utility to fail on a data set larger than it can hold in memory, but some utilities don't do this; for instance, sort
, by using temporary files, can generally sort data sets much larger than it can hold in memory.
Try to make sure you've figured out what data your utility can possibly run on. Don't just ignore the possibility of data you can't handle. Check for it and diagnose it. The more specific your error messages, the more helpful you are being to your users. Try to give the user enough information to know what happened and how to fix it. When processing data files, try to identify exactly what the malformed data was. When trying to parse a number, don't just give up; tell the user what you got, and if possible, what line of the input stream the data was on.
As a good example, consider the difference between two implementations of dc
. If you run dc /home
, one of them says “Cannot use directory as input!” The other just returns silently; no error message, no unusual exit code. Which of these would you rather have in your path when you make a typo on a cd
command? Similarly, the former will give verbose error messages if you feed it the stream of data from a directory, perhaps by doing dc < /home
. On the other hand, it might be nice for it to give up early on when getting invalid data.
Security holes are often rooted in a program that isn't robust in the face of unexpected data. Keep in mind that a good utility might find its way into a shell script run as root. A buffer overflow in a program such as find
is likely to be a risk to a great number of systems.
The better a program deals with unexpected data, the more likely it is to adapt well to varied circumstances. Often, trying to make a program more robust leads to a better understanding of its role, and better generalizations of it.
One of the worst kinds of utility to write is the one you already have. I wrote a wonderful utility called count
. It allowed me to perform just about any counting task. It's a great utility, but there's a standard BSD utility called jot
that does the same thing. Likewise, my very clever program for turning data into columns duplicates an existing utility, rs
, likewise found on BSD systems except that rs
is much more flexible and better designed. See Resources below for more information on jot
and rs
.
If you're about to start writing a utility, take a bit of time to browse around a few systems to see if there might be one already. Don't be afraid to steal Linux utilities for use on BSD, or BSD utilities for use on Linux; one of the joys of utility code is that almost all utilities are quite portable.
Don't forget to look at the possibility of combining existing applications to make a utility. It is possible, in theory, that you'll find stringing existing programs together is not fast enough, but it's very rare that writing a new utility is faster than waiting for a slightly slow pipeline.
In a sense this program is a counterexample, in that it is never useful as a filter. It works very well as a command-line utility, however.
This program does one thing only. It prints out errno lines from /usr/include/sys/errno.h in a slightly pretty-printed format. For instance:
$ errno 22
EINVAL [22]: Invalid argument
|
Does it generalize? Yes, nicely. It supports both numeric and symbolic names. On the other hand, it doesn't know about other files, such as /usr/include/sys/signal.h, that are likely in the same format. It could easily be extended to do that, but for a convenience utility like this, it's easier to just make a copy called "signal" that reads signal.h, and uses "SIG*" as the pattern to match a name.
This is just a tad more convenient than using grep
on system header files, but it's less error-prone. It doesn't produce garbled results from ill-considered arguments. On the other hand, it produces no diagnostic if a given name or number is not found in the header. It also doesn't bother to correct some invalid inputs. Still, as a command-line utility never intended to be used in an automated context, it's okay.
Another example might be a program to unsort input (see Resources for a link to this utility). This is simple enough; read in input files, store them in some way, then generate a random order in which to print out the lines. This is a utility of nearly infinite applications. It's also a lot easier to write than a sorting program; for instance, you don't need to specify which keys you're not sorting on, or whether you want things in a random order alphabetically, lexicographically, or numerically. The tricky part comes in reading in potentially very long lines. In fact, the provided version cheats; it assumes there will be no null bytes in the lines it reads. It's a lot harder to get that right, and I was lazy when I wrote it.
If you find yourself performing a task repeatedly, consider writing a program to do it. If the program turns out to be reasonable to generalize a bit, generalize it, and you will have written a utility.
Don't design the utility the first time you need it. Wait until you have some experience. Feel free to write a prototype or two; a good utility is sufficiently better than a bad utility to justify a bit of time and effort on researching it. Don't feel bad if what you thought would be a great utility ends up gathering dust after you wrote it. If you find yourself frustrated by your new program's shortcomings, you just had another prototyping phase. If it turns out to be useless, well, that happens sometimes.
The thing you're looking for is a program that finds general application outside your initial usage patterns. I wrote unsort
because I wanted an easy way to get a random series of colors out of an old X11 "rgb.txt" file. Since then, I've used it for an incredible number of tasks, not the least of which was producing test data for debugging and benchmarking sort routines.
One good utility can pay back the time you spent on all the near misses. The next thing to do is make it available for others, so they can experiment. Make your failed attempts available, too; other people may have a use for a utility you didn't need. More importantly, your failed utility may be someone else's prototype, and lead to a wonderful utility program for everyone.
- The UNIX Programming Environment by Brian W. Kernighan and Rob Pike (Prentice Hall, Inc., 1984) is an essential part of any programmer's bookshelf. You can also download all the example code for this book on the Bell Labs' CSR page.
- Download core, file, shell, and text utilities from the GNU Web site software pages.
- You can download the source code for the "unsort" utility described in this article.
- More information about the BSD utilities jot and rs are available on freshmeat.
- Read more best practices and hints on getting started coding in "Developing a Linux command-line utility" (developerWorks, June 2002).
- Learn how to write secure applications, validate and check inputs, prevent buffer overflows, and more with David Wheeler's Secure programmer column on developerWorks.
- The "Bash by example" series on developerWorks will help you get started writing shell scripts.
- The tutorial "Building a cross-platform C library" shows how to convert an existing C program or module -- or utility -- into a shared library (developerWorks, June 2001).
- Peter Seebach has previously written a series on how to treat UNIX utilities as a component architecture on
developerWorks. - Get the developerWorks Subscription (formerly the Toolbox subscription) to get CDs and downloads of the latest software from IBM to build, test, evaluate, and demonstrate applications on the IBM Software Development Platform.
- For Linux tools on AIX, check out the AIX toolbox for Linux applications.
- To develop an application on Linux using trial versions of the latest IBM tools and products, visit the Speed-start your Linux app site.
- Find more resources for Linux developers in the
developerWorks Linux section.
About the author: Peter Seebach has been writing utilities for a long time. He uses about one in ten of the utilities he writes more than once, which seems to be a pretty good track record. You can contact Peter at [email protected]. |
First published by IBM developerWorks. Reproduced by LinuxDevices.com with permission.
This article was originally published on LinuxDevices.com and has been donated to the open source community by QuinStreet Inc. Please visit LinuxToday.com for up-to-date news and articles about Linux and open source.