Programming Style Guidelines

This page started life as a copy of man mfcf-programming-style (as of 2008/09/05). Updates are being made to reflect current practice in CSCF, particularly in light of multiple programming languages and software development environments. In addition, we also have a Programmer Guide which provides general advice for programmers.

While many of the examples below are currently in C, an effort is being made to make this document language-independent where possible, and give examples in multiple languages where necessary. Please extrapolate to which ever languages you are using in your projects (if possible, also comment with suggestions in ST#66691).

Table of Contents

Introduction

While there isn't one obviously correct style that one should follow in programming, it should be clear that one should at least follow some consistent style. If everyone uses the same style it makes it much easier for one person to understand the code written by another.

To be fair, almost any self-consistent style should be acceptable; but this document isn't attempting to be fair. It describes one particular style, not necessarily any better than any other style, but one that you should follow for consistency.

This document includes non-style issues of correctness (such as the section on C pointers) without attempting to separate them from purely style issues.

You should keep style in mind right from the beginning when writing code. Otherwise you may find it all too easy to say that you just want to get the thing working and will fix up the style later. But later you may say that it works fine so why fix it, or, being persuaded why, you will find retrofitting the style as much work as writing the program in the first place.

If you follow a consistent style from beginning to end you will eventually find that this style becomes natural to use and that your programs are easier to write because of it.

Make it Readable with Minimal Cross-Referencing

The casual reader of a program shouldn't be required to read the entire program every time an attempt is made to determine whether a particular implementation (e.g. of a function or manifest) is correct, and/or what it can be safely changed to. An adequate characterization of an object should be available for the object's implementation so that the reader can determine that the implementation is correct. If it's not obvious to the reader of a characterization of an object how to successfully use the object, then the characterization is inadequate.

It should be obvious what form input data is to take, what form outputs will take, and (if needed) what is responsible for allocating and deallocating storage for the data.

Adequately characterizing objects is done with descriptive names, comments, and type definitions.

Comments (make them useful)

Each function should begin with a comment describing how it should be called, what it will return, and any side effects it may have. To anyone reading it, it should be obvious how and why one would want to call the function. It is usually not important to the caller how the function is implemented, so details of how the function works should not appear in this comment. Instead a second, separate comment should be added in cases where the implementation details are not obvious (e.g. /* Uses Slobodnik's Inverse Algorithm (CACM June 1923) */).

The following format is suggested for C:

/*
 *
 * DESCRIPTION:
 *    A brief description of why someone would want to call
 *    this function.
 *
 * type function(type1 param1, type2 param2, ..., typeN paramN);
 *    Description of value returned (unless type is void).
 *
 * PARAMETERS:
 *    param1:
 *        Description of parameter #1.
 *    param2:
 *        Description of parameter #2.
 *    paramN:
 *        Description of parameter #N.
 *
 * EFFECTS:
 *    List any externals that might be changed.
 *
 */

Brief descriptions justifying their existence should accompany all definitions of manifests, macros, typedefs, etc. i.e. what they do, or what one uses them for.

As much as possible, the names of the identifiers should document their meaning as well. Something such as

#define BUFSIZE 100 /* Size of buffer */
is not only a useless comment, it is a silly name for the manifest unless the program that has access to this manifest is very small. There may be many different kinds of buffers in many places. It isn't obvious what kind of buffer is intended here, and is everything that might ever be referred to as a buffer in the program expected to have this same size, and what units is that size measured in? Who knows; it certainly isn't obvious from either the comment or the name of the manifest. Giving it a more meaningful name, such as NUMBER_OF_BYTES_FOR_THE_TERMINAL_INPUT_BUFFER, is stretching the point, but it certainly makes the intended use of the manifest obvious.

It would be nice if each variable definition had a comment beside it describing the intended use of the identifier, but often it is even better if the scope of each variable is kept small enough that this isn't necessary.

Be sure to use descriptive identifier names, not x and i, especially if the scope of the variable is more than a few lines. If the name is descriptive enough, its usage should be obvious and will need fewer comments.

Individual pieces of executable code should have as few comments as possible. Too many comments, especially those that are redundant or just plain wrong, simply serve to obscure the code. This doesn't mean that there should be no comments, only that those that are there should be essential. For instance, ++x; /* increment x */ doesn't add information, while

for (i = 0; i < 17; ++i) /* for all elements of the vector */

isn't much better. On the other hand, something like

for (index = 0; index < (sizeof(vector)/sizeof(vector[0])); ++index)

says the same thing without even using a (possibly wrong) comment.

When it is necessary to use comments, they should describe why something is being done, not how it is done. The how should be obvious from the code itself. If it isn't, the code should be rewritten.

In the above for-loop example, /* find the largest member of the vector */ might be a useful comment if the loop is long enough that this isn't obvious at a glance, if that is the only thing the loop does, and if the comment really is true.

Reusability (imitation/theft is the sincerest form of flattery)

Software should be built out of pieces that are simple and useful enough that others will want to use them. The highest praise that can be given to a function is to have it installed in a generally used library. Next best is to have the software copied, and then used with little or no modification. If you find yourself writing something that ought to already exist, then make sure that you never have to do it again.

Similarly, if something ought to already exist, you should make sure it doesn't already exist- make use of public software libraries or even http://www.google.com/codesearch before you begin coding.

Design (simplify functionality)

Most programmers spend most of their time fixing and modifying existing code, not writing new code. Far more time is spent maintaining a program than was spent writing it in the first place. If you are fortunate enough to be writing new code, you should do all that you can to make it easy for those that will have to maintain it. Who knows, it might be yourself.

The components of a program should be written in such a way that even large changes or additions to the original functionality of the program can be accommodated without too much of the code having to be rewritten. This is often easier said than done, but a number of principles can be kept in mind when designing a program.

The use of global variables ("externals" in C) should be restricted as much as possible. Functions that pull magic values out of thin air are difficult to read, not to mention error-prone. If you later want to change globals, it can be difficult as you have to track down all the functions that use them and determine the effects the change will make.

Similarly, the scope of variables should also be as restricted as possible. Such things as defining placeholder variables such as 'i' or 'j' at the top of a long function, and then using them whenever temporary ints are needed is a common practice, but one that makes the code difficult to understand and can easily lead to errors. If a value is assigned to i and then used in the next statement, it isn't obvious whether this value is no longer required or it is supposed to be retained by the variable until it is needed much later in the code.

On the other hand, if you use something like

statement; {auto int i; i=func(); func2(i,i);} statement;
the purpose and the scope of the variable is obvious, and a good optimizer won't even bother to use the variable.

Individual functions, or small groups of functions, should be kept ignorant of what the rest of the program is doing and especially so of how the rest of the program is doing it. For instance if at some time in the future it is decreed that a program must use menus for its user-interface, most of the program should be unaffected by this change. If that isn't the case, you have to wonder why so much of the code needed to know this one small aspect of the program.

Another comment on comments: describing functions

For each function you should try to come up with a very simple description of what it does (but not how it does it). If you can't, either you aren't yet ready to write the function or it should perhaps be written as several separate functions. This description should concern itself strictly with what the function is supposed to do and have nothing to do with the functions that will be calling it. Such a description should in itself be enough for someone who knows nothing about the rest of the program to code the function. If it isn't enough, or the description is very complicated, you are probably attempting to do too much with the one function.

Testing (provide automatic self tests)

A program should be accompanied by a test suite (that can be run non-interactively). This allows changes/corrections to made with some confidence that the process hasn't otherwise damaged the software. It also allows the software to be tested automatically on multiple architectures, something that just isn't practical to do manually.

How to test

The test target of the xhier Imake templates invokes the run_tests command which is a convenient way of running a test suite.

In order to be able to create a test suite, it may be necessary to modify the software in question. Modifications are typically configuration options which change locations of data, network addresses, etc.

What to test

Suggested tests are inputs that are at and near boundary conditions in the program, and inputs that are erroneous. E.g. if the length of some input was limited, then lengths in the middle of the allowed range, one less than the maximum, the maximum, one more than the maximum, and something much larger than the maximum would be obvious candidates for test inputs.

Efficiency (make it work well first)

Many programmers are overly concerned with efficiency. First, make it work well. There is little point in spending hours making a piece of code extremely efficient if that code is only going to be executed a few times a day. The cost of the programming time will far outway any savings in cpu. Make your program efficient by basing it on an efficient algorithm. A bad implementation of an efficient algorithm often runs better than an incredibly efficient implementation of a bad algorithm. Even if it doesn't, the bad code can be fixed; but the bad algorithm usually can't without redesigning the entire program from scratch.

Efficient code can also be difficult to maintain. Often the efficiency is done at the expense of the legibility and understandability of the code. If the tricks used to create the efficiency require long comments to describe, or if anyone else looking at the code has difficulty understanding what is going on, it is going to be very difficult for anyone to maintain this code. Changes that might have taken minutes to make in a well written function might take hours to make because of the efficiency tricks. There are few applications that can justify efficiency at the expense of the clarity of the code.

Even worse, the attempt at efficiency can be self defeating, if they rely on tricks of how the compiler (or interpreter) runs. These are often future-incompatible, un-portable, and unnecessary.

Of course considerations of efficiency do have some importance, but only when done in the right way and the right places. The program itself should have an efficient overall design, independent of the particular compiler or operating system involved. Obviously expensive operations, such as getting a name from a file, should only be done once, not every time the name is needed.

You should write for clarity and ease of modification, not simply for efficiency. Once it works correctly, you can use tools such as profilers to determine which parts of the code take up most of the time to execute. Typically, a program will spend nearly all its time in only a few small sections of code. In that case, you can finally start to worry about improving the efficiency of the entire program by concentrating on those few sections. If that isn't the case, either the program is simply a straight run through and efficiency doesn't really matter, or some of the larger functions might be broken up into smaller functions to isolate the sections that actually determine the overall efficiency of the program.

Obviousness of Purpose (say what you mean)

There is usually more than one way of accomplishing the same thing, all of them perfectly correct. The choice of which to use is thus a matter of personal taste, but one should try to chose the method that most clearly reflects the purpose of the code.

Consider bit operations such as shifts and masks. How should you set the least significant bit to 0 in C code?

x & 0xfffe     /* turn off bit 0, leave bits 31 through 1 alone */
x & ((-1U)<<1) /* turn off bit 0, leave other 1 bits in -1 alone */
(x>>1) << 1    /* shift away bit 0, shift zero back into its place */
x & ~01        /* turn off bit 0 */

The first works fine if x is a 32-bit integer, but on a 36-bit machine the upper four bits will also be zeroed. The second works fine, regardless of the number of bits in a word, but it only works on 2s complement machines. The third will work correctly on any machine, but its two operations tend to obscure its purpose. The last is the best one though. It works on any machine, it is very simple to code, and its purpose is obvious. (Note that we used (-1U)<<1 instead of (-1)<<1 because on some machines the latter would cause an overflow fault.)

Now consider a function call such as func(++i);. You could have called the function with ++i, i+=1, i=i+1, or i+1, all generating the same argument value. The first three have exactly the same meaning and generate exactly the same machine code: the value of the variable i increases by 1, and the result passes into the function. The last one, however, adds 1 to the value of the variable but doesn't change the value of i itself. As far as the function is concerned, all four are equivalent. But are they equivalent in terms of the code that called the function? If so, then it didn't matter whether the incremented value was stored back into the variable or not, and in that case it would have been better to use the i+1 expression since it does not give the misleading implication that the value of i needs to be incremented before it is used again.

Similarly consider the statements {a+=7; b+=3; c+=1; d+=2;}. The third could just as correctly been written as ++c; or c++;, but by using the same operator as the surrounding statements this statement more naturally blends in and so appears to be just one of a group of four similar statements doing similar things.

Now what about isolated increment statements? Is it better to use i++; or ++i; ? In C, they both mean the same thing and probably both generate identical machine code. But the latter is simply short for (i=i+1), while the former is short for the expression ((i=i+1), (i-1)), or even ((temp=i), (i=i+1), temp). Clearly, in this context, i++; implies far more than is necessary, and so the conceptually simpler ++i; is to be preferred.

Exception Handling (never ignore an error status)

One of the biggest causes of mysterious, hard to diagnose failures is the failure of programs to check error statuses. Check them all, and produce a message describing the problem, with enough information that the reader stands a chance of correcting the problem. An error message is usually followed by graceful termination of the program with a non-zero exit status. A warning message does not result in the termination of the program.

Error/Warning Messages (use a consistent form)

All error messages should be prefixed with the name of the program that produced them, i.e. by being of the form:

  <ProgramName>: <MessageText>

The program name is usually the last component of the program pathname, i.e. the filename part of argv[0]. Warning messages are similar in form:

  <ProgramName> warning: <MessageText>

Multiline messages are permitted to include the program name in the just the first line, with the remaining lines being indented, e.g.

  <ProgramName>: <MessageText>
        <MessageContinuation>
        <MessageContinuation>
         ...

A usage message describes errors in command line syntax, and at most is intended to contain a reminder of correct syntax, rather than a complete description of the syntax. A usage message might look like:

  !MyProgramName usage: !MyProgramName [-v] [-e] (-x|-z) <InputFile>*

where text in square brackets is optional, angle brackets and the descriptive name they surround are to be replaced with an appropriate argument, items followed by a "*" may appear 0 or more times, and only one of the arguments that appear on either side of a vertical bar (all optionally bracketed with round brackets) is to be supplied.

Error, warning, and usage messages all appear on the error output of the program (i.e. stderr rather than stdout).

Command Syntax

Although it's not a pleasant syntax, the defacto standard for command line syntax uses single letter options. This is most easily implemented using the getopt library function. An alternative is the keyword syntax implemented via the op_option functions in the libmfcf library.

Source Organization (make things easy to find)

However the parts of a program are divided among files, it should be the case that the casual reader can easily find the definition of any object in the program.

An obvious approach to this is to use one file per object. This is easy to do, easy to use, but can sometimes result in more files than some like.

Another approach is to rely upon a tags file, and using a text editor that honours it. The only disadvantage of the tags approach is that one can't edit the object in question using a "start of file" editor command to mean "start of object". And if one is attempting to use multiple windows to edit multiple objects that happen to reside in the same file, there are text editors that make this an error prone endeavour.

Another approach is to rely upon a naming convention that relates the names of objects with the names of the files that contain them in some obvious easy to use way. E.g. if all object names are of the form X_Y, then the file named AB.c might contain all of the AB_ object definitions.

For large programs, combinations of conventions can be useful, e.g. combining the single file approach with a naming convention, resulting in finding X_Y in X/Y.c.

For trivial programs, the entire program can be in one file.

For anything but trivial programs, one function per file is always a good style to follow.

Indentation and Brace Style (consistency)

There are many different ideas of the correct way to use indentation and braces. None of these are obviously correct, but the most important thing to consider is that the result should be self consistent.

Indentations normally use one tab-stop per level. Never use spaces to do this indentation. Whether the tab stops should be every 8 or every 4 characters is something that everyone argues about, but whatever you decide to use, on UNIX you should put a .exrc file, containing the following line, into the source directory: set tabstop=4 shiftwidth=4, (or 8 if you so choose). That way, anyone else editing the file (with VI) will have things indented the same way as you had them indented.

Because of the differences in opinion about the optimal tab stop to use, it's best to avoid embedded tabs. Just use tabs to indent from the left margin. That way text containing tabs will be readable (although possibly rather wide) regardless of what tab stops are used to display it.

Even though we have printers and terminals capable of much longer lines, most displays are 80 characters wide. Thus it's best to try to restrict text to lines of at most 80 characters, after tabs have been expanded. It doesn't take very many levels of indentation to exceed this limit, and so rather than having to use very short identifiers, the preferred setting for tabs is every 4 rather than every 8.

In function definitions, you should use braces as in the earlier example of a function definition. Within the function however there are some variations, but the following is the officially accepted style:

if (condition) {
    statement;
}
else if (condition) {
    statement;
}
else {
    statement;
}

switch (value) {
    case 1:
        statement;
        break;
    case 2:
        statement;
        /*FALLTHROUGH*/
    default:
        statement;
}

while (condition) {
    statement;
}

do {
    statement;
} while (condition);

do { ; } while (condition);

for (initialize; test; iterate)
    continue;

for (initialize; test; iterate) {
    statement;
}

The most important one is probably the do-while statement. If you must use this statement, always put the while on the same line as the closing brace. Otherwise, if it is on a separate line, it isn't always obvious from the context whether it is the end of a do-while, or simply a separate while statement.

Leaving a blank line before case statements is a common practice.

Sometimes if-conditions can get very complicated. Even so, you should format them to make them as readable and understandable as possible.

For instance, if (c1 && (c2||c3) && (c4||c5)) uses spaces and parentheses to group the conditions. Never rely on the compiler's always doing the ors before the ands (or is it the other way around?). Typically however, the conditions won't be simple two-character identifiers, and the statement won't fit nicely onto one line. If you don't need the else, you can simply break the condition up into several if statements such as

if (condition1) {
    if (condition2 || condition3) {
        if (condition4 || condition5) {
            statement;
        }
    }
}

This will generate the same code as if it were all jammed into a single if statement, but will be much easier to understand and much less likely to contain mistakes. If you need an else, the conditions must remain as a single test, and so you can regroup it onto several lines:

if ( condition1
 && (condition2 || condition3)
 && (condition4 || condition5) ) {

And again the visual pattern reflects the logical grouping.

White Space

You should use white space (blanks or new-lines) to visually separate distinct sections of code.

This separation should enhance the natural grouping of the language and not make it more confusing however. For instance, x = 2 * a+b; is misleading since it visually groups the addition before the multiplication. Either x = 2 * (a+b); or x = 2*a + b; should be used instead, depending upon which meaning was intended.

Much variation on the use of blanks around operators and variables is allowed, but one's style should be consistent, especially within a single statement. Don't, for instance, leave many spaces at the beginning of a statement and then pack things together as you approach the right margin.

Put blanks after language keywords, such as while or if, but not after function names.

Put a blank line between the declarations and the executable statements in a block. Put a blank line between functionally distinct sections of code to suggest that some operation has been completed and a new one is to begin. If this is a major section, you might also add a comment indicating what is about to happen next or what state things are supposed to be in at this point.

Readability

Make your code as easy to read as possible. Packing several assignments and function calls all into one line seldom does any good. The resulting line requires extra time to read and understand and can easily contain buried mistakes.

Often such things are done in the mistaken belief that the code will be more efficient. In fact some compilers can actually generate worse code in some cases (BSD 4.2 did).

Breaking these complex statements up into several simple statements is usually the best thing to do. Even having to use temporary variables isn't that bad as most optimizers will eliminate their use anyway.

Writing complex non-obvious code confuses the reader and can often even confuse the optimizer.

Conditional Compilation

Sometimes one wants to be able to compile different versions of a program without having to make major changes to the source each time. For instance, the name of a directory might be different depending upon which version of the operating system the program is compiled for. Typically one would use something like:

#if !defined(C_OS_UNIX_MIPS)
#if !defined(C_OS_UNIX_SGI)
    /* usual case */
    extern char libdir[] = "/usr/lib";
#   define MAXGROUPS  32
#else  /* SGI */
    extern char libdir[] = "/blah/usr/lib";
#   define MAXGROUPS   1
#endif /* SGI */
#else  /* MIPS */
    extern char libdir[] = "/bsd43/usr/lib";
#   define MAXGROUPS  64
#endif /* MIPS */

Note that because some ancient versions of the preprocessor (e.g. BSD) don't support the #elif directive, one should nest the #if's and #else's as was done above. It also makes it easy to add new cases as they become necessary.

The tests are best done for not defined, rather than for defined. Consider this very commonly seen (bad) example:

#if defined(vax)
    blah blah blah
#else /* vax */
    blah blah blah
#endif /* vax */

Now if the conditional code is fairly long, one may see either of the statements labelled /*vax*/ and think that the enclosed code is for the VAX compilation. In fact that code is for the non-VAX compilation.

If the conditional code is more than a few lines, you should always label the #else and #endif statements, and always make sure the labels agree with what is happening.

C-Specific Guidelines

Portability is worth the effort: Using LINT

Due to the extremely forgiving nature of the BSD C compiler, much code that works fine on a VAX will be full of bugs on many other machines. Thus it is very important that you use LINT to check for such potential problems.

Making your programs portable requires some effort, but the resulting code will be just as efficient as before, and it will be much more understandable and easy to change.

Some programs obviously require some non-portable code, but you should isolate that code as much as possible (e.g. put it all into a few functions, clearly documented as system-specific), so that if anyone ever has to port the program they can easily find the code that they will have to replace.

It's unfortunate, but some libraries don't have corresponding LINT libraries, so it's not practical to have them LINT cleanly. And some architectures have a LINT that's not practical to use. An alternative is to use gcc -Wall.

If your program is to be as portable as possible, and you are using libmfcf, you should call LINT with the -N -O -e -lmfcf flags. If you only worry about UNIX systems, and use libmfcfunix, you should call LINT with the -N -O -lmfcfunix -lmfcf flags. In either case your code should only include header files from <libmfcf/*.h>.

If LINT gives you any warnings whatsoever, it is for the following three reasons:

  1. LINT has a bug in it, in which case tell someone and they will fix it,
  2. the MFCF library has a bug in it, in which case tell someone and they will fix it, and
  3. your own code has a bug in it, in which case you should fix it.

There is really no excuse for having code that doesn't LINT cleanly. One can almost guarantee that code that LINTs cleanly and only uses functions from the libmfcf library will port to any system (ignoring such things as compiled-in file names and the use of m_syscall()).

Source Organization (libraries should have one external function per file)

With most loaders, accessing an external will cause everything defined in the same source file to be loaded too, even though most of it is never needed. So, whenever possible, all functions and large externals should be maintained in their own individual source files. Of course statics must be in the same source file as the functions that use them.

The source files should be named ‘name.c’, where name is the name of the principal function or external defined in the file.

There should be a file ‘libname.h’ for inclusion by other source that use this libname library. In it should have external declarations of all functions and externals that the user is allowed to access. It should also contain any typedef, #define, structure, or enumeration definitions that the user might need. Even functions returning the default (int) type should have explicit declarations.

There should be a file, say ‘name.h’, for inclusion by all the source files for the library. This file should include the externally visible libname.h header file, and also contain external declarations and typedefs etc. for anything used and defined by the library but not intended for use by anyone calling the library.

There may be other header files, all with names ending in .h, that some, but not necessarily all, source files can include.

Header Files

Header files are all named ‘name.h’. They should be set up in such a way (especially if they are intended for inclusion by other users as in the case of libraries) that it is possible to include them more than once without any problem. This can normally be done by wrapping the file like this:

#if !defined(NAME_INCLUDED)
#define NAME_INCLUDED

<contents of file>

#endif /*NAME_INCLUDED*/

Header files should also be self contained. i.e. If anyone includes your header file, they shouldn't have to include any other header files, your header files should already include them. A quick, but not necessarily conclusive, test is to run LINT on the header file itself.

Header files should contain declarations of externals, but never definitions. But header files should not contain any extern declarations for anything not defined in the source. If something from another library (such as the standard C library) needs to be referenced, the appropriate header file should be included. Thus you should never use statements such as extern int isatty();, extern int getchar();, or extern char *sprintf();, since the library's header files may define these as macros on some systems. On some compilers (such as the ANSI X3.159 standard), failing to include the appropriate header file can even call the function with bad arguments since a function prototype will not be in effect at the time the call is compiled.

Some loaders let you get away with putting int x; into header files, but LINT and most loaders won't. You must resolve the ambiguity yourself by putting extern int x; into the header file, or int x = 0; into the source file.

Reserved Names (no underscores in the first three positions)

All identifier names, whether for variables, typedefs, #defines, or whatever, must obey certain rules. Of course these rules aren't always obeyed by all implementors, but if you follow them you'll avoid most difficulties.

Libmfcf defines a number of identifiers containing double underscores. These are intended for internal use only, and no private program should ever need to use them. You may of course define your own identifiers with double underscores so long as they still obey the following rules.

The implementors of the compiler and C library reserve a leading underscore character for their own use. The implementors of libmfcf reserve an underscore in the second or third characters for their own use. Thus you should never define an identifier with an underscore as any of the first three characters.

In addition, the C language itself reserves several keywords such as break, while, const, and int.

Identifier Names (should be self documenting)

Due to the limitations of some loaders, all external names must be unique in the first 6 monocase characters. The -e flag of BSD LINT can be used to find such violations. If you must, you can use #define long_identifier shrtid to apparently eliminate this restriction and to make your code easier to read, but you must still come up with the 6-character names, and remember that these are the names that the compiler, loader, debugger, LINT, and error messages will use, so perhaps it isn't so obvious that using this trick is a good idea.

Always use Dual-Case names for typedefs.

Use UPPER-CASE names for enumeration constants and manifests. All macros that do not behave like functions should also have their names entirely in upper-case as a reminder to the caller to avoid arguments with side-effects.

lower-case should be used for all other names, including macros that do behave like functions (i.e. they evaluate all arguments exactly once, don't make use of side effects, and can be used before an else ).

The names chosen for identifiers should be self-documenting. For instance, in code such as x = y*z; it isn't immediately obvious what is happening, and either the reader has to scan through the source to see what these three variables represent, or the author has to add a comment to explain what the statement does. On the other hand, cost = rate*items; is much more useful. And it generates exactly the same code, so there is really no excuse for using short identifier names.

Similarly, comments and names should be in as correct English as is possible. Why bother to use indx, rsult, or tho, when index, result, and though will generate identical code and be much easier to read.

C Pointers (beware of lazy code)

Pointers to various types can be of different sizes, can have completely different bit patterns in their internal values, and might have values that won't fit into an (int) or even a (long). If you find yourself casting pointers from one type to another or looking at their integral values, you are probably doing something non-portable and should try to find a better way of accomplishing the same thing unless what you are trying to do actually is intended to be very machine-specific.

Casting portably between pointers of different types is possible, but only if done in the correct way and for the correct reasons. The language allows a pointer of one type to be cast to a pointer of another type if the former type is no smaller than the latter type. For instance, if sizeof(double) is greater than sizeof(int), one may safely do the following:

auto double dbl;
auto double *dp;
auto int *ip;
   
ip = (int *) &dbl;
dp = (double *) ip;

but note that the only thing one can do with ip is to assign it to another pointer. Attempting to use it as a real pointer could easily produce a machine fault.

Libmfcf provides a c_Pointer type that you can use to hold generic pointers temporarily. You may cast any pointer to a c_Pointer and then back to the same type without any problems. You must not cast it to any other type, or attempt to use its value in any other way.

The constant literal 0 has special meaning in the C language in that you may assign it to any identifier and have it silently coerced to the appropriate type. But this does not necessarily mean that the actual bit pattern assigned to the variable will be all zero. For instance, char *cp = 0; int *ip = 0; defines cp and ip as null pointers, but their internal representations won't necessarily have the same values. Assigning cp to an int might produce the value -1 on one machine, 0 on another, and 9387 on a third. On the same machines, ip might have completely different values. Statments such as cp = 0; ip = (int *)cp; will leave garbage in ip on many machines. Only the literal 0 has the special coercion properties.

This property of 0 does not extend to function arguments on most compilers. For instance if you call a function, defined with a (char *) parameter, with a literal 0, the function will die or cause damage on many compilers. To make it work correctly you must call the function with (char *)0 as the argument, never with a literal 0.

Note that with the BSD VAX compiler all pointers have identical properties and all function arguments have identical alignment, so that no matter how badly you write your code it will still tend to work ok. This doesn't mean that it is ok to write sloppy code on a VAX; it means that if you are writing your code on such a machine you must be extra careful since bad code will still appear to work fine but will cause severe problems when compiled with a different compiler. LINT will point out most, but not all, such problems.

Casts (only when necessary)

Casts are used to convert one type of data into another type. These are needed for example when generating a literal null pointer to pass to a function as in func((char *)0). Casts can also tell LINT that you really do want to do what you said in operations that look questionable. For instance if you assign a (long) value to a (short) variable, LINT will warn you that it might not fit unless you explicitly cast the (long) value to a (short) type. Similarly you can cast the result of a function to (void) if you really don't care what the result is.

Far too many programs abuse casting though. Some people tend to cast virtually everything so that they won't have to worry about getting any warnings from LINT. But many of these warnings are probably valid and it is the code itself that should be corrected. In particular, with the MFCF version of LINT one should almost never have to cast any function result to void.

Which Types To Use (char, short, and long only as arrays)

Many people are confused about when to use the various integer types (int), (long), (short), and their unsigned versions.

In general the unsigned types should only be used when one is doing bit operations such as &, ~, ^, or >>. Using unsigned integers for numerical values and performing arithmetic on them can lead to problems. For instance one might expect that if (u > -1) would always be true, for any unsigned int u. In fact, it is always false.

As for chosing the correct integral type, you should first decide whether it will ever need to hold a value greater than 32000. If it does, you must choose (long), since there is no guarantee that an (int) can hold more than 16 bits of data. But if the required value does fit into this range you must now decide between (int) and (short). If the type is to be used for a very large array you might want to use (short) to save space. Otherwise (int) is a much better choice since that is the type that can be accessed most efficiently by the machine. Using type (short) for individual variables or small arrays can actually end up using a lot more space than (int) on many machines since they may have to generate several instructions each time they need to access the variable.

Similarly, you should not use type (char) except for very large arrays when you want to store small (<128) integers. For instance, on GCOS-8, every access of a (char) variable takes 16 bytes of instructions as opposed to the 4 bytes required for (int) variables. Thus trying to save 1 or 3 bytes by declaring something (char) instead of (int) actually costs 12 bytes every time that the variable is accessed.

To a lesser extent the same arguments hold for (float) versus (double) types. Just as all (char) and (short) types are converted to (int) before any operation is performed on them, all floating point variables must be converted to or from double precision. Thus an expression such as w = x + y + z;, where the variables are all of type (float), would require three conversions from float to double when reading the values of x, y, and z, and one from double to float when writing the new value of w. Besides wasting space, these extra instructions can often take a significant time to execute. Thus, except for large arrays, type (float) should never be used.

Order of Evaluation (avoid compiler-dependent features)

Often you may unintentionally write code that relies upon the behaviour of a particular compiler. Your program might work fine on your current machine, but moving it to another machine, or even to an updated version of the current compiler, can cause mysterious errors to occur.

Never assume that operations will occur in the order you specified them in a statement. For instance, i = 1; x = i++ + (10*i++); will behave differently on different compilers, or possibly even on different lines of code in the same program, assigning 11, 12, or 21 to x. The compiler is free to rearrange the arguments of commutative operators, such as +, and is free to wait until the end of the statement before actually doing the two increments. The solution is to refrain from writing such code. Never use a variable more than once in the same statement if it has side effects.

For the same reason, (getchar() + (getchar()<<8)) will behave differently under different circumstances since you don't know which function call will happen first.

In fact, this example is even worse than that, since the standard C library version of getchar is actually a macro, though you'd never know it from its all lower-case name. Suppose the next two characters in the input stream are a and b. Then with func(getchar(),getchar());, not only may you actually get func('a','b'); or func('b','a'); since the order of evaluation of the two arguments is up to the compiler, you might even get func('b','b'); or func('a','a'); because of the side effects of using the same macro more than once in the same statement.

If you don't understand why these examples might be problems, there is a good chance your own programs have similar code that will suddenly fail at some time in the future.

Order of Bytes

On top of all the other problems shown by the previous example it illustrates yet another fundamental problem, that of byte-order. The expression c1 + (c2<<8), where the two variables hold characters, is non-portable for at least two distinct reasons.

The first is obvious in that it assumes that a character is 8 bits wide. Some machines can have 9, 12, 64, or some other number of bits for each byte.

The second makes an assumption about the order in which bytes are stored in the machine. Some machines number their bytes left-to-right, and some right-to-left. By right we mean the least significant bit, and similarly by left we mean the most significant bit. For instance if we assume 8 bit bytes, 2 byte shorts, and 4 byte ints and longs, consider the word int w = 'abcd';. (Note the use of single, not double quotes.) The result of w >>= 24 will cause the word to contain the numerical value 'a'. Now consider if w was actually the first word of the literal string abcdefghi. Since ints are 4 bytes, it will still contain the same four letters, but what value will w >>= 24 have? On most machines it will still be the 'a'. But on some machines, such as a VAX, it will instead be the 'd'. This is because on such machines memory is numbered from right to left instead of from left to right.

union example {
    char Char[64] = "abcdefghi";
    short Short[32];
    long  Long[16];
};

string:  i   h  g  f  e   d  c  b  a    a  b  c  d   e  f  g  h   i
Char #:  8   7  6  5  4   3  2  1  0    0  1  2  3   4  5  6  7   8
Short#:  4   ___3  ___2   ___1  ___0    0___  1___   2___  3___   4
Long #:  2   _________1   _________0    0_________   1_________   2
Machine: ______________Little-Endian    Big-Endian_________________

In the above diagram, the index of each item is placed directly under the character number that would correspond to its byte address on the machine. For instance, (char *)&Long[1] would be 4 on either type of machine, but on little-endian machines (e.g. VAX) byte 4 will contain the least significant bits of the number, while on big-endian machines (e.g. SUN) byte 4 will contain the most significant bits of the number.

Note that text strings are easier to visualize on the big-endian machines (e.g. 'abcd'>>24 is 'a') since Roman alphabets are normally written left to right, but that integers are easier to visualize on the little-endian machines since Arabic numbers (like Arabic letters) are written right to left (e.g. in the above, if we set Long[1] = 0; Short[2] = 17;, then the value of Long[1] would then also be 17. Similarly, if one needs a short counter and wants to check for overflow, the value can be accumulated using Long[1] and any overflow from Short[2] would go into Short[3]. Neither of these would work on a big-endian machine.)

External Declarations (only one per external)

As mentioned before, external declarations should only appear in header files, never in source files.

In fact, for any external, its declaration should only ever appear in one place. e.g. the statements extern int x; and extern char * f(); should never appear more than once in the entire source for any program.

External Definitions (explicitly initialize them)

Even though the compiler is supposed to initialize all externals to zero you should always explicitly initialize them yourself. One reason is that some compilers are broken in this respect. Another reason is that putting int x = 0; instead of int x; (or even worse x; ), indicates to anyone looking at the code that this is really the definition of the identifier and not a lazy fortran-style common-block declaration supported by some (e.g. BSD 4.3) compilers.

Function Definitions (format)

Functions should be defined in the following format:

#include <system_headers.h>
#include "private_headers.h"

    type
name(parameter_1, parameter_2, ..., parameter_n)
    type_1 parameter_1;
    type_2 parameter_2;
    type_n parameter_n;
{
    static type_a var_a;
    auto type_b var_b;
    /* function body */
}

Notice that no extern statements appear here. They should only be in the header files corresponding to the externals.

Functions that don't return a value should have type void. You should always provide the types explicitly, even when they have the default (int) type, and each declaration should appear on a separate line. Declaring several identifiers in the same statement can sometimes be a problem when someone wants to make changes later on. For instance, in char *a, *b, *c;, changing the type (char *) to (DefType) requires more than a trivial change (e.g. consider a sed script to do the job). Similarly, if Type is a typedef in a header file, you might want to do something like #define Type Thing*, but this redefinition won't work correctly with statements such as auto Type var1, var2;.

Some people feel that the auto keyword shouldn't be used. Others know that its use makes finding the definitions of variables that much easier. In any piece of code you should either always use an explicit auto or never use it. Don't mix styles.

Macros

Macros can be very useful, but they are also difficult to debug and very easy to abuse. Never use macros to make C look like some other language or to correct deficiencies in C.

#define WHILE(c) while (c) {
#define SWITCH(c) switch (c) {
#define END_WHILE }
#define END_SWITCH }
#define END_CASE break
#define BREAK_WHILE break

WHILE(test())
    SWITCH(getchar())
        case EOF:
            BREAK_WHILE;
        case 'x':
            do_x();
            END_CASE;
        default:
            do_other();
    END_SWITCH
END_WHILE

Besides making the code difficult to read for others, and defeating the brace-matching capability of text editors, such style can introduce serious bugs into the logic of the program. In the above example for instance, although it may not be obvious, the BREAK_WHILE doesn't.

Sometimes macros must contain complex statements. Never write them as:

#define BLAH(a,b) { x=b; if (a) func1(b); else func2(b) }

since this makes their use awkward (in this case the call should not be followed by a semicolon). Instead use something like:

#define BLAH(a,b) do { \
        x=b; if(a) func1(b); else func2(b); break; \
    } while(0)

which enables the macro to be invoked as a simple statement, even within another if or else clause.

Macros are best used to hide information from the code that uses them. For instance

#define MAP_NAME(map) map->info->self->name

might be used by code that needs to know the name of a map, but doesn't need to know what the fields in a map are, or even that a map is a structure pointer. This makes it much easier if the implementation of a map changes at some time in the future (e.g. to a set of functions), since only this macro has to be changed rather than all the code that uses it. Similarly:

#define CREATE_MAP(map, call_me)                do {      \
        auto c_Size len = sizeof((map)->info->self->name) \
        map = (Map*)mm_allocate(sizeof(Map));             \
        *(map) = empty_map;                               \
        strncpy((map)->info->self->name, (call_me), len); \
    } while (0)

provides a simple mechanism for a user to create and initialize a map without having to know about its structure. (Note that in most cases this would probably have better been accomplished using a function, rather than a macro).

Constants (use #defines and enumerations instead)

You should almost never put constant values directly into the code. Exceptions might be made for 0 and 1, and for such obvious things as a function join(4,a,b,c,d);, where the 4 indicates the number of arguments passed to the function. But anywhere that it isn't obvious why the particular value was chosen, or what the effects of changing it might be, the literal constant should be avoided.

Consider a declaration such as auto int table[32], list[32];. If someone wants to increase the size of the table, it isn't obvious what it can be changed to. For instance does the new value, like 32, have to be a power of 2? And if it is changed, does the size of list also have to change? And if there is a while (++temp < 32) buried in the function, is that the same 32 or a different 32?

The correct solution to these problems is to use commented manifests for the constants. For instance, in

#define MAX_PEOPLE 32 /* Maximum number of people we can handle */
#define BUCKETS 32    /* Number of hash buckets.  Must be even. */
#define FREEZING_FAHRENHEIT 32      /* Fahrenheit for 0 Celsius */

auto int table[MAX_PEOPLE];
auto int list[BUCKETS];

while (++temp < FREEZING_FAHRENHEIT)

not only is it obvious that the three numbers have nothing to do with each other, it means that one need change the number in only one place, that the definitions are documented as to how they can be changed, and that using the values documents their meaning. The last means that the names are self-documenting and so one doesn't have to clutter up the code with extra comments.

You must be careful when defining constants though. For instance, in

#define  X  a + b
#define  Y  -c

Consider what the expressions z = 2 * X; and w = -Y; expand to. The first gives (2 * a) + b rather than 2 * (a + b), while the second gives w = --c;. Neither of these is what the programmer wanted, but the code where the bug occurs has nothing obviously wrong with it. The correct solution is to put parentheses around the expression, even when you know they won't be needed.

#define  X  (a + b)
#define  Y  (-c)

Supposing you have a variable that can have one of several values, each indicating a different thing. You could code things like if (flag == 3), or switch (flag) { case 2:, but if you #define manifests for the various values the code will be much easer to read or change, and much less subject to mistakes. Chosing a more meaningful (self-documenting) name than flag would improve things too. An even better solution would be to use enumerations instead of manifests.

typedef enum {
    ACTION_UNKNOWN,    /* Action has not yet been determined */
    ACTION_DELETE,     /* Remove the entry from the database */
    ACTION_APPEND,     /* Add this to the end of the existing entry */
    ACTION_REPLACE     /* Replace the existing entry with this one */
} Action;

    void
act(action)
    Action action;
{
    while (action == ACTION_UNKNOWN) {
        action = ask_for_action();
    }
    switch (action) {
        case ACTION_DELETE:

Note that the executable code is easy to understand even though there are no comments, just as it would be if manifests had been #defined. But enumerations have the advantage that you never need to be aware of the specific integral values assigned to the enumeration. They also allow LINT to check for errors. For instance, if you mistakenly call act(17); LINT would complain that you hadn't called the function with the correct type of argument; with manifests it wouldn't be able to tell. And with some debuggers (e.g. DBX on BSD) information will be displayed using the enumeration names instead of the integral values.

Similarly string literals, such as those containing file names, should only be used once. Using the same literal in more than one place makes it difficult to change and can lead to problems if one instance is spelled slightly differently than the others. Defining a manifest, e.g. #define TERMCAP_FILE "/etc/termcap", eliminates such risks. An even better solution is to define an external, e.g. char termcap_file[] = "/etc/termcap";. This has all the advantages of the manifest, but it also eliminates the multiple instances of the literal that many compilers will generate, it makes it easy to find the functions that use this literal simply by looking at the compiled library, and it allows debuggers to show the name of the variable you provided instead of the memory address (remember, most debuggers don't know anything about manifests, but they work wonders with external identifiers).

Typedefs (easier than using comments)

C and LINT don't do strict type checking for typedefs, but even so there are good reasons to use them.

First, if used properly they provide self-documentation for definitions and declarations. For instance, int infile, outfile; could be used for function parameters, auto variables, structure elements, etc. But if this type is actually a UNIX file descriptor, defining typedef int Unix_fd /* unix open() file descriptor */ and declaring them as Unix_fd infile, outfile; documents their intended use. When writing the code you will not be tempted to perform arithmetic on the variables, as you might with type (int), and you never even need to know what basic type these variables really are.

Secondly, if you ever decide to change the type of something, say changing int counter; to long counter;, because the range of possible values is greater than you originally expected, you only need change the one typedef typedef long BeanCounter; to give all the bean counters in your program the appropriate type. Without typedefs you would have to search through your entire source code to locate all the declarations of variables, structure elements, and parameters that use the values in this counter.

Thirdly, typedefs can greatly simplify complex types. For instance int (*(*f[4])())(); declares f as an array of 4 pointers to functions returning a pointer to functions returning int (got that?). If you have trouble with that, you might find typedefs a little easier to understand.

typedef int (*)() IntFuncPtr;    /* pointer to functions returning int */

You can then declare f as

IntFuncPtr (*f[4])();

or if you need a further breakdown

typedef IntFuncPtr (*)() IFPFuncPtr; /* pointer to func returning IFP */
IFPFuncPtr f[4];

Note that like externs, manifests, macros, and other such declarations, each typedef should only ever appear once in your code, usually in a shared header file.

Original Author

Ray Butterworth

Copyright

Copyright © 1987, 1990, MFCF, University of Waterloo.

This document may be freely copied and distributed in whole or part, provided appropriate copyright attributions are included, and that any additions are clearly indicated as not being part of the original.

Topic revision: r9 - 2015-09-23 - BillInce
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback