Guide to Advanced Programming in C

06 Jan 2014

C language is language of choice for system programming, embedded systems and also viable option for many other applications. While it is not likely to have serious interest in computer programming and not to be touched by C, it is very challenging to understand all its aspects and shady corners. This article attempts to provide dense material to illuminate some of those areas. Namely: integer promotions, memory allocation, array to pointer conversions, explicit inlining, interpositioning and vector conversions.

Integer Overflows and Promotions

Most of us C programmers tend to assume that basic operations with integers are safe and exercise prudence elsewhere. Actually it is not that hard to run into trouble. Consider following code:

int main(int argc, char** argv) {
    long i = -1;

    if (i < sizeof(i)) {
         printf("OK\n");
    }
    else {
         printf("error\n");
    }

    return 0;
}

What happens is that variable i is converted to unsigned integer. Thus its value is no longer -1, but maximum value of size_t, which happens to be result type of sizeof operator. The reason why is that so is described by chapter Usual arithmetic conversions of the C99/C11 standard:

"If the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type."

The size_t is by the C standard defined as unsigned integer with size at least 16 bits. Usually size_t corresponds with long of given architecture. That makes the size of int and size_t at least equal and above rule enforces conversion to unsigned integer.

That bring us to portability issues with integer sizes. The C standard does not exactly define sizes of short, int, long, long long and their unsigned versions. Only minimum sizes are enforced. For sake of example consider x86_64 architecture. long on Linux is 64-bit whereas on 64-bit Windows it is 32-bit. Common approach to make code more portable is to use length-specific types like uint16_t or int32_t defined by C99's stdint.h header file. Three kinds of integer types are defined there:

with exactly specified size: uint8_t uint16_t, int32_t, etc.
smallest type with at least specified size: uint_least8_t, uint_least16_t, int_least32_t, etc.
most efficient type with at least specified size: uint_fast8_t, uint_fast16_t, int_fast32_t, etc.

Unfortunately using stdint.h will not protect us from all trouble. The "integral promotion rule" of the C standard says:

If an int can represent all values of the original type, the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. All other types are unchanged by the integer promotions.

Thus following code will return 65536 on 32-bit platforms, but 0 on 16-bit platforms.

uint32_t sum()
{
    uint16_t a = 65535;
    uint16_t b = 1;
    return a+b;
}

The integer promotions preserve value including sign. Whether a ‘‘plain’’ char is treated as signed is implementation-defined."

How char type is implemented usually depend on hardware architecture and/or OS and it is usually specified by ABI (Application Binary Interface) of particular platform. If you care to find out on your own, in case char is promoted as signed char, following code will print -128,-127 (x86 arch.) otherwise 128,129. The GCC has -funsigned-char switch to force unsigned promotion on x86 architecture.

char c = 128;
char d = 129;
printf("%d,%d\n",c,d);

Memory Allocation and Management

malloc, calloc, realloc, free

The malloc allocates uninitialized memory object with size specified in bytes. What should happen if size is 0 depends on OS implementation or in other words neither C nor POSIX standard specify the behavior.

If the size of the space requested is 0, the behavior is implementation-defined: the value returned shall be either a null pointer or a unique pointer.

malloc(0) usually goes with returning valid unique pointer. Either way return value could be passed as argument of free without ending up with error. In case of NULL pointer free does no action.

Therefore if size argument is result of an expression, make sure to test for integer overflow.

size_t computed_size;

if (elem_size && num > SIZE_MAX / elem_size) {
    errno = ENOMEM;
    err(1, "overflow");
}

computed_size = elem_size*num;

For common case of allocating a sequence with equally sized elements, consider to use calloc instead of calculating size with expression. Additionally it will initialize allocated memory to zero. For releasing allocated space use free as usual.

The realloc will change size of already allocated memory object. Function returns pointer to possibly new memory location with same content to the lesser of the new and old sizes. If new size is larger, additional space is left uninitialized. If provided pointer to old object is NULL and size non-zero behavior is equal to malloc. If new size is zero and provided memory object non-NULL, behavior of realloc is OS depended.

Most implementations will attempt to free memory of an object and return value that malloc(0) would return or return NULL. For instance Windows will release memory and return NULL. OpenBSD will release too and return pointer to zero-sized object.

In case of failure realloc shall return NULL and leave provided memory object intact. Thus it is important not only to check for integer overflow of size argument, but also to correctly handle object size if realloc fails.

#include <stdio.h>
#include <stdint.h>
#include <malloc.h>
#include <errno.h>

#define VECTOR_OK            0
#define VECTOR_NULL_ERROR    1
#define VECTOR_SIZE_ERROR    2
#define VECTOR_ALLOC_ERROR   3

struct vector {
    int *data;
    size_t size;
};

int create_vector(struct vector *vc, size_t num) {

    if (vc == NULL) {
        return VECTOR_NULL_ERROR;
    }

    vc->data = 0;
    vc->size = 0;

    /* check for integer and SIZE_MAX overflow */
    if (num == 0 || SIZE_MAX / num < sizeof(int)) {
        errno = ENOMEM;
        return VECTOR_SIZE_ERROR;
    }

    vc->data = calloc(num, sizeof(int));

    /* calloc faild */
    if (vc->data == NULL) {
        return VECTOR_ALLOC_ERROR;
    }

    vc->size = num * sizeof(int);
    return VECTOR_OK;
}

int grow_vector(struct vector *vc) {

    void *newptr = 0;
    size_t newsize;

    if (vc == NULL) {
        return VECTOR_NULL_ERROR;
    }


    /* check for integer and SIZE_MAX overflow */
    if (vc->size == 0 || SIZE_MAX / 2 < vc->size) {
        errno = ENOMEM;
        return VECTOR_SIZE_ERROR;
    }

    newsize = vc->size * 2;

    newptr = realloc(vc->data, newsize);

    /* realloc faild; vector stays intact size was not changed */
    if (newptr == NULL) {
        return VECTOR_ALLOC_ERROR;
    }

    /* upon success; update new address and size */
    vc->data = newptr;
    vc->size = newsize;
    return VECTOR_OK;
}

Avoiding Fatal Errors

General approach to avoid problems with dynamic memory allocation is to write code as humbly and defensively as circumstances allow. Here are most common problems and a few approaches how to avoid them.

1) Double free corruption

Could be caused by calling free with pointer, which is either NULL pointer, pointer which was not allocated with malloc family function or free / realloc was already called with that pointer. To make code more resistant to such errors consider following points:

Initialize pointers upon declaration with NULL in case you can not pass valid pointer immediately.
Both GCC and Clang have -Wuninitialized switch to warn about uninitialized variables
Do not use same pointer variable for both statically and dynamically allocated memory
After calling free set pointer back to NULL so if you accidentally call it again it will not cause error
To indicate double free use assert or its alternative while testing and debugging

char *ptr = NULL;

/* ... */

void nullfree(void **pptr) {
    void *ptr = *pptr;
    assert(ptr != NULL)
    free(ptr);
    *pptr = NULL;
}

2) Accessing memory through uninitialized or null pointer

Using rules above your code shall only be dealing with NULL or valid pointers. Check for NULL at beginning of function or blocks which are dereferencing pointers to dynamically allocated memory.

3) Accessing memory outside of allocated boundaries

Accessing memory object outside it's boundaries does not necessarily cause program to crash. Program might continue to operate using corrupted data with possibly dangerous behavior or It is also possible take advantage of such operations and alter behavior of program to access otherwise restricted information or even inject executable code. Pedantic manual checking for boundaries of arrays and dynamically allocated memory objects is primary approach to prevent these risks. The information about boundaries of memory objects has to be tracked manually. Size of arrays can be determined with sizeof operator, but after array is converted to pointer e.g. during function call sizeof will return size of a pointer itself instead of array.

The bounds checking interface Annex K of the C11 standard defines new set of library functions providing alternatives easier to use securely to common parts of standard library (such as string and I/O manipulation). There are open-source implementations like [the slibc library][slibc], but the interface is not widely adopted yet. BSD based systems (also Mac OS X) provide strlcpy, strlcat functions for better string manipulation. They are available for other system through libbsd library.

Many operating systems provide interface to control access over memory regions to protect memory against unintended read/write operations such as Posix mprotect . These mechanisms usually apply to whole memory pages.

Avoiding Memory Leaks

Memory leaks are caused by not releasing dynamic memory which is no longer used by program. Thus it is essential to truly understand required scope of allocated memory object, most importantly the point (or condition) where free should be called. While this gets more difficult with growing complexity of an application, it is important to think about memory management upfront with early design decisions.

Here is a list of general approaches to address this issues:

1) Allocate on startup

An example of keeping memory management simple is allocating all required heap memory upfront at program startup. Burden of releasing is left for operating system when program ends. There are many cases when this solution is satisfactory, in particular for programs which process input in one batch and finishes.

2) Variable Length Arrays

If you need a temporary storage with variable size and required lifetime is within scope of a function consider using VLA (Variable Length Array). But there is a limitation; the storage should not be bigger than few hundred bytes per function. Because variable Length Arrays specified by C99 (optional in C11) have automatic storage, they are bound to same scope as other automatic variables. Even though the standard does not explicitly specify that, common way of VLA implementation is putting the memory on stack. Maximum size of memory allocated with VLA is SIZE_MAX bytes. Being aware of stack size of target platforms we have to stay much more humble to make sure that program will not have to face stack overflow and possible data corruption in following segment of memory.

3) Manual Reference Counting

The idea behind this technique is to count each assignment and each loss of reference of particular memory object. The count is incremented on every assignment and decremented on loss of reference. When reference count reach 0 it means that memory object is no longer in use and can be released. Since C does not offer automatic destructor (actually, both GCC and Clang support cleanup language extension) nor means to override assignment operator, reference counting is done manually by calling retain/release functions. Good way of thinking about it, is as various parts of a program are taking and releasing ownership of a memory object. Using this method, though, require a lot of discipline to not forget calling release (will end up with memory leak) or calling redundantly (will trigger free early). If required life time of memory object is implied by external events and if structure of an application implies handling ownership of a memory object anyway, this might be still worth the trouble. Following code block contain very simplified reference counting memory manager.

#include <stdlib.h>
#include <stdint.h>

#define MAX_REF_OBJ 100
#define RC_ERROR -1

struct mem_obj_t{
    void *ptr;
    uint16_t count;
};

static struct mem_obj_t references[MAX_REF_OBJ];
static uint16_t reference_count = 0;

/* create memory object and return handle */
uint16_t create(size_t size){

    if (reference_count >= MAX_REF_OBJ)
        return RC_ERROR;

    if (size){
        void *ptr = calloc(1, size);

        if (ptr != NULL){
            references[reference_count].ptr = ptr;
            references[reference_count].count = 0;
            return reference_count++;
        }
    }

    return RC_ERROR;
}

/* get memory object and increment reference counter */
void* retain(uint16_t handle){

    if(handle < reference_count && handle >= 0){
        references[handle].count++;
        return references[handle].ptr;
    } else {
        return NULL;
    }
}

/* decrement reference counter */
void release(uint16_t handle){
    printf("release\n");

    if(handle < reference_count && handle >= 0){
        struct mem_obj_t *object = &references[handle];

        if (object->count <= 1){
            printf("released\n");
            free(object->ptr);
            reference_count--;
        } else {
            printf("decremented\n");
            object->count--;
        }
    }
}

If you do not care about about compatibility with various compilers, it is possible to use cleanup attribute to mimic automatic destructor in C.

void cleanup_release(void** pmem) {
    int i;
    for(i = 0; i < reference_count; i++) {
        if(references[i].ptr == *pmem)
           release(i);
    }
}

void usage() {
    int16_t ref = create(64);

    void *mem = retain(ref);
    __attribute__((cleanup(cleanup_release), mem));

    /* ... */
}

Another deficiency in above solution is that cleanup_release is provided with address of object to be released instaed of reference number. Therefore cleanup_release have to do costly lookup in references array. One way to fix this is to change interface of retain to return pointer to struct mem_obj_t. Another way is to use following set of macros which crate variable to hold reference number and attach cleanup attribute to it.

/* helper macros */
#define __COMB(X,Y) X##Y
#define COMB(X,Y) __COMB(X,Y)
#define __CLEANUP_RELEASE __attribute__((cleanup(cleanup_release)))

#define retain_auto(REF) retain(REF); int16_t __CLEANUP_RELEASE COMB(__ref,__LINE__) = REF

void cleanup_release(int16_t* phd) {
    release(*phd);
}

void usage() {
    int16_t ref = create(64);

    void *mem = retain_auto(ref);
    /* ... */
}

4) Memory Pools

If a program goes during its execution thorough several stages, each stage might have pool of memory which is allocated at start of a stage. Whenever program need to allocate memory, part of one of memory pools is used. Memory pool is chosen according to required lifetime of allocated memory object and belonging to specific stage of a program. Upon end of each stage whole pool is released at once. This approach is particularly useful with log running processes such as daemons, where it may help to reduce fragmentation of memory over time. Here is very minimalistic demonstration of a memory pool memory manager:

#include <stdlib.h>
#include <stdint.h>

struct pool_t{
    void *ptr;
    size_t size;
    size_t used;
};

/* create memory pool*/
struct pool_t* create_pool(size_t size) {
    struct pool_t* pool = calloc(1, sizeof(struct pool_t));

    if(pool == NULL)
        return NULL;

    if (size) {
        void *mem = calloc(1, size);

        if (mem != NULL) {
            pool->ptr = mem;
            pool->size = size;
            pool->used = 0;
            return pool;
        }
    }
    return NULL;
}

/* allocate memory from memory pool */
void* pool_alloc(struct pool_t* pool, size_t size) {

    if(pool == NULL)
        return NULL;

    size_t avail_size = pool->size - pool->used;

    if (size && size <= avail_size){
        void *mem = pool->ptr + pool->used;
        pool->used += size;
        return mem;
    }

    return NULL;
}

/* release memory for whole pool */
void delete_pool(struct pool_t* pool) {
    if (pool != NULL) {
        free(pool->ptr);
        free(pool);
    }
}

Implementation of memory pool can range into very difficult task. Maybe some of existing libraries will be good fit for your requirements:

5) Data Structures

Many memory management problems can be solved by storing data in right data structure. While the choice of data structure is implied mostly by needs of algorithms accessing data, keeping data in structures like linked lists, hash-maps or trees have additional benefit e.g. being able to traverse data structure and release data at once. Since there is no support for data structures in standard library, here is a list of few libraries:

For traditional Unix implementation of linked lists and trees see BSD's queue.h and tree.h macros both are part of libbsd.
GNU libavl
Glib Data Types
For additional list see http://adtinfo.org/index.html

6) Mark and Sweep Garbage Collector

Another approach is to use advantage of automatic garbage collector and relieve from need to release memory manually. In contrary to reference counting where memory is released when is not needed anymore, garbage collector is invoked upon specific event e.g. failed allocation or after allocated memory reach certain water marks. Mark and sweep algorithm is one way to implement garbage collector. It first traverses heap memory for any references to allocated memory objects and mark those which are still reachable, than it sweeps those which were not marked.

Perhaps most known implementation of such garbage collector in C is Boehm-Demers-Weiser conservative garbage collector. Drawbacks of using garbage collection might be performance overhead or introducing non-deterministic stalls to a program. Another problem would cause library functions using malloc, which memory will not be managed by garbage collector and must be managed manually.

While unpredictable stalls are unacceptable for real-time environments there are many environments where benefits outweigh drawbacks. On performance side there are even claims of performance increase. Projects using Boehm GC include Mono project GNU Objective C runtime or Irssi IRC client.

Pointers and Arrays

Although there are contexts where arrays and pointers are interchangeable, they are threated differently by compiler and are represented differently at runtime.

When we say that object or expression has some type we usually have on mind the type of locator value also called lvalue . When lvalue has a complete non-const type, which is not an array type, we call it modifiable lvalue and it is a value, which gets modified when expression is left argument of assignment operator. If expression is right side argument of assignment operator, than the value doesn't have to be modifiable and become value of an expression by which is left argument modified. If expression has array type, value of an expression is pointer to the first element of array.

That is how array become a pointer under most contexts. Array's value type is not converted in two cases, when it is operand of unary & (address of) or sizeof operator. According to C99/C11 standard section 6.3.2.1:

Except when it is the operand of the sizeof operator or the unary & operator, or is a string literal used to initialize an array, an expression that has type "array of type" is converted to an expression with type "pointer to type" that points to the initial element of the array object and is not an lvalue.

Since array does not have modifiable lvalue and value value of an expression of array type is in most cases a pointer, it is not possible to use assign operator to assign value to array. Here is a little demonstration:

short a[] = {1,2,3};
short *pa;
short (*px)[];

void init(){
    pa = a;
    px = &a;

    printf("a:%p; pa:%p; px:%p\n", a, pa, px);

    printf("a[1]:%i; pa[1]:%i (*px)[1]:%i\n", a[1], pa[1], (*px)[1]);
}

a is array of int, pa is pointer to int and px has incomplete type of array of type int. Before a is assigned to pa it's value is converted to pointer to int pointing at the beginning of array. rvalue of expression &a is not pointer to int, but pointer to array of type int because lvalue was not converted before application of unary & operator.

Application of subscript operator in expression a[1] is equivalent to *(a+1) and obeys rules of pointer arithmetics in same way as in pa[1] expression. But there is one important distinction. With a, which is an array, the actual memory location of a variable it self is used to obtain pointer to first element. While with pa, which is a pointer, the actual value of pa variable is used not the location. The compiler have to be very aware of type difference between a and pa, therefore it is important to use right type for declarations of exported variables.

int a[];
int *pa;

While using following declarations in anoter compilation unit is incorrect and will break the code:

extern int *a;
extern int pa[];

Array as Argument of a Function

Another place when arrays of some type become pointers to that type is declaration of function parameters. All three following function definitions are equivalent.

void sum(int data[10]) {}

void sum(int data[]) {}

void sum(int *data) {}

The compiler shall report an error about redefinition of function sum, because in all three cases compiler see its parameter as int.

Multi-dimensional arrays are bit trickier topic though. First of all C does not exactly support multi-dimensional arrays, even though literature even the standard use that term. Array of array would be perhaps more accurate name.

typedef int[4] vector;
vector m[2] = {{1,2,3,4}, {4,5,6,7}};
int n[2][4] = {{1,2,3,4}, {4,5,6,7}};

Variable m is array of vector type with size 2 and vector is array of int type with size 4. Array n is identical to m apart from fact that they are stored different palace in memory. Speaking of memory, both arrays are laid out in continuous memory area just like nested bracket expressions show. Accessing such array works exactly as stated above.

int *p = n[1];
int y = p[2];

By applying subscript operator n[1] we get element with type array of int with size of four. Because we are addressing second element of array, location within array would be four times size of int from beginning of array. As we know array of int is in expression such as one above converted to pointer to int and then stored as p. Then p[2] will access third element of array produced by previous expression. Equivalent expression written with pointer arithmetic would be following:

int z = *(*(n+1)+2);

Which has same effect as expression we would write in first place:

int x = n[1][2];

When passing such array as argument, first "dimension" array will be converted to the pointer to first element of array which is again array. Thus it is not required to specify first dimension. Following dimensions of array must be exactly expressed. Otherwise subscripting array would not work correctly. While we have freedom to use any of following forms do define function receiving array as argument, we are always forced to explicitly define dimensions of inner array.

void sum(int data[2][4]) {}

void sum(int data[][4]) {}

void sum(int (*data)[4]) {}

To get around this limitation it is possible to cast the array to pointer and calculate offset of required element.

void list(int *arr, int max_i, int max_j){
    int i,j;

    for(i=0; i<max_i; i++){

        for(j=0; j<max_j; j++){
            int x = arr[max_i*i+j];
            printf("%i, ", x);
        }

        printf("\n");
    }
}

Another approach is used by main function to pass list of arguments. The main function receives pointer to pointer instead of array of arrays. Drawback of this approach is that the data must be constructed differently or converted to pointer to pointer form. On the upside it will allow us to use subscript operator in same way as before, because now we have address of beginnings for each sub-array.

int main(int argc, char **argv){
    int arr1[4] = {1,2,3,4};
    int arr2[4] = {5,6,7,8};

    int *arr[] = {arr1, arr2};

    list(arr, 2, 4);
}

void list(int **arr, int max_i, int max_j){
    int i,j;

    for(i=0; i<max_i; i++){

        for(j=0; j<max_j; j++){
            int x = arr[i][j];
            printf("%i, ", x);
        }

        printf("\n");
    }
}

Initializations part gets much simpler with strings, since it is allowed to initialize pointers to strings constant directly.

const char *strings[] = {
    "one",
    "two",
    "three"
};

But there is a pitfall. String constants were converted to pointers sizeof operator will give size of pointer and not size of whole string literals. Another important distinction is that if string literal is modified directly through a pointer behavior of such program is undefined.

Providing you can get away with using variable length arrays there is a third way pass multidimensional array to function. Using previously defined parameters to designate dimensions of inner array, the arr parameter become pointer to complete type of array to int.

void list(int max_i, int max_j, int arr[][max_j]){
    /* ... */
    int x = arr[1][3];
}

Same approach works also for higher dimension arrays, while first dimension is always converted to pointer to array. Similar conversion rule works for function designators. If function designator is not argument of sizeof or unary & operator its value is converted to pointer to function. That is the reason why we do not have to use & operator when passing callback function.

static void catch_int(int no) {
    /* ... */
};

int main(){
    signal(SIGINT, catch_int);

    /* ... */
}

Interpositioning

Interpositioning is technique of replacing function in linked libraries with custom implementation without recompiling them. It is even possible to interposition syscalls (more precisely functions of library wrapping syscalls). Possible applications are sandboxing, debugging or performance enhancement libraries. For demonstration here is a simple library counting malloc calls for GNU/Linux.

/* _GNU_SOURCE is needed for RTLD_NEXT, GCC will not define it by default */
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
#include <stdint.h>
#include <inttypes.h>

static uint32_t malloc_count = 0;
static uint64_t total = 0;

void summary(){
    fprintf(stderr, "malloc called: %u times\n", count);
    fprintf(stderr, "total allocated memory: %" PRIu64 " bytes\n", total);
}

void *malloc(size_t size){
    static void* (*real_malloc)(size_t) = NULL;
    void *ptr = 0;

    if(real_malloc == NULL){
        real_malloc = dlsym(RTLD_NEXT, "malloc");
        atexit(summary);
    }

    count++;
    total += size;

    return real_malloc(size);
}

The intention is to load this library during dynamic linking before libc.so, so our implementation of malloc will be linked when binary is run. This can be achieved setting LD_PRELOAD environment variable to full path to the libraries we want to load first. It will also ensure that calls made from another dynamically linked libraries will also end up calling our implementation of malloc. Since our objective is only to count calls not actually implement allocation we still need to call the "real" malloc. By passing RTLD_NEXT pseudo-handler to dlsym we obtain pointer to next occurrence of malloc among remaining dynamically linked libraries. First time malloc is called libc implementation of malloc is obtained and summary function is registered to be called on program termination with atexit. To see interpositioning in action on GNU/Linux (really 184 times!):

$ gcc -shared -ldl -fPIC malloc_counter.c -o /tmp/libmcnt.so
$ export LD_PRELOAD="/tmp/libstr.so"
$ ps
  PID TTY          TIME CMD
 2758 pts/2    00:00:00 bash
 4371 pts/2    00:00:00 ps
malloc called: 184 times
total allocated memory: 302599 bytes

Symbol Visibility

Because all non-static functions are exported by default, interpositioning can be achieved unintentionally just by defining function with same signature as other dynamically linked library function or even object file. Effective practice to prevent accidental interpositioning and polluting exported function name space is to define every function as static, providing it is not be used behind boundary of object file.

Another possibility to control exporting shared object in shared libraries is to use compiler extensions. Both GCC 4.x and Clang support visibility attribute to and -fvisibility compiler argument for setting global policy per object file. Where default mean no modification of visibility and hidden has similar effect on visibility as using static. The symbol will not be placed into the dynamic symbol table, so it would not be visible for other shared objects or executable.

#if __GNUC__ >= 4 || __clang__
  #define EXPORT_SYMBOL __attribute__ ((visibility ("default")))
  #define LOCAL_SYMBOL  __attribute__ ((visibility ("hidden")))
#else
  #define EXPORT_SYMBOL
  #define LOCAL_SYMBOL
#endif

Global visibility designated by compiler argument can be overridden locally by setting visibility attribute. In practice global policy is set to hidden, so all symbols will be by default local and only those which has __attribute__ ((visibility ("default"))) will be exported.

Explicit Inlining

The code of a function can be directly integrated into caller function instead of generating code of stand-alone function object and a call. The compiler can be instructed to do so explicitly by using inline specifier. According to section 6.7.4 of C standard inline specifier only suggest the compiler to make "calls to the function be as fast as possible" and that "the extent to which such suggestions are effective is implementation-defined".

The simplest way to use the advantage of inline functions is to define function as static and place the definition into header.

/* middle.h */
static inline int middle(int a, int b){
    return (b-a)/2;
}

Stand-alone function object of the function still might be emitted, but it will not be visible outside of translation unit. Providing such header is included in multiple translation units, the compiler might emit multiple copies of the function for each unit. Thus it is possible that two variables caring pointers to same function name may not be equal.

Another approach is to provide both externally linkable and inline version of a same function and let the complier to decide which will be used. That actually is how inline specifier is defined:

If all of the file scope declarations for a function in a translation unit include the inline function specifier without extern, then the definition in that translation unit is an inline definition. An inline definition does not provide an external definition for the function, and does not forbid an external definition in another translation unit. An inline definition provides an alternative to an external definition, which a translator may use to implement any call to the function in the same translation unit. It is unspecified whether a call to the function uses the inline definition or the external definition.

For having both versions of a function we could place following definition in header:

/* middle.h */
inline int middle(int a, int b){
    return (b-a)/2;
}

Then in exactly one source file declare the function with extern specifier to emit externally likable version in this translation unit:

#include "middle.h"
extern int middle(int a, int b);

The GCC compiler implementation differs from this decryption. If a function defined with inline specifier, the GCC always emits externally linkable object code and only one such definition may exist in the program. If function is defined with both export inline specifiers GCC implementation will never emit externally linkable object code for that function. Since GCC version 4.3 it is possible to use -std=c99 option to enable C99 rules for inline defintions If C99 rules are enabled GNUC_STDC_INLINE is defined. Formerly described approach using static is not affected by GCC interpretation of inline functions. If you need to use approach with both inline and externally linkable function consider following solution:

/* global.h */
#ifndef INLINE
# if __GNUC__ && !__GNUC_STDC_INLINE__
#  define INLINE extern inline
# else
#  define INLINE inline
# endif
#endif

In header with function definition

/* middle.h  */
#include "global.h"
INLINE int middle(int a, int b) {
  return (b-a)/2;
}

In exactly one source file:

#define INLINE
#include "middle.h

When function inlining have to be enforced both, GCC and Clang compilers support always_inline attribute for that purpose. In following example stand-alone function object is never emitted.

/* cdefs.h */
# define __always_inline   inline __attribute__((always_inline))

/* middle.h */
#include <cdefs.h>
static __always_inline int middle(int a, int b) {
  return (b-a)/2;
}

In case compiler fail to inline function, compilation will end up with error. This approach is for instance used in Linux kernel The definition of __always_inline used above can be found in cdefs.h.

Vector Extensions

Many microprocessors (x86 architecture in particular) provide Single-Instruction-Multiple-Data (SIMD) instruction sets enabling vector operations. To illustrate that, consider following code:

#include <stdint.h>
#include <string.h>
#define SIZE 8
int16_t a[SIZE], b[SIZE];

void addtwo(){
    int16_t i = 0;

    while (i < SIZE) {
        a[i] = b[i] + 2;
        i++;
    }
}

int main(){
    addtwo();
    return a[0];
}

The loop in function addtwo iterate 8 times each time adding two to array b with type of signed integers with size of 16 bits. For function addtwo will complier output following (or similar) assembly code:

$ gcc -O2 auto.c -S -o auto_no.asm

addtwo:
.LFB22:
        .cfi_startproc
        movl    $0, %eax
.L2:
        movzwl  b(%rax), %edx
        addl    $2, %edx
        movw    %dx, a(%rax)
        addq    $2, %rax
        cmpq    $16, %rax
        jne     .L2
        rep
        ret
        .cfi_endproc

First a zero is written to eax register. The label L2 marks start of loop. Fist element of b is loaded into first 16 bits of 32-bit edx register by movzwl instruction. Rest of the edx register is filled with zeros. Then addl instruction add two to first element of a to value in edx register and store it in dx register. Result of summation copied from dx (lower 16 bits of edx register) to first element of a. Finally rax register which apparently holds offset for array arithmetics is incremented by 2 (representing 2 bytes - 16bits) and compared with total size of array (in bytes). If rax does not equal 16 execution jumps back to L2 label, otherwise execution continues and function returns.

The SSE2 instruction set provide instruction paddw which can add eight 16-bit integers at once. In fact most modern compliers are able to optimize code to use vector instructions such as paddw automatically. The Clang has automatic vectorization enabled by default. The GCC complier has -ftree-vectorize switch or it is enabled with -O3 switch. Assembly code of addtwo function optimized for vector instructions would then look very different:

$ gcc -O2 -msse -msse2 -ftree-vectorize -ftree-vectorizer-verbose=5 auto.c -S -o auto.asm

addtwo:
.LFB22:
        .cfi_startproc
        movdqa  .LC0(%rip), %xmm0
        paddw   b(%rip), %xmm0
        movdqa  %xmm0, a(%rip)
        ret
        .cfi_endproc

;...

.LC0:
        .value  2
        .value  2
        .value  2
        .value  2
        .value  2
        .value  2
        .value  2
        .value  2

Most notable difference is that code handling the loop vanished. First eight 16-bits integer with value 2 labeled LC0 are loaded by movdqa to xmm0 register. Then paddw add each of eight 16-bit elements of b by appropriate element stored in xmm0. The result is written back to a and function may return. Instruction movqda can be used only on memory object aligned by 16 bytes. It indicates that compiler was able to align memory addresses of both arrays for better efficiency.

The size of array does not have to be exactly 8 elements, but it has to be aligned (or padded if necessary) to 16 bytes so 128-bit vector can be used. It also might be a good idea to inline function, especially when arrays are passed as arguments. Because arrays are converted into pointers, the addresses need to be aligned by 16 bytes. If the function is inlined the compiler might be able to reduce overhead of additional aligning.

#include <stdint.h>

void __always_inline addtwo(int16_t* a, int16_t *b, int16_t size){
    int16_t i;

    for (i = 0; i < size; i++) {
        a[i] = b[i] + 2;
    }
}

int main(){
    const int16_t size = 1024;
    int16_t a[size], b[size];

    addtwo(a, b, size);
    return a[0];
}

The loop iterate 1024 each time adding two signed integers with size of 16 bits. Thus, applying vector operations, the count of loop iterations in this example can be reduced to 128. While this can be done also automatically, with GCC it is also possible to define vector data types with vector_size attribute and by using them instruct compiler to use vector extensions explicitly. For illustration here are various vector data types introduced with SSE2 instruction set defined in emmintrin.h header.

/* SSE2 */
typedef double __v2df __attribute__ ((__vector_size__ (16)));
typedef long long __v2di __attribute__ ((__vector_size__ (16)));
typedef int __v4si __attribute__ ((__vector_size__ (16)));
typedef short __v8hi __attribute__ ((__vector_size__ (16)));
typedef char __v16qi __attribute__ ((__vector_size__ (16)));

And here is how previous example can be optimized to use vector instructions using __v8hi type.

#include <stdint.h>
#include <string.h>
#include <emmintrin.h>

static void __always_inline _addtwo(__v8hi *a, __v8hi *b, const int16_t sz){
    __v8hi c = {2,2,2,2,2,2,2,2};

    int16_t i;
    for (i = 0; i < sz; i++) {
        a[i] = b[i] + c;
    }
}

static void __always_inline addtwo(int16_t *a, int16_t *b, const int16_t sz){
    _addtwo((__v8hi *) a, (__v8hi *) b, sz/8);
}

int main(){
    const int16_t size = 1024;
    int16_t a[size], b[size];
    /* ... */

    addtwo(a, b, size);
    return a[0];
}

The trick is to convert data to appropriate type (in this case __v8hi) and adjust the rest of code accordingly. The effect of optimization may vary greatly depending on type of operation and size of data to be processed. For previous example following when addtwo function was called in loop 100 million times. These times were measured:

Compiler	Time
gcc 4.5.4 O2	1m 5.3s
gcc 4.5.4 O2 auto vectorized	12.7s
gcc 4.5.4 O2 manual	8.9s
gcc 4.7.3 O2 auto vectorized	25.s
gcc 4.7.3 O2 manual	8.9s
clang 3.3 O3 auto vectorized	8.1s
clang 3.3 O3 manual	9.5s

Faster time for auto vectorized code of Clang compiler is probably caused by better optimization of outer loop used for testing. Behind slower times for GCC 4.7.3 is less efficient memory aligning (see later).

int32_t i;
for(i=0; i < 100000000; i++){
    addtwo(a, b, size);
}

Using Intristic Functions

Both GCC and Clang compilers also provide build-in functions also called intristic functions to invoke assembly instructions explicitly. Actual intristic functions are very complier specific. For x86 platform both compilers provide headers with definitions made to match Intel compiler intristics accessible through x86intrin.h. Here is a list of header files defining instristics for particular instruction sets:

MMX: mmintrin.h
SSE: xmmintrin.h
SSE2: emmintrin.h
SSE3: mm3dnow.h
3dnow: tmmintrin.h
AVX: immintrin.h

This is how previous example could be modified using intristic functions:

#include <stdint.h>
#include <string.h>
#include <emmintrin.h>

static void __always_inline addtwo(int16_t *a, int16_t *b, int16_t size){

    int16_t i;
    __m128i c = _mm_set1_epi16(2);

    for (i = 0; i < size; i+=8) {
        __m128i bb = _mm_loadu_si128(b+i);  // movqdu b+i -> xmm0
        __m128i r = _mm_add_epi16(bb, c);   // paddw c + xmm0 -> xmm0
        _mm_storeu_si128(a+i, r);           // movqdu xmm0 -> a+i
    }
}

int main(){
    const int16_t size = 1024;
    int16_t a[size], b[size];
    /* ... */

    addtwo(a, b, size);
    return a[0];
}

This approach might be required when compiler generate suboptimal code or when it is not possible express required operations on vector types e.g. due branching of code with if statements.

Memory Aligning

Also notice that last example use _mm_loadu_si128 which translates to movqdu not movqda. That is because there is no assurance that a or b are aligned to 16-byte boundary. Using instructions expecting aligned memory object to not aligned objects will almost surely lead to runtime errors or data corruption. To fix that, attribute aligned can be used to instruct compiler to align memory objects upon definition. In some cases it might be worth to consider aligning critical data to 64 bytes since it is also size of x86 L1 cache lines to increase cache utilization.

#include <stdint.h>
#include <string.h>
#include <emmintrin.h>

static void __always_inline addtwo(int16_t *a, int16_t *b, int16_t size){

    int16_t i;
    __m128i c = _mm_set1_epi16(2) __attribute__((aligned(16)));

    for (i = 0; i < size; i+=8) {
        __m128i bb = _mm_load_si128(b+i);  // movqda b+i -> xmm0
        __m128i r = _mm_add_epi16(bb, c);   // paddw c + xmm0 -> xmm0
        _mm_store_si128(a+i, r);           // movqda xmm0 -> a+i
    }
}

int main(){
    const int16_t size = 1024;
    int16_t a[size], b[size] __attribute__((aligned(16)));
    /* ... */

    addtwo(a, b, size);
    return a[0];
}

In sense of program speed it is better to use automatic variables than static or global variables and avoid dynamic allocation where circumstances allow. For cases when dynamic allocation is inevitable the Posix standard provides posix_memalign and Windows have _aligned_malloc function to provide aligned memory allocation.

Effective use of vector extensions and code optimization requires deep knowledge about how target architecture works and which assembly instructions could be used to make code run faster. Invaluable source of information on this topic is Agner`s CPU blog and his set of Optimization manuals.

Curiosities

The final section consults few amusing parts of C programing language like this one:

array[i] == i[array];

Since subscript operator is equivalent to *(array+i) it also commutative, these are also equivalent.

$ gcc -dM -E - < /dev/null | grep -e linux -e unix
#define unix 1
#define linux 1

GCC by default define both linux and unix to 1, so code will break if one of them is used e.g. as name of function.

int x = 'FOO!';
short y = 'BO';

Yes, character expression can be extended to any integer size.

x = i+++k;
x = i++ +k;

Postfix incrementation operator will be evaluated before add operator.

x = i+++++k; //error
x = i++ ++ +k; //error

y = i++ + ++k; //ok

Lexical processor is looking for longest sequence of non-whitespace characters making token that could be processed (Section 6.4 of C standard). First line will be parsed like second line show and both will yield error about missing modifiable lvalue which could be incremented by second postfix ++ operator.

Acknowledgements

If you have something to add or correct, you are more than welcome to leave a comment. Here I would like to thank Stefan Bellus for reviewing for draft for this article and Greg Davis for his talks at Embedded World 2013 Conference, which inspired me to write it.

Updates

Error in realloc example corected.
Integer types corrected to unsigned to hold value 65535 in integral promotion demonstration.
Misleading statement comparing visibility hidden attribute with static declaration corrected.
Numerous errors in vector example corrected.
Redundant size < SIZE_MAX removed.