heterogeneous lists in c

Programming

heterogeneous lists in c

Inspired by recent journey into lisp and scheme, I thought I’d have a crack at bringing some lisp-like heterogeneous lists into c. To compliment this, I’ll also implement some lisp primitives, namely cons, car and cdr, for list operations.

Up for discussion: are the lists in this post truly heterogeneous? In fact, each atom is the same type, g_cell, does this make our lists homogeneous instead? We rely on type switching to get anything done. And it is not trivial to extend the implementation for new types. Keep these limitations in mind…

First, what is a heterogeneous list? We’ll first describe its opposite, a homogeneous list. A homogeneous is composed of items (or cells) which each possess the same type. For some light reading – Introduction to Linked Lists, this Haskell discussion (particularly the section on Algebraic Datatypes. An example follows

In the below, we notate a list with parenthesis. We make use of integers, which appear as numbers, and of strings, which appear within quotation marks. Items within a list are separated by whitespace.

[1] Legal:
  (1 2 3 42)
[2] Legal:
  ("What is" "the meaning of life" ", the universe and everything")
[3] Illegal:
  ("What is the meaning of life, the universe, and everything?" 42)

[1] and [2] are legal homogeneous lists because each cell within the list is the same type (integers in the former and strings in the latter). [3] is not legal, the items contained within the list are not all of the same type.

We have two roadblocks to implementing a heterogeneous list in c. The first is related to storage. We must be able to store each ‘type’ of variable in a type independent manner. This can be done with void pointers, void*. The second challenge that arises is performing operations on members of a list. c does not allow polymorphic functions, to get around this we maintain the type of each list member. Our ‘generic’ functions operate by first testing the type of the element it is operating on, then selecting the suitable function.

Step one is to define our top-level structure, the g_cell. A cell is merely a building block out of which our more complicated structures are born. Cells are primitives in lisp, read more about them here and here. These references also mention some functions and concepts we’ll define later.

typedef enum _g_type {
    g_i32,
    g_str,
    g_list;
} g_type;

typedef struct _g_cell {
    int atom_flag; // to mark if g_cell is an atom
    g_type type;
    void *data;
    struct _g_cell *next;
} g_cell;

g_type lists the types that our heterogeneous list may contain. For now we limit this to i32’s, str’s and lists. A g_cell contains information on the cells type, a pointer to the data it contains and a pointer to the next cell. atom_flag is a convenience to indicate whether a g_cell is an atom. More on atoms shortly.

The type for data is a pointer to void. This is the c of implementing generics, but it has its drawbacks. When we want to use the data, we have to cast the void* back to its intended type before using it. An example of this, and an explanation of the rather arcane invocation to get at the int held by data, follows:

int *i = malloc(sizeof(int));
*i = 42;
g_cell *g = malloc(sizeof(g_cell));
g->data = i;
printf("g->data is: %d\n", *((int*)g->data) );

// g->data           : a void* to data
// (int*)g->data     : cast the void* to int*
// *((int*)g->data)  : de-reference the int* to get an int

everything is a cell

Another interesting quirk about our implementation is the simple definition that everything is a cell . we provide two additional, higher level, definitions for an atom and a list.

list – a list is a cell whose type is g_list, whose data points to a either a cell or NULL, and whose next points to either cell which is also a list or NULL.

atom – an atom is a cell which is not a list and whose next field is set to NULL.

The following diagram helps to explain.

Both lists and atoms are constructed from cells.

primitives

On to defining some primitive functions to operate on cells

g_cell *make_empty_list(){
	g_cell *l = malloc( sizeof(g_type) );
	l->type = g_list;
	l->flag = 0; //not an atom
	l->data = NULL;
	l->next = NULL;
	return l;
}

g_cell *make_empty_atom(){
	g_cell *a = malloc( sizeof(g_type) );
	a->flag = 1; //is an atom
	a->data = NULL;
	a->next = NULL;
	return a;
}

g_cell *make_atom( g_type t, void *data ){
	g_cell *a = make_empty_atom();
	a->type = t;
	a->data = data;
	return a;
}

The above functions are self explanatory, making empty lists and atoms. The final function, make_atom is the most interesting for now. It receives a void*, allowing generics. The g_type t is the promised type of the void*. There are no checks to ensure that the chosen g_type is appropriate for the incoming data .

Note that the above functions malloc. The allocated memory will later need to be free‘d. It follows that for each g_type that we define, we need a specialised function for free‘ing the data type. I’ve omitted these definitions for now, we’ll return to that later.

// Test if the g_cell is a list
bool is_list( g_cell *g ){
	//return !g->flag;
	return ( ctr(g) == g_list );
}

// Test if the g_cell is a list and empty
bool is_empty_list( g_cell *l ){
	return ( is_list(l) && car( l ) == NULL );
}

// Test if the g_cell is an atom
bool is_atom( g_cell *g ){
	return g->flag;
}

The above defines some predicates for testing the identity of lists and atoms. The functions ctr and car, as well as cdr, are defined below

// Return the type of the g_cell
g_type ctr( g_cell *g ){
	return g->type;
}

// Return the first element in list
g_cell *car( g_cell *l ){
	if( !is_list( l ) ){
		fprintf(stderr, "car: attempt to car on g_cell which is not a list\n");
		exit(EXIT_FAILURE);
	}
	return l->data;
}

// Returns the next node in the list
g_cell *cdr( g_cell *l ){
	if( !is_list( l ) ){
		fprintf(stderr, "cdr: attempt to cdr on g_cell which is not a list\n");
		exit(EXIT_FAILURE);
	}
	return l->next;
}

Finally, we define cons to construct a list

// Append atom to top of list
void cons( g_cell *atom, g_cell **list ){
	if( !is_list( *list ) ){
		fprintf(stderr, "cons: argument 2 must be a list\n");
		exit(EXIT_FAILURE);
	}
	if( is_empty_list( *list ) ){
		//put new data into the current list node
		(*list)->data = atom;
		return;
	} else {
		g_cell *new = make_empty_list();
		new->data = atom;
		new->next = *list;
		*list = new;
		return;
	}
}

Actually, the first argument to cons need not be an atom. We can happily pass in a second list.

That’s about all we need for a few test cases:

int main(int argc, char *argv[]){
	int *a = malloc(sizeof(int));
	*a = 42;
	char *b = "Hello, world!";

	g_cell *l1 = make_empty_list();
	cons( make_atom(g_i32, a), &l1 );
	cons( make_atom(g_str, b), &l1 );

	printf("car(l1) is: %s\n", (char *)(car(l1)->data));
	printf("car(cdr(l1)) is: %d\n", *((int *)car(cdr(l1))->data));
}

Our cons function receives the address of the list, notated by &l1. This is necessary as the head of the list must change when adding a new atom to the list.

Notice the contortions required to print atoms retrieved from the list? The following defines a series of functions to ease this:

// Return a freshly allocated cstring representation of a g_32
char *g_i32_to_str_alloc( g_cell *g ){
	if( !is_atom_i32(g) ){
		fprintf(stderr, "attempt to g_i32_to_str g_cell which is not an i32\n");
		exit(EXIT_FAILURE);
	}
	int i = *((int*)g->data);
	int length = snprintf(NULL, 0, "%d", i);
	char *s = malloc(length + 1);
	snprintf(s, length + 1, "%d", i);
	return s;
}

// Return a freshly allocated cstring copy of a g_str
char *g_str_to_str_alloc( g_cell *g ){
	char *sold = (char*)g->data;
	int length = strlen(sold);
	char *snew = malloc( length + 1);
	strncpy(snew, sold, length + 1);
	return snew;
}

// Return a freshly allocated cstring representation of an atom
char *atom_to_str_alloc( g_cell *a ){
	if( !is_atom( a ) ){
		fprintf(stderr, "atom_to_str_alloc: argument 1 must be an atom\n");
		exit(EXIT_FAILURE);
	}
	char *s;
	switch( ctr(a) ){
		case g_i32:
			s = g_i32_to_str_alloc( a );
			break;
		case g_str:
			s = g_str_to_str_alloc( a );
			break;
		default:
			fprintf(stderr, "atom_to_str_alloc: not implemented for type\n");
			exit(EXIT_FAILURE);
	}
	return s;
}

We define specialised functions to return a valid c string given each g_type. The atom_to_str_alloc function inspects the incoming atom type and executes a function accordingly. In all cases, we return a copy of the string (rather than merely the pointer in the case of g_str), this is so that the results can be free’d in a consistent manner.

Furthering this, we define a set functions to print a list. Notice how these functions are defined in terms of our primitives.

// Print all nodes in a list
void print_list( g_cell *l ){
	if( !is_list( l ) ){
	        fprintf(stderr, "print_list: argument 1 must be of g_type g_list\n");
		exit(EXIT_FAILURE);
	}
        printf("(");
	print_list_helper( l );
	printf(")");
}

// Recursively print nodes in a list
void print_list_helper( g_cell *l ){
	if( l == NULL )
		return;
	// g is the cell contained by l
	g_cell *g = l->data;
	if( g == NULL )
		;  // break early
	else if( ctr(g) == g_list )
		print_list(g);
	else
		print_g_cell(g);
	if( cdr(l) != NULL )
		printf(" ");
	print_list_helper( cdr(l) );
}

// Print the contents of a g_cell/atom
void print_g_cell( g_cell *g ){
	if( g == NULL )
		return;
	if( ctr(g) == g_str )
		printf("\"");
	char *tmp = atom_to_str_alloc(g);
	printf("%s", tmp);
	free(tmp);
	if( ctr(g) == g_str )
		printf("\"");
}

Notice that a g_str is printed surrounded in quotes. It would be wise to abstract this out so that future types can have special printed representations (such as a g_vec being surrounced by ‘[‘ and ‘]’). For now this is adequate.

The top level print_listfunction is exposed to the user. It merely checks if the g_cell is a list and prints the surrounding quotes before calling the recursive print_list_helper function. print_list_helper creates a c_cell pointer g, this points to the data element of the list node. If ctr(g) is a g_list, we call print_list to print its elements. Otherwise, we call print_g_cell.

print_g_cell creates a temporary c string; we rely on the our previously defined atom_to_str_alloc to allocate the memory, determine the atom’s type and return the correct string representation. Once printed, we free the memory.

Let’s try our main function again

int main(int argc, char *argv[]){
	int *a = malloc(sizeof(int));
	*a = 42;
	char *b = "Hello, world!";
	int *c = malloc(sizeof(int));
	*c = 7654321;
	char *d = "list";
	char *e = "list within";

	g_cell *l1 = make_empty_list();
        g_cell *l2 = make_empty_list();

	cons( make_atom(g_i32, a), &l1 );
	cons( make_atom(g_str, b), &l1 );

	cons( make_atom(g_str, d), &l2 );
	cons( make_atom(g_str, e), &l2 );
	cons( make_atom(g_i32, c), &l2 );

	printf("Printing list 1\n");
	print_list( l1 );
	printf("\n");

	printf("Printing list 2\n");
	print_list( l2 );
	printf("\n");

	printf("Printings cons(l2, &l1)\n");
	cons( l2, &l1 );
	print_list(l1);
	printf("\n");

	return 0;
}

Some outstanding pain points with our API

  • We need to allocate memory for our atoms. We should use g_type specific functions for allocating memory, and keep these pointers out of the users program.
  • We make our atom’s one at a time. This is tedious, we should implement a function that can make an arbitrary number of atoms.
  • Everything is a g_cell. We have invented methods to deal with this fact up until now, but it is not without its subtleties. In each list function, we must check if the g_cell we’ve been given is in fact a list. By our current definition, if we don’t perform these checks, then the same functions can be applied to atoms. This behavior is sometimes desired, but we need a more robust method of handling this (see warning 3 below for an example of unexpected behavior)
  • It’s minor, but the g_type is internally just an integer, making debugging difficult (see warning 3 below, we know we are misapplying types, but what type is the g_cell passed to that function?

We’ll attempt to answer these shortcomings in future posts.

warnings

Right now there’s nothing stopping us from doing some nasty things, beware of the following:

//// Warning, some of these will produce infinite loops
//// Keep ctrl-c handy if executing this code.
// Malformed atoms
g_cell *l1 = make_empty_list();
cons( make_atom(g_i32, &"not an int"), &l1 );
print_list(l1);
// Nothing stops us from misapplying types

// Cyclical list
g_cell *l1 = make_empty_list();
cons(l1, &l1);   // creates a cyclical list
print_list(l1);
// Not all cyclical lists are evil. We'll explore ways to take
//  advantage of them in a later post.

// Malformed list
g_cell *l1 = make_empty_list();
g_cell *a1 = make_atom(str, &"string");
l1->next = a1; // sets the next list node to something not a list
print_list(l1);
// "atom_to_str_alloc: not implemented for type"
// we misapplied a function

glist.h

The following declarations are needed to compile the examples. Include these at the top of your glist.c or create a glist.h

#include <string.h>
#include <stdlib.h>
#include <stdbool.h>
#include <stdio.h>

typedef enum _g_type {
	g_i32,
	g_str,
	g_list
} g_type;	

typedef struct _g_cell {
	int flag;
	g_type type;
	void *data;
	struct _g_cell *next;
} g_cell;

g_cell *make_empty_list();
g_cell *make_empty_atom();
g_cell *make_atom(g_type t, void *data);

g_type ctr( g_cell *g );
g_cell *car( g_cell *l );
g_cell *cdr( g_cell *l );
void cons( g_cell *a, g_cell **l );

bool is_list( g_cell *g );
bool is_empty_list( g_cell *g );
bool is_atom( g_cell *g );

char *g_i32_to_str_alloc( g_cell *g );
char *g_str_to_str_alloc( g_cell *g );
char *atom_to_str_alloc( g_cell *g );

void print_g_cell( g_cell *g );
void print_list( g_cell *l );
void print_list_helper( g_cell *l );

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.