Clean pointer serialization in C

January 21st, 2024

Let’s say you’re programming in C and need to serialize a data structure to a storage format on disk. The storage format is not a perfect 1:1 representation of your internal data but for example a custom 3D model format of your game engine.

The problem

It’s common to organize a file format as headers, followed by the actual data. The headers contain relative file offsets to actual data arrays stored later in the file, which are converted to absolute pointers at deserialization time.

When writing the header, you might not yet know where the actual data will lie in the file. So you write a placeholder value in the header, then other data, and finally seek back to the header and replace that placeholder with the relative offset in the file once you know it.

You get repetitive code like this:

// The header layout
struct file_format {
    uint8_t *data1;
    uint8_t *data2;
    uint32_t some_number;
};

void write_file(struct file_format *obj, FILE *out) {
    // write the header
    uint32_t header_ptr_offs = ftell(out);
    w32(out, 0);   // the placeholder for *data1
    w32(out, 0);   // the placeholder for *data2

    // this could be much more stuff
    w32(out, obj->some_number);

    // finally we write data1
    uint32_t data_offs = ftell(out);
    write_data(obj->data1, out);

    // seek back to header, update, seek back where we were
    fseek(out, header_ptr_offset, SEEK_SET);
    w32(data_offs, out);
    fseek(out, 0, SEEK_END);

    // do the same for data2 and possibly other bulk data fields too
}

The w32 function writes a 32-bit unsigned integer to a stream. The problem we are trying to solve is to hide all the bookkeeping we did with the header_ptr_offs and data_offs variables that have nothing to do with the actual file format.

The solution discussed here is a (object_id -> file_offset) mapping kept in a hash table.

A worked example with strings

Let’s switch to a 3D model writer from libdragon as an example. We have a header struct that has a pointer to “mesh_data” that has a list of meshes:

typedef struct {
    mesh_data_t *mesh_data; // An indirection added to make a point
    /* ... */
} model64_data_t;

typedef struct {
    uint32_t num_meshes; // Number of meshes
    mesh_t *meshes;      // Pointer to the first mesh
} mesh_data_t;

It doesn’t really matter what the mesh_t struct contains (a bunch of triangles) because we just want to see how it’s referenced.

Let’s first serialize the header:

void write_header(FILE *out) {
    w32_placeholderf(out, "meshes"); // Write a placeholder integer to file
    /* other fields */
}

The w32_placeholderf call stores the current writing offset (ftell(out)) with the key “meshes” in our global hash map, and writes the zero placeholder values we did by hand earlier.

Then later in a write_meshes function we replace that placeholder with an actual offset in the file. Now you also see that the placeholder names are actually formatted strings! This way you can keep track of offsets of a varying number of items:

void write_meshes(FILE *out) {
    placeholder_set(out, "meshes"); // Seeks back to the header and writes
                                    // the offset of the "meshes" section,
                                    // the current stream position, that is.
    
    w32(out, model->num_meshes);    // Write the number of meshes

    for(uint32_t i=0; i<model->num_meshes; i++) {
        // Stores the offset of this mesh's pointer and writes a placeholder value to file
        placeholder_set(out, "mesh%d", i); 
    }
}

Later when it’s time to actually go and write the mesh data, which may vary in size, we go and update the placeholder values set earlier:

for(uint32_t i=0; i<model->num_meshes; i++) {
    w32_placeholderf(out, "mesh%d", i);
    write_mesh_data(meshes[i]);
}

That’s how it works. Maybe it doesn’t seem like a lot but it does clean up the code when you have many nested objects which may vary in numbers. Note that the meshes are still referenced by integer indices elsewhere in the file.

Do you even need string keys?

No you don’t. If you know the input data pointers stay constant you can use them as keys instead. Thanks to commenter david_chisnall for pointing this out.

So do the string keys make sense at all?

In the above example the data structures come from a 3rd party library. Relying on pointer identities malloc’d by other people’s code makes me nervous but they’d still probably work just fine. And since what the code is doing is a conversion to a storage format, I think it’s more fitting to refer to objects in terms of the output file and not the input data structure.

To be clear, I didn’t write the original code in question :)

Implementation notes

If you were writing, say, JavaScript, it really wouldn’t be anything new to do

offsets[`skin${skin_idx}`] = current_offset;

but I find it novel in C. Behind the scenes this library uses vasprintf to apply formatting and uses it as a key to store the file offset in a stb_ds hash table.

Also, the offset keys are arbitrary strings so you can put in more complex formats as well. One example is "mesh%d_primitive%d_position" which stores the offset of field of one “primitive” of a mesh.

Thanks to Rasky for introducing me to this technique. Thanks to mankeli, shaiggon, and david_chisnall for comments. I added the “problem” and “do you need string keys” sections and clarified integer indexing after publishing this based on above discussions.