Writing a toy allocator in 200 lines of C

2026-06-14T00:00:00Z

Every few years I talk myself into writing a memory allocator from scratch. It is the kind of project that looks trivial from a distance — bump a pointer, hand back the bytes — and reveals a surprising amount of texture the moment you ask it to do anything real. This post walks through a small one, roughly two hundred lines, and the handful of decisions that mattered most.

The goal here is not to beat malloc. The goal is to understand the shape of the problem: how free lists fragment, why alignment is load-bearing, and where the real cost hides once you start measuring.

Start with the arena

The simplest useful primitive is an arena: a single contiguous block you carve allocations out of by advancing an offset. There is no per-allocation bookkeeping and freeing is a no-op until you reset the whole thing. It is fast precisely because it refuses to answer the hard question.

void *arena_alloc(Arena *a, size_t n, size_t align) {
    uintptr_t p = (uintptr_t)a->cursor;
    uintptr_t aligned = (p + (align - 1)) & ~(align - 1);
    if (aligned + n > a->end) return NULL;   // out of room
    a->cursor = (char *)(aligned + n);
    return (void *)aligned;
}

Three lines of arithmetic and one branch. The alignment mask is the part people get wrong: round the address up to the next multiple of align, then check that the request still fits. Skip it and you will spend an afternoon chasing a bus error on the one platform that cares.

Where it gets interesting

An arena is wonderful right up until you need to free individual objects. The moment you do, you are back in the land of free lists, size classes, and coalescing — and every one of them is a trade between speed, fragmentation, and how much metadata you are willing to carry.

Free lists are O(1) to push and pop, but a naive one fragments badly under mixed sizes.
Size classes trade a little internal waste for far less external fragmentation.
Coalescing adjacent free blocks fights fragmentation but adds a footer to every allocation.

The allocator you actually want depends entirely on your allocation pattern — and you will not know that pattern until you measure it.

Which is the real lesson. Measure first.

Why your p99 latency lies to you

2026-05-02T00:00:00Z

Tail latency is the number everyone quotes and almost nobody computes correctly. We put p99 on dashboards, alert on it, and write it into SLOs — and then measure it in a way that quietly removes the worst cases from the sample.

Coordinated omission

The classic mistake is sending a request, waiting for the response, and only then sending the next one. When the system stalls, your load generator stalls with it, so the long pause never produces the flood of slow samples it should have. You measured the system's good behavior and called it the tail.

The fix is to decouple request scheduling from response timing: issue requests on a fixed schedule and record latency against the time the request should have started, not when you got around to sending it.

Measure the thing you promised