A new hash table for Lwan
2026-05-06
For a long time, Lwan used to use a heavily modified version of the hash table from the kmod project. I was lightly involved with that project during its inception, so this seemed like a natural choice. Over the years, that hash table proved inneficient for the use cases in Lwan, so I began patching it to try and improve its performance; I succeeded in some cases, but the added complexity and years of technical debt eventually caught up with me, so things started failing.
I've wanted to try a different way of implementing a hash table, possibly using some kind of linear probing method, for a good while, so that's what I did.
I've been inspired by Rust's hashbrown, which has been the default hash table in its standard library for a good while. That hash table, in turn, was inspired by Abseil's SwissTable. Even the Go programming language began using a variation of this hash table since 2025.
I wanted to make Lwan's new hash table as portable as possible; however, Swiss Tables require the use of SIMD instructions to perform acceptably. As you can see in the following diagram, a swiss table is laid out like so:
It can have multiple groups, chained together;
Each group contains a Control Word and the slots containing the actual items;
The nth byte in the Control Word represents a Control Byte for the nth slot.
Empty Empty In Use In Use Deleted Deleted Group 0 Group 0 Group 1 Group 1 Group 2 Group 2 Group N Group N Control Word Control Word Slots Slots Text is not SVG - cannot display
The result of the hash function is split in two: H1 , which chooses the first group where an item – either empty, unused, or deleted; and H2 , which is stored in the Control Word. SIMD is used to determine which slots in a group matches the required H2 (or an empty or a deleted slot), in parallel, with minimal instructions. Even though it performs a linear probe, it does so in a very efficient manner that's not only cache efficient, it's also using hardware capabilities that are rarely exploited in data structures.
Note The H2 hash has only 7 bits, with one bit left to signal if an item is empty, in use, or deleted. The distinction between deleted and empty is used for a few different things, including reducing fragmentation (or requiring constant rehashes), or straying too far from the original H1 % number_of_groups -th group.
Using SIMD directly was ruled out due to it not being portable enough; even though SWAR could be used where SIMD was not available (or in architectures I wasn't familiar with or had the hardware to test the implementation). In its place, the standard C function memchr() was used instead. This function is usually implemented using SIMD instructions anyway (although with higher fixed costs that wouldn't be an issue if SIMD was used directly, as it'll become clear with code snippets down below).
As a result, instead of having multiple groups like Swiss Tables, the table in Lwan has only one group:
Empty Empty In Use In Use Tophashes Tophashes Buckets Buckets Text is not SVG - cannot display
With this structure, the hash H is split into two:
uint8_t tophash = no_tombstone(H & 0xff) , with no_tombstone() returning a value that's not used for a tombstone ( \0 );
, with returning a value that's not used for a tombstone ( ); uint32_t startpos = (H >> 8) & (capacity - 1) , with capacity being always a power of two.
With both tophash and startpos in hand, the table is effectively split in two:
Empty Empty In Use In Use Tophashes Tophashes Buckets Buckets Second half Second half First half First half startpos H1 % capacity Text is not SVG - cannot display
Making it possible to define the internal hash table probing functions like this:
static bool hash_probe_half ( const struct hash * ht , const void * key , uint32_t * out_slot , uint32_t startpos , uint32_t endpos , uint8_t tophash ) { uint8_t * slotptr = memchr ( ht -> tophashes + startpos , tophash , endpos - startpos ); assert ( tophash != '\0' ); /* '\0' is the tombstone */ while ( slotptr ) { uint32_t slot = ( uint32_t )( slotptr - ht -> tophashes ); if ( LIKELY ( ht -> key_equal ( ht -> buckets [ slot ]. key , key ))) { * out_slot = slot ; return true ; } if ( UNLIKELY ( endpos == slot )) { break ; } slotptr = memchr ( slotptr + 1 , tophash , endpos - slot - 1 ); } return false ; } static bool hash_probe_startpos ( const struct hash * ht , const void * key , uint32_t * out_slot , uint32_t startpos , uint8_t tophash ) { return hash_probe_half ( ht , key , out_slot , startpos , ht -> cap , tophash ) || hash_probe_half ( ht , key , out_slot , 0 , startpos , tophash ); } static bool hash_probe ( const struct hash * ht , const void * key , uint32_t * out_slot ) { const uint32_t hash = ht -> hash ( key ); const uint32_t startpos = ( hash >> 8 ) & ( ht -> cap - 1 ); return hash_probe_startpos ( ht , key , out_slot , startpos , no_tombstone ( hash & 0xff )); }
The user-facing API can be implemented in term of these functions; for instance, here's how one looks a value from a key:
void * hash_find ( const struct hash * ht , const void * key ) { uint32_t slot ; return LIKELY ( hash_probe ( ht , key , & slot )) ? ( void * ) ht -> buckets [ slot ]. value : NULL ; }
Likewise, insertion and deletion are implemented in terms of hash_probe_half() . You may have noticed that this table makes no distinction between empty and deleted items, which, in one hand, simplifies things quite a bit, but in the other hand, makes the lookup theoretically bound to O(capacity).
During testing, though, I found out that lookups rarely check the second half, and that the key equality function is very likely to succeed the first time. I suspect that this happens because the chosen hash function (FNV-1a or the Intel CRC32 instructions depending on hardware capability) are not only pretty good for this particular hash table implementation, but their results are well used, both as a hint to where to start looking, and also what to look for.
Deleting items, however, require the costly operation of rehashing the whole table. Here's how element deleting is currently implemented:
int hash_del ( struct hash * ht , const void * key ) { uint32_t slot ; if ( LIKELY ( hash_probe ( ht , key , & slot ))) { /* Item found! Let's remove it by tombstoning it. */ ht -> tophashes [ slot ] = '\0' ; ht -> free_key (( void * ) ht -> buckets [ slot ]. key ); ht -> free_value (( void * ) ht -> buckets [ slot ]. value ); ht -> len -- ; if ( ht -> cap > INITIAL_CAP && ht -> len < ht -> cap / 2 ) { /* Failure to resize to reduce the table won't leave it * in an inconsistent state, so don't propagate the error. */ hash_resize ( ht , ht -> cap / 2 ); } return 0 ; } return - ENOENT ; }
The hash_resize() function doesn't work in-place, which isn't ideal. I've been experimenting with ideas to perform both in-place-rehashing and lazy-rehashing, but for now I'm going with the obvious but high-latency implementation until I'm happy with the experiments.
A nice little addition to the hash table API is the HASH_FOREACH() macro that lets one loop through all keys/values stored in the table in a straightforward and fool-proof manner:
const void * k , * v ; HASH_FOREACH ( table , & k , & v ) { printf ( "%s: %s
" , ( char * ) k , ( char * ) v ); }
This is used both internally, and in the mimegen executable built during the Lwan build process.
This implementation now carries unit tests that execute automatically every time the server boots when built in debug mode. The macro is declared using the SELF_TEST() macro I wrote about a while back. This has already caught a few bugs in experiments I've made, and made it possible for me to refactor the implementation without having to worry too much about subtly breaking things.
While I usually talk about performance when talking about Lwan, the purpose of the hash table rewrite was not to improve the performance – so I haven't measured the performance of this new table. I'm just happy it's a much simpler implementation, without multiple sub-tables and the extra overhead of carrying the whole hash around with every key/value pair.
It's very likely that this new table performs better than the previous table, but I won't bother comparing with the previous implementation.
This table is mostly done, but I'll continue tweaking it here and there.
As I already mentioned, I'm experimenting with ways to make rehashing less expensive than it currently is. If I find a good way to implement this without requiring an extra level of indirection like the Go version of this hash table, I'll end up writing about it because it might be useful for other people.