TL;DR

The DiskANN paper explained the Vamana algorithm, but implementing it made the missing engineering details obvious. My first version used the right algorithm with the wrong data structures, so the ANN index was slower than brute force on small fixtures. The fix was not a new algorithm; it was making the code preserve the algorithm’s real invariants directly: bounded sorted candidates, cheap visited checks, duplicate prevention, and marker-style pruning. The lesson is that pseudocode names like L , V , and “remove” are not implementation details to fill in later. They are where the performance and correctness work begins.

Full Post

My first ANN implementation had the right idea and the wrong engineering shape. The embarrassing part was simple: my approximate nearest-neighbor index was slower than brute force.

That sounds absurd at first. ANN exists because brute force gets expensive. Instead of comparing a query against every vector, the index should walk a graph and inspect a smaller set of candidates. I wrote a more introductory explanation of that idea in my post on vector databases and semantic search.

So when I started building sembed-engine , a small C++ vector search engine, I expected the Vamana index to become faster almost by default. I had read the DiskANN paper. I understood the broad shape: vectors become graph nodes, edges point to useful neighbors, and search walks from an entry point toward vectors closer to the query. On paper, that felt clean. In code, my version was paying rent on every step.

The uncomfortable benchmark

The early benchmark was not flattering. On tiny deterministic fixtures, brute force won easily:

dataset brute force p50 Vamana p50 gvec.bin 0.27 ms 22.98 ms w2v.bin 1.49 ms 142.27 ms

That was not a production benchmark. Those fixtures are small on purpose, because they are useful for recall checks and regression tracking. Still, the result stung. It forced the better question: not “Is ANN wrong?”, but “What does ANN need before it actually beats brute force?”

Brute force is not dumb. It is simple, and simple stays fast for longer than you expect. On a small dataset it gets to do a very clean thing:

walk through contiguous vectors, compute distances, keep the top results.

That loop is boring in the best possible way. The CPU understands it. A graph-based ANN index has to pay for traversal, candidate bookkeeping, visited tracking, pruning, backlinks, and index metadata before it gets any benefit from skipping work. On a tiny dataset, the overhead can be the whole story.

The benchmark did not tell me ANN was useless. It told me my implementation had not earned the theory yet.

What the paper says

The part of DiskANN I was implementing is the Vamana-style graph index. The paper gives three main pieces:

GreedySearch

RobustPrune

the Vamana indexing loop

In plain English, GreedySearch keeps a candidate list and a visited set. It starts from a chosen point in the graph, expands the closest unvisited candidate, adds that candidate’s graph neighbors, keeps the best candidates, and eventually returns the top results.

RobustPrune runs while building the graph. For a point p , it looks at possible neighbors and keeps a bounded set of useful outgoing edges. alpha controls how aggressively candidates are removed, and R is the maximum degree.

Vamana construction combines those two ideas. For each point, it runs greedy search toward that point, uses the visited nodes as candidate neighbors, prunes them, adds backlinks, and repairs degree violations caused by those backlinks. The paper describes this clearly. The trap was that “I understand this paragraph” and “I know what this should look like in C++” are not the same thing.

A set is not a data structure

The DiskANN pseudocode talks about sets like L and V . That is fine. Pseudocode should explain the algorithm, not tell me which C++ container to use. But code cannot store an abstract set.

When the paper says L , I have to ask:

Is this a sorted vector?

Is it a heap?

Is it a bounded priority queue?

How often do I insert into it?

How do I find the closest unvisited element?

How do I remove duplicates?

Where is the search-list bound enforced?

When the paper says V , I have to ask:

Is this an unordered_set ?

? Is it a dense bitset?

Is it a boolean array?

Is the node id space dense enough for indexed membership checks?

Does it reset every query?

When the paper says “remove candidates”, I have to ask what remove means. Do I erase from a vector? Do I mark entries as deleted and skip them later? How often does that happen inside the hot loop?

That was the shift. The paper gave me the nouns. The implementation had to choose the machinery.

The mistake in my first shape

My earlier implementation was too literal. It used plain vectors, std::unordered_set , repeated sorting, and repeated erasing. That matched the algorithm at a surface level, but it did not match the cost model.

A simplified version of the bad pattern looked like this:

append new candidates rebuild or update a hash set for duplicates sort the candidate list by distance truncate it to the search - list size scan again for the next unvisited candidate

None of those steps are wrong in isolation. The problem is that graph search repeats the same small operations again and again. If every expansion needs expensive bookkeeping, the traversal becomes heavy even if the algorithm looks correct.

This is where the usual “make it correct first, make it fast later” advice gets a little blurry. Correctness still comes first, obviously. But for this kind of algorithm, some representations make correctness expensive by default. If the core loop needs sorted candidates, duplicate prevention, bounded size, and visited state, those are not tiny cleanup details. They are part of the algorithm’s real shape.

What changed in the optimization

In PR #4, I changed the data structures around Vamana’s build and search path in the sembed-engine repo. The algorithm did not change into something else. The code started matching the invariants the algorithm already needed. This is a different lesson from my earlier write-up on making the vector search hot path faster, which was more about data layout, squared distances, and CPU work per node.

The first piece was a concrete Neighbour object:

struct Neighbour { float distance; NodeId node; bool marked; };

That struct is small, but it makes the candidate’s actual job visible. A candidate is not just a node id. It has a distance and state.

Then I added a SortedBoundedVector for candidate lists. It keeps candidates sorted as they are inserted, caps the list by the search-list size, rejects duplicates, and remembers where the next unexpanded candidate is. The candidate list now owns the invariant instead of asking every caller to remember it.

Before that, the code kept doing some version of: sort this vector, deduplicate it, trim it, now scan for the next node. The new representation says: this is a bounded sorted candidate list. That is a much better fit for GreedySearch .

For visited tracking, I moved toward boost::dynamic_bitset . Graph node ids are dense indices, so membership can be an indexed bit operation instead of a hash-table lookup. It also makes the code read closer to the idea: this node has been seen already.

For pruning, I moved away from repeated physical deletion and toward marker-style bookkeeping. Candidates can be marked as deleted and skipped, which fits the hot loop better than constantly reshaping the vector.

Pseudocode is allowed to say set . My compiler is not.

The hidden invariants

The more I worked on Vamana, the more the implementation started to look like a pile of invariants.

For greedy search:

the candidate list should stay sorted by distance,

it should stay bounded by the search-list size,

it should not contain duplicate nodes,

each expanded node should not be expanded again,

the returned nearest candidates and the visited traversal set are related but not identical.

For robust prune:

the output degree should stay within R ,

, the source node should not accidentally become its own neighbor,

candidates removed by the prune rule should stop participating,

backlinks can violate degree bounds,

degree violations need repair after insertion.

These sound obvious once written down. They were less obvious when I was reading the paper and mentally treating L , V , and Nout(p) as simple containers. That was the mistake: I had read the algorithm, but I had not read the costs hiding inside the algorithm.

The benchmark became a mirror

After this round of work, the fixture-scale numbers looked much healthier:

dataset algorithm recall@10 qps p50 latency p95 latency build gvec.bin brute force 1.00 70284.10 0.01 ms 0.02 ms - gvec.bin Vamana 1.00 43433.53 0.02 ms 0.03 ms 0.04 s w2v.bin brute force 1.00 16870.26 0.06 ms 0.06 ms - w2v.bin Vamana 1.00 14195.72 0.07 ms 0.08 ms 1.17 s

The caveat still matters. On tiny fixtures, brute force is very strong. That is not embarrassing anymore. It is just the benchmark telling the truth.

The larger calibration run is where the expected ANN behavior starts to show up:

dataset algorithm recall@10 qps p50 latency p95 latency gnews-6500.bin brute force 1.00 560.06 1.70 ms 2.18 ms gnews-6500.bin Vamana 1.00 2989.22 0.32 ms 0.51 ms

That is 5.34x query throughput while preserving recall@10. But I do not want the lesson to be “the optimization worked, ship it.” The useful part is that the first benchmark made me honest.

A benchmark is not a trophy. It is a mirror. It showed me that the algorithmic idea was not enough. The representation had to match the operations the algorithm performs over and over.

Reading algorithms like an engineer

The lesson is to slow down at the nouns in pseudocode. If it says L , ask what operations L needs to support. If it says V , ask how membership is checked. If it says “remove”, ask whether deletion is physical or logical. If it says “bounded”, ask where that bound is enforced. If it says “add backlinks”, ask what invariant can break afterward.

Reading algorithms like an engineer does not mean treating theory as secondary. Without the DiskANN paper, there is no useful map for what to build. The paper gives the map. Implementation is the terrain.

Every abstract set becomes memory. Every loop becomes cost. Every hidden invariant becomes a bug waiting for you. Algorithms are compressed explanations. Implementations are decompressed consequences. The job of an engineer is to do that decompression carefully.