TL;DR

I changed sembed-engine so the hot path uses flat arrays, lightweight views, squared distances, and cached candidate scores. The Vamana search still visits the same number of nodes and keeps the same recall, but each node visit is cheaper for the CPU.

The practical result was:

gvec query latency: 4.094ms -> 0.631ms

query latency: w2v query latency: 25.15ms -> 1.524ms

query latency: w2v build time: 17.91s -> 1.889s

build time: Recall stayed at 1.0

The main lesson: performance is often hidden in data layout. A clean object model can still be slow if the CPU has to chase pointers, call virtual methods, run scalar loops, calculate square roots, and recompute the same scores while sorting.

Full post

A few months ago I wrote about vector databases and Approximate Nearest Neighbors.

You may read it here.

The short version is this:

We take text, images, or any other messy data and turn it into a list of numbers called a vector. Similar things end up close to each other in this vector space.

So if you search for “dog shelters around Gurgaon”, the search engine can return “animal shelters near Delhi” because the vectors for those ideas are close.

Implementing the algorithms verbatim is easy.

Making it fast is where the real work begins.

In my project, sembed-engine, I am building this kind of vector search engine in C++. It uses a graph-based index called Vamana. The graph helps avoid checking every vector in the dataset.

Recently I made an improvement to Vamana: PR #3.

The PR made query latency much faster:

Workload Before After Improvement gvec build time 2.004s 0.369s 5.43x faster gvec p50 query 4.094ms 0.631ms 6.49x faster w2v build time 17.91s 1.889s 9.48x faster w2v p50 query 25.15ms 1.524ms 16.5x faster

The interesting part is that the algorithm stayed the same.

Recall stayed 1.0 -> 1.0 .

The number of visited nodes stayed the same.

For example, on one workload the search visited 64.625 nodes before and 64.625 nodes after.

So the improvement came from reducing the cost of each step.

This post is about that.

The problem with thinking only in algorithms

When learning algorithms, we usually think in terms of Big-O.

If the old code is O ( n ) O(n) and the new code is also O ( n ) O(n) , then it is tempting to say both are roughly equivalent.

But Big-O is only part of the story.

Computers run instructions.

They load memory.

They follow pointers.

They call functions.

They wait for cache misses.

They do floating point operations.

They sometimes execute one clean instruction, and sometimes they take a long detour to do the same logical thing.

In vector search this matters a lot because the same small operation is repeated again and again.

A distance calculation between two vectors looks simple:

for each dimension : diff = a[i] - b[i] sum += diff * diff

But this tiny loop runs everywhere.

It runs while building the graph.

It runs while searching the graph.

It runs while sorting candidate nodes.

It runs while pruning neighbors.

So if we make one distance calculation cheaper, the whole engine benefits.

What the old code was doing

Before the PR, the data layout was object-heavy.

Each record had a pointer to a vector object. In simplified form, it looked like this:

struct Record { int64_t id; std :: shared_ptr < Vector > vector; };

This is a very natural way to write C++.

A record has an id.

A record has a vector.

The vector knows how to calculate distance.

Everything is nicely separated.

But in the hot path, this design asks the CPU to do extra work.

To compare a query with one record, the CPU has to do something like this:

Load the record. Read the shared_ptr inside the record. Follow that pointer to find the actual vector object. Call methods on that vector object. Read each float through that abstraction. Calculate distance. Take a square root at the end.

None of these steps look outrageous alone.

But when they happen millions of times, they become expensive.

The biggest cost here comes from indirection.

Indirection means: before I can get the thing I want, I first need to go somewhere else to find out where that thing lives.

The CPU is extremely fast when data is laid out predictably. It can prefetch memory, use cache lines well, and run tight loops.

It is much less happy when every row says, “the real data is somewhere else, please follow this pointer.”

The new layout: keep the floats together

The PR changed the dataset layout.

Instead of storing every vector as a separate object, it stores all vector values in one big flat array.

Conceptually:

ids = [id0, id1, id2, ...] values = [v0_dim0, v0_dim1, v0_dim2, ..., v1_dim0, v1_dim1, v1_dim2, ...]

If each vector has D dimensions, then vector number i starts at:

i * D

So getting a vector is just pointer arithmetic.

FloatVectorView view { values.data() + i * dimensions, dimensions };

A FloatVectorView only borrows the data.

It is just:

struct FloatVectorView { const float * data; size_t dimensions; };

In simple words, it says:

“Here is where the numbers start, and here is how many numbers there are.”

That is it.

The CPU likes this because the next vector is right after the previous vector in memory.

Why this matters at the assembly level

I made a small dummy C++ file to isolate the patterns from the PR and compiled it with -O3 to assembly.

This is useful because assembly shows what the CPU is actually being asked to do.

The old style, using a virtual vector interface, produced assembly like this:

mov rax , QWORD PTR [ r12 ] ; load vtable mov rax , QWORD PTR 32 [ rax ] ; load virtual at() slot jne .L166 ; fallback path ... call rax ; possible virtual call ... sqrtss xmm2 , xmm2 ; scalar square root

The assembly is readable enough for the important point.

call rax means the program is calling a function through an address stored in a register. This usually happens with virtual functions or function pointers.

The CPU sees a complicated path involving object metadata and possible function calls instead of a simple loop over floats.

The new FloatVectorView style produced assembly like this:

movups xmm1 , XMMWORD PTR [ rdi + rax ] movups xmm3 , XMMWORD PTR [ rcx + rax ] subps xmm1 , xmm3 mulps xmm1 , xmm1 addss xmm0 , xmm1

This is the shape we want.

The important instructions are:

movups : load multiple floats from memory.

: load multiple floats from memory. subps : subtract multiple floats at once.

: subtract multiple floats at once. mulps : multiply multiple floats at once.

The ps suffix roughly means “packed single-precision floats”.

So instead of handling one float through an object abstraction, the compiler can work on multiple floats together.

This is one of the main reasons the new version is faster.

The math stayed the same.

It became easier for the compiler and CPU to understand.

Removing an unnecessary square root

The old distance function returned Euclidean distance.

For two vectors, Euclidean distance is:

( a 1 − b 1 ) 2 + ( a 2 − b 2 ) 2 + . . . + ( a n − b n ) 2 \sqrt{(a_1-b_1)^2 + (a_2-b_2)^2 + ... + (a_n-b_n)^2}

The square root is mathematically correct.

But for nearest-neighbor search, we usually only care which distance is smaller.

If distance A is smaller than distance B, then squared distance A is also smaller than squared distance B.

For example:

sqrt(25) < sqrt(100) 25 < 100

The ordering is the same.

So instead of calculating:

return std :: sqrt(sum);

we can just return:

return sum;

This is called squared distance.

In assembly, the old path had:

sqrtss xmm2 , xmm2 jmp sqrtf@PLT

The new path had no sqrtss , no sqrtps , and no sqrtf@PLT .

That matters because square root is more expensive than normal addition or multiplication.

There was one detail to be careful about.

Vamana pruning uses a parameter called alpha . The old comparison looked conceptually like this:

alpha * distance(a, b) <= distance(c, b)

If we use squared distances, we must square alpha too:

alpha * alpha * squaredDistance(a, b) <= squaredDistance(c, b)

That keeps the meaning the same while avoiding the square root.

This is a useful kind of optimization: removing work after noticing that the final answer never needed it.

There was another expensive pattern in the old code.

During Vamana search and graph construction, the engine keeps candidate nodes and sorts them by distance from the query.

The old style computed distance inside the comparator.

Conceptually:

sort(nodes.begin(), nodes.end(), [ & ](Node a, Node b) { return distance (query, a) < distance(query, b); });

This looks clean.

But sorting calls the comparator many times.

So the same distance can be recomputed again and again.

Imagine sorting people by age, but every comparison requires calling their school, reading their birth certificate, and calculating their age from scratch.

That is what this pattern does, except with vector distances.

The new code does this instead:

struct ScoredNode { Node node; float score; }; for each node : scored.push_back({node, squaredDistance(query, node)}); sort(scored.begin(), scored.end(), []( auto a, auto b) { return a.score < b.score; });

Now the expensive part happens once per node.

Sorting only compares already-computed numbers.

The assembly makes the difference very clear.

Old comparator pattern:

call new_view_squared ; compute left score ... call new_view_squared ; compute right score comiss xmm0 , xmm4 ; compare scores ... call new_view_squared ; repeated again deeper in sort

New comparator pattern:

movss xmm0 , DWORD PTR 8 [ rbx ] ; load cached score movss xmm1 , DWORD PTR 8 [ r12 ] ; load cached score comiss xmm1 , xmm0 ; compare floats

Again, the high-level algorithm is the same.

But the cost of maintaining candidates is much lower.

This is especially important because graph search algorithms spend a lot of time asking the same question:

“Which candidate should I visit next?”

If answering that question repeatedly recomputes distances, performance suffers.

The actual benchmark result

I compared main against the PR branch using Release builds.

Each side was run three times, and I used the median.

The important results were:

Workload Metric Main PR Improvement gvec Vamana build time 2.004s 0.369s 5.43x faster gvec Vamana p50 query latency 4.094ms 0.631ms 6.49x faster gvec Vamana p95 query latency 4.817ms 0.657ms 7.33x faster gvec Vamana QPS 237.1 1593 6.72x higher w2v Vamana build time 17.91s 1.889s 9.48x faster w2v Vamana p50 query latency 25.15ms 1.524ms 16.5x faster w2v Vamana p95 query latency 26.39ms 1.593ms 16.57x faster w2v Vamana QPS 39.62 656.3 16.56x higher

The quality checks were important too:

Recall stayed 1.0 -> 1.0 .

. Average visited nodes stayed identical.

Index bytes stayed identical for the Vamana runs.

This is why I am confident the speedup came from cheaper execution while preserving search quality.

Conclusion

The biggest learning from this is that data layout is part of the algorithm.

On paper, both versions do the same kind of graph search.

But the CPU experiences them very differently.

Before, one distance calculation involved object lookups, pointer chasing, virtual dispatch, scalar operations, and a square root.

After, one distance calculation is mostly a tight loop over contiguous floats.

Before, sorting candidates could repeatedly recompute distances.

After, sorting candidates compares cached scores.

That is enough to turn:

25.15ms -> 1.524ms

for the w2v query benchmark.

The search visited the same number of nodes.

It just stopped paying unnecessary tax at every node.