Skip to content

Optimizes reducelanes in diversityCalculation of PQVectors, for Euclidean function#623

Open
MarkWolters wants to merge 2 commits intomainfrom
reduce_lanes
Open

Optimizes reducelanes in diversityCalculation of PQVectors, for Euclidean function#623
MarkWolters wants to merge 2 commits intomainfrom
reduce_lanes

Conversation

@MarkWolters
Copy link
Contributor

This pull request introduces improvement to Euclidean similarity function in PQVectors.diversityFunctionFor. From flamegraph, it is observed that considerable amount of time is spent in jdk/incubator/vector/FloatVector.reducelanesTemplate. This is mainly because FloatVector.reducelanes() is expensive and it is being called inside a for loop (via VectorUtil.squareL2Distance). Modification in this pull request moves call to reduceLanes() outside the for loop.

Change proposed here was tested with the benchmark, PQDistanceCalculationBenchmark.diversityCalculation
With this benchmark, ~18% reduction in time was observed when M=64 and ~22% when M=192.

Code modifications:

Added a new function pqDiversityEuclidean in VectorUtilSupport and its corresponding implementations
Removed for loop in PQVectors.diversityFunctionFor and moved it into pqDiversityEuclidean
Moved FloatVector.reducelanes() outside the for loop
Test setup:
Jvector version : main branch (as of 2025-08-28)
JDK version : openjdk version "24.0.2" 2025-07-15
Platform : INTEL(R) XEON(R) PLATINUM 8592+
Benchmark : PQDistanceCalculationBenchmark.diversityCalculation

New changes
I have modified the code to include dot product & cosine functions and implemented similar changes for scoreFunctionFor.
With the changes applied to scoreFunctionFor, when M=64: dot product shows ~30% reduction in latency & cosine shows ~43% reduction.
I can add data points for other subspace counts, if required.

Code modifications:

Added new functions pqScoreEuclidean, pqScoreDotProduct, pqScoreCosine in VectorUtilSupport and its corresponding implementations for diversityFunctionFor
Added overloaded version of above functions for scoreFunctionFor
Removed for loop in PQVectors.diversityFunctionFor and PQVectors.scoreFunctionFor and moved them into respective functions in PanamaVectorUtilSupport
Moved FloatVector.reducelanes() outside the for loop
Added a new benchmark which uses MutablePQVectors to test this
Test setup:
Jvector version : main branch (as of 2025-10-22)
JDK version : openjdk version "25.0.1" 2025-10-21
Platform : Intel(R) Xeon(R) 6979P
Benchmark : PQDistanceCalculationMutableVectorBenchmark (Added this by replicating PQDistanceCalculationBenchmark to measure performance for MutablePQVectors)

@github-actions
Copy link
Contributor

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

Copy link
Contributor

@jshook jshook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me. It would be nice to see the results laid out a little better for before and after, but it still looks like a solid improvement.

@shyla226
Copy link

shyla226 commented Mar 2, 2026

@MarkWolters, Thank you for creating this pull request. Please let me know if you have any concerns about the current implementation or need any additional data.
@jshook, I collected benchmark data using PQDistanceCalculationMutableVectorBenchmark. Would you like to see tabulated results from the runs, or do you have what you need already?

Copy link

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First‑pass review: there’s quite a bit of duplicated code, and I think we can consolidate it significantly.


/*-------------------- Score functions--------------------*/
//adding SPECIES_64 & SPECIES_128 for completeness, will it get there?
float pqScoreEuclidean_512(VectorFloat<?>[] codebooks, int[][] subvectorSizesAndOffsets, ByteSequence<?> node1Chunk, int node1Offset, ByteSequence<?> node2Chunk, int node2Offset, int subspaceCount) {
Copy link

@r-devulap r-devulap Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is mis-leading, probably better to rename this to pqScoreEuclidean_preferred?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed the functions, code available here

I rearranged the code a little bit for better clarity.

int length1 = centroidIndex1 * centroidLength;
int length2 = centroidIndex2 * centroidLength;

if (centroidLength == FloatVector.SPECIES_PREFERRED.length()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this special case? It introduces a lot of duplicated code across all the functions. Wouldn’t the loop below already handle this scenario correctly?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way, so we don't process the tail when it is not needed, similar to dotPreferred & dotProductPreferred functions in the file. Not processing the tail seemed to yield some performance benefit in my previous experiments.
Let me do a few more tests and get back to you on this one

else if (subvectorSizesAndOffsets[0][0] >= FloatVector.SPECIES_128.length()) {
return pqScoreEuclidean_128( codebooks, subvectorSizesAndOffsets, node1Chunk, node1Offset, node2Chunk, node2Offset, subspaceCount);
}
return pqScoreEuclidean_64( codebooks, subvectorSizesAndOffsets, node1Chunk, node1Offset, node2Chunk, node2Offset, subspaceCount);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have 4 separate implementations (pqScoreEuclidean_64/_128/_256/_512) that are structurally identical aside from the FloatVector.SPECIES_* used. This is a lot of duplicated hot-path code and increases the risk of fixes/optimizations diverging across variants.

Could we fold these into a single generic-ish implementation that takes the species as a parameter, e.g.:

  • pqScoreEuclideanImpl(VectorSpecies<Float> species, ...)
  • pqScoreEuclidean(...) selects species (SPECIES_PREFERRED, SPECIES_256, SPECIES_128, SPECIES_64) once and dispatches to the shared impl.

This keeps the vectorized logic in one place and avoids the confusing _512 naming given it actually uses SPECIES_PREFERRED on some paths. It should also make it much easier to keep Euclidean/Dot/Cosine implementations consistent.

@shyla226
Copy link

First‑pass review: there’s quite a bit of duplicated code, and I think we can consolidate it significantly.

@r-devulap, Thank you very much for your feedback. I agree there is quite a bit of duplication of code, but I had tried creating a generic function, providing VectorSpecies as an argument.
For cosine function, something like this
float pqScoreCosine_generic(final VectorSpecies<Float> SPEC, VectorFloat<?>[] codebooks, int[][] subvectorSizesAndOffsets, ByteSequence<?> encodedChunk, int encodedOffset, VectorFloat<?> centeredQuery, int subspaceCount)

What I found was that, if the SPECIES is not set at compile-time, we don't see the performance benefit.

I tried it again after I saw your feedback, I see the same behavior as I did before unfortunately!
I am open to other thoughts you have.

@r-devulap
Copy link

@shyla226 Quick question regarding this patch.

It updates diversityFunctionFor in the abstract PQVectors class, but I’m not convinced the benchmark PQDistanceCalculationBenchmark.diversityCalculation would show any improvement from this change
as you claimed.

In the benchmark setup, pqVectors is populated via pq.encodeAll(ravv), which produces an ImmutablePQVectors. As a result, the call to diversityFunctionFor is dispatched to ImmutablePQVectors.diversityFunctionFor — the overridden implementation that uses precomputed codebook partial sums — rather than the base implementation in PQVectors that this patch modifies. You can confirm this in the attached flame graph: the class method MutablePQVectors.diversityFunctionFor never appears in the benchmark.

flame-graph-benchmark

That code path is only exercised when the instance is a MutablePQVectors, which does not override the method. The new microbenchmarks you added for MutablePQVectors may indeed show improvements, since they actually hit the modified code.

Would you be able to show whether this change yields improvements in a high‑level index‑construction benchmark rather than only the microbenchmarks?. My understanding is that MutablePQVectors is used during incremental (larger‑than‑memory) index construction, but I don't believe we currently have benchmarks covering that. @jshook or @MarkWolters can likely confirm.

For the standard in‑memory (on‑heap) index build, the code path ImmutablePQVectors.diversityFunctionFor isn’t on the hot path, and I'm not sure whether incremental index construction would behave differently.

@shyla226
Copy link

shyla226 commented Mar 12, 2026

@r-devulap, you have a very good point, if it is not in the critical path there is no point in adding complexity.
Are there any high-level benchmarks, which use MutablePQVectors, that you would like me to try?
Let me know if you have any other thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants