improve dataset loading, add vectordata source#643

Open

jshook wants to merge 2 commits intomainfrom

datasets_update

Contributor

jshook commented Mar 10, 2026

This PR activates the vectodata loader as the third option after the HDF5 and MFD loaders.
If a dataset is not found in either of these, then it will be loaded from vectordata sources so long as the dataset is visible.
Users will want to add a ~/.config/vectordata/catalogs.yaml file to get access to their own or group shared datasets.


          improve dataset loading, add vectordata source

14e0763

update dependency

jshook requested review from MarkWolters and tlwillke as code owners

March 10, 2026 19:57

Contributor

github-actions bot commented Mar 10, 2026 •

edited by jshook

Loading

Before you submit for review:

Does your PR follow guidelines from CONTRIBUTIONS.md?
Did you summarize what this PR does clearly and concisely?
Did you include performance data for changes which may be performance impacting?
Did you include useful docs for any user-facing changes or features?
Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
Did you trigger and review regression testing results against the base branch via Run Bench Main?
Did you adhere to the code formatting guidelines (TBD)
Did you group your changes for easy review, providing meaningful descriptions for each commit?
Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.


          update vectordata dep to fixed multi-jar version

bea1a72

MarkWolters approved these changes

View reviewed changes

Contributor

MarkWolters left a comment

a couple nit-picky suggestions but looks good

...s/src/main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetLoaderHDF5.java

+                          return Optional.empty();
+                      }
+                      // If it exists locally, we're good

Contributor

MarkWolters Mar 12, 2026

I wonder if it would make sense to pub this check before the KNOWN_DATASETS check as a way of allowing the user to add their own hdf5 datasets that are not part of the canonical set

Collaborator

tlwillke Mar 13, 2026

Agreed. If the user adds a dataset locally, it should always take precedence over the other available sources.

...s/src/main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetLoaderHDF5.java

                       // Download from https://ann-benchmarks.com/datasetName
                       var url = "https://ann-benchmarks.com/" + datasetName + HDF5_EXTN;
-                      System.out.println("Downloading: " + url);
+                      logger.info("Downloading: {}", url);

Contributor

MarkWolters Mar 12, 2026

Should this not come after the file is found? Currently this prints for every dataset, even non-hdf5, which is annoying. I realize you put checks in for dataset existence before we get here but it could still print this, then get an HTTP_NOT_FOUND and return Optional.empty

Collaborator

tlwillke Mar 13, 2026

Ditto comment on #637!

tlwillke reviewed

View reviewed changes

...s/src/main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetLoaderHDF5.java

+                      return NAME;
+                  }
+                  private static final java.util.Set<String> KNOWN_DATASETS = java.util.Set.of(

Collaborator

tlwillke Mar 13, 2026

This is not a representative set of datasets. Let's align this better with datasets.yml. Also, we do not support the jaccard metric. These datasets should be removed.

tlwillke reviewed

View reviewed changes

...s/src/main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetLoaderHDF5.java

+                      return NAME;
+                  }
+                  private static final java.util.Set<String> KNOWN_DATASETS = java.util.Set.of(

Collaborator

tlwillke Mar 13, 2026

I was not able load any of these using BenchYAML or AutoBenchYAML due to a missing catalogs.yaml. If this is supposed to be supplied by the user, please add it to our documentation. If not, please provide a reasonable default (probably in either case).

tlwillke reviewed

View reviewed changes

...s/src/main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetLoaderHDF5.java

                           similarityFunction = VectorSimilarityFunction.EUCLIDEAN;
                       }
                       else {
                           throw new IllegalArgumentException("Unknown similarity function -- expected angular or euclidean for " + filename);

Collaborator

tlwillke Mar 13, 2026

Let's use the terminology we use elsewhere: cosine, l2 (Euclidean is fine), and dot product. And dot product should map to dot product, not cosine.

Also, it's pretty precarious selecting the VSF based on the file name.

tlwillke reviewed

View reviewed changes

...main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetLoaderVectordata.java

+                              ProfileSelector selector = entry.select();
+                              view = spec.profile().map(selector::profile).orElseGet(selector::profile);
+                          } else {
+                              // Fallback to local load

Collaborator

tlwillke Mar 13, 2026

It should be the other way around. Local takes priority because: 1) it's the easiest way to override when experimenting with new data, 2) data is already downloaded.

tlwillke reviewed

View reviewed changes

...main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetLoaderVectordata.java

+                              logger.info("Prebuffering dataset '{}'...", dataSetName);
+                              CompletableFuture<Void> f = view.prebuffer();
+                              if (f instanceof ProgressIndicatingFuture) {
+                                  System.out.println("blocking until prebuffer completes, with progress reporting...");

Collaborator

tlwillke Mar 13, 2026

Not sure the blocking comment is necessary. Already said it is prebuffering (and the prebuffer progress should be reported).

tlwillke reviewed

View reviewed changes

...main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetLoaderVectordata.java

+                  }
+                  private VectorSimilarityFunction mapDistanceFunction(DistanceFunction df) {
+                      if (df == null) return COSINE;

Collaborator

tlwillke Mar 13, 2026

Why wouldn't df == null be an error?

If there is a need for a default, it should be DOT_PRODUCT and is only safe to arbitrarily use in all cases if the vectors are normalized.

Also, shouldn't this return VectorSimilarityFunction.COSINE?

tlwillke requested changes

View reviewed changes

Collaborator

tlwillke left a comment

Please see my comments for requested changes. It's looking good so far, but I cannot evaluate it further until the catalog.yaml comment is addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet