Skip to content

improve dataset loading, add vectordata source#643

Open
jshook wants to merge 2 commits intomainfrom
datasets_update
Open

improve dataset loading, add vectordata source#643
jshook wants to merge 2 commits intomainfrom
datasets_update

Conversation

@jshook
Copy link
Contributor

@jshook jshook commented Mar 10, 2026

This PR activates the vectodata loader as the third option after the HDF5 and MFD loaders.
If a dataset is not found in either of these, then it will be loaded from vectordata sources so long as the dataset is visible.
Users will want to add a ~/.config/vectordata/catalogs.yaml file to get access to their own or group shared datasets.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 10, 2026

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

Copy link
Contributor

@MarkWolters MarkWolters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple nit-picky suggestions but looks good

return Optional.empty();
}

// If it exists locally, we're good
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would make sense to pub this check before the KNOWN_DATASETS check as a way of allowing the user to add their own hdf5 datasets that are not part of the canonical set

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. If the user adds a dataset locally, it should always take precedence over the other available sources.

// Download from https://ann-benchmarks.com/datasetName
var url = "https://ann-benchmarks.com/" + datasetName + HDF5_EXTN;
System.out.println("Downloading: " + url);
logger.info("Downloading: {}", url);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not come after the file is found? Currently this prints for every dataset, even non-hdf5, which is annoying. I realize you put checks in for dataset existence before we get here but it could still print this, then get an HTTP_NOT_FOUND and return Optional.empty

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto comment on #637!

return NAME;
}

private static final java.util.Set<String> KNOWN_DATASETS = java.util.Set.of(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a representative set of datasets. Let's align this better with datasets.yml. Also, we do not support the jaccard metric. These datasets should be removed.

return NAME;
}

private static final java.util.Set<String> KNOWN_DATASETS = java.util.Set.of(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able load any of these using BenchYAML or AutoBenchYAML due to a missing catalogs.yaml. If this is supposed to be supplied by the user, please add it to our documentation. If not, please provide a reasonable default (probably in either case).

similarityFunction = VectorSimilarityFunction.EUCLIDEAN;
}
else {
throw new IllegalArgumentException("Unknown similarity function -- expected angular or euclidean for " + filename);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the terminology we use elsewhere: cosine, l2 (Euclidean is fine), and dot product. And dot product should map to dot product, not cosine.

Also, it's pretty precarious selecting the VSF based on the file name.

ProfileSelector selector = entry.select();
view = spec.profile().map(selector::profile).orElseGet(selector::profile);
} else {
// Fallback to local load
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be the other way around. Local takes priority because: 1) it's the easiest way to override when experimenting with new data, 2) data is already downloaded.

logger.info("Prebuffering dataset '{}'...", dataSetName);
CompletableFuture<Void> f = view.prebuffer();
if (f instanceof ProgressIndicatingFuture) {
System.out.println("blocking until prebuffer completes, with progress reporting...");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the blocking comment is necessary. Already said it is prebuffering (and the prebuffer progress should be reported).

}

private VectorSimilarityFunction mapDistanceFunction(DistanceFunction df) {
if (df == null) return COSINE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't df == null be an error?

If there is a need for a default, it should be DOT_PRODUCT and is only safe to arbitrarily use in all cases if the vectors are normalized.

Also, shouldn't this return VectorSimilarityFunction.COSINE?

Copy link
Collaborator

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comments for requested changes. It's looking good so far, but I cannot evaluate it further until the catalog.yaml comment is addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants