Conversation
update dependency
|
Before you submit for review:
If you did not complete any of these, then please explain below. |
MarkWolters
left a comment
There was a problem hiding this comment.
a couple nit-picky suggestions but looks good
| return Optional.empty(); | ||
| } | ||
|
|
||
| // If it exists locally, we're good |
There was a problem hiding this comment.
I wonder if it would make sense to pub this check before the KNOWN_DATASETS check as a way of allowing the user to add their own hdf5 datasets that are not part of the canonical set
There was a problem hiding this comment.
Agreed. If the user adds a dataset locally, it should always take precedence over the other available sources.
| // Download from https://ann-benchmarks.com/datasetName | ||
| var url = "https://ann-benchmarks.com/" + datasetName + HDF5_EXTN; | ||
| System.out.println("Downloading: " + url); | ||
| logger.info("Downloading: {}", url); |
There was a problem hiding this comment.
Should this not come after the file is found? Currently this prints for every dataset, even non-hdf5, which is annoying. I realize you put checks in for dataset existence before we get here but it could still print this, then get an HTTP_NOT_FOUND and return Optional.empty
| return NAME; | ||
| } | ||
|
|
||
| private static final java.util.Set<String> KNOWN_DATASETS = java.util.Set.of( |
There was a problem hiding this comment.
This is not a representative set of datasets. Let's align this better with datasets.yml. Also, we do not support the jaccard metric. These datasets should be removed.
| return NAME; | ||
| } | ||
|
|
||
| private static final java.util.Set<String> KNOWN_DATASETS = java.util.Set.of( |
There was a problem hiding this comment.
I was not able load any of these using BenchYAML or AutoBenchYAML due to a missing catalogs.yaml. If this is supposed to be supplied by the user, please add it to our documentation. If not, please provide a reasonable default (probably in either case).
| similarityFunction = VectorSimilarityFunction.EUCLIDEAN; | ||
| } | ||
| else { | ||
| throw new IllegalArgumentException("Unknown similarity function -- expected angular or euclidean for " + filename); |
There was a problem hiding this comment.
Let's use the terminology we use elsewhere: cosine, l2 (Euclidean is fine), and dot product. And dot product should map to dot product, not cosine.
Also, it's pretty precarious selecting the VSF based on the file name.
| ProfileSelector selector = entry.select(); | ||
| view = spec.profile().map(selector::profile).orElseGet(selector::profile); | ||
| } else { | ||
| // Fallback to local load |
There was a problem hiding this comment.
It should be the other way around. Local takes priority because: 1) it's the easiest way to override when experimenting with new data, 2) data is already downloaded.
| logger.info("Prebuffering dataset '{}'...", dataSetName); | ||
| CompletableFuture<Void> f = view.prebuffer(); | ||
| if (f instanceof ProgressIndicatingFuture) { | ||
| System.out.println("blocking until prebuffer completes, with progress reporting..."); |
There was a problem hiding this comment.
Not sure the blocking comment is necessary. Already said it is prebuffering (and the prebuffer progress should be reported).
| } | ||
|
|
||
| private VectorSimilarityFunction mapDistanceFunction(DistanceFunction df) { | ||
| if (df == null) return COSINE; |
There was a problem hiding this comment.
Why wouldn't df == null be an error?
If there is a need for a default, it should be DOT_PRODUCT and is only safe to arbitrarily use in all cases if the vectors are normalized.
Also, shouldn't this return VectorSimilarityFunction.COSINE?
tlwillke
left a comment
There was a problem hiding this comment.
Please see my comments for requested changes. It's looking good so far, but I cannot evaluate it further until the catalog.yaml comment is addressed.
This PR activates the vectodata loader as the third option after the HDF5 and MFD loaders.
If a dataset is not found in either of these, then it will be loaded from vectordata sources so long as the dataset is visible.
Users will want to add a ~/.config/vectordata/catalogs.yaml file to get access to their own or group shared datasets.