Skip to content

[DRAFT] FEAT: Dataset Loading Changes#1451

Draft
ValbuenaVC wants to merge 1 commit intoAzure:mainfrom
ValbuenaVC:datasetloader
Draft

[DRAFT] FEAT: Dataset Loading Changes#1451
ValbuenaVC wants to merge 1 commit intoAzure:mainfrom
ValbuenaVC:datasetloader

Conversation

@ValbuenaVC
Copy link
Contributor

@ValbuenaVC ValbuenaVC commented Mar 10, 2026

Description

Features:

  • Addition of filters argument to get_all_dataset_names, which rejects datasets that don't
    meet filter criteria
  • Use of a DatasetMetadata factory enables both static metadata (like loading rank) and size (which at least for remote datasets can only exist after being downloaded)
  • DatasetMetadata dataclass contains size: int, modalities: list[DatasetModalities], source: DatasetSourceType, and loading_rank: DatasetLoadingRank.
  • The fields of DatasetMetadata
  • Each dataset child implements the abstract method metadata_factory which returns the metadata
    and is called in SeedDatasetProvider's subclass call during its init

Problems:

  • Way too complicated for static attributes that could be class variables
  • Forces metadata generation to wait until dataset is downloaded for derived attributes
  • It would be nice to have SQL ability for all datasets; imagine doing a JOIN operation across different datasets using the same harm category
  • Not a lot of interaction with identifiers which seem like a natural overlap point for tracking dataset metadata
  • Use of a factory method is more explicit, but use of a private attribute is more intuitive. It's unclear which should take precendence

Possible Solutions:

  • Separate metadata into dynamic and static subtypes that have different paths
  • Use None values for dynamic attributes and populate them when a dataset actually downloads (if invoked in get_all_dataset_names, force downloads)
  • Save rich querying via SQL for a separate PR
  • Migrate seed_dataset to an identifier, which already makes the crucial distinction of static attributes and dynamic (what ComponentIdentifier calls behavioral) attributes

Tests and Documentation

  • For remote, just test that the file writes to test and connection is served
  • For local, test that one entry makes it into the patched DB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant