feat: Add metadata-only replace API to Table for REPLACE snapshot operations#3131
Open
qzyu999 wants to merge 5 commits intoapache:mainfrom
Open
feat: Add metadata-only replace API to Table for REPLACE snapshot operations#3131qzyu999 wants to merge 5 commits intoapache:mainfrom
qzyu999 wants to merge 5 commits intoapache:mainfrom
Conversation
- Fixed positional argument type mismatch for `snapshot_properties` in [_RewriteFiles](iceberg-python/pyiceberg/table/update/snapshot.py) - Added missing `Catalog` type annotations to pytest fixtures in [test_replace.py](iceberg-python/tests/table/test_replace.py) - Added strict `is not None` assertions for `table.current_snapshot()` to satisfy mypy Optional checking - Auto-formatted tests with ruff
…ass enum validation (Operation.REPLACE is valid so we can no longer use it in test_invalid_operation)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3130
Rationale for this change
In a current PR (#3124, part of #1092), the proposed
replace()API accepts a PyArrow dataframe (pa.Table), forcing the table engine to physically serialize data during a metadata transaction commit. This couples execution with the catalog, diverges from Java Iceberg's nativeRewriteFilesbuilder behavior, and fails to register underOperation.REPLACE.This PR redesigns
table.replace()andtransaction.replace()to acceptIterable[DataFile]inputs. By externalizing physical data writing (e.g., compaction via Ray), the new explicit metadata-only_RewriteFilesSnapshotProducer can natively swap snapshot pointers in the manifests, perfectly inheriting ancestral sequence numbers forDELETEDentries to ensure time-travel equivalence.Are these changes tested?
Yes.
Fully exhaustive test coverage has been added to
tests/table/test_replace.py. The suite validates:len(table.history())).Operation.REPLACEtags.manifest.fetch_manifest_entry(discard_deleted=False)to natively assert thatstatus=DELETEDoverrides are accurately preserving avro sequence numbers.replace([], [])successfully short-circuits the commit loop without mutating history.Are there any user-facing changes?
Yes.
The method signature for
Table.replace()andTransaction.replace()has been updated from the original PR #3124.It no longer accepts a PyArrow DataFrame (
df: pa.Table). Instead, it now requests two arguments:files_to_delete: Iterable[DataFile]andfiles_to_add: Iterable[DataFile], following the convention seen in the Java implementation.(Please add the
changeloglabel)