Skip to content

feat: Add metadata-only replace API to Table for REPLACE snapshot operations#3131

Open
qzyu999 wants to merge 5 commits intoapache:mainfrom
qzyu999:feature/core-rewrite-api
Open

feat: Add metadata-only replace API to Table for REPLACE snapshot operations#3131
qzyu999 wants to merge 5 commits intoapache:mainfrom
qzyu999:feature/core-rewrite-api

Conversation

@qzyu999
Copy link

@qzyu999 qzyu999 commented Mar 9, 2026

Closes #3130

Rationale for this change

In a current PR (#3124, part of #1092), the proposed replace() API accepts a PyArrow dataframe (pa.Table), forcing the table engine to physically serialize data during a metadata transaction commit. This couples execution with the catalog, diverges from Java Iceberg's native RewriteFiles builder behavior, and fails to register under Operation.REPLACE.

This PR redesigns table.replace() and transaction.replace() to accept Iterable[DataFile] inputs. By externalizing physical data writing (e.g., compaction via Ray), the new explicit metadata-only _RewriteFiles SnapshotProducer can natively swap snapshot pointers in the manifests, perfectly inheriting ancestral sequence numbers for DELETED entries to ensure time-travel equivalence.

Are these changes tested?

Yes.

Fully exhaustive test coverage has been added to tests/table/test_replace.py. The suite validates:

  1. Context manager executions tracking valid history growth (len(table.history())).
  2. Snapshot summary bindings asserting strict Operation.REPLACE tags.
  3. Accurate evaluation of delta-metrics (added/deleted files and records tracking perfectly).
  4. Low-level serialization: Bypassed high-level discard filters on manifest.fetch_manifest_entry(discard_deleted=False) to natively assert that status=DELETED overrides are accurately preserving avro sequence numbers.
  5. Idempotent edge cases where replace([], []) successfully short-circuits the commit loop without mutating history.

Are there any user-facing changes?

Yes.

The method signature for Table.replace() and Transaction.replace() has been updated from the original PR #3124.
It no longer accepts a PyArrow DataFrame (df: pa.Table). Instead, it now requests two arguments:
files_to_delete: Iterable[DataFile] and files_to_add: Iterable[DataFile], following the convention seen in the Java implementation.

(Please add the changelog label)

qzyu999 added 4 commits March 9, 2026 15:40
- Fixed positional argument type mismatch for `snapshot_properties` in [_RewriteFiles](iceberg-python/pyiceberg/table/update/snapshot.py)
- Added missing `Catalog` type annotations to pytest fixtures in [test_replace.py](iceberg-python/tests/table/test_replace.py)
- Added strict `is not None` assertions for `table.current_snapshot()` to satisfy mypy Optional checking
- Auto-formatted tests with ruff
…ass enum validation (Operation.REPLACE is valid so we can no longer use it in test_invalid_operation)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Add metadata-only replace API to Table for REPLACE snapshot operations

1 participant