feat: Add table.maintenance.compact() for full-table data file compaction by qzyu999 · Pull Request #3124 · apache/iceberg-python

qzyu999 · 2026-03-06T05:39:43Z

Rationale for this change

This introduces a simplified, whole-table compaction strategy via the MaintenanceTable API (table.maintenance.compact()).

Key implementation details:

Reads the entire table state into memory via .to_arrow().
- Note: This initial implementation uses an in-memory Arrow-based rewrite strategy. Future iterations can extend this to support streaming or distributed rewrites for larger-than-memory datasets.
Uses table.overwrite() to rewrite data, leveraging PyIceberg's target file bin-packing (write.target-file-size-bytes) natively.
Ensures atomicity by executing within a table transaction.
Explicitly sets snapshot-type: replace and replace-operation: compaction to ensure correct metadata history for downstream engines.
Includes a guard to safely ignore compaction requests on empty tables.

Are these changes tested?

Includes full Pytest coverage in tests/table/test_maintenance.py.

Are there any user-facing changes?

Yes. This PR adds a new compact() method to the TableMaintenance API, allowing users to perform file compaction on existing Iceberg tables.

Example usage:

table = catalog.load_table("default.my_table")
# Merges small files into larger ones based on table properties
table.maintenance.compact()

Edit: It looks like I'm not able to add the changelog label, hopefully someone with permissions can do so.

This introduces a simplified, whole-table compaction strategy via the MaintenanceTable API (`table.maintenance.compact()`). Key implementation details: - Reads the entire table state into memory via `.to_arrow()`. - Uses `table.overwrite()` to rewrite data, leveraging PyIceberg's target file bin-packing (`write.target-file-size-bytes`) natively. - Ensures atomicity by executing within a table transaction. - Explicitly sets `snapshot-type: replace` and `replace-operation: compaction` to ensure correct metadata history for downstream engines. - Includes a guard to safely ignore compaction requests on empty tables. Includes full Pytest coverage in `tests/table/test_maintenance.py`. Closes apache#1092

kevinjqliu · 2026-03-06T19:15:46Z

pyiceberg/table/maintenance.py

+
+        # Overwrite the table atomically (REPLACE operation)
+        with self.tbl.transaction() as txn:
+            txn.overwrite(arrow_table, snapshot_properties={"snapshot-type": "replace", "replace-operation": "compaction"})


i think we should have a replace operation instead
https://iceberg.apache.org/javadoc/latest/org/apache/iceberg/DataOperations.html#REPLACE

we might want to create the .replace() first

Hi @kevinjqliu, thanks for the insight, I agree with what you're saying in terms of building a replace rather than just reusing the overwrite. I've refactored the compaction run to properly use a .replace() API, following the design of the Java Iceberg implementation.

The approach is to create a new _RewriteFiles in pyiceberg/table/update/snapshot.py, which utilizes the new Operation.REPLACE from pyiceberg/table/update/snapshots.py. The _RewriteFiles utilizes the replace(), which effectively mimics the _OverwriteFiles operation, with the exception that it uses Operation.REPLACE instead of Operation.OVERWRITE. This allows MaintenanceTable.compact() to do a proper txn.replace() rather than reuse txn.overwrite().

I also think it's worth noting that by adding Operation.REPLACE, we make room for the needed rewrite manifests (#270) and delete orphan files functionality (#1200).

kevinjqliu · 2026-03-06T19:17:02Z

tests/table/test_maintenance.py

+    after_files = list(table.scan().plan_files())
+    assert len(after_files) == 3  # Should be 1 optimized data file per partition
+    assert table.scan().to_arrow().num_rows == 120
+


since its a small result set, we should verify the data is the same too

Hi @kevinjqliu, made a change in 6420027 to check that the columns and the primary keys remain the same before/after compaction.

kevinjqliu · 2026-03-06T19:17:19Z

pyiceberg/table/maintenance.py

        return ExpireSnapshots(transaction=Transaction(self.tbl, autocommit=True))
+
+    def compact(self) -> None:
+        """Compact the table's data files by reading and overwriting the entire table.


this should be data and delete files. but generally it compacts the entire table

Hi @kevinjqliu, made the update to the docstring here: 9fd51a8.

…ction in test_maintenance_compact()

Formats the [compact](iceberg-python/pyiceberg/table/maintenance.py) method docstring to ensure the summary line does not wrap and correctly ends with a period, satisfying pydocstyle D205 and D400 rules.

Replaces the use of .overwrite() in MaintenanceTable.compact() with a new .replace() API backed by a _RewriteFiles producer. This ensures compaction now generates an Operation.REPLACE snapshot instead of Operation.OVERWRITE, preserving logical table state for downstream consumers. Fixes apache#1092

kevinjqliu · 2026-03-09T17:20:21Z

pyiceberg/table/__init__.py

                for data_file in data_files:
                    append_files.append_data_file(data_file)

+    def replace(


lets add replace on its own since its a pretty significant change and follow up with table compaction.

i think there are a few more things we need to add to the replace operation. Would be a good idea to look into the java side. For example, how can we ensure that the table's data remains the same? REPLACE means no data change. If we cannot guarantee that the data remains the same, maybe we should not expose a replace function that takes a df as a parameter

Hi @kevinjqliu, I created an issue (#3130) and a corresponding PR (#3131) to address the need to create a separate PR for replace. When approved, we can use that to build and complete this current PR for compaction. We can move this discussion to there and come back when finished.

qzyu999 added 2 commits March 5, 2026 21:32

fix: address linting and mypy type errors in maintenance tests

2774bd3

kevinjqliu reviewed Mar 6, 2026

View reviewed changes

qzyu999 added 5 commits March 6, 2026 12:22

fix: verify that the table itself remains the same before/after compa…

6420027

…ction in test_maintenance_compact()

style: fix trailing whitespace in test_maintenance.py

dfbde71

docs: update compact() docstring to include delete files

9fd51a8

chore: fix pydocstyle warnings in maintenance.py

93df231

Formats the [compact](iceberg-python/pyiceberg/table/maintenance.py) method docstring to ensure the summary line does not wrap and correctly ends with a period, satisfying pydocstyle D205 and D400 rules.

kevinjqliu reviewed Mar 9, 2026

View reviewed changes

This was referenced Mar 9, 2026

Feature: Add metadata-only replace API to Table for REPLACE snapshot operations #3130

Open

feat: Add metadata-only replace API to Table for REPLACE snapshot operations #3131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add table.maintenance.compact() for full-table data file compaction#3124

feat: Add table.maintenance.compact() for full-table data file compaction#3124
qzyu999 wants to merge 7 commits intoapache:mainfrom
qzyu999:feat-compaction-issue-1092

qzyu999 commented Mar 6, 2026 •

edited

Loading

Uh oh!

kevinjqliu Mar 6, 2026

Uh oh!

qzyu999 Mar 7, 2026

Uh oh!

kevinjqliu Mar 6, 2026

Uh oh!

qzyu999 Mar 6, 2026

Uh oh!

kevinjqliu Mar 6, 2026

Uh oh!

qzyu999 Mar 6, 2026 •

edited

Loading

Uh oh!

kevinjqliu Mar 9, 2026

Uh oh!

qzyu999 Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

qzyu999 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kevinjqliu Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

qzyu999 Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

qzyu999 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

qzyu999 Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

qzyu999 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qzyu999 commented Mar 6, 2026 •

edited

Loading

qzyu999 Mar 6, 2026 •

edited

Loading