Detect meaningful differences between web pages -- with Wayback Machine artifact cleaning, visual comparison, and significance scoring.
Comparing web pages sounds simple until you deal with Wayback Machine injection artifacts, insignificant whitespace noise, and visual regressions invisible to the DOM. Website-Diff is a purpose-built CLI that solves all three:
- Wayback Machine cleaning -- automatically strips banners, analytics scripts, playback code, and URL rewrites so you compare actual content.
- Significance scoring -- every change is tagged High, Medium, or Low so you focus on what matters.
- Multi-browser visual comparison -- captures screenshots in Chrome, Firefox, Edge, and Opera, then generates pixel-diff images.
- CI/CD-ready exit codes -- integrate directly into pipelines (
0= no changes,1= low/medium,2= high).
- Quick Start
- Installation
- Usage
- Visual Comparison
- Markdown Reports
- CI/CD Integration
- How It Works
- Output Formats
- Comparison with Similar Tools
- Contributing
- License
pip install -e .
# Compare two pages
website-diff https://example.com/old https://example.com/new
# Compare a Wayback snapshot with the live site
website-diff https://web.archive.org/web/20230101/https://example.com/ https://example.com/
# Full report: visual diff + markdown
website-diff https://old.example.com https://new.example.com --visual --markdowngit clone https://github.com/GeiserX/Website-Diff.git
cd Website-Diff
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .For visual comparison support:
pip install -e ".[visual]"docker build -t website-diff .
docker run --rm website-diff https://example.com/a https://example.com/bwebsite-diff https://example.com/page1 https://example.com/page2The tool automatically detects Wayback Machine URLs and cleans injection artifacts before comparing:
# Archive vs. live site
website-diff https://web.archive.org/web/20230101/https://example.com/ https://example.com/
# Two archive snapshots
website-diff \
https://web.archive.org/web/20230101/https://example.com/ \
https://web.archive.org/web/20230601/https://example.com/# Save to file
website-diff url1 url2 -o diff.txt
# JSON (for programmatic consumption)
website-diff url1 url2 --format json
# Unified diff
website-diff url1 url2 --format unified# Crawl and compare across linked pages (depth-limited)
website-diff url1 url2 --traverse --depth 2| Flag | Description |
|---|---|
--no-clean-wayback |
Disable Wayback Machine artifact removal |
--no-ignore-whitespace |
Treat whitespace changes as significant |
--timeout N |
Set HTTP timeout in seconds (default: 30) |
--verbose |
Enable detailed logging |
Take screenshots in one or more browsers and generate side-by-side difference images:
# Auto-detect all installed browsers
website-diff url1 url2 --visual
# Specific browsers
website-diff url1 url2 --visual --browsers chrome firefox edge opera
# Custom viewport
website-diff url1 url2 --visual --viewport-width 1280 --viewport-height 720
# Non-headless mode (for debugging)
website-diff url1 url2 --visual --no-headless
# Custom screenshot output
website-diff url1 url2 --visual --screenshot-dir ./my-screenshotsVisual comparison generates:
- Screenshots of both pages per browser
- Side-by-side comparison images
- Pixel-level difference highlighting (red overlay marks changes)
Generate comprehensive Markdown reports that include everything in a single reviewable document:
website-diff url1 url2 --visual --markdown --report-dir ./reportsEach report contains:
- Executive summary with change statistics
- Visual comparison screenshots (when
--visualis used) - Changes grouped by significance (High / Medium / Low)
- Site-wide results (when
--traverseis used) - Actionable recommendations
Website-Diff returns meaningful exit codes designed for pipeline gates:
| Exit Code | Meaning |
|---|---|
0 |
No differences detected |
1 |
Low or medium significance changes |
2 |
High significance changes detected |
name: Visual Regression Check
on:
pull_request:
jobs:
diff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install Website-Diff
run: |
pip install -r requirements.txt
pip install -e ".[visual]"
- name: Compare staging vs production
run: |
website-diff \
https://staging.example.com \
https://production.example.com \
--visual --markdown --format json -o diff.json
- name: Upload report
if: always()
uses: actions/upload-artifact@v4
with:
name: diff-report
path: reports/website-diff "$OLD_URL" "$NEW_URL" --format json -o result.json
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "BLOCKING: high-significance changes detected"
exit 1
elif [ $EXIT_CODE -eq 1 ]; then
echo "WARNING: minor changes detected"
fiWhen a Wayback Machine URL is detected, the tool automatically:
- Removes header artifacts -- strips analytics scripts, playback scripts, and banner CSS injected by the Wayback Machine.
- Removes footer comments -- removes archival metadata and copyright notices.
- Restores URLs -- converts
web.archive.org/web/…/prefixed URLs back to their originals. - Normalizes content -- handles whitespace and formatting differences introduced by archival.
Every detected change is categorized:
| Level | Examples |
|---|---|
| High | Structural changes, content text, meta tags, scripts, stylesheets |
| Medium | Attribute changes, inline styling, div/span modifications |
| Low | Whitespace, comments, minor formatting |
The diff engine:
- Focuses on meaningful content changes
- Ignores noise like timestamps and auto-generated IDs
- Provides context around each change
- Groups results by significance for fast review
Summary statistics, significance breakdown, and detailed changes with context lines.
Structured output for programmatic processing:
{
"summary": {
"total_changes": 15,
"added": 5,
"removed": 3,
"modified": 7,
"high_significance": 2,
"medium_significance": 8,
"low_significance": 5
},
"changes": [
{
"type": "modified",
"old_text": "...",
"new_text": "...",
"significance": "high"
}
]
}Standard unified diff format, compatible with patch and code review tools.
| Feature | Website-Diff | htmldiff | diff2html | BackstopJS | Percy |
|---|---|---|---|---|---|
| HTML-aware semantic diff | Yes | Yes | No | No | No |
| Wayback Machine artifact cleaning | Yes | No | No | No | No |
| Significance scoring | Yes | No | No | No | No |
| Visual (screenshot) comparison | Yes | No | No | Yes | Yes |
| Multi-browser support | Yes | N/A | N/A | Yes | Yes |
| Site-wide crawl and compare | Yes | No | No | Yes | No |
| Markdown report generation | Yes | No | No | No | No |
| CI/CD exit codes | Yes | No | No | Yes | Yes |
| Self-hosted / no SaaS | Yes | Yes | Yes | Yes | No |
| Free and open source | GPL-3.0 | MIT | MIT | MIT | Freemium |
pip install -r requirements-dev.txt
# Run tests
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=website_diff --cov-report=htmlContributions are welcome. To get started:
- Fork the repository
- Create a feature branch (
git checkout -b feature/my-feature) - Add tests for new functionality
- Ensure all tests pass:
pytest tests/ -v - Submit a Pull Request
This project is licensed under the GNU General Public License v3.0 (GPL-3.0). See the LICENSE file for details.
This software is not intended for commercial use.