Skip to content

GeiserX/Website-Diff

Repository files navigation

Website-Diff Banner

Detect meaningful differences between web pages -- with Wayback Machine artifact cleaning, visual comparison, and significance scoring.

Python Versions Version License: GPL-3.0 Docker


Why Website-Diff?

Comparing web pages sounds simple until you deal with Wayback Machine injection artifacts, insignificant whitespace noise, and visual regressions invisible to the DOM. Website-Diff is a purpose-built CLI that solves all three:

  • Wayback Machine cleaning -- automatically strips banners, analytics scripts, playback code, and URL rewrites so you compare actual content.
  • Significance scoring -- every change is tagged High, Medium, or Low so you focus on what matters.
  • Multi-browser visual comparison -- captures screenshots in Chrome, Firefox, Edge, and Opera, then generates pixel-diff images.
  • CI/CD-ready exit codes -- integrate directly into pipelines (0 = no changes, 1 = low/medium, 2 = high).

Table of Contents


Quick Start

pip install -e .

# Compare two pages
website-diff https://example.com/old https://example.com/new

# Compare a Wayback snapshot with the live site
website-diff https://web.archive.org/web/20230101/https://example.com/ https://example.com/

# Full report: visual diff + markdown
website-diff https://old.example.com https://new.example.com --visual --markdown

Installation

From source

git clone https://github.com/GeiserX/Website-Diff.git
cd Website-Diff
python3 -m venv venv
source venv/bin/activate      # Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .

For visual comparison support:

pip install -e ".[visual]"

Docker

docker build -t website-diff .
docker run --rm website-diff https://example.com/a https://example.com/b

Usage

Basic comparison

website-diff https://example.com/page1 https://example.com/page2

Wayback Machine support

The tool automatically detects Wayback Machine URLs and cleans injection artifacts before comparing:

# Archive vs. live site
website-diff https://web.archive.org/web/20230101/https://example.com/ https://example.com/

# Two archive snapshots
website-diff \
  https://web.archive.org/web/20230101/https://example.com/ \
  https://web.archive.org/web/20230601/https://example.com/

Output formats

# Save to file
website-diff url1 url2 -o diff.txt

# JSON (for programmatic consumption)
website-diff url1 url2 --format json

# Unified diff
website-diff url1 url2 --format unified

Site-wide traversal

# Crawl and compare across linked pages (depth-limited)
website-diff url1 url2 --traverse --depth 2

Advanced options

Flag Description
--no-clean-wayback Disable Wayback Machine artifact removal
--no-ignore-whitespace Treat whitespace changes as significant
--timeout N Set HTTP timeout in seconds (default: 30)
--verbose Enable detailed logging

Visual Comparison

Take screenshots in one or more browsers and generate side-by-side difference images:

# Auto-detect all installed browsers
website-diff url1 url2 --visual

# Specific browsers
website-diff url1 url2 --visual --browsers chrome firefox edge opera

# Custom viewport
website-diff url1 url2 --visual --viewport-width 1280 --viewport-height 720

# Non-headless mode (for debugging)
website-diff url1 url2 --visual --no-headless

# Custom screenshot output
website-diff url1 url2 --visual --screenshot-dir ./my-screenshots

Visual comparison generates:

  • Screenshots of both pages per browser
  • Side-by-side comparison images
  • Pixel-level difference highlighting (red overlay marks changes)

Markdown Reports

Generate comprehensive Markdown reports that include everything in a single reviewable document:

website-diff url1 url2 --visual --markdown --report-dir ./reports

Each report contains:

  • Executive summary with change statistics
  • Visual comparison screenshots (when --visual is used)
  • Changes grouped by significance (High / Medium / Low)
  • Site-wide results (when --traverse is used)
  • Actionable recommendations

CI/CD Integration

Website-Diff returns meaningful exit codes designed for pipeline gates:

Exit Code Meaning
0 No differences detected
1 Low or medium significance changes
2 High significance changes detected

GitHub Actions example

name: Visual Regression Check
on:
  pull_request:

jobs:
  diff:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install Website-Diff
        run: |
          pip install -r requirements.txt
          pip install -e ".[visual]"

      - name: Compare staging vs production
        run: |
          website-diff \
            https://staging.example.com \
            https://production.example.com \
            --visual --markdown --format json -o diff.json

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: diff-report
          path: reports/

Shell script gate

website-diff "$OLD_URL" "$NEW_URL" --format json -o result.json
EXIT_CODE=$?

if [ $EXIT_CODE -eq 2 ]; then
  echo "BLOCKING: high-significance changes detected"
  exit 1
elif [ $EXIT_CODE -eq 1 ]; then
  echo "WARNING: minor changes detected"
fi

How It Works

Wayback Machine cleaning

When a Wayback Machine URL is detected, the tool automatically:

  1. Removes header artifacts -- strips analytics scripts, playback scripts, and banner CSS injected by the Wayback Machine.
  2. Removes footer comments -- removes archival metadata and copyright notices.
  3. Restores URLs -- converts web.archive.org/web/…/ prefixed URLs back to their originals.
  4. Normalizes content -- handles whitespace and formatting differences introduced by archival.

Significance scoring

Every detected change is categorized:

Level Examples
High Structural changes, content text, meta tags, scripts, stylesheets
Medium Attribute changes, inline styling, div/span modifications
Low Whitespace, comments, minor formatting

Intelligent comparison

The diff engine:

  • Focuses on meaningful content changes
  • Ignores noise like timestamps and auto-generated IDs
  • Provides context around each change
  • Groups results by significance for fast review

Output Formats

Text (default)

Summary statistics, significance breakdown, and detailed changes with context lines.

JSON

Structured output for programmatic processing:

{
  "summary": {
    "total_changes": 15,
    "added": 5,
    "removed": 3,
    "modified": 7,
    "high_significance": 2,
    "medium_significance": 8,
    "low_significance": 5
  },
  "changes": [
    {
      "type": "modified",
      "old_text": "...",
      "new_text": "...",
      "significance": "high"
    }
  ]
}

Unified diff

Standard unified diff format, compatible with patch and code review tools.


Comparison with Similar Tools

Feature Website-Diff htmldiff diff2html BackstopJS Percy
HTML-aware semantic diff Yes Yes No No No
Wayback Machine artifact cleaning Yes No No No No
Significance scoring Yes No No No No
Visual (screenshot) comparison Yes No No Yes Yes
Multi-browser support Yes N/A N/A Yes Yes
Site-wide crawl and compare Yes No No Yes No
Markdown report generation Yes No No No No
CI/CD exit codes Yes No No Yes Yes
Self-hosted / no SaaS Yes Yes Yes Yes No
Free and open source GPL-3.0 MIT MIT MIT Freemium

Testing

pip install -r requirements-dev.txt

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=website_diff --cov-report=html

Contributing

Contributions are welcome. To get started:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Add tests for new functionality
  4. Ensure all tests pass: pytest tests/ -v
  5. Submit a Pull Request

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0). See the LICENSE file for details.

This software is not intended for commercial use.