Skip to content

refactor: decompose buildGraph() into pipeline stages (ROADMAP 3.9)#434

Merged
carlos-alm merged 5 commits intomainfrom
refactor/builder-pipeline
Mar 13, 2026
Merged

refactor: decompose buildGraph() into pipeline stages (ROADMAP 3.9)#434
carlos-alm merged 5 commits intomainfrom
refactor/builder-pipeline

Conversation

@carlos-alm
Copy link
Contributor

Summary

  • Decompose the 1,487-line monolithic buildGraph() into 9 independently testable pipeline stages communicating through a shared PipelineContext object
  • Convert src/builder.js from a mega-file to an 11-line barrel re-export (following the established src/db.js pattern)
  • Eliminate watcher divergencesrc/watcher.js now delegates to builder/incremental.js instead of duplicating 140 lines of node insertion and edge building logic

New directory structure

src/builder/
  context.js           — PipelineContext class (shared mutable state)
  helpers.js           — Extracted utilities (collectFiles, fileHash, batch inserts)
  incremental.js       — Single-file rebuild for watch mode
  pipeline.js          — Orchestrator: setup → stages → return
  stages/
    collect-files.js   — File collection + scoped rebuild
    detect-changes.js  — Three-tier change detection + reverse-dep cascade
    parse-files.js     — parseFilesAuto dispatch
    insert-nodes.js    — Batch node/edge insertion in SQLite transaction
    resolve-imports.js — Batch import resolution + barrel map
    build-edges.js     — Import/call/receiver/extends/implements edges
    build-structure.js — Directory structure + node role classification
    run-analyses.js    — Unified AST analysis engine dispatch
    finalize.js        — Metadata, drift detection, cleanup, registry

Backward compatibility

All existing import { buildGraph, collectFiles, ... } from './builder.js' continue working unchanged. The barrel re-export is transparent to consumers.

Test plan

  • All 16 build.test.js integration tests pass
  • All 5 scoped-rebuild.test.js tests pass (including reverse-dep cascade)
  • All incremental-parity.test.js tests pass
  • All 18 builder.test.js unit tests pass (collectFiles, loadPathAliases, readFileSafe)
  • 14 new pipeline stage unit tests pass (context, collect-files, detect-changes, pipeline)
  • 301 total tests pass across 19 test files, zero regressions
  • Lint clean (1 intentional warning: unused db param in rebuildFile kept for API consistency)

Split the 1,487-line monolithic buildGraph() into independently testable
pipeline stages communicating through a shared PipelineContext object.

New directory structure:
  src/builder/
    context.js          — PipelineContext class (shared mutable state)
    helpers.js          — Extracted utilities (collectFiles, fileHash, batch inserts)
    incremental.js      — Single-file rebuild for watch mode
    pipeline.js         — Orchestrator: setup → stages → return
    stages/
      collect-files.js  — File collection + scoped rebuild
      detect-changes.js — Three-tier change detection + reverse-dep cascade
      parse-files.js    — parseFilesAuto dispatch
      insert-nodes.js   — Batch node/edge insertion in SQLite transaction
      resolve-imports.js— Batch import resolution + barrel map
      build-edges.js    — Import/call/receiver/extends/implements edges
      build-structure.js— Directory structure + node role classification
      run-analyses.js   — Unified AST analysis engine dispatch
      finalize.js       — Metadata, drift detection, cleanup, registry

src/builder.js is now a barrel re-export (11 lines) following the
established src/db.js pattern. All existing imports continue working.

src/watcher.js now delegates to builder/incremental.js instead of
duplicating 140 lines of node insertion and edge building logic.

Impact: 28 functions changed, 27 affected
@claude
Copy link

claude bot commented Mar 13, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Decompose buildGraph() into pipeline stages

Comprehensive review of this large refactoring PR that decomposes the monolithic buildGraph() function.

Root Cause Analysis

Problem Being Solved: 1,487-line monolithic buildGraph() function was difficult to maintain, test, and understand. Additionally, src/watcher.js duplicated 140 lines of node insertion/edge building logic.

Why This Approach: Pipeline decomposition with shared PipelineContext enables:

  • Independent testing of each stage
  • Clear separation of concerns
  • Elimination of code duplication in watcher
  • Backward compatibility through barrel re-exports

Risk Assessment: Low risk - well-structured refactoring with comprehensive test coverage.

Configuration Changes ✅

  • No linting rules relaxed - biome.json unchanged
  • No new dependencies - package.json runtime deps unchanged
  • No test weakening - no eslint-disable or workarounds found
  • No timeout increases - test thresholds maintained

Test Integrity ✅

  • Tests are properly structured and focused
  • New unit tests for pipeline stages (context, collect-files, detect-changes, pipeline)
  • Integration tests still validate end-to-end functionality
  • Test assertions are appropriate (checking actual behavior, not weakened)
  • Claims of 301 total tests across 19 test files appear credible

Code Quality Assessment ⭐⭐⭐⭐

Strengths:

  • Clean architecture: Pipeline stages are focused and single-responsibility
  • Shared context pattern: PipelineContext eliminates complex parameter passing
  • Proper separation: Each stage handles one concern (collect → detect → parse → insert → resolve → edges → structure → analyze → finalize)
  • Good documentation: Each stage has clear purpose and JSDoc types
  • Performance-conscious: Pre-loads node lookup maps to avoid N+1 queries
  • Memory management: WASM tree cleanup in finalize stage
  • Backward compatibility: Barrel re-export maintains all existing imports

Technical Observations:

  • src/builder.js transformed from 1,487 lines to 11-line barrel (99.3% reduction)
  • src/watcher.js now delegates to builder/incremental.js eliminating duplication
  • Pipeline stages follow consistent patterns with timing, error handling
  • Proper database transaction usage in critical sections

Minor Concerns:

  • Large PR scope (19 files) makes review challenging
  • Some stages are still substantial (build-edges.js ~297 lines) but appropriately focused

Backlog Compliance

Zero-dep: ✓ — No new runtime dependencies added, builds on existing tree-sitter/SQLite/AST foundation

Foundation-aligned: ✓ — Fully compliant with FOUNDATION.md principles:

  • P1 (always-current): No impact on graph freshness
  • P4 (zero-cost core): Core refactoring, no LLM dependencies
  • P5 (embeddable-first): Maintains programmatic API contract
  • P8 (not a linting tool): Pure architectural improvement

Problem-fit: 3/5 — Improves maintainability and testability but doesn't directly address the core "lost AI" problem. However, better code structure indirectly helps by making the codebase more navigable.

Breaking: No — All existing import statements continue working unchanged through barrel re-exports

Tier: 1 (Zero-dep ✓ + Foundation-aligned ✓)

Critical Concerns

None identified. This is a well-executed refactoring with appropriate safeguards.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality architectural refactoring that significantly improves code maintainability while preserving backward compatibility. The pipeline decomposition is well-designed, thoroughly tested, and eliminates code duplication. The barrel re-export pattern is already established in the codebase (src/db.js). This refactoring creates a solid foundation for future enhancements.

Minor suggestions for follow-up (not blocking):

  • Consider splitting the largest stages (build-edges.js) further if they grow
  • Document the PipelineContext lifecycle in architecture docs

@carlos-alm
Copy link
Contributor Author

@greptileai

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 13, 2026

Greptile Summary

This PR decomposes the 1,487-line monolithic buildGraph() into 9 independently testable pipeline stages communicating through a shared PipelineContext object, following the established src/db.js barrel pattern. All existing public imports remain unchanged. The watcher divergence (previously ~140 lines of duplicated node/edge logic in watcher.js) is eliminated by delegating to the new incremental.js.

The refactoring is architecturally sound and all prior review-thread issues have been addressed in the current HEAD commit:

  • DB leak protection: try block now wraps everything from openDb through all 9 stages; !ctx.earlyExit guard prevents double-close on the detectChanges early-exit path.
  • ctx.fileSymbols partial-population: Moved before the SQLite transaction in insertNodes so rollback leaves the JS Map consistent.
  • readFileSafe in buildStructure: Both fs.readFileSync calls replaced for transient-error retry on Windows.
  • Analysis failure visibility: run-analyses.js failure now logs at warn level.
  • Timing ?? 0 fallbacks: All undefined-timing .toFixed(1) crash paths guarded.
  • Static fs import in incremental.js: Dynamic import inside hot-path eliminated.

Two minor cleanup items remain:

  • Dead code in src/watcher.js: stmts.countEdgesForFile and the underlying origCountEdges prepared statement are constructed on every watchProject() call but never consumed by incremental.js after the refactoring.
  • Stale phase-numbering comment in src/builder/stages/insert-nodes.js: Transaction phases are labelled 1 → 1b → 3 → 5, with the Phase 3 comment still referencing a "Phase 2" that no longer exists.

Confidence Score: 5/5

  • This PR is safe to merge — the refactoring is a clean decomposition with no regressions and all prior critical issues resolved.
  • All nine previously-flagged logic issues (DB leaks, partial fileSymbols mutation, readFileSafe bypasses, timing crashes, dynamic import hot-path overhead, analysis failures swallowed at debug level) have been addressed at the current HEAD. The two remaining items — a dead SQLite prepared statement in watcher.js and a stale phase-number comment in insert-nodes.js — are cosmetic and carry no runtime risk. The 301-test suite with zero regressions further supports confidence.
  • No files require special attention — the two outstanding items in src/watcher.js and src/builder/stages/insert-nodes.js are style-level cleanup only.

Important Files Changed

Filename Overview
src/builder.js Reduced from 1,487 lines to an 11-line barrel re-export, delegating to src/builder/pipeline.js and src/builder/helpers.js. All existing public imports (buildGraph, collectFiles, loadPathAliases, purgeFilesFromGraph, readFileSafe, resolveImportPath) continue to work unchanged.
src/builder/context.js New PipelineContext class with clearly typed JSDoc fields for each pipeline stage's inputs and outputs. Clean separation with sensible defaults (earlyExit = false, forceFullRebuild = false, hasEmbeddings = false, timing = {}). No logic — pure data holder.
src/builder/helpers.js Extracted shared utilities from the monolithic builder: collectFiles, loadPathAliases, fileHash, fileStat, readFileSafe, purgeFilesFromGraph, batchInsertNodes, batchInsertEdges. Symlink loop detection ordering and warn() call are correctly restored per prior review feedback.
src/builder/incremental.js Single-file rebuild used by watch mode, extracted from the old watcher inline logic. Static top-level fs import (dynamic import regression fixed per prior review). Correctly handles node insertion, import/call edge building, and symbol diffs.
src/builder/pipeline.js Pipeline orchestrator. try block now starts immediately after openDb (covers setup, all 9 stages). earlyExit guard prevents double-close on the detectChanges early-exit path. All timing fields use ?? 0 fallback. Clean and well-structured.
src/watcher.js Now delegates file rebuilding to incremental.js, eliminating the ~140-line duplication. One dead-code remnant: stmts.countEdgesForFile and the underlying origCountEdges prepared statement are still constructed but never consumed by incremental.js.
src/builder/stages/detect-changes.js Three-tier change detection (journal → mtime+size → hash), reverse-dependency cascade, and scoped-build handling. Early-exit paths correctly call closeDb before setting ctx.earlyExit. writeJournalHeader swallows its own errors so the close/earlyExit ordering is safe.
src/builder/stages/insert-nodes.js ctx.fileSymbols is correctly populated before the transaction starts (partial-population-on-rollback fix from prior review). Phase-numbering comments are stale (Phase 1 → 1b → 3 → 5 with no Phase 2/4), but the logic itself is correct.
src/builder/stages/build-edges.js Builds import, call, receiver, extends, and implements edges. Pre-loads all nodes into in-memory lookup maps to avoid N+1 DB queries. Both native and JS-fallback paths are correctly wired up. seenCallEdges deduplication prevents duplicate call edges in the JS path.
src/builder/stages/finalize.js WASM cleanup, stats logging, drift detection, build metadata, and journal header. Correctly resolves package.json three levels up from src/builder/stages/. DB close and journal write are properly ordered.

Sequence Diagram

sequenceDiagram
    participant C as Caller
    participant P as pipeline.js<br/>buildGraph()
    participant CTX as PipelineContext
    participant DB as SQLite DB

    C->>P: buildGraph(rootDir, opts)
    P->>CTX: new PipelineContext()
    P->>DB: openDb()
    P->>DB: initSchema()
    P->>CTX: setup (config, engine, aliases, timing.setupMs)

    P->>P: collectFiles(ctx)
    Note over P: allFiles + discoveredDirs → ctx

    P->>P: detectChanges(ctx)
    Note over P: Three-tier detection (journal→mtime→hash)<br/>reverse-dep cascade<br/>earlyExit=true → closeDb + return

    alt earlyExit
        P-->>C: return undefined
    end

    P->>P: parseFiles(ctx)
    Note over P: parseFilesAuto → allSymbols, fileSymbols

    P->>P: insertNodes(ctx)
    Note over P: fileSymbols populated before tx<br/>batch INSERT nodes + edges (contains/param)

    P->>P: resolveImports(ctx)
    Note over P: batch resolve + barrel map<br/>load unchanged barrels from DB

    P->>P: buildEdges(ctx)
    Note over P: import / call / receiver /<br/>extends / implements edges

    P->>P: buildStructure(ctx)
    Note over P: directory nodes, metrics,<br/>node role classification

    P->>P: runAnalyses(ctx)
    Note over P: AST / complexity / CFG / dataflow<br/>failures logged at warn level

    P->>P: finalize(ctx)
    Note over P: drift detection, metadata,<br/>closeDb, journal header

    P-->>C: return { phases: { setupMs, parseMs, ... } }
Loading

Comments Outside Diff (1)

  1. src/watcher.js, line 58-76 (link)

    Dead code left after incremental.js refactoring

    stmts.countEdgesForFile (and the underlying origCountEdges prepared statement) was used by the old inline 140-line watcher logic that this PR removes. incremental.js never calls stmts.countEdgesForFile, so the statement is prepared and the wrapper object is created on every watchProject() invocation but never consumed.

    The same applies to origCountEdges (lines 72-74) and the countEdgesForFile: null sentinel on line 58 — all three can be removed:

    → delete lines 58, 72-74, and 76 entirely.

    Leaving it in doesn't break anything, but it allocates a prepared SQLite statement on every watcher start and silently misleads future readers into thinking the count query is still needed.

Last reviewed commit: 027af48

Comment on lines +59 to +77
try {
entries = fs.readdirSync(dir, { withFileTypes: true });
} catch (err) {
warn(`Cannot read directory ${dir}: ${err.message}`);
return trackDirs ? { files, directories } : files;
}

// Merge config ignoreDirs with defaults
const extraIgnore = config.ignoreDirs ? new Set(config.ignoreDirs) : null;

// Detect symlink loops
let realDir;
try {
realDir = fs.realpathSync(dir);
} catch {
return trackDirs ? { files, directories } : files;
}
if (_visited.has(realDir)) return trackDirs ? { files, directories } : files;
_visited.add(realDir);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symlink loop warning silently dropped + readdirSync hoisted before loop check

The refactored collectFiles contains two subtle regressions compared to the original in builder.js:

  1. warn() removed: The original emitted warn(\Symlink loop detected, skipping: ${dir}`)when_visited.has(realDir)` was true. The new code returns silently, making symlink loop problems invisible to users.

  2. readdirSync called before the loop guard: In the original, realpathSync/_visited check happened before readdirSync. Now the directory is always read first — so when a loop is detected on line 76, we've already done the I/O for nothing.

The original ordering:

// 1. realpathSync → fail → return early
// 2. _visited check → loop → warn + return early
// 3. readdirSync (only if no loop)

Suggested restoration of the warning (at minimum):

Suggested change
try {
entries = fs.readdirSync(dir, { withFileTypes: true });
} catch (err) {
warn(`Cannot read directory ${dir}: ${err.message}`);
return trackDirs ? { files, directories } : files;
}
// Merge config ignoreDirs with defaults
const extraIgnore = config.ignoreDirs ? new Set(config.ignoreDirs) : null;
// Detect symlink loops
let realDir;
try {
realDir = fs.realpathSync(dir);
} catch {
return trackDirs ? { files, directories } : files;
}
if (_visited.has(realDir)) return trackDirs ? { files, directories } : files;
_visited.add(realDir);
// Merge config ignoreDirs with defaults
const extraIgnore = config.ignoreDirs ? new Set(config.ignoreDirs) : null;
// Detect symlink loops
let realDir;
try {
realDir = fs.realpathSync(dir);
} catch {
return trackDirs ? { files, directories } : files;
}
if (_visited.has(realDir)) {
warn(`Symlink loop detected, skipping: ${dir}`);
return trackDirs ? { files, directories } : files;
}
_visited.add(realDir);
let entries;
try {
entries = fs.readdirSync(dir, { withFileTypes: true });
} catch (err) {
warn(`Cannot read directory ${dir}: ${err.message}`);
return trackDirs ? { files, directories } : files;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d270073. Moved symlink loop detection (realpathSync + _visited check) before readdirSync to avoid wasted I/O, and restored the warn() call so loop detection is visible to users.

export async function rebuildFile(db, rootDir, filePath, stmts, engineOpts, cache, options = {}) {
const { diffSymbols } = options;
const relPath = normalizePath(path.relative(rootDir, filePath));
const fs = await import('node:fs');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic fs import inside hot-path function

rebuildFile is called for every file change event in watch mode. Using a dynamic await import('node:fs') here creates and resolves a new Promise on each invocation — while Node.js caches the module, the async overhead (microtask scheduling, Promise allocation) still accumulates across many rapid changes.

All other files in src/builder/ import fs as a top-level static import. This should follow the same pattern:

Suggested change
const fs = await import('node:fs');
import fs from 'node:fs';

Then remove the const fs = await import('node:fs'); line inside the function body, and remove the await from the function signature if it's the only reason it's async (it isn't — parseFileIncremental is awaited too, so async stays).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d270073. Replaced the dynamic \ with a static top-level , eliminating per-call async overhead in the watch mode hot path.

- Restore symlink loop warning and move detection before readdirSync
  to avoid wasted I/O (greptile review)
- Replace dynamic fs import with static top-level import in
  incremental.js to avoid per-call async overhead in watch mode
  (greptile review)

Impact: 2 functions changed, 0 affected
@carlos-alm
Copy link
Contributor Author

@greptileai

Comment on lines +118 to +119
astMs: +ctx.timing.astMs.toFixed(1),
complexityMs: +ctx.timing.complexityMs.toFixed(1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crash if analysis engine fails — timing properties may be undefined

run-analyses.js wraps runAnalysesFn in a try-catch and leaves ctx.timing.astMs / ctx.timing.complexityMs unset when the engine throws. The return statement then calls .toFixed(1) on those undefined values, which throws a TypeError.

This pattern is preserved from the original monolithic builder.js (where _t.astMs.toFixed(1) had the same risk), but the refactoring is a good opportunity to close it.

Suggested change
astMs: +ctx.timing.astMs.toFixed(1),
complexityMs: +ctx.timing.complexityMs.toFixed(1),
astMs: +(ctx.timing.astMs ?? 0).toFixed(1),
complexityMs: +(ctx.timing.complexityMs ?? 0).toFixed(1),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d7355e4 — added \ fallback for all timing properties (, , and all others) so \ won't crash if a stage didn't populate its timing field.

Comment on lines +101 to +107
await parseFiles(ctx);
await insertNodes(ctx);
await resolveImports(ctx);
await buildEdges(ctx);
await buildStructure(ctx);
await runAnalyses(ctx);
await finalize(ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB left open on unhandled stage errors

ctx.db is opened during setup but only closed inside finalize() (or inside detectChanges() on the early-exit path). If any stage between parseFiles and runAnalyses throws an uncaught exception — e.g. an out-of-memory error during a large parse, an unexpected SQLite error, or a missing barrel file — the database connection is leaked.

The original monolithic function had the same issue, but the decomposition into discrete stages makes it more likely that future stage additions will introduce additional failure paths that bypass finalize. A try-finally in buildGraph would catch all of them:

try {
  await parseFiles(ctx);
  await insertNodes(ctx);
  await resolveImports(ctx);
  await buildEdges(ctx);
  await buildStructure(ctx);
  await runAnalyses(ctx);
  await finalize(ctx);
} catch (err) {
  closeDb(ctx.db);
  throw err;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d7355e4 (partial) and d229c87 (complete). The try-catch now wraps ALL stages including collectFiles and detectChanges. Added a guard (!ctx.earlyExit) to avoid double-closing since detectChanges already calls closeDb on its early-exit paths.

…ings

Wrap pipeline stages in try-catch to close the DB if a stage throws
between parseFiles and finalize. Add ?? 0 fallback for all timing
properties that may be undefined if a stage didn't run.

Impact: 1 functions changed, 0 affected
@carlos-alm
Copy link
Contributor Author

@greptileai

Comment on lines +95 to +112
// ── Pipeline stages ───────────────────────────────────────────────
await collectFiles(ctx);
await detectChanges(ctx);

if (ctx.earlyExit) return;

try {
await parseFiles(ctx);
await insertNodes(ctx);
await resolveImports(ctx);
await buildEdges(ctx);
await buildStructure(ctx);
await runAnalyses(ctx);
await finalize(ctx);
} catch (err) {
closeDb(ctx.db);
throw err;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB leaked if collectFiles or detectChanges throws

The previous review thread addressed the DB leak for stages between parseFiles and finalize (now wrapped in the try/catch at line 101). However, collectFiles (line 96) and detectChanges (line 97) are called outside that block, so ctx.db — opened at line 43 — is still leaked if either of them throws an uncaught exception.

detectChanges in particular executes multiple SQLite queries: SELECT MAX(mtime) from file_hashes, reverse-dependency cascade joins, and purgeFilesFromGraph. Any of these can throw if the DB file is corrupt, locked, or the schema is in an unexpected state.

The current fix only covers 7 out of the 9 stages:

// Before the try block — db opened but unprotected against throws in these two stages
await collectFiles(ctx);    // line 96 — no SQLite, but can OOM
await detectChanges(ctx);   // line 97 — many SQLite queries, CAN throw

if (ctx.earlyExit) return;

try {
  // Only these stages are protected
  await parseFiles(ctx);
  ...
} catch (err) {
  closeDb(ctx.db);
  throw err;
}

Extending the guard to include detectChanges (and for consistency collectFiles) closes the remaining gap:

try {
  await collectFiles(ctx);
  await detectChanges(ctx);

  if (ctx.earlyExit) return;

  await parseFiles(ctx);
  await insertNodes(ctx);
  await resolveImports(ctx);
  await buildEdges(ctx);
  await buildStructure(ctx);
  await runAnalyses(ctx);
  await finalize(ctx);
} catch (err) {
  // Only close if not already closed by detectChanges early-exit path
  if (!ctx.earlyExit) closeDb(ctx.db);
  throw err;
}

Note: detectChanges already calls closeDb on its own early-exit paths (lines 323 and 332 of detect-changes.js), so a guard is needed to avoid double-closing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d229c87 — extended the try-catch to cover all stages including collectFiles and detectChanges. The catch guard checks !ctx.earlyExit to avoid double-closing, since detectChanges already calls closeDb on its early-exit paths.

Comment on lines +44 to +57
const insertAll = db.transaction(() => {
// Phase 1: Batch insert all file nodes + definitions + exports
const phase1Rows = [];
for (const [relPath, symbols] of allSymbols) {
ctx.fileSymbols.set(relPath, symbols);
phase1Rows.push([relPath, 'file', relPath, 0, null, null]);
for (const def of symbols.definitions) {
phase1Rows.push([def.name, def.kind, relPath, def.line, def.endLine || null, null]);
}
for (const exp of symbols.exports) {
phase1Rows.push([exp.name, exp.kind, relPath, exp.line, null, null]);
}
}
batchInsertNodes(db, phase1Rows);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.fileSymbols mutated inside a SQLite transaction — partial population on rollback

ctx.fileSymbols.set(relPath, symbols) on line 48 is a side effect that happens inside the db.transaction() callback. If the transaction is rolled back mid-way (e.g., a SQLite UNIQUE constraint failed or database is locked error during one of the batch inserts), the transaction's DB writes are atomically rolled back — but ctx.fileSymbols is a plain JS Map and is not rolled back.

The result: subsequent stages (resolveImports, buildEdges, buildStructure) would consume a fileSymbols map that is partially populated — containing symbols for files that were processed before the failure point, but not for files after it. This means edge resolution would silently skip files that were mid-insert, producing an incomplete and inconsistent graph instead of a clean error.

The fix is to populate ctx.fileSymbols before the transaction, so it is a pure input to (rather than a side-effect of) the DB write:

// Populate fileSymbols before the transaction
for (const [relPath, symbols] of allSymbols) {
  ctx.fileSymbols.set(relPath, symbols);
}

const insertAll = db.transaction(() => {
  const phase1Rows = [];
  for (const [relPath, symbols] of allSymbols) {
    // No longer mutates ctx inside the transaction
    phase1Rows.push([relPath, 'file', relPath, 0, null, null]);
    ...
  }
  ...
});

If the pipeline catch block re-throws, no downstream stage runs anyway — but this also makes the transaction a pure DB operation, which is easier to reason about and test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d229c87 — moved ctx.fileSymbols population before the db.transaction() call so it is a pure input rather than a side-effect of the DB write. On rollback, the JS Map is fully populated (consistent) while the DB is cleanly rolled back.

…nsaction

- Extend try-catch in pipeline.js to cover collectFiles and
  detectChanges (which run SQLite queries that can throw), with a
  guard to avoid double-closing on the early-exit path.
- Move ctx.fileSymbols population before the db.transaction() in
  insert-nodes.js so a rollback doesn't leave partial JS state.

Impact: 2 functions changed, 1 affected
@carlos-alm
Copy link
Contributor Author

All Greptile feedback on #434 addressed in d229c87:

  1. DB leak if collectFiles/detectChanges throws — Extended try-catch to wrap all stages; added !ctx.earlyExit guard to avoid double-close
  2. fileSymbols mutated inside transaction — Hoisted ctx.fileSymbols population before db.transaction() so rollback doesn't leave partial JS state
  3. Timing crash on undefined — Already fixed in d7355e4 (?? 0 fallback on all timing properties)
  4. DB leak between parseFiles and finalize — Already fixed in d7355e4 (now subsumed by the full-coverage try-catch)

@greptileai

Comment on lines +34 to +65
*/
export async function buildGraph(rootDir, opts = {}) {
const ctx = new PipelineContext();
ctx.buildStart = performance.now();
ctx.opts = opts;

// ── Setup (creates DB, loads config, selects engine) ──────────────
ctx.rootDir = path.resolve(rootDir);
ctx.dbPath = path.join(ctx.rootDir, '.codegraph', 'graph.db');
ctx.db = openDb(ctx.dbPath);
initSchema(ctx.db);

ctx.config = loadConfig(ctx.rootDir);
ctx.incremental =
opts.incremental !== false && ctx.config.build && ctx.config.build.incremental !== false;

ctx.engineOpts = {
engine: opts.engine || 'auto',
dataflow: opts.dataflow !== false,
ast: opts.ast !== false,
};
const { name: engineName, version: engineVersion } = getActiveEngine(ctx.engineOpts);
ctx.engineName = engineName;
ctx.engineVersion = engineVersion;
info(`Using ${engineName} engine${engineVersion ? ` (v${engineVersion})` : ''}`);

// Engine/schema mismatch detection
ctx.schemaVersion = MIGRATIONS[MIGRATIONS.length - 1].version;
ctx.forceFullRebuild = false;
if (ctx.incremental) {
const prevEngine = getBuildMeta(ctx.db, 'engine');
if (prevEngine && prevEngine !== engineName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB leaked if setup phase throws before the try block

ctx.db is opened on line 34, but the try/catch that calls closeDb doesn't begin until line 65. If any of the following lines throw, the database file handle is leaked:

  • initSchema(ctx.db) (line 35) — runs migrations, can fail on a corrupted/locked DB
  • getBuildMeta(ctx.db, 'engine') (line 47) — live SQLite query
  • getBuildMeta(ctx.db, 'schema_version') (line 49) — live SQLite query
  • loadPathAliases(ctx.rootDir) (line 56) — filesystem I/O

The previous review threads described the fix as "complete" and wrapping "ALL stages including collectFiles and detectChanges", but the setup phase between openDb and the try block is not a stage and remains unprotected.

A minimal fix is to start the try block immediately after openDb:

ctx.db = openDb(ctx.dbPath);
try {
  initSchema(ctx.db);
  // ... all setup code and stages ...
} catch (err) {
  if (!ctx.earlyExit) closeDb(ctx.db);
  throw err;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 027af48 — moved the try block to start immediately after openDb, so initSchema, getBuildMeta, loadPathAliases, and all other setup code is now covered. The earlyExit guard still prevents double-closing on the detectChanges early-exit path.

Comment on lines +41 to +43
} catch (err) {
debug(`Unified analysis engine failed: ${err.message}`);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Analysis engine failures silently swallowed at debug level

When runAnalysesFn throws, the catch block only logs at debug level (line 42). This means failures in AST analysis, complexity metrics, CFG construction, or dataflow analysis are completely invisible to users unless they have debug logging explicitly enabled.

Unlike the pipeline's stage-level catch which re-throws (letting the caller decide), this catch silently discards the error and leaves all timing fields unset (which is why the ?? 0 fallbacks in pipeline.js were needed).

Consider raising the log level to warn so users know when analysis data is missing from their graph:

Suggested change
} catch (err) {
debug(`Unified analysis engine failed: ${err.message}`);
}
} catch (err) {
debug(`Unified analysis engine failed: ${err.message}`);
}

Should be warn so the failure is surfaced without aborting the build:

  } catch (err) {
    warn(`Unified analysis engine failed (AST/complexity/CFG/dataflow may be incomplete): ${err.message}`);
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 027af48 — promoted the log from debug to warn so users see when analysis data is incomplete. The message now reads: "Analysis engine failed (AST/complexity/CFG/dataflow may be incomplete): ..."

Comment on lines +24 to +31
const absPath = path.join(rootDir, relPath);
try {
const content = fs.readFileSync(absPath, 'utf-8');
ctx.lineCountMap.set(relPath, content.split('\n').length);
} catch {
ctx.lineCountMap.set(relPath, 0);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct fs.readFileSync instead of readFileSafe

Lines 26 and 73 use fs.readFileSync directly to read file contents for line counting, bypassing the readFileSafe retry logic that handles transient errors (EBUSY, EACCES, EPERM) common on Windows or with editors doing non-atomic saves. All other file reads in the pipeline stages use readFileSafe.

Suggested change
const absPath = path.join(rootDir, relPath);
try {
const content = fs.readFileSync(absPath, 'utf-8');
ctx.lineCountMap.set(relPath, content.split('\n').length);
} catch {
ctx.lineCountMap.set(relPath, 0);
}
}
const content = readFileSafe(absPath);

This applies to both line 26 and line 73. readFileSafe is already imported in helpers.js and used throughout the other stages for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 027af48 — replaced both fs.readFileSync calls (lines 26 and 73) with readFileSafe for consistent transient-error retry (EBUSY/EACCES/EPERM) on Windows. Also removed the unused fs import.

…eadFileSafe

- Move try-catch to start immediately after openDb so initSchema,
  getBuildMeta, and loadPathAliases are all covered.
- Promote analysis engine failure log from debug to warn so users
  know when AST/complexity/CFG/dataflow data is incomplete.
- Replace direct fs.readFileSync with readFileSafe in
  build-structure.js for consistent transient-error retry on Windows.

Impact: 3 functions changed, 1 affected
@carlos-alm
Copy link
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 6b0f44d into main Mar 13, 2026
26 checks passed
@carlos-alm carlos-alm deleted the refactor/builder-pipeline branch March 13, 2026 10:00
@github-actions github-actions bot locked and limited conversation to collaborators Mar 13, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant