refactor: decompose buildGraph() into pipeline stages (ROADMAP 3.9)#434
refactor: decompose buildGraph() into pipeline stages (ROADMAP 3.9)#434carlos-alm merged 5 commits intomainfrom
Conversation
Split the 1,487-line monolithic buildGraph() into independently testable
pipeline stages communicating through a shared PipelineContext object.
New directory structure:
src/builder/
context.js — PipelineContext class (shared mutable state)
helpers.js — Extracted utilities (collectFiles, fileHash, batch inserts)
incremental.js — Single-file rebuild for watch mode
pipeline.js — Orchestrator: setup → stages → return
stages/
collect-files.js — File collection + scoped rebuild
detect-changes.js — Three-tier change detection + reverse-dep cascade
parse-files.js — parseFilesAuto dispatch
insert-nodes.js — Batch node/edge insertion in SQLite transaction
resolve-imports.js— Batch import resolution + barrel map
build-edges.js — Import/call/receiver/extends/implements edges
build-structure.js— Directory structure + node role classification
run-analyses.js — Unified AST analysis engine dispatch
finalize.js — Metadata, drift detection, cleanup, registry
src/builder.js is now a barrel re-export (11 lines) following the
established src/db.js pattern. All existing imports continue working.
src/watcher.js now delegates to builder/incremental.js instead of
duplicating 140 lines of node insertion and edge building logic.
Impact: 28 functions changed, 27 affected
|
Claude finished @carlos-alm's task —— View job PR Review: Decompose buildGraph() into pipeline stagesComprehensive review of this large refactoring PR that decomposes the monolithic buildGraph() function. Root Cause AnalysisProblem Being Solved: 1,487-line monolithic Why This Approach: Pipeline decomposition with shared
Risk Assessment: Low risk - well-structured refactoring with comprehensive test coverage. Configuration Changes ✅
Test Integrity ✅
Code Quality Assessment ⭐⭐⭐⭐Strengths:
Technical Observations:
Minor Concerns:
Backlog ComplianceZero-dep: ✓ — No new runtime dependencies added, builds on existing tree-sitter/SQLite/AST foundation Foundation-aligned: ✓ — Fully compliant with FOUNDATION.md principles:
Problem-fit: 3/5 — Improves maintainability and testability but doesn't directly address the core "lost AI" problem. However, better code structure indirectly helps by making the codebase more navigable. Breaking: No — All existing Tier: 1 (Zero-dep ✓ + Foundation-aligned ✓) Critical ConcernsNone identified. This is a well-executed refactoring with appropriate safeguards. Final Recommendation
Minor suggestions for follow-up (not blocking):
|
Greptile SummaryThis PR decomposes the 1,487-line monolithic The refactoring is architecturally sound and all prior review-thread issues have been addressed in the current HEAD commit:
Two minor cleanup items remain:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant C as Caller
participant P as pipeline.js<br/>buildGraph()
participant CTX as PipelineContext
participant DB as SQLite DB
C->>P: buildGraph(rootDir, opts)
P->>CTX: new PipelineContext()
P->>DB: openDb()
P->>DB: initSchema()
P->>CTX: setup (config, engine, aliases, timing.setupMs)
P->>P: collectFiles(ctx)
Note over P: allFiles + discoveredDirs → ctx
P->>P: detectChanges(ctx)
Note over P: Three-tier detection (journal→mtime→hash)<br/>reverse-dep cascade<br/>earlyExit=true → closeDb + return
alt earlyExit
P-->>C: return undefined
end
P->>P: parseFiles(ctx)
Note over P: parseFilesAuto → allSymbols, fileSymbols
P->>P: insertNodes(ctx)
Note over P: fileSymbols populated before tx<br/>batch INSERT nodes + edges (contains/param)
P->>P: resolveImports(ctx)
Note over P: batch resolve + barrel map<br/>load unchanged barrels from DB
P->>P: buildEdges(ctx)
Note over P: import / call / receiver /<br/>extends / implements edges
P->>P: buildStructure(ctx)
Note over P: directory nodes, metrics,<br/>node role classification
P->>P: runAnalyses(ctx)
Note over P: AST / complexity / CFG / dataflow<br/>failures logged at warn level
P->>P: finalize(ctx)
Note over P: drift detection, metadata,<br/>closeDb, journal header
P-->>C: return { phases: { setupMs, parseMs, ... } }
|
src/builder/helpers.js
Outdated
| try { | ||
| entries = fs.readdirSync(dir, { withFileTypes: true }); | ||
| } catch (err) { | ||
| warn(`Cannot read directory ${dir}: ${err.message}`); | ||
| return trackDirs ? { files, directories } : files; | ||
| } | ||
|
|
||
| // Merge config ignoreDirs with defaults | ||
| const extraIgnore = config.ignoreDirs ? new Set(config.ignoreDirs) : null; | ||
|
|
||
| // Detect symlink loops | ||
| let realDir; | ||
| try { | ||
| realDir = fs.realpathSync(dir); | ||
| } catch { | ||
| return trackDirs ? { files, directories } : files; | ||
| } | ||
| if (_visited.has(realDir)) return trackDirs ? { files, directories } : files; | ||
| _visited.add(realDir); |
There was a problem hiding this comment.
Symlink loop warning silently dropped + readdirSync hoisted before loop check
The refactored collectFiles contains two subtle regressions compared to the original in builder.js:
-
warn()removed: The original emittedwarn(\Symlink loop detected, skipping: ${dir}`)when_visited.has(realDir)` was true. The new code returns silently, making symlink loop problems invisible to users. -
readdirSynccalled before the loop guard: In the original,realpathSync/_visitedcheck happened beforereaddirSync. Now the directory is always read first — so when a loop is detected on line 76, we've already done the I/O for nothing.
The original ordering:
// 1. realpathSync → fail → return early
// 2. _visited check → loop → warn + return early
// 3. readdirSync (only if no loop)Suggested restoration of the warning (at minimum):
| try { | |
| entries = fs.readdirSync(dir, { withFileTypes: true }); | |
| } catch (err) { | |
| warn(`Cannot read directory ${dir}: ${err.message}`); | |
| return trackDirs ? { files, directories } : files; | |
| } | |
| // Merge config ignoreDirs with defaults | |
| const extraIgnore = config.ignoreDirs ? new Set(config.ignoreDirs) : null; | |
| // Detect symlink loops | |
| let realDir; | |
| try { | |
| realDir = fs.realpathSync(dir); | |
| } catch { | |
| return trackDirs ? { files, directories } : files; | |
| } | |
| if (_visited.has(realDir)) return trackDirs ? { files, directories } : files; | |
| _visited.add(realDir); | |
| // Merge config ignoreDirs with defaults | |
| const extraIgnore = config.ignoreDirs ? new Set(config.ignoreDirs) : null; | |
| // Detect symlink loops | |
| let realDir; | |
| try { | |
| realDir = fs.realpathSync(dir); | |
| } catch { | |
| return trackDirs ? { files, directories } : files; | |
| } | |
| if (_visited.has(realDir)) { | |
| warn(`Symlink loop detected, skipping: ${dir}`); | |
| return trackDirs ? { files, directories } : files; | |
| } | |
| _visited.add(realDir); | |
| let entries; | |
| try { | |
| entries = fs.readdirSync(dir, { withFileTypes: true }); | |
| } catch (err) { | |
| warn(`Cannot read directory ${dir}: ${err.message}`); | |
| return trackDirs ? { files, directories } : files; | |
| } |
There was a problem hiding this comment.
Fixed in d270073. Moved symlink loop detection (realpathSync + _visited check) before readdirSync to avoid wasted I/O, and restored the warn() call so loop detection is visible to users.
src/builder/incremental.js
Outdated
| export async function rebuildFile(db, rootDir, filePath, stmts, engineOpts, cache, options = {}) { | ||
| const { diffSymbols } = options; | ||
| const relPath = normalizePath(path.relative(rootDir, filePath)); | ||
| const fs = await import('node:fs'); |
There was a problem hiding this comment.
Dynamic fs import inside hot-path function
rebuildFile is called for every file change event in watch mode. Using a dynamic await import('node:fs') here creates and resolves a new Promise on each invocation — while Node.js caches the module, the async overhead (microtask scheduling, Promise allocation) still accumulates across many rapid changes.
All other files in src/builder/ import fs as a top-level static import. This should follow the same pattern:
| const fs = await import('node:fs'); | |
| import fs from 'node:fs'; |
Then remove the const fs = await import('node:fs'); line inside the function body, and remove the await from the function signature if it's the only reason it's async (it isn't — parseFileIncremental is awaited too, so async stays).
There was a problem hiding this comment.
Fixed in d270073. Replaced the dynamic \ with a static top-level , eliminating per-call async overhead in the watch mode hot path.
- Restore symlink loop warning and move detection before readdirSync to avoid wasted I/O (greptile review) - Replace dynamic fs import with static top-level import in incremental.js to avoid per-call async overhead in watch mode (greptile review) Impact: 2 functions changed, 0 affected
src/builder/pipeline.js
Outdated
| astMs: +ctx.timing.astMs.toFixed(1), | ||
| complexityMs: +ctx.timing.complexityMs.toFixed(1), |
There was a problem hiding this comment.
Crash if analysis engine fails — timing properties may be undefined
run-analyses.js wraps runAnalysesFn in a try-catch and leaves ctx.timing.astMs / ctx.timing.complexityMs unset when the engine throws. The return statement then calls .toFixed(1) on those undefined values, which throws a TypeError.
This pattern is preserved from the original monolithic builder.js (where _t.astMs.toFixed(1) had the same risk), but the refactoring is a good opportunity to close it.
| astMs: +ctx.timing.astMs.toFixed(1), | |
| complexityMs: +ctx.timing.complexityMs.toFixed(1), | |
| astMs: +(ctx.timing.astMs ?? 0).toFixed(1), | |
| complexityMs: +(ctx.timing.complexityMs ?? 0).toFixed(1), |
There was a problem hiding this comment.
Fixed in d7355e4 — added \ fallback for all timing properties (, , and all others) so \ won't crash if a stage didn't populate its timing field.
src/builder/pipeline.js
Outdated
| await parseFiles(ctx); | ||
| await insertNodes(ctx); | ||
| await resolveImports(ctx); | ||
| await buildEdges(ctx); | ||
| await buildStructure(ctx); | ||
| await runAnalyses(ctx); | ||
| await finalize(ctx); |
There was a problem hiding this comment.
DB left open on unhandled stage errors
ctx.db is opened during setup but only closed inside finalize() (or inside detectChanges() on the early-exit path). If any stage between parseFiles and runAnalyses throws an uncaught exception — e.g. an out-of-memory error during a large parse, an unexpected SQLite error, or a missing barrel file — the database connection is leaked.
The original monolithic function had the same issue, but the decomposition into discrete stages makes it more likely that future stage additions will introduce additional failure paths that bypass finalize. A try-finally in buildGraph would catch all of them:
try {
await parseFiles(ctx);
await insertNodes(ctx);
await resolveImports(ctx);
await buildEdges(ctx);
await buildStructure(ctx);
await runAnalyses(ctx);
await finalize(ctx);
} catch (err) {
closeDb(ctx.db);
throw err;
}…ings Wrap pipeline stages in try-catch to close the DB if a stage throws between parseFiles and finalize. Add ?? 0 fallback for all timing properties that may be undefined if a stage didn't run. Impact: 1 functions changed, 0 affected
src/builder/pipeline.js
Outdated
| // ── Pipeline stages ─────────────────────────────────────────────── | ||
| await collectFiles(ctx); | ||
| await detectChanges(ctx); | ||
|
|
||
| if (ctx.earlyExit) return; | ||
|
|
||
| try { | ||
| await parseFiles(ctx); | ||
| await insertNodes(ctx); | ||
| await resolveImports(ctx); | ||
| await buildEdges(ctx); | ||
| await buildStructure(ctx); | ||
| await runAnalyses(ctx); | ||
| await finalize(ctx); | ||
| } catch (err) { | ||
| closeDb(ctx.db); | ||
| throw err; | ||
| } |
There was a problem hiding this comment.
DB leaked if collectFiles or detectChanges throws
The previous review thread addressed the DB leak for stages between parseFiles and finalize (now wrapped in the try/catch at line 101). However, collectFiles (line 96) and detectChanges (line 97) are called outside that block, so ctx.db — opened at line 43 — is still leaked if either of them throws an uncaught exception.
detectChanges in particular executes multiple SQLite queries: SELECT MAX(mtime) from file_hashes, reverse-dependency cascade joins, and purgeFilesFromGraph. Any of these can throw if the DB file is corrupt, locked, or the schema is in an unexpected state.
The current fix only covers 7 out of the 9 stages:
// Before the try block — db opened but unprotected against throws in these two stages
await collectFiles(ctx); // line 96 — no SQLite, but can OOM
await detectChanges(ctx); // line 97 — many SQLite queries, CAN throw
if (ctx.earlyExit) return;
try {
// Only these stages are protected
await parseFiles(ctx);
...
} catch (err) {
closeDb(ctx.db);
throw err;
}Extending the guard to include detectChanges (and for consistency collectFiles) closes the remaining gap:
try {
await collectFiles(ctx);
await detectChanges(ctx);
if (ctx.earlyExit) return;
await parseFiles(ctx);
await insertNodes(ctx);
await resolveImports(ctx);
await buildEdges(ctx);
await buildStructure(ctx);
await runAnalyses(ctx);
await finalize(ctx);
} catch (err) {
// Only close if not already closed by detectChanges early-exit path
if (!ctx.earlyExit) closeDb(ctx.db);
throw err;
}Note: detectChanges already calls closeDb on its own early-exit paths (lines 323 and 332 of detect-changes.js), so a guard is needed to avoid double-closing.
There was a problem hiding this comment.
Fixed in d229c87 — extended the try-catch to cover all stages including collectFiles and detectChanges. The catch guard checks !ctx.earlyExit to avoid double-closing, since detectChanges already calls closeDb on its early-exit paths.
| const insertAll = db.transaction(() => { | ||
| // Phase 1: Batch insert all file nodes + definitions + exports | ||
| const phase1Rows = []; | ||
| for (const [relPath, symbols] of allSymbols) { | ||
| ctx.fileSymbols.set(relPath, symbols); | ||
| phase1Rows.push([relPath, 'file', relPath, 0, null, null]); | ||
| for (const def of symbols.definitions) { | ||
| phase1Rows.push([def.name, def.kind, relPath, def.line, def.endLine || null, null]); | ||
| } | ||
| for (const exp of symbols.exports) { | ||
| phase1Rows.push([exp.name, exp.kind, relPath, exp.line, null, null]); | ||
| } | ||
| } | ||
| batchInsertNodes(db, phase1Rows); |
There was a problem hiding this comment.
ctx.fileSymbols mutated inside a SQLite transaction — partial population on rollback
ctx.fileSymbols.set(relPath, symbols) on line 48 is a side effect that happens inside the db.transaction() callback. If the transaction is rolled back mid-way (e.g., a SQLite UNIQUE constraint failed or database is locked error during one of the batch inserts), the transaction's DB writes are atomically rolled back — but ctx.fileSymbols is a plain JS Map and is not rolled back.
The result: subsequent stages (resolveImports, buildEdges, buildStructure) would consume a fileSymbols map that is partially populated — containing symbols for files that were processed before the failure point, but not for files after it. This means edge resolution would silently skip files that were mid-insert, producing an incomplete and inconsistent graph instead of a clean error.
The fix is to populate ctx.fileSymbols before the transaction, so it is a pure input to (rather than a side-effect of) the DB write:
// Populate fileSymbols before the transaction
for (const [relPath, symbols] of allSymbols) {
ctx.fileSymbols.set(relPath, symbols);
}
const insertAll = db.transaction(() => {
const phase1Rows = [];
for (const [relPath, symbols] of allSymbols) {
// No longer mutates ctx inside the transaction
phase1Rows.push([relPath, 'file', relPath, 0, null, null]);
...
}
...
});If the pipeline catch block re-throws, no downstream stage runs anyway — but this also makes the transaction a pure DB operation, which is easier to reason about and test.
There was a problem hiding this comment.
Fixed in d229c87 — moved ctx.fileSymbols population before the db.transaction() call so it is a pure input rather than a side-effect of the DB write. On rollback, the JS Map is fully populated (consistent) while the DB is cleanly rolled back.
…nsaction - Extend try-catch in pipeline.js to cover collectFiles and detectChanges (which run SQLite queries that can throw), with a guard to avoid double-closing on the early-exit path. - Move ctx.fileSymbols population before the db.transaction() in insert-nodes.js so a rollback doesn't leave partial JS state. Impact: 2 functions changed, 1 affected
|
All Greptile feedback on #434 addressed in d229c87:
|
src/builder/pipeline.js
Outdated
| */ | ||
| export async function buildGraph(rootDir, opts = {}) { | ||
| const ctx = new PipelineContext(); | ||
| ctx.buildStart = performance.now(); | ||
| ctx.opts = opts; | ||
|
|
||
| // ── Setup (creates DB, loads config, selects engine) ────────────── | ||
| ctx.rootDir = path.resolve(rootDir); | ||
| ctx.dbPath = path.join(ctx.rootDir, '.codegraph', 'graph.db'); | ||
| ctx.db = openDb(ctx.dbPath); | ||
| initSchema(ctx.db); | ||
|
|
||
| ctx.config = loadConfig(ctx.rootDir); | ||
| ctx.incremental = | ||
| opts.incremental !== false && ctx.config.build && ctx.config.build.incremental !== false; | ||
|
|
||
| ctx.engineOpts = { | ||
| engine: opts.engine || 'auto', | ||
| dataflow: opts.dataflow !== false, | ||
| ast: opts.ast !== false, | ||
| }; | ||
| const { name: engineName, version: engineVersion } = getActiveEngine(ctx.engineOpts); | ||
| ctx.engineName = engineName; | ||
| ctx.engineVersion = engineVersion; | ||
| info(`Using ${engineName} engine${engineVersion ? ` (v${engineVersion})` : ''}`); | ||
|
|
||
| // Engine/schema mismatch detection | ||
| ctx.schemaVersion = MIGRATIONS[MIGRATIONS.length - 1].version; | ||
| ctx.forceFullRebuild = false; | ||
| if (ctx.incremental) { | ||
| const prevEngine = getBuildMeta(ctx.db, 'engine'); | ||
| if (prevEngine && prevEngine !== engineName) { |
There was a problem hiding this comment.
DB leaked if setup phase throws before the try block
ctx.db is opened on line 34, but the try/catch that calls closeDb doesn't begin until line 65. If any of the following lines throw, the database file handle is leaked:
initSchema(ctx.db)(line 35) — runs migrations, can fail on a corrupted/locked DBgetBuildMeta(ctx.db, 'engine')(line 47) — live SQLite querygetBuildMeta(ctx.db, 'schema_version')(line 49) — live SQLite queryloadPathAliases(ctx.rootDir)(line 56) — filesystem I/O
The previous review threads described the fix as "complete" and wrapping "ALL stages including collectFiles and detectChanges", but the setup phase between openDb and the try block is not a stage and remains unprotected.
A minimal fix is to start the try block immediately after openDb:
ctx.db = openDb(ctx.dbPath);
try {
initSchema(ctx.db);
// ... all setup code and stages ...
} catch (err) {
if (!ctx.earlyExit) closeDb(ctx.db);
throw err;
}There was a problem hiding this comment.
Fixed in 027af48 — moved the try block to start immediately after openDb, so initSchema, getBuildMeta, loadPathAliases, and all other setup code is now covered. The earlyExit guard still prevents double-closing on the detectChanges early-exit path.
| } catch (err) { | ||
| debug(`Unified analysis engine failed: ${err.message}`); | ||
| } |
There was a problem hiding this comment.
Analysis engine failures silently swallowed at debug level
When runAnalysesFn throws, the catch block only logs at debug level (line 42). This means failures in AST analysis, complexity metrics, CFG construction, or dataflow analysis are completely invisible to users unless they have debug logging explicitly enabled.
Unlike the pipeline's stage-level catch which re-throws (letting the caller decide), this catch silently discards the error and leaves all timing fields unset (which is why the ?? 0 fallbacks in pipeline.js were needed).
Consider raising the log level to warn so users know when analysis data is missing from their graph:
| } catch (err) { | |
| debug(`Unified analysis engine failed: ${err.message}`); | |
| } | |
| } catch (err) { | |
| debug(`Unified analysis engine failed: ${err.message}`); | |
| } |
Should be warn so the failure is surfaced without aborting the build:
} catch (err) {
warn(`Unified analysis engine failed (AST/complexity/CFG/dataflow may be incomplete): ${err.message}`);
}There was a problem hiding this comment.
Fixed in 027af48 — promoted the log from debug to warn so users see when analysis data is incomplete. The message now reads: "Analysis engine failed (AST/complexity/CFG/dataflow may be incomplete): ..."
| const absPath = path.join(rootDir, relPath); | ||
| try { | ||
| const content = fs.readFileSync(absPath, 'utf-8'); | ||
| ctx.lineCountMap.set(relPath, content.split('\n').length); | ||
| } catch { | ||
| ctx.lineCountMap.set(relPath, 0); | ||
| } | ||
| } |
There was a problem hiding this comment.
Direct fs.readFileSync instead of readFileSafe
Lines 26 and 73 use fs.readFileSync directly to read file contents for line counting, bypassing the readFileSafe retry logic that handles transient errors (EBUSY, EACCES, EPERM) common on Windows or with editors doing non-atomic saves. All other file reads in the pipeline stages use readFileSafe.
| const absPath = path.join(rootDir, relPath); | |
| try { | |
| const content = fs.readFileSync(absPath, 'utf-8'); | |
| ctx.lineCountMap.set(relPath, content.split('\n').length); | |
| } catch { | |
| ctx.lineCountMap.set(relPath, 0); | |
| } | |
| } | |
| const content = readFileSafe(absPath); |
This applies to both line 26 and line 73. readFileSafe is already imported in helpers.js and used throughout the other stages for consistency.
There was a problem hiding this comment.
Fixed in 027af48 — replaced both fs.readFileSync calls (lines 26 and 73) with readFileSafe for consistent transient-error retry (EBUSY/EACCES/EPERM) on Windows. Also removed the unused fs import.
…eadFileSafe - Move try-catch to start immediately after openDb so initSchema, getBuildMeta, and loadPathAliases are all covered. - Promote analysis engine failure log from debug to warn so users know when AST/complexity/CFG/dataflow data is incomplete. - Replace direct fs.readFileSync with readFileSafe in build-structure.js for consistent transient-error retry on Windows. Impact: 3 functions changed, 1 affected
Summary
buildGraph()into 9 independently testable pipeline stages communicating through a sharedPipelineContextobjectsrc/builder.jsfrom a mega-file to an 11-line barrel re-export (following the establishedsrc/db.jspattern)src/watcher.jsnow delegates tobuilder/incremental.jsinstead of duplicating 140 lines of node insertion and edge building logicNew directory structure
Backward compatibility
All existing
import { buildGraph, collectFiles, ... } from './builder.js'continue working unchanged. The barrel re-export is transparent to consumers.Test plan
build.test.jsintegration tests passscoped-rebuild.test.jstests pass (including reverse-dep cascade)incremental-parity.test.jstests passbuilder.test.jsunit tests pass (collectFiles, loadPathAliases, readFileSafe)dbparam inrebuildFilekept for API consistency)