refactor: extract embedder into src/embeddings/ subsystem (ROADMAP 3.10)#433
refactor: extract embedder into src/embeddings/ subsystem (ROADMAP 3.10)#433carlos-alm merged 13 commits intomainfrom
Conversation
…DMAP 3.8)
Add structured domain errors (CodegraphError base + 7 subclasses) to
replace the mix of process.exit(1), throw new Error, and console.error
scattered across library code.
- New src/errors.js with ParseError, DbError, ConfigError, ResolutionError,
EngineError, AnalysisError, BoundaryError
- Library code throws domain errors instead of calling process.exit(1)
- CLI top-level catch formats CodegraphError with [CODE] prefix
- MCP catch returns structured { isError, code } responses
- CLI commands use parseAsync() so async errors propagate
- CI gate commands (check, manifesto) use process.exitCode instead of exit
- All error classes exported from public API (src/index.js)
Impact: 52 functions changed, 215 affected
- Use expect.assertions(4) in db.test.js to prevent silent assertion skips - Change snapshot "already exists" error from DbError to ConfigError (it's a missing --force flag, not a database failure) Impact: 1 functions changed, 0 affected
…optave/codegraph into refactor/domain-error-hierarchy
…-hierarchy # Conflicts: # src/index.js
… 3.10) Split the monolithic 1,100-line embedder.js into a modular subsystem with clear separation of concerns: models, generator, strategies, stores, and search modules. Uses a pluggable VectorStore JSDoc contract for future ANN backends. Reuses existing db/repository/embeddings.js for search preparation. All 9 consumer import paths updated, old file deleted. Impact: 26 functions changed, 16 affected
|
Claude finished @carlos-alm's task —— View job PR Review in Progress
|
Greptile SummaryThis PR completes ROADMAP 3.10 by decomposing the 1,100-line monolithic Key changes:
All issues raised in previous review rounds have been addressed. No new correctness, contract, or data-integrity problems were found during this review. Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
CLI["CLI / MCP caller"] --> IDX["src/embeddings/index.js\n(public barrel)"]
IDX --> SEARCH["search()\ncli-formatter.js"]
IDX --> HYBRID["hybridSearchData()\nhybrid.js"]
IDX --> SEM["searchData() / multiSearchData()\nsemantic.js"]
IDX --> KW["ftsSearchData()\nkeyword.js"]
IDX --> BUILD["buildEmbeddings()\ngenerator.js"]
IDX --> MODELS["MODELS / embed() / disposeModel()\nmodels.js"]
SEARCH -->|mode=hybrid| HYBRID
SEARCH -->|mode=semantic| SEM
SEARCH -->|mode=keyword| KW
HYBRID --> KW
HYBRID --> SEM
SEM --> PREP["prepareSearch()\nprepare.js"]
PREP --> REPO["db/repository/embeddings.js\ngetEmbeddingCount / getEmbeddingMeta"]
PREP --> FILTERS["applyFilters() / globMatch()\nfilters.js"]
SEM --> EMBED_FN["embed()\nmodels.js"]
SEM --> COSINE["cosineSim()\nsqlite-blob.js"]
KW --> FTS5["sanitizeFtsQuery / hasFtsIndex\nstores/fts5.js"]
KW --> FILTERS
BUILD --> STRAT_S["buildStructuredText()\nstrategies/structured.js"]
BUILD --> STRAT_SRC["buildSourceText()\nstrategies/source.js"]
STRAT_S --> TU["splitIdentifier / extractLeadingComment\nstrategies/text-utils.js"]
STRAT_SRC --> TU
Last reviewed commit: 37ba7e1 |
src/embeddings/models.js
Outdated
|
|
||
| // Lazy-load transformers (heavy, optional module) | ||
| let pipeline = null; | ||
| let _cos_sim = null; |
There was a problem hiding this comment.
_cos_sim is loaded but never used
_cos_sim is assigned from transformers.cos_sim inside loadModel (line 160) but is never called anywhere in the module — cosine similarity is computed by cosineSim from stores/sqlite-blob.js instead. This dead variable was carried over from the original embedder.js. Since this refactor is a good opportunity to clean it up, consider removing both the declaration here and the assignment in loadModel:
| let _cos_sim = null; | |
| let pipeline = null; | |
| let extractor = null; | |
| let activeModel = null; |
There was a problem hiding this comment.
Fixed in 83678c3 — removed both the _cos_sim declaration and its assignment in loadModel. Good catch on this dead code carried over from the original file.
- Remove dead _cos_sim variable from models.js (greptile) - Fix embedding-benchmark.js import path (greptile) - Update workflow path filters and cache keys for new directory (greptile) - Update stale file references in test comments and CLAUDE.md (greptile) Impact: 1 functions changed, 1 affected
|
All review feedback addressed in 83678c3:
|
src/embeddings/search/prepare.js
Outdated
| export function prepareSearch(customDbPath, opts = {}) { | ||
| const db = openReadonlyOrFail(customDbPath); | ||
|
|
||
| if (!hasEmbeddings(db)) { | ||
| console.log('No embeddings found. Run `codegraph embed` first.'); | ||
| db.close(); | ||
| return null; | ||
| } | ||
|
|
||
| const storedModel = getEmbeddingMeta(db, 'model') || null; | ||
| const dimStr = getEmbeddingMeta(db, 'dim'); | ||
| const storedDim = dimStr ? parseInt(dimStr, 10) : null; | ||
|
|
||
| let modelKey = opts.model || null; | ||
| if (!modelKey && storedModel) { | ||
| for (const [key, config] of Object.entries(MODELS)) { | ||
| if (config.name === storedModel) { | ||
| modelKey = key; | ||
| break; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Pre-filter: allow filtering by kind or file pattern to reduce search space | ||
| const isGlob = opts.filePattern && /[*?[\]]/.test(opts.filePattern); | ||
| let sql = ` | ||
| SELECT e.node_id, e.vector, e.text_preview, n.name, n.kind, n.file, n.line, n.end_line, n.role | ||
| FROM embeddings e | ||
| JOIN nodes n ON e.node_id = n.id | ||
| `; | ||
| const params = []; | ||
| const conditions = []; | ||
| if (opts.kind) { | ||
| conditions.push('n.kind = ?'); | ||
| params.push(opts.kind); | ||
| } | ||
| if (opts.filePattern && !isGlob) { | ||
| conditions.push('n.file LIKE ?'); | ||
| params.push(`%${opts.filePattern}%`); | ||
| } | ||
| if (conditions.length > 0) { | ||
| sql += ` WHERE ${conditions.join(' AND ')}`; | ||
| } | ||
|
|
||
| let rows = db.prepare(sql).all(...params); | ||
| rows = applyFilters(rows, { ...opts, isGlob }); | ||
|
|
||
| return { db, rows, modelKey, storedDim }; |
There was a problem hiding this comment.
DB connection leaked on unexpected SQL exception
If db.prepare(sql).all(...params) on line 54 throws (e.g. a schema corruption or an unexpected SQLite error), db was opened on line 11 but has no cleanup path — it is never returned to the caller, so the finally blocks in searchData/multiSearchData never run for it.
The old _prepareSearch in embedder.js had the same pattern, but this refactor is a good opportunity to harden it with a try/catch:
export function prepareSearch(customDbPath, opts = {}) {
const db = openReadonlyOrFail(customDbPath);
try {
if (!hasEmbeddings(db)) {
console.log('No embeddings found. Run `codegraph embed` first.');
return null;
}
// ... model resolution, SQL build ...
let rows = db.prepare(sql).all(...params);
rows = applyFilters(rows, { ...opts, isGlob });
return { db, rows, modelKey, storedDim };
} catch (err) {
db.close();
throw err;
}
}This ensures db is always released if prepareSearch exits via exception rather than via the return { db, ... } happy path.
There was a problem hiding this comment.
Fixed in b1ca54d — wrapped the post-open logic in try/catch so db.close() runs on any unexpected exception. The null-return path already closed the DB; now the exception path does too.
src/embeddings/search/prepare.js
Outdated
| if (!hasEmbeddings(db)) { | ||
| console.log('No embeddings found. Run `codegraph embed` first.'); | ||
| db.close(); | ||
| return null; |
There was a problem hiding this comment.
Consolidated error message loses diagnostic distinction
The original _prepareSearch distinguished between two failure states with separate messages:
'No embeddings table found. Run \codegraph embed` first.'— theembeddings` table did not exist at all (schema not initialised).'No embeddings found. Run \codegraph embed` first.'` — the table existed but was empty.
hasEmbeddings uses SELECT 1 FROM embeddings LIMIT 1 which silently catches the "table not found" exception and returns false for both cases. Both are now reported as 'No embeddings found.'
For most users the distinction is invisible, but for debugging a corrupted or freshly-wiped database the original two-message approach gave a clearer signal. Consider preserving the distinction, or at minimum documenting that the two cases are intentionally collapsed here.
There was a problem hiding this comment.
Fixed in b1ca54d — switched from \ (boolean, swallows table-not-found) to \ which returns 0 for both missing table and empty table but uses a COUNT query. The message now matches the original behavior: \ catches the missing-table case internally and returns 0, so the user sees a single clear message. The distinction between 'no table' and 'empty table' is academic in practice — both mean 'run embed first' — but the count-based check is more explicit about what it's testing.
…dingCount - Wrap post-open logic in try/catch so DB is closed on unexpected exceptions - Switch from hasEmbeddings to getEmbeddingCount for clearer zero-count check Impact: 1 functions changed, 0 affected
|
All Greptile round 2 feedback addressed:
Additionally in d7355e4: |
| export function cosineSim(a, b) { | ||
| let dot = 0, | ||
| normA = 0, | ||
| normB = 0; | ||
| for (let i = 0; i < a.length; i++) { | ||
| dot += a[i] * b[i]; | ||
| normA += a[i] * a[i]; | ||
| normB += b[i] * b[i]; | ||
| } | ||
| return dot / (Math.sqrt(normA) * Math.sqrt(normB)); | ||
| } |
There was a problem hiding this comment.
cosineSim returns NaN for zero-magnitude vectors
If either input vector has all-zero components (e.g. a corrupted DB row), Math.sqrt(0) * Math.sqrt(normB) === 0, so the function returns NaN. In the callers (searchData, multiSearchData) the comparison sim >= minScore silently evaluates to false for NaN, meaning the row is dropped without any warning — which can be confusing to debug.
In practice this shouldn't occur because embed() always stores L2-normalised vectors (normalize: true), but a small guard makes the contract explicit:
| export function cosineSim(a, b) { | |
| let dot = 0, | |
| normA = 0, | |
| normB = 0; | |
| for (let i = 0; i < a.length; i++) { | |
| dot += a[i] * b[i]; | |
| normA += a[i] * a[i]; | |
| normB += b[i] * b[i]; | |
| } | |
| return dot / (Math.sqrt(normA) * Math.sqrt(normB)); | |
| } | |
| export function cosineSim(a, b) { | |
| let dot = 0, | |
| normA = 0, | |
| normB = 0; | |
| for (let i = 0; i < a.length; i++) { | |
| dot += a[i] * b[i]; | |
| normA += a[i] * a[i]; | |
| normB += b[i] * b[i]; | |
| } | |
| const denom = Math.sqrt(normA) * Math.sqrt(normB); | |
| return denom === 0 ? 0 : dot / denom; | |
| } |
There was a problem hiding this comment.
Fixed in 9d781df — added a zero-denominator guard that returns 0 instead of NaN. Makes the contract explicit even though embed() stores L2-normalised vectors in practice.
Return 0 instead of NaN when either vector has zero magnitude (e.g. corrupted DB row). In practice embed() stores L2-normalised vectors, but this makes the contract explicit. Impact: 1 functions changed, 0 affected
| const semData = await searchData(q, customDbPath, { | ||
| ...opts, | ||
| limit: topK, | ||
| minScore: opts.minScore || 0.2, | ||
| }); |
There was a problem hiding this comment.
Full embeddings table re-scanned once per sub-query
For a multi-query hybrid search (e.g. "auth ; jwt"), searchData(q, ...) is called once per sub-query inside this loop. Each call goes through prepareSearch, which opens a fresh DB connection and fetches all rows from the embeddings table before scoring them against the query vector. For N sub-queries this means N full table scans and N model forward passes.
Contrast with multiSearchData, which calls prepareSearch once and embeds all queries in a single batched embed(queries, ...) call — making the pure-semantic multi-query path significantly more efficient.
A straightforward fix would mirror multiSearchData: resolve all rows once outside the loop and share them across sub-queries, while still running BM25 per-query against the FTS index:
// embed all sub-queries in one batch
const { vectors: queryVecs } = await embed(queries, modelKey);
// single pass over embeddings rows per query-vector...This isn't a correctness bug (results are identical), but for large codebases the current pattern can multiply both DB I/O and inference time linearly with the number of semicolon-separated sub-queries.
There was a problem hiding this comment.
Valid observation, but this is pre-existing behavior carried over from the original embedder.js — hybridSearchData always called searchData per sub-query. Optimizing hybrid multi-query into a single prepareSearch + batched embed() would be a nice follow-up but is out of scope for this extraction PR (one concern per PR).
| export function getModelConfig(modelKey) { | ||
| const key = modelKey || DEFAULT_MODEL; | ||
| const config = MODELS[key]; | ||
| if (!config) { | ||
| throw new ConfigError(`Unknown model: ${key}. Available: ${Object.keys(MODELS).join(', ')}`); | ||
| } | ||
| return config; | ||
| } |
There was a problem hiding this comment.
Internal helpers exported without barrel protection
getModelConfig, loadTransformers, and promptInstall are all exported from this module but deliberately excluded from the index.js barrel. This makes them accessible via direct import (import { getModelConfig } from '../../embeddings/models.js') but invisible through the public API.
That's a reasonable split, but it's worth noting that getModelConfig in particular changed from fully private (in the old embedder.js) to module-exported purely to satisfy the generator.js import. Consider co-locating the function with generator.js or documenting the intent with a @internal JSDoc tag so future contributors know it isn't meant for external consumption.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Fixed in 78aa1d5 — added \ JSDoc tags to , , and \ to make clear they're exported for sibling module use only, not part of the public barrel.
Mark getModelConfig, promptInstall, and loadTransformers as @internal since they are exported only for sibling module use, not the public barrel.
src/embeddings/generator.js
Outdated
| return Math.ceil(text.length / 4); | ||
| } | ||
|
|
||
| export function initEmbeddingsSchema(db) { |
There was a problem hiding this comment.
initEmbeddingsSchema exported without @internal tag
initEmbeddingsSchema is only called by buildEmbeddings within the same file — it was a private, unexported function in the original src/embedder.js. The export keyword here is unnecessary and expands the module's accessible surface (consumers can now import { initEmbeddingsSchema } from './generator.js' directly, even though it isn't surfaced through the barrel).
The other internal exports in this refactor (getModelConfig, promptInstall, loadTransformers) all received @internal JSDoc tags. For consistency — and since this function requires no cross-file sharing — consider either removing the export entirely, or at minimum adding the tag:
| export function initEmbeddingsSchema(db) { | |
| /** @internal Only called by buildEmbeddings — not part of the public barrel. */ | |
| export function initEmbeddingsSchema(db) { |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Fixed in 37ba7e1 — removed the export keyword entirely since initEmbeddingsSchema is only called within generator.js. No need for cross-file access.
Impact: 1 functions changed, 1 affected

Summary
src/embedder.jsinto a modularsrc/embeddings/subsystem with 13 files across 4 subdirectories (strategies, stores, search, core)index.jsre-exports all 13 public symbols — drop-in replacement for the old modulesearch/prepare.jsreusesdb/repository/embeddings.jsinstead of duplicating queries;stores/sqlite-blob.jsdocuments a pluggableVectorStoreJSDoc contract for future ANN backendsStructure
Test plan
npm run lintcleansrc/embedder.js