This file is the root entrypoint for AI agents working in this repository. Keep it small. Use it to route to the right workflow and local guide, not as the full operations manual.
- All work happens on
mainby default. If you use feature branches, keep them small, short-lived, and easy to fast-forward back intomain. - Every
harbor runmust be gated by interactive confirmation. - Before commit/push, run
python3 scripts/repo_health.py(or--quickfor docs/config-only changes). - Prefer a remote execution environment (e.g., Daytona) for large benchmark runs; use local Docker only when a task’s image or registry is incompatible with your cloud environment. See
docs/DAYTONA.md. - Set parallelism based on your own account and model limits. Avoid exceeding documented concurrency or rate caps for your environment or provider.
- Before launching any benchmark batch, check account readiness with
python3 scripts/check_infra.pyorpython3 scripts/account_health.py status. Do not assume OAuth accounts are usable just because credentials exist.
- Keep the Beads CLI (
bd, aliasbeads) up to date before running agent workflows that rely on task graphs. - Install or update with the official installer:
curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/install.sh | bash- Verify install/version with
bd --version(orbeads --version). - Do not use
bd edit; use non-interactivebd create/update/close --jsonor stdin-based--description=-. - Typical flow:
bd ready --json,bd create ... --json,bd update <id> --claim,bd close <id> --reason "Done".
- Default load order: this file + one relevant skill + one relevant doc.
- Do not open broad catalogs (
docs/TASK_CATALOG.md, large script lists, full reports) unless required. - Prefer directory-local
AGENTS.md/CLAUDE.mdwhen working underscripts/,configs/,tasks/, ordocs/.
- Launch or rerun benchmarks:
docs/DAYTONA.md(Daytona, preferred) ordocs/START_HERE_BY_TASK.md - Monitor / status:
docs/START_HERE_BY_TASK.md-> "Monitor Active Runs" - Triage failures:
docs/START_HERE_BY_TASK.md-> "Triage Failed Tasks" - Compare configs / MCP impact / IR:
docs/START_HERE_BY_TASK.md-> "Analyze Results" - Repo policy / health gate:
docs/REPO_HEALTH.md,docs/ops/WORKFLOWS.md - Script discovery:
docs/ops/SCRIPT_INDEX.md
scripts/AGENTS.md- script categories, safe usage, one-off handlingconfigs/AGENTS.md- run launcher wrappers and confirmation gate policydocs/AGENTS.md- documentation IA and canonical vs archive guidance
- Compact after exploration, after launching a batch, and after triage/report passes.
- Use
/handoffskill for session handoffs (inline prompt, not a markdown file unless asked). - Use
docs/ops/HANDOFF_TEMPLATE.mdas checklist.
- Run
python3 scripts/repo_health.py(or--quickfor docs/config-only). git pull --rebase && git push && git status-- work is not done until push succeeds.- Track follow-ups in issues or beads. Update status.
docs/START_HERE_BY_TASK.md- task-based read orderdocs/ops/WORKFLOWS.md- operational workflow summariesdocs/ops/TROUBLESHOOTING.md- escalation and common failure routingdocs/ops/SCRIPT_INDEX.md- generated script registry indexdocs/reference/README.md- stable specs and reference docsdocs/explanations/README.md- rationale and context docs
- NEVER edit root
CLAUDE.mdorAGENTS.mddirectly. Edit canonical sources underdocs/ops/and regenerate. Direct edits causeagent_guides_driftfailures inrepo_health.py. - After removing directories from the repo, also clean references from
scripts/sync_agent_guides.py(LOCAL_SOURCES) andscripts/docs_consistency_check.py(LOCAL_AGENT_TARGET_DIRS).
- Daytona builds from Dockerfiles at sandbox creation. Fixes on
maintake effect next run (exception: pre-built GHCR base images need separate rebuild). - Harbor+Daytona (
harbor run --environment-type daytona) is recommended.scripts/daytona_runner.pyis for quick validation only. BASELINE_MCP_TYPEenv var:none,sourcegraph,deepsearch.- Use Daytona SDK (
daytona_sdk) over CLI (CLI is interactive-only for SSH). - GHCR packages default private for personal accounts; visibility change requires GitHub web UI.
- Snapshot names are positional:
daytona snapshot create ccb-name, NOT--name. - CLI/API version mismatch causes "Forbidden" errors. Keep CLI version in sync.
- Registry types enum:
internal,organization,transient,backup. Useorganizationfor GHCR/Docker Hub.
uv tool installsegfaults on ARM64/QEMU emulation. Usepip installinstead, or switch to Daytona (native x86_64).- Build-push-clean pattern when building Docker images with limited disk (~45GB): build one image, push, then clean locally before the next.
- Colons in agent names (e.g.,
module:ClassName) break Docker volume mounts. Sanitize paths: replace:with__. - Add
|| git initfallback to allgit clonecommands in Dockerfiles for network resilience. Applied to 269 Dockerfiles. - Add
chown claude:claude /logsandadduser claudeto Dockerfiles for cross-harness (OH) permission compatibility.
.mcp.jsonat$CLAUDE_CONFIG_DIR(typically/logs/agent/sessions/), not/app/or/root/.- Claude Code needs
--mcp-configflag; it does not auto-detect. Inject MCP usage instructions into the task prompt. NODE_TLS_REJECT_UNAUTHORIZED=0for Node.js SSL in containers.- Sourcegraph: stdio transport (
npx @sourcegraph/cody --stdio), NOT HTTP. HTTP 405 = wrong protocol. - Sourcegraph skills show empty in headless mode. Embed prompt content in CLAUDE.md.
- Sourcegraph env vars:
SOURCEGRAPH_URLandSOURCEGRAPH_ACCESS_TOKEN(NOT_ENDPOINTor_TOKEN).
- Timing fields (
started_at,finished_at) at top level ofresult.json, not nested undertiming. trajectory.jsongenerated by Harbor's_convert_events_to_trajectory(), not by Claude Code CLI.- SWE-bench
test.shredirects stdout to temp file; Harbor never seesSTART_TEST_OUTPUT/END_TEST_OUTPUTmarkers. - Token usage in
trajectory.json; transcript parsers don't see it. Contract: write/logs/verifier/reward.txt.
- Never pass credentials via Docker
-eflags. They leak into trajectory HTML when an agent runsenv. Use file-based injection: write to/logs/agent/.credentials.jsonwithchmod 600. scripts/sanitize_secrets.pyredacts real API keys (Anthropic, OpenAI, Sourcegraph, GitHub, Daytona) at result generation time. Maintains allowlist for known fake benchmark fixtures.
- no_changes_guard must use
git diff origin/main HEAD(notgit diff HEAD) for agents that auto-commit (e.g., OpenHands). Otherwise the guard falsely penalizes normal OH behavior. - Verifier path fallback chains: use
${TASK_WORKDIR:-/workspace}for working directory and${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}for repo root. Enables same verifier across Harbor and OpenHands. - Set
GOWORK=offin test.sh when sg_only verifier restores full repo. The go.work file may require a newer Go version than the container provides.
validators.pyduplicated acrossccb_buildtasks. Changes must hit all copies (verify withsha256sum).- Install scripts printing "INSTALL_SUCCESS" regardless of outcome are common. Verify binary exists.
- Agent completing in <2s = never installed/ran. Trial dir names truncated with hash; real name in
config.jsonattask.path. - LoCoBench task IDs have multi-word fields. Use 3-digit task number as positional anchor.
- no_changes_guard: write
reward.txtinside Python block, not in bash after it. timeout 600on all test runners.--forceExitfor Jest. Jest+TS needsmemory_mb = 8192.- CSB dual-score: file edits +
answer.jsonscored independently. Fallback:promoted_verifier.py->oracle_checks.py-> heuristic. - Rate-limited results (score=0, <30s):
scripts/quarantine_invalid_tasks.py --execute. - Bare
$VARininstruction.mdgets expanded. Use<placeholder>syntax.
gh auth refreshneeds explicit-s <scope>:gh auth refresh -h github.com -s write:packages.- Env vars must be exported for Harbor subprocesses. Use
set -abefore sourcing.env.local. - Account readiness:
runs/state/account_health.json. Launchers sourceconfigs/_common.sh. - GitHub push protection blocks synthetic keys. Squash with
git reset --soft origin/main. - Shallow clones fail on push. Some repos use
master; detect withgit symbolic-ref refs/remotes/origin/HEAD. - GitHub secret scanning: unblock via
/security/secret-scanning/unblock-secret/URL.
dict.get(key, default)does NOT protect againstNonevalues. Usedata.get("key") or default_value.with open(log) as f: subprocess.Popen(stdout=f)closes the handle. Useopen()without context manager for long-running subprocesses.- macOS Bash 3.2 lacks
declare -A. Use pipe-delimited strings withIFS='|' read -r.
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
- Tool categorization order matters: check MCP prefix (
mcp__) before substring checks (e.g.,deep_search) to avoid miscategorization ofmcp__deep_search.
sandbox_pluginsis a list (not property). Strip ALL plugins (= []) --agent_skillsindexes/workspaceat startup (120s timeout on large repos). TOML config has no effect in v1.4.0.shlex.quote()breaks on shell metacharacters (0% execution). Base64-encode instructions on host, decode inside container.- Background daemons outlive the main process and hang Daytona poll. Wrap with
pkillcleanup; guard withshutil.which('pkill')(missing on minimal images). - Alpine lacks
apt-get(OH installer requirement). Usebookwormvariants. - OH MCP client has ~30s timeout. Block
deepsearch/deepsearch_readin auth proxy; redirect tokeyword_search/nls_search. chown -R /workspaceblocks port binding >120s on large repos. Edit installedruntime_init.pysource -- monkey-patches don't propagate to action_execution_server subprocess.- Set
PYTHONSAFEPATH=1to prevent repo-local packages from shadowing installed deps.
- Secret-detection hooks false-positive on code that detects secrets. Use
--no-verifywhen flagged code is detection logic. - Classes named
TestPlan/TestCase/TestResultget auto-collected by pytest. Rename toEvaluationPlanetc. - Ralph sessions write learnings to
progress.txton feature branches, not main. Compound back after merge.
- Root and local
AGENTS.md/CLAUDE.mdfiles are generated from sources indocs/ops/. docs/START_HERE_BY_TASK.mdis generated fromdocs/ops/task_routes.json.- Regenerate after edits (single command):
python3 scripts/refresh_agent_navigation.py