The Context Engine for Real Software Engineering Workloads
The Context Engine for Real Software Engineering Workloads
LLMs are great at local reasoning. things like naming a function, sketching a query, and explaining a snippet. Real software work is different. It’s long-horizon: multi-hour debugging, refactors that touch five modules, investigations that weave through docs, issues, and experiments. Most “RAG + summarization” stacks help for a few turns, then drift.
Cortex avoids that drift with an internal subsystem we call the Cortex Context Engine. It isn’t a separate product; it’s the state manager that feeds the model only what matters—evidence with provenance, version-correct code awareness, and a stable prompt shape across 50–100+ turns.
Where It Lives (and Why)
User Goal ──► Planner ──► Tools (repo, search, runners, profilers)
│ ▲
▼ │
Cortex Context Engine ──────┘
│
Evidence Ledger (audit trail)
│
Model Runtime
- Planner breaks goals into steps and decides when to read, run, or refactor.
- Tools surface raw signals (docs, code slices, traces, outputs).
- Cortex Context Engine turns raw signals into a compact, reproducible working set the model can cite.
- Evidence Ledger records exactly what was injected (source + range + integrity hash).
The Context Engine is orchestrated by the Planner and informed by Tools. It’s not “just RAG”; it’s how the agent remembers the why, not only the what.
Design Goals
- Next-step helpfulness over raw similarity
- Deterministic prompt shape under explicit token budgets
- Provenance you can audit and reproduce
- Conversation compression that preserves decisions and causal links
- Version-correct retrieval in multi-branch repos
Under the Hood (safe-to-share details)
Hybrid retrieval that prefers helpfulness
We combine lexical and semantic candidates, then re-rank with a lightweight embedding-cosine signal that captures usefulness at decision time. Retrieval is two-stage: page recall → section recall. This gets you the right kind of evidence (references/definitions) rather than just “similar words.”
Candidate set
pages = BM25_top_k(q) ⟂ Dense_top_k(embed(q))
sections = topSections(pages, perPage=m) // headings/anchors
Rerank features (illustrative, not exhaustive):
doc_type(reference/definition/tutorial/example/impl)symbol_overlap(identifiers, API names)section_depth(heading proximity)code_ratioandreadability_entropy(favor concise, reference-like text)recency(when available)
Selection sketch (pseudo):
cands = union(sections_bm25(q), sections_dense(embed(q)))
scored = cosine_rerank(cands) // MLX embedder; no cross-encoder
picked = mmr(scored, q_vec, k=K, lambda∈(0,1)) // diversify ideas/sources
Weights, normalizers, and tie-breakers are private.
Snippet record (what the model can cite):
Snippet {
id: int
title: string
source_uri: uri
span: { start_line: int, end_line: int }
sha256: hex // integrity for audit
license: spdx
approxTokenCount: int
fused_score: float
}
Deterministic packing with budgets and diversification
Context must be small, predictable, and non-redundant. We treat budget as a hard constraint and use token-cost–aware MMR. Snippets carry approxTokenCount computed at ingest to avoid re-tokenizing during packing.
Policy (conceptual):
-
Hard cap:
budget_tokens - Per-source soft cap to avoid single-doc domination
- Reserve lanes (e.g., pinned goal, recent raw turns, active code diff)
- Diversification via MMR so we keep complementary snippets
Pack sketch (pseudo):
budget = B
blocks = [
pin(Goal, max=T_goal),
pin(RecentRawTurns[N], max=T_recent),
inject(ActiveDiff, max=T_diff),
inject(mmr_select_budget_aware(picked_snippets), max=T_evidence)
]
context, citations = assemble(blocks, budget=B)
ledger.append(citations, tokensUsed=context.tokenCount)
Why this matters: The model sees a stable prompt shape from turn to turn; latency stays bounded; and every claim can point back to a numbered snippet.
Graph-aware compression (keep the “why”)
We don’t compress by word count—we compress by structure so the assistant retains intent and rationale.
Graph schema (conceptual):
- Nodes: Problem, Hypothesis, Decision, Evidence, CodeChange, ToolOutput
- Edges: temporal (happened-after), causal (supports/refutes), reference (cites)
Implemented with a SemanticKnowledgeGraph and GraphBasedMessageAnalyzer that extract nodes/edges; an MLXCompressionEngine emits a short, citation-aware summary; CompressionStateStorage persists compact stats for pattern mining across runs.
Compression policy:
- Pin: original Goal and last N raw user turns (verbatim)
-
Compact: middle segments into a Graph Summary that:
- Preserves decision bullets with back-references to Evidence IDs
- Merges redundant tool outputs; keeps deltas
- Retains failure chains (problem → attempt → result → fix)
Older, low-centrality nodes get summarized first; decisions and their supporting evidence remain explicit. Decay functions and thresholds are private.
Version-correct overlays & access proofs
Multi-branch, multi-user environments demand the right branch, the right text.
- Overlay index: per-user/branch diffs sit atop a shared base (copy-on-write). Retrieval checks overlay first, then base.
- Proof-of-possession: materializing text may require
{path, sha256}to match the index; otherwise we hand back metadata only. - Effect: retrieval aligns to your working tree, accidental leaks are reduced, and we avoid duplicating entire corpora.
The Evidence Ledger (what we record)
Each turn appends an immutable record:
TurnRecord {
turn_id: int
time: ts
goal_ref: hash
tokens_used: int
snippets: [
{ local_id, source_uri, span:{start,end}, sha256, license, reason_tag }
]
model_citations: [local_id,...] // validated; out-of-set IDs dropped
replay_key: hash
}
We validate citations server-side and drop any out-of-set IDs before logging. This is how we achieve explainability (“where did that claim come from?”) and reproducibility (replay the exact context that produced an answer).
Planner ↔ Context Engine contract
A small, explicit interface keeps responsibilities clean:
build_context(goal, query, tool_signals, state) -> {
prompt_blocks, // ordered: goal, recent, diff, evidence
citations, // snippet IDs and spans
replay_key, // for deterministic rebuilds
advisory // e.g., "insufficient evidence; need X"
}
feedback(model_citations, tool_results) -> {
ledger_update,
graph_update
}
The Planner decides strategy; the Cortex Context Engine guarantees compactness, provenance, and determinism.
Failure modes & mitigations
- Over-similar candidates → MMR + per-source caps reduce redundancy.
- Ambiguous symbol names → lexical grounding +
symbol_overlap. - Hallucinated citations → model instructed to cite only provided IDs; server validates and drops out-of-set IDs before logging.
- Summary drift → graph compression preserves decisions and evidence IDs.
- Version soup → overlay precedence + content-hash proofs.
- Evidence starvation → advisory signals prompt the Planner to fetch more or ask the user.
What the model actually sees (trimmed example)
System:
You are a citation-first assistant. Use only the numbered snippets below.
If an answer isn’t in evidence, say so and request what’s missing.
Cite like [1], [2-3].
Pinned Goal:
Build a custom Metal compute shader to ...
Recent:
[User/Agent raw turns, last N tokens] …
Evidence:
[1] "IVF Lists—Sizing"
Source: …/faiss/ivf.md#lists Lines: 12–41 Hash: 9f…
[2] "Storage Modes"
Source: …/apple/metal/storage Lines: 101–135 Hash: 83…
Assistant style:
Start with list counts so each list holds ~1–2k vectors [1]. For the intermediate HDR buffer, prefer a private storage mode; see alignment notes in [2].
The ledger ties those numbers to sources and spans.
Performance notes (without the trade secrets)
- Retrieval: an inverted-file (IVF) semantic index (e.g., IVFFlat) plus an in-memory lexical index keep latency low on typical code/doc corpora.
- Packing: bounded to a small candidate set; complexity dominated by rerank + diversification on K.
- Stability: explicit budgets and fixed block ordering produce predictable prompt shapes turn-to-turn.
- Scale: comfortably handles tens of thousands of chunks per workspace on a developer laptop; larger corpora shift to sharded indices.
Exact index parameters, probe schedules, feature weights, and thresholds are private—those are where much of the reliability comes from.
What you’ll notice in Cortex
- Inline, numbered citations you can inspect.
- Consistent answers late in the session (no context thrash).
- Reproducible runs via replay keys and the ledger.
- Branch-aware behavior when you jump between features (when overlays are enabled).
FAQ
Do you rewrite my first prompt?
No. The initial goal is pinned verbatim; we start injecting context after that.
Will compression hide details I need?
We compress structure, not significance. Decisions and their evidence stay explicit; raw text is recoverable by the Planner as needed.
How do you keep prompts small without losing recall?
We prefer references over redundant implementations, diversify sources, and treat the token budget as a hard constraint.
Security in shared environments?
Content-hash proofs and overlays limit what text materializes and align retrieval to the right branch.