The Context Engine for Real Software Engineering Workloads

LLMs are great at local reasoning. things like naming a function, sketching a query, and explaining a snippet. Real software work is different. It’s long-horizon: multi-hour debugging, refactors that touch five modules, investigations that weave through docs, issues, and experiments. Most “RAG + summarization” stacks help for a few turns, then drift.

Cortex avoids that drift with an internal subsystem we call the Cortex Context Engine. It isn’t a separate product; it’s the state manager that feeds the model only what matters—evidence with provenance, version-correct code awareness, and a stable prompt shape across 50–100+ turns.

Where It Lives (and Why)

User Goal ──► Planner ──► Tools (repo, search, runners, profilers)
                    │                 ▲
                    ▼                 │
          Cortex Context Engine ──────┘
                    │
         Evidence Ledger (audit trail)
                    │
               Model Runtime

Planner breaks goals into steps and decides when to read, run, or refactor.
Tools surface raw signals (docs, code slices, traces, outputs).
Cortex Context Engine turns raw signals into a compact, reproducible working set the model can cite.
Evidence Ledger records exactly what was injected (source + range + integrity hash).

The Context Engine is orchestrated by the Planner and informed by Tools. It’s not “just RAG”; it’s how the agent remembers the why, not only the what.

Design Goals

Next-step helpfulness over raw similarity
Deterministic prompt shape under explicit token budgets
Provenance you can audit and reproduce
Conversation compression that preserves decisions and causal links
Version-correct retrieval in multi-branch repos

Hybrid retrieval that prefers helpfulness

We combine lexical and semantic candidates, then re-rank with a lightweight embedding-cosine signal that captures usefulness at decision time. Retrieval is two-stage: page recall → section recall. This gets you the right kind of evidence (references/definitions) rather than just “similar words.”

Candidate set

pages    = BM25_top_k(q) ⟂ Dense_top_k(embed(q))
sections = topSections(pages, perPage=m)   // headings/anchors

Rerank features (illustrative, not exhaustive):

doc_type (reference/definition/tutorial/example/impl)
symbol_overlap (identifiers, API names)
section_depth (heading proximity)
code_ratio and readability_entropy (favor concise, reference-like text)
recency (when available)

Selection sketch (pseudo):

cands  = union(sections_bm25(q), sections_dense(embed(q)))
scored = cosine_rerank(cands)                 // MLX embedder; no cross-encoder
picked = mmr(scored, q_vec, k=K, lambda∈(0,1)) // diversify ideas/sources

Weights, normalizers, and tie-breakers are private.

Snippet record (what the model can cite):

Snippet {
  id: int
  title: string
  source_uri: uri
  span: { start_line: int, end_line: int }
  sha256: hex          // integrity for audit
  license: spdx
  approxTokenCount: int
  fused_score: float
}

Deterministic packing with budgets and diversification

Context must be small, predictable, and non-redundant. We treat budget as a hard constraint and use token-cost–aware MMR. Snippets carry approxTokenCount computed at ingest to avoid re-tokenizing during packing.

Policy (conceptual):

Hard cap: budget_tokens
Per-source soft cap to avoid single-doc domination
Reserve lanes (e.g., pinned goal, recent raw turns, active code diff)
Diversification via MMR so we keep complementary snippets

Pack sketch (pseudo):

budget = B
blocks = [
  pin(Goal, max=T_goal),
  pin(RecentRawTurns[N], max=T_recent),
  inject(ActiveDiff, max=T_diff),
  inject(mmr_select_budget_aware(picked_snippets), max=T_evidence)
]

context, citations = assemble(blocks, budget=B)
ledger.append(citations, tokensUsed=context.tokenCount)

Why this matters: The model sees a stable prompt shape from turn to turn; latency stays bounded; and every claim can point back to a numbered snippet.

Graph-aware compression (keep the “why”)

We don’t compress by word count—we compress by structure so the assistant retains intent and rationale.

Graph schema (conceptual):

Nodes: Problem, Hypothesis, Decision, Evidence, CodeChange, ToolOutput
Edges: temporal (happened-after), causal (supports/refutes), reference (cites)

Implemented with a SemanticKnowledgeGraph and GraphBasedMessageAnalyzer that extract nodes/edges; an MLXCompressionEngine emits a short, citation-aware summary; CompressionStateStorage persists compact stats for pattern mining across runs.

Compression policy:

Pin: original Goal and last N raw user turns (verbatim)
Compact: middle segments into a Graph Summary that:
- Preserves decision bullets with back-references to Evidence IDs
- Merges redundant tool outputs; keeps deltas
- Retains failure chains (problem → attempt → result → fix)

Older, low-centrality nodes get summarized first; decisions and their supporting evidence remain explicit. Decay functions and thresholds are private.

Version-correct overlays & access proofs

Multi-branch, multi-user environments demand the right branch, the right text.

Overlay index: per-user/branch diffs sit atop a shared base (copy-on-write). Retrieval checks overlay first, then base.
Proof-of-possession: materializing text may require {path, sha256} to match the index; otherwise we hand back metadata only.
Effect: retrieval aligns to your working tree, accidental leaks are reduced, and we avoid duplicating entire corpora.

The Evidence Ledger (what we record)

Each turn appends an immutable record:

TurnRecord {
  turn_id: int
  time: ts
  goal_ref: hash
  tokens_used: int
  snippets: [
    { local_id, source_uri, span:{start,end}, sha256, license, reason_tag }
  ]
  model_citations: [local_id,...]  // validated; out-of-set IDs dropped
  replay_key: hash
}

We validate citations server-side and drop any out-of-set IDs before logging. This is how we achieve explainability (“where did that claim come from?”) and reproducibility (replay the exact context that produced an answer).

Planner ↔ Context Engine contract

A small, explicit interface keeps responsibilities clean:

build_context(goal, query, tool_signals, state) -> {
  prompt_blocks,   // ordered: goal, recent, diff, evidence
  citations,       // snippet IDs and spans
  replay_key,      // for deterministic rebuilds
  advisory         // e.g., "insufficient evidence; need X"
}

feedback(model_citations, tool_results) -> {
  ledger_update,
  graph_update
}

The Planner decides strategy; the Cortex Context Engine guarantees compactness, provenance, and determinism.

Failure modes & mitigations

Over-similar candidates → MMR + per-source caps reduce redundancy.
Ambiguous symbol names → lexical grounding + symbol_overlap.
Hallucinated citations → model instructed to cite only provided IDs; server validates and drops out-of-set IDs before logging.
Summary drift → graph compression preserves decisions and evidence IDs.
Version soup → overlay precedence + content-hash proofs.
Evidence starvation → advisory signals prompt the Planner to fetch more or ask the user.

What the model actually sees (trimmed example)

System:
You are a citation-first assistant. Use only the numbered snippets below.
If an answer isn’t in evidence, say so and request what’s missing.
Cite like [1], [2-3].

Pinned Goal:
Build a custom Metal compute shader to ...

Recent:
[User/Agent raw turns, last N tokens] …

Evidence:
[1] "IVF Lists—Sizing"
Source: …/faiss/ivf.md#lists   Lines: 12–41   Hash: 9f…
[2] "Storage Modes"
Source: …/apple/metal/storage  Lines: 101–135  Hash: 83…

Assistant style:

Start with list counts so each list holds ~1–2k vectors [1]. For the intermediate HDR buffer, prefer a private storage mode; see alignment notes in [2].

The ledger ties those numbers to sources and spans.

Performance notes (without the trade secrets)

Retrieval: an inverted-file (IVF) semantic index (e.g., IVFFlat) plus an in-memory lexical index keep latency low on typical code/doc corpora.
Packing: bounded to a small candidate set; complexity dominated by rerank + diversification on K.
Stability: explicit budgets and fixed block ordering produce predictable prompt shapes turn-to-turn.
Scale: comfortably handles tens of thousands of chunks per workspace on a developer laptop; larger corpora shift to sharded indices.

Exact index parameters, probe schedules, feature weights, and thresholds are private—those are where much of the reliability comes from.

What you’ll notice in Cortex

Inline, numbered citations you can inspect.
Consistent answers late in the session (no context thrash).
Reproducible runs via replay keys and the ledger.
Branch-aware behavior when you jump between features (when overlays are enabled).

FAQ

Do you rewrite my first prompt?

No. The initial goal is pinned verbatim; we start injecting context after that.

Will compression hide details I need?

We compress structure, not significance. Decisions and their evidence stay explicit; raw text is recoverable by the Planner as needed.

How do you keep prompts small without losing recall?

We prefer references over redundant implementations, diversify sources, and treat the token budget as a hard constraint.

Security in shared environments?

Content-hash proofs and overlays limit what text materializes and align retrieval to the right branch.