30B LLM, 260k context, at 50 tokens/s… On your Laptop

30B LLM, 260k context, at 50 tokens/s… On your Laptop

LLMs are astonishing. Ten years ago they would have fallen under Clarke’s third law:

Any sufficiently advanced technology is indistinguishable from magic.

But that magic comes with a price. A price that, until today, has made running them locally a thing only available to the hardware wealthy. Long context windows explode memory, and per-token latency is often bounded by bandwidth rather than FLOPs.

Meanwhile, many acceleration trick demand complex multi-model setups that are brittle in practice.

Today we’re sharing a secret that’s made the Cortex model feel snappy at full context (260k) on just 16GB of unified RAM (Apple Silicon):

  • Hybrid attention to crush KV-cache growth and keep memory flat
  • Medusa heads for multi-token progress without a second model
  • System-level decision that let Metal/MLX really breath

No retraining of the base experts was required.

Why LLM generation is slow (especially on your laptop)

From a systems lens, decoding is memory-bound:

  1. Tokens are generated one at a time
  2. Each step is read heavy and fetches weight blocks
  3. Naive attention makes memory grow with T, strangling batch/parallelism and causing paging at long context

On Apple’s unified memory, you feeel this as: cache thrash, thermal throttling, and token rates that flatten, or even drop, as the prompt grows.

Speculative decoding is great, but complex

Speculative decoding accelerates generation with a separate draft model whose proposals are verified by the main model. It can be fast, but in practice you must:

  • host two models and their state
  • keep tokenizers in sync
  • maintain high acceptance despite domain drift
  • and on MoE verifiers, acceptance suffers without careful alignment

Our goal was speed without the operational overhead. Because, hey, we only have 16GB on most MacBooks.

Our Approach: One model, tiny cache

We replace most attention blocks with a dynamic, causal convolution over values V whose kernel is generated per token. This makes attention linear-time and O(1) state per layer. We keep just 2-4 sync layers of full attention (first/last and conditionally one or two mid-stack) with a sliding window.

This approach shrinks KV cache from gigabytes to tens of megabytes.

On Apple Silicon this buys you:

  • Long contexts (260k) with stable memory and steady latency
  • Headroom to run additional logic without paging

To compliment this hybrid attention, we attach lightweight decoding heads to the final hidden state that predict multiple future tokens. At inference, teh main model stays frozen, and we assemble candidate continuations from the Medusa heads and verify them in one pass, accepting the longest plausible prefix.

  • No importance-sampling detours.
  • No second tokenizer
  • No extra KV for a draft model

This works because heads reuse the main model’s representation with no distribution mismatch. Serving is operationally simple, and combined with the tiny cache, verification stays cheap even at long context.

What this looks like in use

When you submit a prompt it runs through our hybrid attention stack. At each step:

  1. Main haed predicts next token
  2. Medusa heads propose offsets
  3. We build a small set of candidates
  4. Then verifications evaluates candidates efficiently

Then advance by 1-N tokens in a single step.

Why this fits in 16GB (and stays fast)

  • Attention layers keep O(1) state with only 2-4 windowed full-attn layers holding short KV
  • We run Q4 weights and keep active MoE experts resident, while the rest are memory-mapped
  • Medusa adds a handful of small heads
  • Fused kernels (dequant, matmul/conv, activation), pre-allocation buffers, and expert-batched GEMMs keep bandwidth pressure low in Metal/MLX

Results (what you should expect)

All numbers below are single-stream guidance on M-series Macs with 16GB. Your exact device and kernel maturity affect the final values.

  • Base hybrid only (no Medusa):

    ~25–35 tok/s steady decode at long context.

  • Hybrid + Medusa (2–3 heads, typical acceptance, packed verify):

    ~35–55 tok/s, depending on prompt stability and head accuracy.

  • With retrieval (many citations, 260k rolling):

    Multi-scale kernels + one tiny cross-attn to doc summaries keep us in the low-to-mid 30s, rising to 40s when the conversation stabilizes.

We intentionally avoid “lab peak” numbers; these reflect realistic thermal and memory behavior on laptops.

What we’re not doing

  • No separate draft model to host and align.
  • No unbounded full attention.
  • No heavyweight retraining of base experts.
  • No fragile tokenizer games.

Engineering choices that mattered

  • 2–4 full-attn “sync” layers only; everything else custom attention.
  • Windowed full attention (e.g., 8k) to cap KV even in sync layers.
  • Fused Q4 ops: unpack int4 + scale inside the kernel; FP32 accumulate.
  • Batch-by-expert (MoE) so GEMMs are big and hot.
  • Pre-allocated private memory for ring buffers, logits, scratch; argument buffers/heaps to avoid per-step rebinding.
  • Typical acceptance over importance sampling—fast and stable at low temperature.

Quality & safety

  • We keep first/last attention layers dense, and often one mid-stack layer, to preserve global mixing.
  • Medusa heads are trained with the base frozen; they suggest, not decide.
  • Typical acceptance ensures we always advance at least one token (greedy) and accept longer spans only when probabilities are high.

Where this shines (and where it doesn’t)

Shines

  • Long-context coding and retrieval: repository-scale prompts, tool traces, multi-file refactors.
  • Low-temperature chat and completion where local determinism is high.
  • On-device privacy scenarios where cloud-scale hardware isn’t an option.

Doesn’t

  • High-temperature creative writing (acceptance drops; Medusa helps less).
  • Ultra-constrained memory (<8 GB) without expert slimming or offloading.

What’s next

  • A compact “landmarks over attention kernels” variant for even cheaper long-range mixing.
  • Optional speculative decoding on top of Medusa (small SSM drafter) for workloads where acceptance is extremely high.
  • Public kernels for the fused attention kernels op on Metal once we’ve hardened them across devices.