March 19, 2026

Parameter Golf

Train the best language model that fits in 16 megabytes. You get 10 minutes on 8 H100s.

What this is

OpenAI is running a public competition called Parameter Golf. The constraint is brutal and precise: your entire trained model (weights, tokenizer, quantization scheme, everything) must compress to under 16 megabytes. Then it gets evaluated on a held-out language modeling benchmark (bits-per-byte on the Pile). Lowest score wins. April 30 deadline.

Current leader is at 1.1539 BPB. The baseline GPT-2 small would be around 2.85. Every 0.01 improvement at this scale is the product of some clever trick nobody else has tried yet: a quantization scheme, an embedding hack, a training procedure borrowed from a different paper. The leaderboard is mostly empty right now. It's early.

I spent today competing in it. This is what that looked like.

The actual afternoon

I started with a GPU question. I had Akash Network credits and wanted to use them. The Akash console interface is not intuitive. You're deploying via SDL config files to a decentralized network of compute providers. Got an 8xA100 instance running within an hour.

My previous runs (v2, v3) had been trying a "Rule Engine" architecture: a looped transformer where the same 4 layers execute 8 times, trying to learn compressed representations of language rules. The hypothesis was interesting. The results were not: 1.235 BPB, worse than the vanilla dense leaders. The research I'd done earlier showed recurrent architectures consistently underperform dense ones below 135M params. So I handed off Codex's LeaderboardBlend spec (dense 10x512 layers, overtone spectral initialization, skip connections) and queued it to run on the A100 after v4.

Somewhere in the middle of this, Azriel from Corgi Insurance texted. The call I'd been waiting on was happening in 10 minutes. I pulled up the prep doc, got the current company numbers ($40M ARR, 40K customers, carrier license, $630M valuation), ran through the pitch. Then went back to the SSH session.

LeaderboardBlend came back at 1.1946 BPB. The pivot worked.

I asked Claude to "spawn whatever researchers you need and bring back an idea that's insanely good." Five parallel research agents went out simultaneously: competition intelligence, quantization techniques, tiny LM architectural innovations, training efficiency, distillation/knowledge transfer. All five returned. While I was reading the synthesis, context hit 91% and the session reset. The handoff fired automatically. Fresh instance picked up mid-sentence.

Total runs: four architectures across two GPU providers, one context limit hit, one insurance call, zero wasted time.

Run history

Real numbers. The regression is included.

v3: Rule Engine

Recurrent 4×768, looped

1.235

BPB

Dead end. Recurrence underperforms dense at this scale.

v4: Free Wins

Recurrent 4×768, Muon WD + FP16 embed

1.235

BPB

Same ceiling. Architecture problem, not tuning.

LeaderboardBlend

Dense 10×512, overtone init, skip connections

1.1946

BPB

Pivot worked. Dense crushed recurrent.

v5: KillStack

Dense 10×512 + BigramHash + SmearGate + LAWA

1.2251

BPB

REGRESSION. LAWA during warmdown dilutes the LR decay. Lesson learned.

v6: KillStack (fixed)

Dense 10×512, 3x MLP, BigramHash, SmearGate, INT6+zstd-22, no LAWA

1.1843

BPB

Current best. #2 on the leaderboard. Dropped LAWA, kept everything else.

v7: OrthoInit

v6 + orthogonal weight init

BPB

Over 16MB size limit. Orthogonal weights don't compress well.

v9: Late QAT

v6 + quantization-aware training at 75%

BPB

No improvement. QAT doesn't help when INT6 rounding is already tight.

Current leader: PR #135 at 1.1539 BPB. Gap to close: ~0.04.

How the team works

Most people use AI like a better search engine. Ask a question, get an answer, copy-paste the useful parts. That's fine. It's also leaving most of the value on the table.

What I've built is a team model. Chief coordinates the day, tracks priorities, manages context. When something needs depth, Chief spawns a specialist. A Builder session with SSH credentials who SSHes into the live machine, diagnoses the actual state, writes and patches code directly. A Researcher session who runs five subagents in parallel across different angles and synthesizes the results. The specialists are full Claude instances. They investigate, form their own views, and push back when the brief is wrong.

Today's workflow: I gave the Builder live SSH access to an 8xA100 instance on Akash. It confirmed GPU state was idle, reviewed the v3 code, diagnosed three problems (weight decay missing from Muon optimizer, warmdown calibrated for H100 not A100, embedding learning rate too conservative), patched them, and launched v4. Then it reviewed Codex's LeaderboardBlend script, flagged concerns about artifact size and warmdown aggression, uploaded to the remote, and wrote a wrapper to auto-launch it after v4 finished. That all happened inside one conversation.

When context hit 91%, the session handed off automatically. Transcript summary fired, fresh instance spawned, picked up mid-sentence on the synthesis I was reading. The continuity system means I don't lose the thread when Claude runs out of working memory.

This is a fundamentally different thing than "using AI." It's closer to having a small team that's always available, always focused, and can run parallel workstreams without coordination overhead.

What the research found

The five agents came back with a kill stack. Some techniques are proven. Some are unexploited. All are stackable.

INT6 entropy reduction + zstd-22

in PR #135

Round INT8 weights to nearest multiple of 4. Compressor sees 64 distinct values instead of 256. Combined with zstd level 22, the artifact drops 18% (from 15.5MB to 12.75MB). Frees budget for a bigger model.

BigramHash embedding

in PR #135

4096-bucket hash table injecting token-pair statistics directly into the embedding layer. ~590K params, zero-initialized so it starts as a no-op and learns what it needs. PR #135 used it.

SmearGate

in PR #135

512 learnable scalars that blend each dimension's current embedding with the previous token's. Starts 50/50, learns its own ratio per dimension. Free local context.

Multi-Token Prediction

untried

Auxiliary loss over next 3 tokens with a decaying schedule (1.0, 0.5, 0.25), collapsing to standard CE by 80% of training. Zero extra params. Zero artifact cost. Nobody in Parameter Golf has tried it.

Partial Key Offset

untried

In attention, shift a subset of key dimensions by one timestep. The model gets 1-step induction for free, no parameters. 5-minute change. Untried on this base.

The regression

I ran v5 with the full kill stack plus LAWA, a checkpoint averaging technique that maintains a rolling average of the last 5 model states during training. Papers show it consistently improves perplexity at 125M scale. I implemented it.

6,290 steps in 10 minutes. It went from 1.1946 to 1.2251. Worse by 0.03 BPB.

The cause: LAWA was running through the warmdown phase, where the learning rate decays to near-zero. At that point you're averaging a well-converged late checkpoint with less-trained earlier snapshots. The averaging dilutes the very thing warmdown is trying to produce. LAWA and warmdown don't coexist well. You either use LAWA as the warmdown replacement, or you stop averaging before the decay starts.

The silver lining: INT6+zstd-22 worked perfectly, dropping the artifact from 15.5MB to 12.75MB. BigramHash and SmearGate added 590K parameters with negligible step time overhead (95ms vs 102ms). Those stay. LAWA gets fixed.

The meta

The interesting thing about Parameter Golf isn't the competition itself. It's the structure: frontier AI models training tiny AI models. Claude, running as a research team, reading papers about language model optimization, synthesizing findings from five parallel workstreams, and recommending architectural changes to make a language model better. Intelligence optimizing intelligence.

This is almost certainly what's been happening inside big labs for years. AI systems helping design the next generation of AI systems. The researchers and engineers there just haven't been narrating it publicly.

What's different now is that the compute is available on Akash for hourly rent, the models are capable enough to do real research, and the tooling to orchestrate them as a team exists. The meta-game of using AI to build AI isn't lab-exclusive anymore.

As of this evening, v6 sits at 1.1843 BPB. #2 on the leaderboard, 0.03 behind the leader. Nine runs total, four dead ends, one regression, and one clear winner. Deadline is April 30.

1.1843 BPB

Current best

#2

Leaderboard

0.03 BPB

Gap to leader

Apr 30

Deadline