March 19, 2026
Parameter Golf
Train the best language model that fits in 16 megabytes. You get 10 minutes on 8 H100s.
What this is
OpenAI is running a public competition called Parameter Golf. The constraint is brutal and precise: your entire trained model (weights, tokenizer, quantization scheme, everything) must compress to under 16 megabytes. Then it gets evaluated on a held-out language modeling benchmark (bits-per-byte on the Pile). Lowest score wins. April 30 deadline.
Current leader is at 1.1539 BPB. The baseline GPT-2 small would be around 2.85. Every 0.01 improvement at this scale is the product of some clever trick nobody else has tried yet: a quantization scheme, an embedding hack, a training procedure borrowed from a different paper. The leaderboard is mostly empty right now. It's early.
I spent today competing in it. This is what that looked like.
The actual afternoon
I started with a GPU question. I had Akash Network credits and wanted to use them. The Akash console interface is not intuitive. You're deploying via SDL config files to a decentralized network of compute providers. Got an 8xA100 instance running within an hour.
My previous runs (v2, v3) had been trying a "Rule Engine" architecture: a looped transformer where the same 4 layers execute 8 times, trying to learn compressed representations of language rules. The hypothesis was interesting. The results were not: 1.235 BPB, worse than the vanilla dense leaders. The research I'd done earlier showed recurrent architectures consistently underperform dense ones below 135M params. So I handed off Codex's LeaderboardBlend spec (dense 10x512 layers, overtone spectral initialization, skip connections) and queued it to run on the A100 after v4.
Somewhere in the middle of this, Azriel from Corgi Insurance texted. The call I'd been waiting on was happening in 10 minutes. I pulled up the prep doc, got the current company numbers ($40M ARR, 40K customers, carrier license, $630M valuation), ran through the pitch. Then went back to the SSH session.
LeaderboardBlend came back at 1.1946 BPB. The pivot worked.
I asked Claude to "spawn whatever researchers you need and bring back an idea that's insanely good." Five parallel research agents went out simultaneously: competition intelligence, quantization techniques, tiny LM architectural innovations, training efficiency, distillation/knowledge transfer. All five returned. While I was reading the synthesis, context hit 91% and the session reset. The handoff fired automatically. Fresh instance picked up mid-sentence.
Total runs: four architectures across two GPU providers, one context limit hit, one insurance call, zero wasted time.
Run history
Real numbers. The regression is included.
v3: Rule Engine
Recurrent 4×768, looped
1.235
BPB
Dead end. Recurrence underperforms dense at this scale.
v4: Free Wins
Recurrent 4×768, Muon WD + FP16 embed
1.235
BPB
Same ceiling. Architecture problem, not tuning.
LeaderboardBlend
Dense 10×512, overtone init, skip connections
1.1946
BPB
Pivot worked. Dense crushed recurrent.
v5: KillStack
Dense 10×512 + BigramHash + SmearGate + LAWA
1.2251
BPB
REGRESSION. LAWA during warmdown dilutes the LR decay. Lesson learned.
v6: KillStack (fixed)
Dense 10×512, 3x MLP, BigramHash, SmearGate, INT6+zstd-22, no LAWA
1.1843
BPB
Current best. #2 on the leaderboard. Dropped LAWA, kept everything else.
v7: OrthoInit
v6 + orthogonal weight init
—
BPB
Over 16MB size limit. Orthogonal weights don't compress well.
v9: Late QAT
v6 + quantization-aware training at 75%
—
BPB
No improvement. QAT doesn't help when INT6 rounding is already tight.
Current leader: PR #135 at 1.1539 BPB. Gap to close: ~0.04.
How the team works
Most people use AI like a better search engine. Ask a question, get an answer, copy-paste the useful parts. That's fine. It's also leaving most of the value on the table.
What I've built is a team model. Chief coordinates the day, tracks priorities, manages context. When something needs depth, Chief spawns a specialist. A Builder session with SSH credentials who SSHes into the live machine, diagnoses the actual state, writes and patches code directly. A Researcher session who runs five subagents in parallel across different angles and synthesizes the results. The specialists are full Claude instances. They investigate, form their own views, and push back when the brief is wrong.
Today's workflow: I gave the Builder live SSH access to an 8xA100 instance on Akash. It confirmed GPU state was idle, reviewed the v3 code, diagnosed three problems (weight decay missing from Muon optimizer, warmdown calibrated for H100 not A100, embedding learning rate too conservative), patched them, and launched v4. Then it reviewed Codex's LeaderboardBlend script, flagged concerns about artifact size and warmdown aggression, uploaded to the remote, and wrote a wrapper to auto-launch it after v4 finished. That all happened inside one conversation.
When context hit 91%, the session handed off automatically. Transcript summary fired, fresh instance spawned, picked up mid-sentence on the synthesis I was reading. The continuity system means I don't lose the thread when Claude runs out of working memory.
This is a fundamentally different thing than "using AI." It's closer to having a small team that's always available, always focused, and can run parallel workstreams without coordination overhead.
What the research found
The five agents came back with a kill stack. Some techniques are proven. Some are unexploited. All are stackable.
INT6 entropy reduction + zstd-22
in PR #135Round INT8 weights to nearest multiple of 4. Compressor sees 64 distinct values instead of 256. Combined with zstd level 22, the artifact drops 18% (from 15.5MB to 12.75MB). Frees budget for a bigger model.
BigramHash embedding
in PR #1354096-bucket hash table injecting token-pair statistics directly into the embedding layer. ~590K params, zero-initialized so it starts as a no-op and learns what it needs. PR #135 used it.
SmearGate
in PR #135512 learnable scalars that blend each dimension's current embedding with the previous token's. Starts 50/50, learns its own ratio per dimension. Free local context.
Multi-Token Prediction
untriedAuxiliary loss over next 3 tokens with a decaying schedule (1.0, 0.5, 0.25), collapsing to standard CE by 80% of training. Zero extra params. Zero artifact cost. Nobody in Parameter Golf has tried it.
Partial Key Offset
untriedIn attention, shift a subset of key dimensions by one timestep. The model gets 1-step induction for free, no parameters. 5-minute change. Untried on this base.
The regression
I ran v5 with the full kill stack plus LAWA, a checkpoint averaging technique that maintains a rolling average of the last 5 model states during training. Papers show it consistently improves perplexity at 125M scale. I implemented it.
6,290 steps in 10 minutes. It went from 1.1946 to 1.2251. Worse by 0.03 BPB.
The cause: LAWA was running through the warmdown phase, where the learning rate decays to near-zero. At that point you're averaging a well-converged late checkpoint with less-trained earlier snapshots. The averaging dilutes the very thing warmdown is trying to produce. LAWA and warmdown don't coexist well. You either use LAWA as the warmdown replacement, or you stop averaging before the decay starts.
The silver lining: INT6+zstd-22 worked perfectly, dropping the artifact from 15.5MB to 12.75MB. BigramHash and SmearGate added 590K parameters with negligible step time overhead (95ms vs 102ms). Those stay. LAWA gets fixed.
The meta
The interesting thing about Parameter Golf isn't the competition itself. It's the structure: frontier AI models training tiny AI models. Claude, running as a research team, reading papers about language model optimization, synthesizing findings from five parallel workstreams, and recommending architectural changes to make a language model better. Intelligence optimizing intelligence.
This is almost certainly what's been happening inside big labs for years. AI systems helping design the next generation of AI systems. The researchers and engineers there just haven't been narrating it publicly.
What's different now is that the compute is available on Akash for hourly rent, the models are capable enough to do real research, and the tooling to orchestrate them as a team exists. The meta-game of using AI to build AI isn't lab-exclusive anymore.
As of this evening, v6 sits at 1.1843 BPB. #2 on the leaderboard, 0.03 behind the leader. Nine runs total, four dead ends, one regression, and one clear winner. Deadline is April 30.
1.1843 BPB
Current best
#2
Leaderboard
0.03 BPB
Gap to leader
Apr 30
Deadline