HWE Bench · RISC-V CPU design benchmark for LLMs

HWE Bench

An unbounded benchmark for LLM hardware engineering. Large language models design RISC-V CPUs from scratch. Every design must first pass a full battery of formal correctness proofs, so buggy CPUs are thrown out. The ones that survive are then scored by how fast they would actually run on a physical FPGA.

Thesis SWE-bench tops out at 100%. HWE Bench doesn't have a top.
The fitness number reflects an actual microarchitecture, and microarchitecture has room to grow as long as models keep finding it.

Model release date × peak HWE score

Each point is one model configuration's best completed HWE Bench rep; reasoning-effort variants share their underlying model family's public release date. The dashed fit is descriptive, not a forecast. Release dates come from the OpenAI API / Codex notes, Gemini API changelog, and Kimi K2.6 announcement.

Score × Area

Vertical axis: CoreMark fitness (how fast the CPU runs the benchmark). Horizontal axis: chip area (LUT4 count, basically how many gates the design uses on the FPGA). One point per model's best run. VexRiscv (3,957 LUT4 · fitness 370) is the human-engineered reference. Up and to the left is the goal: faster chip, smaller chip.

Peak fitness per model

Best of N=3 reps per model · 31 reps total · VexRiscv human reference in red · baseline V0 in italic
#	Model	Reps	Best	Δ%	Mean ± std	Area (LUT4)	Fmax (MHz)
1	gpt-5_5_xhigh	3/3	525.04	+85.6%	468.3 ± 52.8	5.5k	220
2	gpt-5_6-terra	3/3	515.70	+82.3%	442.3 ± 55.8	10.5k	209
3	gpt-5_4_xhigh	3/3	513.84	+81.7%	485.8 ± 28.1	10.1k	203
4	gpt-5_6-luna	3/3	480.90	+70.0%	452.0 ± 29.0	10.1k	209
5	gpt-5_6-sol	1/3	470.80	+66.5%	470.8	10.2k	200
6	gpt-5_5_high	3/3	461.87	+63.3%	430.2 ± 23.0	9.8k	187
7	gpt-5_5_medium	3/3	431.58	+52.6%	423.5 ± 11.2	7.8k	201
8	kimi-k2_6	2/3	396.13	+40.1%	339.5 ± 8.3	9.9k	166
9	gpt-5_4-mini	3/3	395.53	+39.9%	362.3 ± 23.7	10.2k	187
10	VexRiscv (human ref)	n/a	370.00	+30.8%	n/a	3.4k	144
11	gemini-3_5-flash	0/1	359.04	+27.0%	n/a	13.8k	125
12	gemini-3_1-pro	3/3	354.73	+25.4%	339.4 ± 12.6	10.2k	150
13	baseline V0 (fixture)	n/a	282.82	n/a	n/a	9.6k	127

The VexRiscv row is the human-engineered reference, a well-known open-source RV32IM CPU synthesized on the same FPGA used for the benchmark. 9 of the LLM-generated designs beat it. See the methodology page for the full procedure.

SWE-bench saturates. HWE Bench doesn't.

Most LLM benchmarks have a fixed ceiling. SWE-bench tops out at 100% issue-resolution. Multiple-choice evals approach 99%. Once a model lands at the ceiling, every subsequent model gets the same score, and the benchmark stops being useful for tracking capability.

HWE Bench has no ceiling. Fitness is the CPU's actual speed running CoreMark on a real FPGA, operating frequency times instructions-per-cycle (Fmax × IPC for the technically inclined). There's no theoretical maximum: a smarter microarchitecture always scores higher. As long as models keep finding new tricks (deeper pipelines, smarter branch predictors, restructured ALUs), the leaderboard keeps moving.

Empirically: the current best is 525.04 iter/s, +85.6% over the V0 baseline core, and clear of the VexRiscv human reference. There is no theoretical ceiling, and within current budgets the curve has not saturated.

Fitness over rounds, best rep per model

Running max of CoreMark fitness across the 15 hypothesis rounds for each model's best-performing rep. Lines step up when a winning hypothesis lands and stay flat otherwise. VexRiscv's human-reference fitness is the red dashed line; the baseline V0 core is the gray dashed line.