An unbounded benchmark for LLM hardware development. Models compete to design RISC-V CPU microarchitectures, measured by CoreMark fitness (Fmax × IPC) on a real Tang Nano 20K FPGA, gated by riscv-formal correctness and Python-ISS cosim.
| # | Model | Reps | Best | Δ% | Mean ± std | Best LUT4 | Best Fmax |
|---|---|---|---|---|---|---|---|
| 1 | gpt-5_5_xhigh | 3/3 | 525.04 | +85.6% | 468.3 ± 52.8 | 5.5k | 220 |
| 2 | gpt-5_4_xhigh | 2/2 | 513.84 | +81.7% | 505.0 ± 8.9 | 10.1k | 203 |
| 3 | gpt-5_5_high | 3/3 | 461.87 | +63.3% | 430.2 ± 23.0 | 9.8k | 187 |
| 4 | gpt-5_5_medium | 3/3 | 431.58 | +52.6% | 423.5 ± 11.2 | 7.8k | 201 |
| 5 | kimi-k2_6 | 2/3 | 396.13 | +40.1% | 339.5 ± 8.3 | 9.9k | 166 |
| 6 | VexRiscv (human ref) | — | 370.00 | +30.8% | — | 4.0k | 129 |
| 7 | gemini-3_1-pro | 3/3 | 354.73 | +25.4% | 339.4 ± 12.6 | 10.2k | 150 |
| 8 | baseline V0 (fixture) | — | 282.82 | — | — | 9.6k | 127 |
The VexRiscv row is the human-engineered reference — a well-known open-source RV32IM core,
synthesized on the same Tang Nano 20K Gowin part used for the benchmark, with its bench
reading scaled to CoreMark/MHz. 5 of the LLM-generated
designs exceed it. Peak fitness includes reps that finalized with a failed
status if their data was captured before the failure; the mean column excludes failed reps.
Methodology details on the methodology page.
Most LLM benchmarks have a fixed ceiling. SWE-bench tops out at 100% issue-resolution. Multiple-choice evals approach 99%. Once a model lands at the ceiling, every subsequent model gets the same score, and the benchmark stops being useful for tracking capability.
HWE Bench has no ceiling. The fitness score is Fmax × IPC — operating frequency times instructions-per-cycle — measured on a real FPGA. There is no theoretical maximum; better microarchitecture always scores higher. As long as models can find new tricks (deeper pipelines, smarter predictors, restructured ALUs), the leaderboard keeps moving.
Empirically: the current best is 525.04 iter/s, +85.6% over the V0 baseline core, and clear of the VexRiscv human reference. Each successive batch of reps has produced at least one design that beats the prior record. The curve has not plateaued.