Run 1 · Friday, 15 May 2026
Real workloads. Real APIs. Raw outputs published.
Most LLM rankings measure pairwise human preference on short prompts. Production teams need different signals: does the model return valid JSON under a schema, does it hallucinate field values, does it shut up when it should. This page runs fixed task suites against every model and shows you what came back. See also llm-rate for the use-case picker.
20 news headlines, each scored against an expected JSON object with four fields (company, ticker, event_type, sentiment). Grading is mechanical — JSON parses, fields match exactly. No judge model. Anyone with the dataset can reproduce these scores.
| Model | Status | Parse rate | Field acc | Total cost | Median latency | |
|---|---|---|---|---|---|---|
| 01 | claude-haiku-4.5 | ran | 100% | 87.5% | $0.0061 | 811ms |
| 02 | claude-opus-4.7 | 20 blocked | 0% | — | — | 2607ms |
| 03 | claude-opus-4.6 | 20 blocked | 0% | — | — | 2639ms |
| 04 | claude-sonnet-4.6 | 20 blocked | 0% | — | — | 2592ms |
This run hit the Anthropic API directly using the fleet's existing OAuth token. The same token is used by the production agents that run this orchestrator, so requests for premium models (Opus / Sonnet) competed with live traffic and lost to rate limiting. Haiku had spare quota and ran clean. Future runs will use OpenRouter or a separate research key to bypass this.
Anthropic 429: {"type":"error","error":{"type":"rate_limit_error","message":"Error"},"request_id":"req_011Cb5DznYvA1UeCAbC877ST"}Anthropic 429: {"type":"error","error":{"type":"rate_limit_error","message":"Error"},"request_id":"req_011Cb5E4cuqmfgGanNAHijKY"}Anthropic 429: {"type":"error","error":{"type":"rate_limit_error","message":"Error"},"request_id":"req_011Cb5E8WdvuXqvppzmwkYr1"}Every model received the same system prompt and the same headline. Temperature was not specified (newer Claude models reject the parameter). Up to 200 output tokens. Each request retried up to 8× on transient errors. Mechanical grading: case-insensitive equality per field, after stripping markdown code fences.
Dataset and full per-item outputs: raw JSON. Source: scripts/bench/run.ts.
Top 5 items from the run that landed (haiku). The raw output is the model's exact reply; the expected JSON is the ground truth used for grading.
n1: company✓ · ticker✓ · event_type✓ · sentiment✓
Got: {"company": "Apple", "ticker": "AAPL", "event_type": "earnings", "sentiment": "positive"}
n2: company✓ · ticker✓ · event_type✓ · sentiment✗
Got: {"company": "Microsoft", "ticker": "MSFT", "event_type": "acquisition", "sentiment": "positive"}
n3: company✓ · ticker✓ · event_type✓ · sentiment✓
Got: {"company": "Nvidia", "ticker": "NVDA", "event_type": "lawsuit", "sentiment": "negative"}
n4: company✓ · ticker✓ · event_type✓ · sentiment✓
Got: {"company": "Tesla", "ticker": "TSLA", "event_type": "product", "sentiment": "positive"}
n5: company✓ · ticker✓ · event_type✓ · sentiment✗
Got: {"company": "OpenAI", "ticker": null, "event_type": "leadership", "sentiment": "positive"}
This is one run, one task, with shared-key rate limits cutting out three of four models. It's not "claude-haiku-4-5 is the best model for JSON extraction." It IS a working demonstration of the methodology: fixed task suite, mechanical grading, raw outputs published, costs measured. Future runs will cover RAG over a fixed corpus, function-call correctness, code refactoring with test verification, and structured extraction at higher complexity. Each will land here.