bench — real LLM workloads, public results

Real workloads. Real APIs. Raw outputs published.

Most LLM rankings measure pairwise human preference on short prompts. Production teams need different signals: does the model return valid JSON under a schema, does it hallucinate field values, does it shut up when it should. This page runs fixed task suites against every model and shows you what came back. See also llm-rate for the use-case picker.

Run 1 — News headline → JSON

20 news headlines, each scored against an expected JSON object with four fields (company, ticker, event_type, sentiment). Grading is mechanical — JSON parses, fields match exactly. No judge model. Anyone with the dataset can reproduce these scores.

	Model	Status	Parse rate	Field acc	Total cost	Median latency
01	claude-haiku-4.5	ran	100%	87.5%	$0.0061	811ms
02	claude-opus-4.7	20 blocked	0%	—	—	2607ms
03	claude-opus-4.6	20 blocked	0%	—	—	2639ms
04	claude-sonnet-4.6	20 blocked	0%	—	—	2592ms

Why 3 models didn't run

This run hit the Anthropic API directly using the fleet's existing OAuth token. The same token is used by the production agents that run this orchestrator, so requests for premium models (Opus / Sonnet) competed with live traffic and lost to rate limiting. Haiku had spare quota and ran clean. Future runs will use OpenRouter or a separate research key to bypass this.

claude-opus-4.7 — 20 items blocked. Example: Anthropic 429: {"type":"error","error":{"type":"rate_limit_error","message":"Error"},"request_id":"req_011Cb5DznYvA1UeCAbC877ST"}
claude-opus-4.6 — 20 items blocked. Example: Anthropic 429: {"type":"error","error":{"type":"rate_limit_error","message":"Error"},"request_id":"req_011Cb5E4cuqmfgGanNAHijKY"}
claude-sonnet-4.6 — 20 items blocked. Example: Anthropic 429: {"type":"error","error":{"type":"rate_limit_error","message":"Error"},"request_id":"req_011Cb5E8WdvuXqvppzmwkYr1"}

Method

Every model received the same system prompt and the same headline. Temperature was not specified (newer Claude models reject the parameter). Up to 200 output tokens. Each request retried up to 8× on transient errors. Mechanical grading: case-insensitive equality per field, after stripping markdown code fences.

You are a strict JSON extraction tool. Given a single news headline, output exactly one JSON object on a single line. The schema is: {"company": string | null, "ticker": string | null, "event_type": one of ["earnings", "acquisition", "lawsuit", "product", "leadership", "other"], "sentiment": one of ["positive", "negative", "neutral"]}. Return JSON ONLY. No prose, no markdown fences, no commentary. If a value is unknown, use null for company/ticker; pick "other" for event_type and "neutral" for sentiment.

Dataset and full per-item outputs: raw JSON. Source: scripts/bench/run.ts.

What the model actually said

Top 5 items from the run that landed (haiku). The raw output is the model's exact reply; the expected JSON is the ground truth used for grading.

claude-haiku-4.5 (first 5 of 20 items)

n1: company✓ · ticker✓ · event_type✓ · sentiment✓

Got: {"company": "Apple", "ticker": "AAPL", "event_type": "earnings", "sentiment": "positive"}

n2: company✓ · ticker✓ · event_type✓ · sentiment✗

Got: {"company": "Microsoft", "ticker": "MSFT", "event_type": "acquisition", "sentiment": "positive"}

n3: company✓ · ticker✓ · event_type✓ · sentiment✓

Got: {"company": "Nvidia", "ticker": "NVDA", "event_type": "lawsuit", "sentiment": "negative"}

n4: company✓ · ticker✓ · event_type✓ · sentiment✓

Got: {"company": "Tesla", "ticker": "TSLA", "event_type": "product", "sentiment": "positive"}

n5: company✓ · ticker✓ · event_type✓ · sentiment✗

Got: {"company": "OpenAI", "ticker": null, "event_type": "leadership", "sentiment": "positive"}

What this is, and isn't

This is one run, one task, with shared-key rate limits cutting out three of four models. It's not "claude-haiku-4-5 is the best model for JSON extraction." It IS a working demonstration of the methodology: fixed task suite, mechanical grading, raw outputs published, costs measured. Future runs will cover RAG over a fixed corpus, function-call correctness, code refactoring with test verification, and structured extraction at higher complexity. Each will land here.