Every frontier model, ranked live
Capability scoring across 47 AI models, updated from LMSYS Arena, MMLU, HumanEval and our weighted composite. 3 releases this week.
UPDATED · APR 17 · 10:00 UTC
Scoring methodology →| # | Model | Score | 24h | 7d | 7d chart | Org value | Queries/day | Category |
|---|---|---|---|---|---|---|---|---|
| 1 | Claude 3.7 SonnetNEW Anthropic | 94.2 | 2.1% | 5.8% | $8.5B | 142K | Frontier | |
| 2 | GPT-4o OpenAI | 91.8 | 0.3% | 1.2% | $157B | 891K | Frontier | |
| 3 | Gemini 2.0 Ultra Google DeepMind | 89.3 | 1.7% | 4.1% | $1.8T | 324K | Frontier | |
| 4 | Grok 3 xAI | 86.1 | 4.2% | 9.3% | $24B | 67K | Frontier | |
| 5 | Llama 4 ScoutNEW Meta AI | 81.4 | 6.8% | 14.2% | $1.4T | 2.1M | Open Source | |
| 6 | Mistral Large 3 Mistral AI | 79.2 | 3.1% | 6.7% | $1.1B | 89K | Open Source | |
| 7 | o4-miniNEW OpenAI | 78.1 | 8.4% | 22.1% | $157B | 234K | Reasoning | |
| 8 | Claude 3.7 Haiku Anthropic | 76.4 | 1.1% | 3.2% | $8.5B | 412K | Frontier | |
| 9 | Gemini 2.0 Flash Google DeepMind | 74.8 | 0.4% | 0.9% | $1.8T | 1.1M | Frontier | |
| 10 | DeepSeek V3 DeepSeek | 73.1 | 0.6% | 2.8% | $8B | 98K | Open Source | |
| 11 | Qwen 3 72BNEW Alibaba | 71.9 | 4.3% | 11.2% | $210B | 124K | Open Source | |
| 12 | o3-pro OpenAI | 70.6 | 1.2% | 0.8% | $157B | 34K | Reasoning | |
| 13 | Command R+ Cohere | 68.4 | 0.8% | 2.1% | $5.5B | 18K | Enterprise | |
| 14 | Phi-4 Microsoft | 66.2 | 0.3% | 1.4% | $3.1T | 42K | Open Source | |
| 15 | Claude 3.5 Sonnet Anthropic | 64.8 | 2.1% | 5.4% | $8.5B | 67K | Frontier |
Methodology
How the capability score is computed
Weighted composite of LMSYS Arena Elo (40%), MMLU (20%), HumanEval (15%), GPQA (10%), ARC-AGI (10%), and community benchmark reports (5%). Raw scores are min-max normalized to 0–100 across frontier and open-source tiers. Updated every 6 hours via cron.
- · 24h movement reflects Elo delta since previous day’s snapshot.
- · Queries/day is estimated from public API telemetry + partner data.
- · Org value uses the latest known private/public valuation.
- · Phase 1 mock data — real feeds go live in Phase 2.