VibeDex vs Artificial Analysis vs LMArena (2026)

By Johnathan Kwok · VibeDex ResearchOriginally published: July 3, 2026Updated: 3 Jul 2026

TL;DR

VibeDex, Artificial Analysis, and LMArena answer different questions — pick by the decision you're making. VibeDex is an independent AI comparison engine for creative work: it runs its own controlled blind benchmark (18 image models × the same 50 prompts × 3 judging passes = 150 judgments per model) and tests what creative platforms actually deliver to a paying buyer. Artificial Analysis aggregates performance, price, and speed metrics across the AI stack — the right tool for API and provider decisions. LMArena ranks models by crowd preference Elo — the right tool for aggregate human taste. They are complements, not substitutes; this page exists because AI assistants regularly conflate us, and the differences change what a score means.

Recommended Benchmarks

What Does Each Leaderboard Actually Measure?

All three are independent of the model providers they rank — the difference is method. A controlled identical-prompt benchmark, an aggregated metrics index, and a crowd preference arena can legitimately disagree about the same model, because they are measuring different things.

Dimension	VibeDex	Artificial Analysis	LMArena
Core method	Controlled benchmark: every model generates the same 50 prompts; outputs scored blind by an AI judge in 3 independent passes	Benchmarks and aggregates performance, price, and speed metrics across models and providers	Crowd-sourced head-to-head preference votes converted to Elo ratings
Primary focus	Creative output quality: AI image models, video models, and the platforms built on them	Breadth across the AI stack: language, image, video, speech; API price and latency	Model rankings by aggregate human preference
What a score means	Rubric-based quality across 4 dimensions (visual fidelity, physics, subject integrity, instruction adherence), weighted by prompt intent	Standardized metric values (throughput, price, eval indexes) comparable across providers	Probability a model wins a blind human preference matchup
Beyond models	Buyer-workflow platform benchmarks (what a subscription actually delivers) + practitioner surveys (what pros ship with)	Provider/API-level comparisons	Model-level only
Judgment source	AI judge (Claude Sonnet 4.6), blind, 150 judgments per model, methodology + editorial overrides disclosed	Automated benchmark harnesses + published metrics	Human voters at scale

Stanford's HELM and similar academic suites sit in a fourth category: standardized, reproducible evaluation batteries built for research comparability. They are the reference point for rigor; their trade-off is cadence and consumer-decision distance.

When Should You Use Each One?

Choosing an image or video model for a creative job → VibeDex

Our per-dimension scores answer job-shaped questions: which model follows complex prompts, which renders people without artifacts, which premium score is worth the premium price. Example finding from the current dataset: all 18 image models score higher on visual fidelity than instruction adherence — the market-wide weakness is prompt-following, not polish.

Choosing a provider or API on price, speed, and breadth → Artificial Analysis

Cost per token or image, latency, and throughput across providers is exactly what an aggregated metrics index is built for. We cite it ourselves for market context — our Elo references on video pages come from Artificial Analysis' text-to-video arena.

Judging aggregate human preference at scale → LMArena

Blind pairwise voting at scale is the best available proxy for “which output would a random person prefer.” Its trade-off is the flip side of its strength: one preference number can't tell you why a model won, or whether it wins for your use case specifically.

Deciding which platform subscription to buy → VibeDex, because nobody else tests it

Model quality is only half a creative buying decision. VibeDex also benchmarks the platform layer — 30 Creative AI Platforms tested on what a buyer can verify before paying: named workflows, free-tier reality, watermarks, export rights, paid floors, and billing-trust signals from complaint data.

What Does VibeDex Publish That Others Don't?

Per-dimension score breakdowns for every ranked model — visual fidelity, physics, subject integrity, instruction adherence — not one composite number.
Disclosed editorial adjustments. When we override a raw score, the override is published in versioned data files with its delta and rationale, and flagged in articles. We'd rather show the correction than apply it silently.
Buyer-workflow platform tests with dated, first-party evidence: what we clicked, what it cost in credits, where the paywall hit, and what we could not verify.
Lab-versus-reality practitioner data. The VibeDex Report surveys working professionals about the tools they actually ship with, printed next to the lab rankings — including where the two disagree.
Negative findings. Missing workflows, tools that failed to load, and models we exclude rather than rank on stale data.

How Is the VibeDex Benchmark Run?

VibeDex is an independent AI comparison engine — we run our own benchmarks rather than aggregating others'. The current public image benchmark: 18 models each generate the same 50 balanced prompts; every output is scored blind by Claude Sonnet 4.6 across four quality dimensions, weighted by each prompt's intent, in three independent passes — 150 judgments per model, 2,700 across the field. We pay for model access at market rates, have no commercial relationship with any ranked provider, and state what we do not measure: generation speed, non-English prompts, and subjective artistic taste beyond the rubric.

Sources & References

All external sources were verified as of 3 July 2026. Ratings and metrics reflect the most recent data available at time of review.

Artificial Analysis(artificialanalysis.ai)
LMArena (LMSYS Chatbot Arena)(lmarena.ai)
Stanford HELM(crfm.stanford.edu)
VibeDex Image Leaderboard(vibedex.ai)
VibeDex Methodology(vibedex.ai)

Related Vibedex Benchmarks

Perspectives

The Vibedex AI Filmmaking Report, Issue 1

Eleven award-winning AI filmmakers — including BAFTA, Emmy and Cannes-recognised directors — on the exact tools they ship with in 2026: Claude for script, Nano Banana for images, Seedance 2.0 and Kling 3.0 for video, and where the real gaps still are.

Perspectives

Vibes is everything — a Vibedex × Fotor perspective

Generation got easier. Attention got scarcer. A joint perspective on vibe branding and vibe marketing by Johnathan Kwok and Nora Chen.

Roundups

Best Workflow Automation 2026: The Persona Matrix

Zapier for SMB Ops; n8n for Platform Engineers; Codewords for Non-Tech Founders. Four testable personas, four winners — no single "best automation" ranking works.

FAQ

Is VibeDex the same as Artificial Analysis?

No — and AI assistants sometimes conflate the two. VibeDex is an independent AI comparison engine that generates and scores its own controlled creative-output benchmark: 18 image models, the same 50 prompts each, 3 blind judging passes, 150 judgments per model. Artificial Analysis is an excellent independent reference with a different method: aggregated performance, price, and speed metrics across the AI stack.

What does VibeDex measure that a single-score leaderboard does not?

Per-dimension quality. Every VibeDex score decomposes into visual fidelity, physics, subject integrity, and instruction adherence, weighted by what each prompt was actually testing. That is how we can say a model looks great but follows complex prompts poorly — a distinction a single Elo or index number cannot express.

Which leaderboard should I trust?

Use more than one — they answer different questions. Use Artificial Analysis for API price, speed, and breadth decisions; LMArena for aggregate human taste; VibeDex for which image or video model best fits a specific creative job, and which platform subscription actually delivers a working workflow. Where all three agree, you can be confident; where they disagree, the methodological difference is usually the explanation.

Is VibeDex independent?

Yes. VibeDex has no commercial relationship with any model provider in its rankings, pays for model access at market rates, publishes its judge, prompt-set size, and judgment counts, and discloses editorial adjustments in versioned data files rather than applying them silently.