Our Methodology
How VibeDex benchmarks and scores AI image generators — transparently and independently.
Last updated: February 2026
Benchmark Scope
200+
Test Prompts
Photorealism, illustration, typography, product shots, concept art, and edge cases
20
Models Benchmarked
All major providers — same prompts, same conditions, no cherry-picking
3,500+
Evaluations
Every model-prompt pair scored across multiple quality dimensions
How We Evaluate
Every generated image is evaluated by Gemini 3 Pro as our primary visual language model (VLM) judge. We tested multiple VLMs — including Gemini 2.5 Pro, Claude Opus, and Claude Sonnet — before selecting Gemini 3 Pro for its consistency, scoring calibration, and ability to assess fine-grained visual quality across diverse styles.
Our prompt suite is designed to isolate specific quality dimensions. Some prompts target photorealistic accuracy, others stress-test text rendering, physical plausibility, or complex multi-subject compositions. Every model runs the same prompts under the same conditions — we generate a single image per model-prompt pair with no cherry-picking or re-rolling.
What We Measure
Unlike single-score rankings, VibeDex evaluates models across four quality dimensions, each with granular sub-metrics. This means we can match you with the right model for your specific task.
Visual Fidelity
Overall image quality and visual appeal:
- •Aesthetics — artistic quality, color harmony, visual impact
- •Image Quality — sharpness, noise, artifact-free rendering
- •Composition — framing, balance, visual hierarchy
Physics & Logic
Realistic lighting, materials, gravity, and physical plausibility:
- •Static Physics — gravity, support, spatial relationships
- •Material Physics — textures, reflections, transparency
- •Biomechanics — natural poses, joint articulation, movement
Subject & Object Integrity
Accurate anatomy, object coherence, and scene consistency:
- •Human Subjects — anatomy, faces, hands, proportions
- •Object Integrity — structural coherence, correct details
- •Scene Logic — spatial relationships, context consistency
Instruction Adherence
How faithfully the output matches the prompt:
- •Semantic Accuracy — correct subjects, actions, attributes
- •Spatial Framing — camera angle, layout, positioning
- •Text Rendering — accuracy and legibility of in-image text
Scoring Approach
Not all dimensions matter equally for every prompt. A product photography prompt demands high visual fidelity and physics accuracy, while a fantasy illustration prioritizes composition and subject integrity.
Our scoring engine analyzes each prompt to determine which quality dimensions are most important. The primary dimension is scored in depth across its sub-metrics, while the remaining dimensions receive holistic scores. The final score is a weighted combination across all four dimensions, tuned to what your specific prompt demands.
Limitations
No benchmark is perfect. We believe in being transparent about ours:
- •Automated scoring only — our evaluations are AI-judged. Human review validates trends but does not produce individual scores.
- •English-focused prompt set — all evaluation prompts are currently in English. Multi-language support is planned.
- •Single generation per pair — we generate one image per model-prompt combination. No cherry-picking, but also no variance sampling.
- •Models update frequently — providers ship updates regularly. Our scores reflect performance at evaluation date and are re-run periodically.
- •Artistic subjectivity — style preference is inherently personal. Our scores measure technical quality, not taste.
Models Benchmarked
We currently benchmark 20 image generation models across all major providers. Models are re-evaluated as new versions are released.
| Model | Tier | Cost/Image |
|---|---|---|
| Flux Schnell | Budget | $0.0010 |
| Flux Dev | Budget | $0.0030 |
| Qwen Image 2512 | Budget | $0.0030 |
| Seedream 3.0 | Standard | $0.0180 |
| Reve Image | Standard | $0.0240 |
| Seedream 4.0 | Standard | $0.0300 |
| Ideogram 2a | Standard | $0.0320 |
| FLUX.2 Pro | Standard | $0.0350 |
| Nano Banana | Standard | $0.0390 |
| FLUX 1.1 Pro | Standard | $0.0400 |
| Ideogram 3.0 | Standard | $0.0400 |
| Seedream 4.5 | Standard | $0.0400 |
| Kling Image O1 | Standard | $0.0400 |
| FLUX.2 Max | Premium | $0.0700 |
| Hunyuan Image 3.0 | Premium | $0.0800 |
| Runway Gen-4 Image | Premium | $0.0800 |
| GPT Image 1.5 | Premium | $0.1330 |
| Nano Banana Pro | Premium | $0.1380 |
| Nano Banana 2 | Premium | $0.0670 |
| Grok Imagine Image | Standard | $0.0200 |
Find the best model for your prompt
VibeDex analyzes your prompt and recommends the best AI image model based on what your specific image demands.