How many tokens are in an image?
The short answer
It depends entirely on the provider, and the answer is rarely what beginners expect.
- GPT-4o / GPT-5 vision: Each image is broken into 512×512 tiles. Each tile = 170 tokens, plus a fixed 85-token "base" token for low-detail mode, or scaled per resolution for high-detail. A 1024×1024 image = 765 tokens in high-detail mode (85 base + 4 tiles × 170 = 765).
- Claude (Opus/Sonnet/Haiku): Approximately
(width × height) / 750tokens. A 1024×1024 image = ~1,400 tokens. Claude scales smoothly with resolution rather than tiling. - Gemini 2.5 / 3.x vision: Roughly 258 tokens per image regardless of resolution, up to a cap. Larger images may be reduced to fit the per-call limit.
So the same 1024×1024 image costs you 765 tokens on GPT-4o, ~1,400 on Claude, and 258 on Gemini. Three providers, three completely different tokenizations of the same pixels.
OpenAI: the tile model
OpenAI's vision pricing has two modes:
Low detail. Image is downsampled to 512×512 and costs a flat 85 tokens regardless of input size. Use this when you only need to identify what's in an image, not read text or fine details.
High detail. Image is divided into 512×512 tiles after being scaled to fit within a 2048×2048 box, then each tile costs 170 tokens plus the 85-token base.
The math:
| Original size | Tiles in high-detail | High-detail tokens | Low-detail tokens |
|---|---|---|---|
| 512×512 | 1 | 85 + 170 = 255 | 85 |
| 1024×1024 | 4 | 85 + 680 = 765 | 85 |
| 1536×1536 | 9 | 85 + 1530 = 1,615 | 85 |
| 2048×2048 | 16 | 85 + 2720 = 2,805 | 85 |
| 4096×4096 (downscaled to 2048×2048) | 16 | 85 + 2720 = 2,805 | 85 |
Anything beyond 2048×2048 gets downscaled before tiling, so very large images cap at the 2,805-token mark. For batch processing where you don't need fine details, low-detail is 30× cheaper per image for full-resolution inputs.
Anthropic Claude: pixel-area scaling
Claude doesn't tile. Instead, image token count scales smoothly with pixel area: roughly (width × height) / 750, then rounded to the nearest token.
| Image size | Approximate tokens |
|---|---|
| 512×512 | ~350 |
| 768×768 | ~790 |
| 1024×1024 | ~1,400 |
| 1568×1568 (Claude's recommended max) | ~3,280 |
Claude documentation recommends a maximum of ~1.15 megapixels per image to keep token costs reasonable and processing fast. Larger images are accepted but get reduced to fit.
Multiple images compound, five 1024×1024 images = ~7,000 tokens, which on Opus 4.8 input pricing ($5/M) costs $0.035 in image tokens alone, before the text prompt or output.
Google Gemini: flat per-image
Gemini's vision tokenization is the most opaque of the three. Public documentation puts it at roughly 258 tokens per image for standard-size inputs (under ~3.4MP). Beyond that, the image is broken into tiles of 768×768, each consuming roughly the same token budget.
Gemini's approach makes per-image cost predictable and very low for typical web/photo resolutions, but harder to predict for high-resolution scientific imagery or scans.
Why this matters for cost
Take a real workflow: extracting structured data from 1,000 product photos at 1024×1024 resolution, with a ~500-token text prompt asking for JSON output.
| Provider | Image tokens (×1k) | Text input | Total input | Input cost |
|---|---|---|---|---|
| GPT-4o high-detail | 765,000 | 500,000 | 1,265,000 | $3.16 |
| GPT-4o low-detail | 85,000 | 500,000 | 585,000 | $1.46 |
| Claude Sonnet 4.5 | 1,400,000 | 500,000 | 1,900,000 | $5.70 |
| Gemini 2.5 Flash | 258,000 | 500,000 | 758,000 | $0.23 |
Same workload, ~25× difference between cheapest (Gemini Flash) and most expensive (Claude Sonnet on high-detail). The per-image cost difference compounds fast at scale.
How to count exactly
The token counter on this site supports image inputs for Claude, GPT-4o, and Gemini, drop your image into the upload field and you'll see the exact token count returned by each provider's actual API. That's the only way to be sure of the bill before you send it.
For programmatic counting:
- OpenAI does not expose an image token counter; you have to compute it from image dimensions using the tile formula above
- Anthropic counts image tokens via the same
/v1/messages/count_tokensendpoint as text, pass the image as a content block and you get back the count - Google counts via
countTokenson the Gemini API endpoint
The takeaway
If you're cost-sensitive on a vision workload, the order is usually:
1. Gemini Flash for high-volume document/photo classification (cheapest) 2. GPT-4o low-detail for "what's in this image?" tasks 3. GPT-4o high-detail when fine-grained reading matters 4. Claude when you specifically want Claude's reasoning over the image's content
Don't use Opus on a 10,000-image batch unless you've calculated the bill, it can run 30× more than a Gemini-Flash equivalent.
Try this on every model
- Claude Opus 4.8 $5.00/$25.00
- Claude Opus 4.8 (Fast Mode) $10.00/$50.00
- Claude Sonnet 4.6 $3.00/$15.00
- Claude Haiku 4.5 $1.00/$5.00
- GPT-5.5 $5.00/$30.00
- GPT-5.5 Pro $30.00/$180.00
- GPT-5.4 $2.50/$15.00
- GPT-5.4 Mini $0.75/$4.50
- GPT-5.4 Nano $0.20/$1.25
- GPT-5.4 Pro $30.00/$180.00
- GPT-5.3 $1.75/$14.00
- GPT-5.2 $1.75/$14.00
- GPT-5.2 Pro $21.00/$168.00
- GPT-5.1 $1.25/$10.00
- GPT-5 $1.25/$10.00
- GPT-5 Mini $0.25/$2.00
- GPT-5 Nano $0.05/$0.40
- GPT-5 Pro $15.00/$120.00
- GPT-4.1 $2.00/$8.00
- GPT-4.1 Mini $0.40/$1.60
- GPT-4.1 Nano $0.10/$0.40
- o3 $2.00/$8.00
- o3-mini $1.10/$4.40
- o3-pro $20.00/$80.00
- o4-mini $1.10/$4.40
- GPT-4o $2.50/$10.00
- GPT-4o mini $0.15/$0.60
- GPT-4 Turbo $10.00/$30.00
- Gemini 3.1 Pro $2.00/$12.00
- Gemini 3 Flash $0.50/$3.00
- Gemini 3.1 Flash-Lite $0.25/$1.50
- Gemini 2.5 Pro $1.25/$10.00
- Gemini 2.5 Flash $0.30/$2.50
- Gemini 2.5 Flash-Lite $0.10/$0.40
- Llama 3.3 70B $0.88/$0.88
- Llama 3.1 405B $3.50/$3.50
- Llama 3.1 70B $0.59/$0.79
- Llama 3.1 8B $0.18/$0.18
- Mistral Large $2.00/$6.00
- DeepSeek V3 $0.27/$1.10
- DeepSeek V3.1 $0.60/$1.70
- DeepSeek R1 $3.00/$7.00
- Qwen 2.5 72B $0.90/$0.90
- Qwen 2.5 Coder 32B $0.80/$0.80
- Qwen3 Coder 480B $2.00/$2.00
- GLM-5.1 $1.40/$4.40