How many tokens are in an image?

Q: How many tokens are in an image?

Image tokens are billed per tile by GPT-4o, per pixel-area by Claude, and per second-equivalent by Gemini. Here's the math for each provider in 2026.

Updated 2026-05-31 · By Clinton Patrick · Methodology

The short answer

It depends entirely on the provider, and the answer is rarely what beginners expect.

GPT-4o / GPT-5 vision: Each image is broken into 512×512 tiles. Each tile = 170 tokens, plus a fixed 85-token "base" token for low-detail mode, or scaled per resolution for high-detail. A 1024×1024 image = 765 tokens in high-detail mode (85 base + 4 tiles × 170 = 765).
Claude (Opus/Sonnet/Haiku): Approximately (width × height) / 750 tokens. A 1024×1024 image = ~1,400 tokens. Claude scales smoothly with resolution rather than tiling.
Gemini 2.5 / 3.x vision: Roughly 258 tokens per image regardless of resolution, up to a cap. Larger images may be reduced to fit the per-call limit.

So the same 1024×1024 image costs you 765 tokens on GPT-4o, ~1,400 on Claude, and 258 on Gemini. Three providers, three completely different tokenizations of the same pixels.

OpenAI: the tile model

OpenAI's vision pricing has two modes:

Low detail. Image is downsampled to 512×512 and costs a flat 85 tokens regardless of input size. Use this when you only need to identify what's in an image, not read text or fine details.

High detail. Image is divided into 512×512 tiles after being scaled to fit within a 2048×2048 box, then each tile costs 170 tokens plus the 85-token base.

The math:

Original size	Tiles in high-detail	High-detail tokens	Low-detail tokens
512×512	1	85 + 170 = 255	85
1024×1024	4	85 + 680 = 765	85
1536×1536	9	85 + 1530 = 1,615	85
2048×2048	16	85 + 2720 = 2,805	85
4096×4096 (downscaled to 2048×2048)	16	85 + 2720 = 2,805	85

Anything beyond 2048×2048 gets downscaled before tiling, so very large images cap at the 2,805-token mark. For batch processing where you don't need fine details, low-detail is 30× cheaper per image for full-resolution inputs.

Anthropic Claude: pixel-area scaling

Claude doesn't tile. Instead, image token count scales smoothly with pixel area: roughly (width × height) / 750, then rounded to the nearest token.

Image size	Approximate tokens
512×512	~350
768×768	~790
1024×1024	~1,400
1568×1568 (Claude's recommended max)	~3,280

Claude documentation recommends a maximum of ~1.15 megapixels per image to keep token costs reasonable and processing fast. Larger images are accepted but get reduced to fit.

Multiple images compound, five 1024×1024 images = ~7,000 tokens, which on Opus 4.8 input pricing ($5/M) costs $0.035 in image tokens alone, before the text prompt or output.

Google Gemini: flat per-image

Gemini's vision tokenization is the most opaque of the three. Public documentation puts it at roughly 258 tokens per image for standard-size inputs (under ~3.4MP). Beyond that, the image is broken into tiles of 768×768, each consuming roughly the same token budget.

Gemini's approach makes per-image cost predictable and very low for typical web/photo resolutions, but harder to predict for high-resolution scientific imagery or scans.

Why this matters for cost

Take a real workflow: extracting structured data from 1,000 product photos at 1024×1024 resolution, with a ~500-token text prompt asking for JSON output.

Provider	Image tokens (×1k)	Text input	Total input	Input cost
GPT-4o high-detail	765,000	500,000	1,265,000	$3.16
GPT-4o low-detail	85,000	500,000	585,000	$1.46
Claude Sonnet 4.5	1,400,000	500,000	1,900,000	$5.70
Gemini 2.5 Flash	258,000	500,000	758,000	$0.23

Same workload, ~25× difference between cheapest (Gemini Flash) and most expensive (Claude Sonnet on high-detail). The per-image cost difference compounds fast at scale.

How to count exactly

The token counter on this site supports image inputs for Claude, GPT-4o, and Gemini, drop your image into the upload field and you'll see the exact token count returned by each provider's actual API. That's the only way to be sure of the bill before you send it.

For programmatic counting:

OpenAI does not expose an image token counter; you have to compute it from image dimensions using the tile formula above
Anthropic counts image tokens via the same /v1/messages/count_tokens endpoint as text, pass the image as a content block and you get back the count
Google counts via countTokens on the Gemini API endpoint

The takeaway

If you're cost-sensitive on a vision workload, the order is usually:

1. Gemini Flash for high-volume document/photo classification (cheapest) 2. GPT-4o low-detail for "what's in this image?" tasks 3. GPT-4o high-detail when fine-grained reading matters 4. Claude when you specifically want Claude's reasoning over the image's content

Don't use Opus on a 10,000-image batch unless you've calculated the bill, it can run 30× more than a Gemini-Flash equivalent.

Try the multimodal counter →

Try this on every model

Claude Opus 4.8 $5.00/$25.00
Claude Opus 4.8 (Fast Mode) $10.00/$50.00
Claude Sonnet 4.6 $3.00/$15.00
Claude Haiku 4.5 $1.00/$5.00
GPT-5.5 $5.00/$30.00
GPT-5.5 Pro $30.00/$180.00
GPT-5.4 $2.50/$15.00
GPT-5.4 Mini $0.75/$4.50
GPT-5.4 Nano $0.20/$1.25
GPT-5.4 Pro $30.00/$180.00
GPT-5.3 $1.75/$14.00
GPT-5.2 $1.75/$14.00
GPT-5.2 Pro $21.00/$168.00
GPT-5.1 $1.25/$10.00
GPT-5 $1.25/$10.00
GPT-5 Mini $0.25/$2.00
GPT-5 Nano $0.05/$0.40
GPT-5 Pro $15.00/$120.00
GPT-4.1 $2.00/$8.00
GPT-4.1 Mini $0.40/$1.60
GPT-4.1 Nano $0.10/$0.40
o3 $2.00/$8.00
o3-mini $1.10/$4.40
o3-pro $20.00/$80.00
o4-mini $1.10/$4.40
GPT-4o $2.50/$10.00
GPT-4o mini $0.15/$0.60
GPT-4 Turbo $10.00/$30.00
Gemini 3.1 Pro $2.00/$12.00
Gemini 3 Flash $0.50/$3.00
Gemini 3.1 Flash-Lite $0.25/$1.50
Gemini 2.5 Pro $1.25/$10.00
Gemini 2.5 Flash $0.30/$2.50
Gemini 2.5 Flash-Lite $0.10/$0.40
Llama 3.3 70B $0.88/$0.88
Llama 3.1 405B $3.50/$3.50
Llama 3.1 70B $0.59/$0.79
Llama 3.1 8B $0.18/$0.18
Mistral Large $2.00/$6.00
DeepSeek V3 $0.27/$1.10
DeepSeek V3.1 $0.60/$1.70
DeepSeek R1 $3.00/$7.00
Qwen 2.5 72B $0.90/$0.90
Qwen 2.5 Coder 32B $0.80/$0.80
Qwen3 Coder 480B $2.00/$2.00
GLM-5.1 $1.40/$4.40

Try the live counter →