#tHow Many Tokens?

← Back to counter

How many tokens are in an image?

The short answer

It depends entirely on the provider, and the answer is rarely what beginners expect.

So the same 1024×1024 image costs you 765 tokens on GPT-4o, ~1,400 on Claude, and 258 on Gemini. Three providers, three completely different tokenizations of the same pixels.

OpenAI: the tile model

OpenAI's vision pricing has two modes:

Low detail. Image is downsampled to 512×512 and costs a flat 85 tokens regardless of input size. Use this when you only need to identify what's in an image, not read text or fine details.

High detail. Image is divided into 512×512 tiles after being scaled to fit within a 2048×2048 box, then each tile costs 170 tokens plus the 85-token base.

The math:

Original sizeTiles in high-detailHigh-detail tokensLow-detail tokens
512×512185 + 170 = 25585
1024×1024485 + 680 = 76585
1536×1536985 + 1530 = 1,61585
2048×20481685 + 2720 = 2,80585
4096×4096 (downscaled to 2048×2048)1685 + 2720 = 2,80585

Anything beyond 2048×2048 gets downscaled before tiling, so very large images cap at the 2,805-token mark. For batch processing where you don't need fine details, low-detail is 30× cheaper per image for full-resolution inputs.

Anthropic Claude: pixel-area scaling

Claude doesn't tile. Instead, image token count scales smoothly with pixel area: roughly (width × height) / 750, then rounded to the nearest token.

Image sizeApproximate tokens
512×512~350
768×768~790
1024×1024~1,400
1568×1568 (Claude's recommended max)~3,280

Claude documentation recommends a maximum of ~1.15 megapixels per image to keep token costs reasonable and processing fast. Larger images are accepted but get reduced to fit.

Multiple images compound, five 1024×1024 images = ~7,000 tokens, which on Opus 4.8 input pricing ($5/M) costs $0.035 in image tokens alone, before the text prompt or output.

Google Gemini: flat per-image

Gemini's vision tokenization is the most opaque of the three. Public documentation puts it at roughly 258 tokens per image for standard-size inputs (under ~3.4MP). Beyond that, the image is broken into tiles of 768×768, each consuming roughly the same token budget.

Gemini's approach makes per-image cost predictable and very low for typical web/photo resolutions, but harder to predict for high-resolution scientific imagery or scans.

Why this matters for cost

Take a real workflow: extracting structured data from 1,000 product photos at 1024×1024 resolution, with a ~500-token text prompt asking for JSON output.

ProviderImage tokens (×1k)Text inputTotal inputInput cost
GPT-4o high-detail765,000500,0001,265,000$3.16
GPT-4o low-detail85,000500,000585,000$1.46
Claude Sonnet 4.51,400,000500,0001,900,000$5.70
Gemini 2.5 Flash258,000500,000758,000$0.23

Same workload, ~25× difference between cheapest (Gemini Flash) and most expensive (Claude Sonnet on high-detail). The per-image cost difference compounds fast at scale.

How to count exactly

The token counter on this site supports image inputs for Claude, GPT-4o, and Gemini, drop your image into the upload field and you'll see the exact token count returned by each provider's actual API. That's the only way to be sure of the bill before you send it.

For programmatic counting:

The takeaway

If you're cost-sensitive on a vision workload, the order is usually:

1. Gemini Flash for high-volume document/photo classification (cheapest) 2. GPT-4o low-detail for "what's in this image?" tasks 3. GPT-4o high-detail when fine-grained reading matters 4. Claude when you specifically want Claude's reasoning over the image's content

Don't use Opus on a 10,000-image batch unless you've calculated the bill, it can run 30× more than a Gemini-Flash equivalent.

Try the multimodal counter →

Try this on every model

Try the live counter →