How are tokens counted for images, audio, and PDFs?

Q: How are tokens counted for images, audio, and PDFs?

Images, audio, and video each have token-counting rules per provider. A 1024×1024 image: ~1,100 tokens on GPT-4o, ~258 on Gemini, ~1,600 on Claude. Full breakdown.

Updated 2026-05-31 · By Clinton Patrick · Methodology

The short answer

Multimodal token counts vary dramatically by provider, there's no universal "an image is N tokens" rule. As rough April 2026 numbers for a single 1024×1024 image:

OpenAI GPT-4o family, ~1,100 tokens (low-detail mode: 85 tokens)
Anthropic Claude, ~1,600 tokens
Google Gemini, 258 tokens per image (under 384px) or tile-based for larger
Image generation output, counted separately ($60/M tokens on Gemini 3.1 Flash Image)

This counter handles text only today. Multimodal counting is on the roadmap. Below is what you need to know to estimate cost manually.

Image tokens by provider

OpenAI (GPT-4o, GPT-5 vision-enabled)

Low detail: flat 85 tokens regardless of image size
High detail: 85 tokens base + 170 tokens per 512×512 tile after resizing to fit a 2048×2048 box (long side) and a 768px short side
Typical 1024×1024 high-detail image: ~1,105 tokens

Anthropic (Claude vision)

Image tokens ≈ (width × height) / 750
1024×1024 image: ~1,365 tokens
1568×1568 (max): ~3,300 tokens

Google Gemini

Images ≤384×384: flat 258 tokens
Images >384×384: tiled at 768×768, 258 tokens per tile
1024×1024 image: ~774 tokens (3 tiles)

The variation is real, the same image costs 4× more on Claude than on Gemini. For high-image-volume workloads, model choice meaningfully affects cost.

Audio tokens

OpenAI

Audio input on GPT-4o-audio and successors: roughly 80 tokens per second of audio for "standard" quality, higher for high-fidelity modes
1 minute of audio ≈ 4,800 audio tokens

Google Gemini

Audio input: 32 tokens per second flat (Gemini 2.5 family and newer)
1 minute of audio = 1,920 audio tokens
Live API models (gemini-3.1-flash-live-preview) price audio separately: $3/M input audio tokens

Anthropic

No native audio input on Claude as of April 2026 (use a transcription model upstream).

PDFs

PDFs are typically processed as a stack of images (one per page) plus extracted text:

Page count × per-page image tokens + extracted text tokens
A 20-page PDF with high-detail OpenAI vision: ~22,000 tokens just for the images
Use the PDF question page for a more detailed text-side estimate

For cost-sensitive PDF workloads: extract text upstream (pdfplumber, pypdf) and send text-only, cuts per-page cost by 10-50× depending on page density.

Video

Video is a frame-by-frame multimodal load. Gemini handles native video input by sampling frames at low frame rate (~1 fps by default). Cost = seconds_of_video × image_tokens_per_frame.

60 seconds at 1 fps on Gemini = 60 × 258 = ~15,500 image tokens

Practical cost shaping

If multimodal cost matters:

1. Use Gemini for high-image-volume workloads, its tile-based pricing is dramatically cheaper than OpenAI or Claude at scale. 2. Extract text from PDFs upstream instead of sending pages as images. 3. Use low-detail mode on OpenAI when image content is simple (charts, screenshots, single-subject photos). 4. Cache the image-bearing prompts if the same image gets queried multiple ways, caching applies to image tokens too on Anthropic and Google.

When this counter will support it

Per-image and per-audio counting is on the v1.1 roadmap. Until then, the home counter handles text only, use the per-provider rules above to add the image/audio component manually.

Try this on every model

Claude Opus 4.8 $5.00/$25.00
Claude Opus 4.8 (Fast Mode) $10.00/$50.00
Claude Sonnet 4.6 $3.00/$15.00
Claude Haiku 4.5 $1.00/$5.00
GPT-5.5 $5.00/$30.00
GPT-5.5 Pro $30.00/$180.00
GPT-5.4 $2.50/$15.00
GPT-5.4 Mini $0.75/$4.50
GPT-5.4 Nano $0.20/$1.25
GPT-5.4 Pro $30.00/$180.00
GPT-5.3 $1.75/$14.00
GPT-5.2 $1.75/$14.00
GPT-5.2 Pro $21.00/$168.00
GPT-5.1 $1.25/$10.00
GPT-5 $1.25/$10.00
GPT-5 Mini $0.25/$2.00
GPT-5 Nano $0.05/$0.40
GPT-5 Pro $15.00/$120.00
GPT-4.1 $2.00/$8.00
GPT-4.1 Mini $0.40/$1.60
GPT-4.1 Nano $0.10/$0.40
o3 $2.00/$8.00
o3-mini $1.10/$4.40
o3-pro $20.00/$80.00
o4-mini $1.10/$4.40
GPT-4o $2.50/$10.00
GPT-4o mini $0.15/$0.60
GPT-4 Turbo $10.00/$30.00
Gemini 3.1 Pro $2.00/$12.00
Gemini 3 Flash $0.50/$3.00
Gemini 3.1 Flash-Lite $0.25/$1.50
Gemini 2.5 Pro $1.25/$10.00
Gemini 2.5 Flash $0.30/$2.50
Gemini 2.5 Flash-Lite $0.10/$0.40
Llama 3.3 70B $0.88/$0.88
Llama 3.1 405B $3.50/$3.50
Llama 3.1 70B $0.59/$0.79
Llama 3.1 8B $0.18/$0.18
Mistral Large $2.00/$6.00
DeepSeek V3 $0.27/$1.10
DeepSeek V3.1 $0.60/$1.70
DeepSeek R1 $3.00/$7.00
Qwen 2.5 72B $0.90/$0.90
Qwen 2.5 Coder 32B $0.80/$0.80
Qwen3 Coder 480B $2.00/$2.00
GLM-5.1 $1.40/$4.40

Try the live counter →