How are tokens counted for images, audio, and PDFs?
The short answer
Multimodal token counts vary dramatically by provider, there's no universal "an image is N tokens" rule. As rough April 2026 numbers for a single 1024×1024 image:
- OpenAI GPT-4o family, ~1,100 tokens (low-detail mode: 85 tokens)
- Anthropic Claude, ~1,600 tokens
- Google Gemini, 258 tokens per image (under 384px) or tile-based for larger
- Image generation output, counted separately ($60/M tokens on Gemini 3.1 Flash Image)
This counter handles text only today. Multimodal counting is on the roadmap. Below is what you need to know to estimate cost manually.
Image tokens by provider
OpenAI (GPT-4o, GPT-5 vision-enabled)
- Low detail: flat 85 tokens regardless of image size
- High detail: 85 tokens base + 170 tokens per 512×512 tile after resizing to fit a 2048×2048 box (long side) and a 768px short side
- Typical 1024×1024 high-detail image: ~1,105 tokens
Anthropic (Claude vision)
- Image tokens ≈
(width × height) / 750 - 1024×1024 image: ~1,365 tokens
- 1568×1568 (max): ~3,300 tokens
Google Gemini
- Images ≤384×384: flat 258 tokens
- Images >384×384: tiled at 768×768, 258 tokens per tile
- 1024×1024 image: ~774 tokens (3 tiles)
The variation is real, the same image costs 4× more on Claude than on Gemini. For high-image-volume workloads, model choice meaningfully affects cost.
Audio tokens
OpenAI
- Audio input on GPT-4o-audio and successors: roughly 80 tokens per second of audio for "standard" quality, higher for high-fidelity modes
- 1 minute of audio ≈ 4,800 audio tokens
Google Gemini
- Audio input: 32 tokens per second flat (Gemini 2.5 family and newer)
- 1 minute of audio = 1,920 audio tokens
- Live API models (
gemini-3.1-flash-live-preview) price audio separately: $3/M input audio tokens
Anthropic
- No native audio input on Claude as of April 2026 (use a transcription model upstream).
PDFs
PDFs are typically processed as a stack of images (one per page) plus extracted text:
- Page count × per-page image tokens + extracted text tokens
- A 20-page PDF with high-detail OpenAI vision: ~22,000 tokens just for the images
- Use the PDF question page for a more detailed text-side estimate
For cost-sensitive PDF workloads: extract text upstream (pdfplumber, pypdf) and send text-only, cuts per-page cost by 10-50× depending on page density.
Video
Video is a frame-by-frame multimodal load. Gemini handles native video input by sampling frames at low frame rate (~1 fps by default). Cost = seconds_of_video × image_tokens_per_frame.
- 60 seconds at 1 fps on Gemini = 60 × 258 = ~15,500 image tokens
Practical cost shaping
If multimodal cost matters:
1. Use Gemini for high-image-volume workloads, its tile-based pricing is dramatically cheaper than OpenAI or Claude at scale. 2. Extract text from PDFs upstream instead of sending pages as images. 3. Use low-detail mode on OpenAI when image content is simple (charts, screenshots, single-subject photos). 4. Cache the image-bearing prompts if the same image gets queried multiple ways, caching applies to image tokens too on Anthropic and Google.
When this counter will support it
Per-image and per-audio counting is on the v1.1 roadmap. Until then, the home counter handles text only, use the per-provider rules above to add the image/audio component manually.
Try this on every model
- Claude Opus 4.8 $5.00/$25.00
- Claude Opus 4.8 (Fast Mode) $10.00/$50.00
- Claude Sonnet 4.6 $3.00/$15.00
- Claude Haiku 4.5 $1.00/$5.00
- GPT-5.5 $5.00/$30.00
- GPT-5.5 Pro $30.00/$180.00
- GPT-5.4 $2.50/$15.00
- GPT-5.4 Mini $0.75/$4.50
- GPT-5.4 Nano $0.20/$1.25
- GPT-5.4 Pro $30.00/$180.00
- GPT-5.3 $1.75/$14.00
- GPT-5.2 $1.75/$14.00
- GPT-5.2 Pro $21.00/$168.00
- GPT-5.1 $1.25/$10.00
- GPT-5 $1.25/$10.00
- GPT-5 Mini $0.25/$2.00
- GPT-5 Nano $0.05/$0.40
- GPT-5 Pro $15.00/$120.00
- GPT-4.1 $2.00/$8.00
- GPT-4.1 Mini $0.40/$1.60
- GPT-4.1 Nano $0.10/$0.40
- o3 $2.00/$8.00
- o3-mini $1.10/$4.40
- o3-pro $20.00/$80.00
- o4-mini $1.10/$4.40
- GPT-4o $2.50/$10.00
- GPT-4o mini $0.15/$0.60
- GPT-4 Turbo $10.00/$30.00
- Gemini 3.1 Pro $2.00/$12.00
- Gemini 3 Flash $0.50/$3.00
- Gemini 3.1 Flash-Lite $0.25/$1.50
- Gemini 2.5 Pro $1.25/$10.00
- Gemini 2.5 Flash $0.30/$2.50
- Gemini 2.5 Flash-Lite $0.10/$0.40
- Llama 3.3 70B $0.88/$0.88
- Llama 3.1 405B $3.50/$3.50
- Llama 3.1 70B $0.59/$0.79
- Llama 3.1 8B $0.18/$0.18
- Mistral Large $2.00/$6.00
- DeepSeek V3 $0.27/$1.10
- DeepSeek V3.1 $0.60/$1.70
- DeepSeek R1 $3.00/$7.00
- Qwen 2.5 72B $0.90/$0.90
- Qwen 2.5 Coder 32B $0.80/$0.80
- Qwen3 Coder 480B $2.00/$2.00
- GLM-5.1 $1.40/$4.40