#tHow Many Tokens?

← Back to counter

How are tokens counted for images, audio, and PDFs?

The short answer

Multimodal token counts vary dramatically by provider, there's no universal "an image is N tokens" rule. As rough April 2026 numbers for a single 1024×1024 image:

This counter handles text only today. Multimodal counting is on the roadmap. Below is what you need to know to estimate cost manually.

Image tokens by provider

OpenAI (GPT-4o, GPT-5 vision-enabled)

Anthropic (Claude vision)

Google Gemini

The variation is real, the same image costs 4× more on Claude than on Gemini. For high-image-volume workloads, model choice meaningfully affects cost.

Audio tokens

OpenAI

Google Gemini

Anthropic

PDFs

PDFs are typically processed as a stack of images (one per page) plus extracted text:

For cost-sensitive PDF workloads: extract text upstream (pdfplumber, pypdf) and send text-only, cuts per-page cost by 10-50× depending on page density.

Video

Video is a frame-by-frame multimodal load. Gemini handles native video input by sampling frames at low frame rate (~1 fps by default). Cost = seconds_of_video × image_tokens_per_frame.

Practical cost shaping

If multimodal cost matters:

1. Use Gemini for high-image-volume workloads, its tile-based pricing is dramatically cheaper than OpenAI or Claude at scale. 2. Extract text from PDFs upstream instead of sending pages as images. 3. Use low-detail mode on OpenAI when image content is simple (charts, screenshots, single-subject photos). 4. Cache the image-bearing prompts if the same image gets queried multiple ways, caching applies to image tokens too on Anthropic and Google.

When this counter will support it

Per-image and per-audio counting is on the v1.1 roadmap. Until then, the home counter handles text only, use the per-provider rules above to add the image/audio component manually.

Try this on every model

Try the live counter →