API pricing

Fair, flexible pricing.
Built to scale with you.

With the Soniox Speech-to-Text and Text-to-Speech API you pay only for what you use, whether you transcribe in real-time or in batch across 60+ languages.

Pricing calculator

Stop overpaying for speech AI

Sonioxvs

1,000 hours of audio / month

1025501002505001k2.5k5k10k100k

Pricing assumptions

Based on public pay-as-you-go pricing. Enterprise discounts and committed-use contracts may differ. Some providers charge separately for certain features. The calculator uses the public price for the provider configuration that most closely matches Soniox.

Speech-to-Text API pricing

Token-based pricing

All API costs are calculated based on tokens.

Equivalent to about $0.10/hour for async (file) and $0.12/hour for real-time (streaming) transcription.

 
Async (file)
Real-time (streaming)
Input audio tokens

Duration of audio or streaming session

$1.50 per 1M tokens
$2.00 per 1M tokens
Input text tokens

Custom instructions or context you provide

$3.50 per 1M tokens
$4.00 per 1M tokens
Output text tokens

Transcription and optionally translation or other text returned by the model

$3.50 per 1M tokens
$4.00 per 1M tokens

Usage reference:
1 hour of audio is ~30,000 input audio tokens • 1 hour of speech is ~15,000 output text tokens • 1 character of output is ~0.3 tokens

Text-to-Speech API pricing

Token-based pricing

All API costs are calculated based on tokens.

Equivalent to about $0.70/hour of generated speech.

 
Real-time (streaming)
Input text tokens

Text input to generate

$4.00 per 1M tokens
Output audio tokens

Duration of generated audio

$21.50 per 1M tokens

Usage reference:
1 character ≈ 0.3 input text tokens • 15,000 input text tokens ≈ 1 hour of generated speech • 1 hour of generated speech ≈ 30,000 output audio tokens

How the pricing works

Breakthrough innovation is why Soniox costs less

Soniox costs less because the technology is fundamentally more efficient. We built the full speech AI stack ourselves, from models to inference to real-time cloud infrastructure, and optimized every layer to process more audio with lower latency and less wasted compute.

That efficiency is what lets us offer production-grade speech AI at a fraction of the price of traditional providers.

Built for real-time speech AI

Soniox models are built from scratch for real-time speech understanding and generation, not adapted from general-purpose models that waste compute.

Custom inference engine

Our inference stack is built for low-latency audio streaming, batching, scheduling, and GPU utilization, so the same hardware processes more audio at lower cost.

Massive concurrency

The Soniox platform is engineered to run hundreds of thousands of concurrent streams efficiently, turning infrastructure scale into lower prices for every customer.

Frequently asked questions


How much does the Soniox API cost?
Speech-to-Text is $0.10/hour for async (file uploads) and $0.12/hour for real-time (streaming). Advanced use cases with translation, custom context, or fine-grained control are billed by token usage. Text-to-Speech is token-based, about $0.70/hour of generated speech. Use the calculator above to estimate your spend.

How does Soniox compare to Google, Azure, and OpenAI on price?
Soniox real-time is $0.12/hour. Google Speech-to-Text V2 starts around $0.96/hour and Azure Speech around $1.00/hour for real-time, so Soniox is roughly 8x less. OpenAI has no native streaming and starts at $0.36/hour for batch transcription.

Is translation included, or billed separately?
Included. Soniox transcribes and translates across 60+ languages in the same real-time API call at no extra cost. OpenAI, Google, and Azure bill translation as a separate service (Azure’s add-on alone is about $2.50/hour).

Is Soniox cheaper than Deepgram?
Yes. Soniox is $0.10–0.12/hour, while Deepgram Nova-3 with comparable add-ons (keyterms, diarization) runs about $0.39–0.55/hour, roughly 4–5x more. See the full breakdown on our Soniox vs Deepgram page.

Do I pay extra for diarization, language detection, or formatting?
No. Speaker diarization, language identification, and smart formatting are bundled into the hourly rate. Most providers charge these as add-ons, for example Deepgram diarization adds about $0.12/hour and Azure real-time diarization about $0.30/hour.

What is the difference between real-time and async pricing?
Real-time streaming is $0.12/hour and async file transcription is $0.10/hour. Both run the same model with the same accuracy and features.

How much does Soniox Text-to-Speech cost?
Text-to-Speech is token-based: $4.00 per 1M input text tokens and $21.50 per 1M output audio tokens, about $0.70 per hour of generated speech.

Ready to get started?

Create an account instantly, or contact us to design a custom package for your business.

Build with API

Documentation

Get up and running in minutes and spend your time building, not wrestling with the API.

Explore docs

See what you’ll pay

Pay only for what you use with our flexible pricing. Built to scale with you.

Pricing details