Build voice apps for the real world.

Power transcription, translation, and speaker-aware understanding that keeps up with natural conversations and holds up in production. One API. 60+ languages. High accuracy, low latency.

One API built for how people actually speak

One API for the world

Transcribe and translate in 60+ languages with production-ready accuracy. No model switching. No per-language setup. No glue code.

requests.post(
    "https://api.soniox.com/v1/transcriptions",
    json={
        "audio_url": "https://example.com/audio.mp3",
        "model": "stt-async-preview"
    },
    headers={"Authorization": "Bearer SONIOX_API_KEY"}
)

Real time that’s actually real time

Get token-level updates in milliseconds. No buffering, no batch lag, no awkward handoffs. Feels live because it is.

{text: "Wh", start_ms: 540, end_ms: 540,…}
{text: "at", start_ms: 540, end_ms: 600,…}
{text: " is", start_ms: 660, end_ms: 720,…}
{text: " y", start_ms: 780, end_ms: 840,…}
{text: "our", start_ms: 840, end_ms: 900,…}
{text: " best", start_ms: 900, end_ms: 960,…}
{text: " s", start_ms: 1200, end_ms: 1260,…}
{text: "eller", start_ms: 1320, end_ms: 1380,…}
{text: " h", start_ms: 1440, end_ms: 1500,…}
{text: "ere", start_ms: 1560, end_ms: 1620,…}
{text: "?", start_ms: 1620, end_ms: 1680,…}

Built to handle real speech

Detects speakers. Tracks language shifts. Structures chaos. Works the way people talk, not the way models wish they did.

{
  "enable_speaker_diarization": true,
  "enable_endpoint_detection": true,
  "enable_language_identification": true,
}

Everything in one call

No patching together transcription, speaker logic, and translation. Soniox handles the full stream — from raw audio to structured output.

{
  "tokens": [{
      "text": "Hola",
      "start_ms": 600,
      "end_ms": 760,
      "confidence": 0.97,
      "is_final": true,
      "speaker": "1",
      "language": "es",
      "translation_status": "translation",
      "source_language": "en"
  }]
}

Everything you need to build great voice apps

Multilingual transcription

Convert speech to text in over 60 languages. No per-language setup required.

Real-time translation

Get real-time translated output alongside the original audio stream.

Real-time streaming

Receive token-level transcription and translation in milliseconds.

Two-way translation

Stream multilingual conversations and receive real-time transcription and translation in both directions.

Language detection

Automatically identify the spoken language without preconfiguration.

Speaker diarization

Automatically detect and label individual speakers in any conversation.

Endpoint detection

Track when speakers start and stop talking in real time.

Latency control

Adjust how quickly tokens become final — trade off speed for accuracy to match your real-time needs.

Custom vocabulary

Improve accuracy for names, acronyms, or domain-specific terms.

Async file transcription

Submit audio files via URL or upload for offline processing.

Structured output

Get labeled, speaker-aware output ready for downstream use.

Helping startups and enterprises ship real world voice apps

Samsung
Deliver Health
Avodah
Mobius
Scribe
Agora

Power every speech experience, in any language

Transcribe and translate 60+ languages in real time

Get accurate speech-to-text and instant translation. No language config required.

Build fast, responsive voice agents and assistants

Stream audio over Websocket and receive token-level output that stays in sync with users.

Secure medical transcription with custom vocab

Capture clinical conversations with speaker labels, term tuning, and HIPAA-compliant infrastructure, using our REST API.

Generate live captions and subtitle files

Output timestamped, speaker-aware text in formats like SRT or VTT, or display live captions.

Analyze calls with structured transcription

Use custom context to improve accuracy and output labeled, segmented transcripts for QA and insights.

Get API key

Privacy and compliance, built right in

Never stored, never saved.

Audio stays in memory, everything is processed in real time.

Built for privacy-critical use cases.

SOC 2 Type II–certified and HIPAA-ready from day one.

Trusted where privacy matters most.

Used in industries where speech is sensitive — from healthcare to enterprise.

Get started with the Soniox API

Start building

Create your account and generate an API key. Includes $200 in free credits.

Explore the docs

Find guides, API reference, and code samples to help you build fast.

Join our Discord

Ask questions, get feedback, and connect with other builders.