Real-time transcription with Web SDK

Soniox Web SDK supports real-time transcription over WebSocket directly in the browser. This allows you to transcribe live audio with low latency — ideal for live captions, voice input, and interactive experiences.

You can capture audio from the user's microphone, consume results via events or buffers that group tokens into utterances, and manage sessions with built-in connection handling.

Create a real-time recording session

client.realtime.record() is the high-level API for capturing audio and streaming it to Soniox for real-time transcription. It returns a Recording instance synchronously so you can attach event listeners before any async work (microphone access, API key fetch, WebSocket connection) begins.

const recording = client.realtime.record({
  // speech-to-text model to use
  model: "stt-rt-v4",

  // Optional: hint expected languages
  language_hints: ["en", "es"],

  // Optional: enable speaker identification
  enable_speaker_diarization: true,

  // Optional: detect utterance boundaries (useful for voice agents)
  enable_endpoint_detection: true,

  // Optional: provide domain context to improve accuracy
  context: {
    terms: ["Soniox", "WebSocket"],
    general: [{ key: "domain", value: "technology" }],
  },

  // ... other options ...
});

Listen for results

The result event fires every time the server returns a transcription update.

Each RealtimeResult contains an array of RealtimeToken objects — both finalized and in-progress tokens.

recording.on("result", (result) => {
  const text = result.tokens.map((t) => t.text).join("");
  if (text) console.log(text);
});

Handle session events

Event	Payload	Description
`result`	`RealtimeResult`	Transcription result received from the server.
`error`	`Error`	An error occurred during recording.
`endpoint`	—	Endpoint detected (speaker finished talking).
`finalized`	—	Server completed finalization of current tokens.
`finished`	—	Server acknowledged end of stream. Fires before `stopped` state.
`connected`	—	WebSocket connected and streaming.
`state_change`	`{ old_state, new_state }`	Recording state transition.
`source_muted`	—	Audio source was muted externally (e.g. OS-level or hardware mute).
`source_unmuted`	—	Audio source was unmuted after an external mute.

Session lifecycle

A Recording transitions through a set of states. The lifecycle is fully managed — audio buffering during connection, keepalive during pause, and cleanup on stop or error are all handled automatically.

States

State	Description
`idle`	Initial state before any work begins.
`starting`	Audio source is starting, API key is being fetched. Audio is buffered.
`connecting`	WebSocket connection is being established.
`recording`	Actively capturing and streaming audio.
`paused`	Audio capture and streaming paused. Keepalive messages maintain the connection. You are still charged for the open session even when it is paused.
`stopping`	`stop()` called. Waiting for the server to finish processing remaining audio.
`stopped`	Gracefully stopped. All final results have been received.
`error`	An error occurred. Resources have been cleaned up.
`canceled`	Canceled via `cancel()` or `AbortSignal`.

Methods

`stop(): Promise<void>`

Gracefully stops the recording. Stops the audio source and waits for the server to process all remaining audio and return final results.

await recording.stop();
// All final results have been received at this point

`cancel(): void`

Immediately cancels the recording without waiting for final results. Closes the WebSocket connection and releases all resources.

recording.cancel();

`pause(): void`

Pauses audio capture and streaming. The WebSocket connection stays open with automatic keepalive messages.

recording.pause();
console.log(recording.state); // 'paused'

You are charged for the full stream duration even when session is paused.

`resume(): void`

Resumes audio capture and streaming after a pause.

recording.resume();
console.log(recording.state); // 'recording'

`finalize(options?): void`

Requests the server to finalize current non-final tokens. Useful for forcing finalization at a specific point (e.g. before displaying a completed sentence).

recording.finalize();

// With trailing silence trimming:
recording.finalize({ trailing_silence_ms: 500 });

Tracking state changes

recording.on("state_change", ({ old_state, new_state }) => {
  console.log(`${old_state} → ${new_state}`);
});

Endpoint detection and manual finalization

Endpoint detection lets you know when a speaker has finished speaking. This is critical for real-time voice AI assistants, command-and-response systems, and conversational apps where you want to respond immediately without waiting for long silences.

Pause, resume and muting audio source

recording.pause();   // keeps connection alive, drops audio while paused
recording.resume();  // resume sending audio

SDK will finalize audio on pause. Make sure to adjust your VAD sensitivity to have enough silence before pause. Learn more about Manual finalization

Recording will also react on system level mute events and will start sending keepalive messages to keep the session alive.

You are billed for the full stream duration even when session is paused.

Handling translation

The SDK supports one-way and two-way real-time translation. Configure translation in the session config, then filter tokens by translation_status to separate original and translated text.

One-way translation

Translates all spoken audio into a single target language.

const recording = client.realtime.record({
  model: "stt-rt-v4",
  translation: {
    type: "one_way",
    target_language: "es", // Translate everything to Spanish
  },
});

recording.on("result", (result) => {
  for (const token of result.tokens) {
    if (token.translation_status === "original") {
      console.log("[Original]", token.text);
    } else if (token.translation_status === "translation") {
      console.log("[Translated]", token.text);
    }
  }
});

Two-way translation

Translates between two languages — each speaker's speech is translated into the other language.

const recording = client.realtime.record({
  model: "stt-rt-v4",
  translation: {
    type: "two_way",
    language_a: "en",
    language_b: "fr",
  },
});

Translation token fields

When translation is enabled, each RealtimeToken includes:

Field	Type	Description
`translation_status`	`'none' \| 'original' \| 'translation'`	Whether this token is original speech or a translation.
`source_language`	`string`	The source language code for translated tokens.
`language`	`string`	The language of this token's text.

Learn more about Real-time translation

You can provide custom translation terms in the context to improve translation accuracy.

Handle permissions

The SDK provides a platform-agnostic permission system for checking and requesting microphone access before starting a recording. This is optional but recommended for a good user experience — you can show appropriate UI based on the permission state rather than waiting for the recording to fail.

Setup

Pass a BrowserPermissionResolver when creating the client:

import { SonioxClient, BrowserPermissionResolver } from "@soniox/client";

const client = new SonioxClient({
  api_key: fetchKey,
  permissions: new BrowserPermissionResolver(),
});

Check permission status

check() queries the current microphone permission without prompting the user:

const result = await client.permissions?.check("microphone");

switch (result?.status) {
  case "granted":
    // Microphone access already granted — safe to record
    break;
  case "prompt":
    // User hasn't been asked yet — show a "start recording" button
    break;
  case "denied":
    if (!result.can_request) {
      // Permanently denied — show "go to browser settings" instructions
    }
    break;
  case "unavailable":
    // No microphone or getUserMedia not supported
    break;
}

Request permission

request() triggers the browser permission prompt. On platforms where permission is already granted, this is a no-op.

const result = await client.permissions?.request("microphone");

if (result?.status === "granted") {
  startRecording();
} else if (result?.status === "denied") {
  showPermissionDeniedMessage();
}

Only create BrowserPermissionResolver in browser environments

Use custom audio source

By default, client.realtime.record() uses the built-in MicrophoneSource which captures audio via getUserMedia and MediaRecorder. You can replace it with any object that implements the AudioSource interface.

Real-time transcription with Web SDK

On this page