Soniox
SDKsWeb

Real-time transcription with Web SDK

Create and manage real-time speech-to-text sessions with the Soniox Web SDK

Soniox Web SDK supports real-time transcription over WebSocket directly in the browser. This allows you to transcribe live audio with low latency — ideal for live captions, voice input, and interactive experiences.

You can capture audio from the user's microphone, consume results via events or buffers that group tokens into utterances, and manage sessions with built-in connection handling.

Create a real-time recording session

client.realtime.record() is the high-level API for capturing audio and streaming it to Soniox for real-time transcription. It returns a Recording instance synchronously so you can attach event listeners before any async work (microphone access, API key fetch, WebSocket connection) begins.

const recording = client.realtime.record({
  // speech-to-text model to use
  model: "stt-rt-v4",

  // Optional: hint expected languages
  language_hints: ["en", "es"],

  // Optional: enable speaker identification
  enable_speaker_diarization: true,

  // Optional: detect utterance boundaries (useful for voice agents)
  enable_endpoint_detection: true,

  // Optional: provide domain context to improve accuracy
  context: {
    terms: ["Soniox", "WebSocket"],
    general: [{ key: "domain", value: "technology" }],
  },

  // ... other options ...
});

Listen for results

The result event fires every time the server returns a transcription update.

Each RealtimeResult contains an array of RealtimeToken objects — both finalized and in-progress tokens.

recording.on("result", (result) => {
  const text = result.tokens.map((t) => t.text).join("");
  if (text) console.log(text);
});

Handle session events

EventPayloadDescription
resultRealtimeResultTranscription result received from the server.
errorErrorAn error occurred during recording.
endpointEndpoint detected (speaker finished talking).
finalizedServer completed finalization of current tokens.
finishedServer acknowledged end of stream. Fires before stopped state.
connectedWebSocket connected and streaming.
state_change{ old_state, new_state }Recording state transition.
source_mutedAudio source was muted externally (e.g. OS-level or hardware mute).
source_unmutedAudio source was unmuted after an external mute.

Session lifecycle

A Recording transitions through a set of states. The lifecycle is fully managed — audio buffering during connection, keepalive during pause, and cleanup on stop or error are all handled automatically.

States

StateDescription
idleInitial state before any work begins.
startingAudio source is starting, API key is being fetched. Audio is buffered.
connectingWebSocket connection is being established.
recordingActively capturing and streaming audio.
pausedAudio capture and streaming paused. Keepalive messages maintain the connection.
You are still charged for the open session even when it is paused.
stoppingstop() called. Waiting for the server to finish processing remaining audio.
stoppedGracefully stopped. All final results have been received.
errorAn error occurred. Resources have been cleaned up.
canceledCanceled via cancel() or AbortSignal.

Methods

stop(): Promise<void>

Gracefully stops the recording. Stops the audio source and waits for the server to process all remaining audio and return final results.

await recording.stop();
// All final results have been received at this point

cancel(): void

Immediately cancels the recording without waiting for final results. Closes the WebSocket connection and releases all resources.

recording.cancel();

pause(): void

Pauses audio capture and streaming. The WebSocket connection stays open with automatic keepalive messages.

recording.pause();
console.log(recording.state); // 'paused'

You are charged for the full stream duration even when session is paused.

resume(): void

Resumes audio capture and streaming after a pause.

recording.resume();
console.log(recording.state); // 'recording'

finalize(options?): void

Requests the server to finalize current non-final tokens. Useful for forcing finalization at a specific point (e.g. before displaying a completed sentence).

recording.finalize();

// With trailing silence trimming:
recording.finalize({ trailing_silence_ms: 500 });

Tracking state changes

recording.on("state_change", ({ old_state, new_state }) => {
  console.log(`${old_state} → ${new_state}`);
});

Endpoint detection and manual finalization

Endpoint detection lets you know when a speaker has finished speaking. This is critical for real-time voice AI assistants, command-and-response systems, and conversational apps where you want to respond immediately without waiting for long silences.

Read more about Endpoint detection

Enable endpoint detection by setting enable_endpoint_detection: true in the session configuration.

Listen for the endpoint event to know when a speaker has finished speaking.

recording.on("endpoint", () => {
  console.log("--- speaker finished ---");
});

Manual finalization gives you precise control over when audio should be finalized — useful for Push-to-talk systems and client-side voice activity detection (VAD).

Read more about Manual finalization

session.finalize();

Pause, resume and muting audio source

recording.pause();   // keeps connection alive, drops audio while paused
recording.resume();  // resume sending audio

Recording will also react on system level mute events and will start sending keepalive messages to keep the session alive.

You are billed for the full stream duration even when session is paused.

Handling translation

The SDK supports one-way and two-way real-time translation. Configure translation in the session config, then filter tokens by translation_status to separate original and translated text.

One-way translation

Translates all spoken audio into a single target language.

const recording = client.realtime.record({
  model: "stt-rt-v4",
  translation: {
    type: "one_way",
    target_language: "es", // Translate everything to Spanish
  },
});

recording.on("result", (result) => {
  for (const token of result.tokens) {
    if (token.translation_status === "original") {
      console.log("[Original]", token.text);
    } else if (token.translation_status === "translation") {
      console.log("[Translated]", token.text);
    }
  }
});

Two-way translation

Translates between two languages — each speaker's speech is translated into the other language.

const recording = client.realtime.record({
  model: "stt-rt-v4",
  translation: {
    type: "two_way",
    language_a: "en",
    language_b: "fr",
  },
});

Translation token fields

When translation is enabled, each RealtimeToken includes:

FieldTypeDescription
translation_status'none' | 'original' | 'translation'Whether this token is original speech or a translation.
source_languagestringThe source language code for translated tokens.
languagestringThe language of this token's text.

Learn more about Real-time translation

You can provide custom translation terms in the context to improve translation accuracy.

Handle permissions

The SDK provides a platform-agnostic permission system for checking and requesting microphone access before starting a recording. This is optional but recommended for a good user experience — you can show appropriate UI based on the permission state rather than waiting for the recording to fail.

Setup

Pass a BrowserPermissionResolver when creating the client:

import { SonioxClient, BrowserPermissionResolver } from "@soniox/client";

const client = new SonioxClient({
  api_key: fetchKey,
  permissions: new BrowserPermissionResolver(),
});

Check permission status

check() queries the current microphone permission without prompting the user:

const result = await client.permissions?.check("microphone");

switch (result?.status) {
  case "granted":
    // Microphone access already granted — safe to record
    break;
  case "prompt":
    // User hasn't been asked yet — show a "start recording" button
    break;
  case "denied":
    if (!result.can_request) {
      // Permanently denied — show "go to browser settings" instructions
    }
    break;
  case "unavailable":
    // No microphone or getUserMedia not supported
    break;
}

Request permission

request() triggers the browser permission prompt. On platforms where permission is already granted, this is a no-op.

const result = await client.permissions?.request("microphone");

if (result?.status === "granted") {
  startRecording();
} else if (result?.status === "denied") {
  showPermissionDeniedMessage();
}

Only create BrowserPermissionResolver in browser environments

Use custom audio source

By default, client.realtime.record() uses the built-in MicrophoneSource which captures audio via getUserMedia and MediaRecorder. You can replace it with any object that implements the AudioSource interface.