Real-time transcription with Web SDK
Create and manage real-time speech-to-text sessions with the Soniox Web SDK
Soniox Web SDK supports real-time transcription over WebSocket directly in the browser. This allows you to transcribe live audio with low latency — ideal for live captions, voice input, and interactive experiences.
You can capture audio from the user's microphone, consume results via events or buffers that group tokens into utterances, and manage sessions with built-in connection handling.
Create a real-time recording session
client.realtime.record() is the high-level API for capturing audio and streaming it to Soniox for real-time transcription.
It returns a Recording instance synchronously so you can attach event listeners before any async work
(microphone access, API key fetch, WebSocket connection) begins.
Listen for results
The result event fires every time the server returns a transcription update.
Each RealtimeResult contains an array of RealtimeToken objects — both
finalized and in-progress tokens.
Handle session events
| Event | Payload | Description |
|---|---|---|
result | RealtimeResult | Transcription result received from the server. |
error | Error | An error occurred during recording. |
endpoint | — | Endpoint detected (speaker finished talking). |
finalized | — | Server completed finalization of current tokens. |
finished | — | Server acknowledged end of stream. Fires before stopped state. |
connected | — | WebSocket connected and streaming. |
state_change | { old_state, new_state } | Recording state transition. |
source_muted | — | Audio source was muted externally (e.g. OS-level or hardware mute). |
source_unmuted | — | Audio source was unmuted after an external mute. |
Session lifecycle
A Recording transitions through a set of states. The lifecycle is fully managed — audio buffering during connection, keepalive during pause, and cleanup on stop or error are all handled automatically.
States
| State | Description |
|---|---|
idle | Initial state before any work begins. |
starting | Audio source is starting, API key is being fetched. Audio is buffered. |
connecting | WebSocket connection is being established. |
recording | Actively capturing and streaming audio. |
paused | Audio capture and streaming paused. Keepalive messages maintain the connection. You are still charged for the open session even when it is paused. |
stopping | stop() called. Waiting for the server to finish processing remaining audio. |
stopped | Gracefully stopped. All final results have been received. |
error | An error occurred. Resources have been cleaned up. |
canceled | Canceled via cancel() or AbortSignal. |
Methods
stop(): Promise<void>
Gracefully stops the recording. Stops the audio source and waits for the server to process all remaining audio and return final results.
cancel(): void
Immediately cancels the recording without waiting for final results. Closes the WebSocket connection and releases all resources.
pause(): void
Pauses audio capture and streaming. The WebSocket connection stays open with automatic keepalive messages.
You are charged for the full stream duration even when session is paused.
resume(): void
Resumes audio capture and streaming after a pause.
finalize(options?): void
Requests the server to finalize current non-final tokens. Useful for forcing finalization at a specific point (e.g. before displaying a completed sentence).
Tracking state changes
Endpoint detection and manual finalization
Endpoint detection lets you know when a speaker has finished speaking. This is critical for real-time voice AI assistants, command-and-response systems, and conversational apps where you want to respond immediately without waiting for long silences.
Read more about Endpoint detection
Enable endpoint detection by setting enable_endpoint_detection: true in the session configuration.
Listen for the endpoint event to know when a speaker has finished speaking.
Manual finalization gives you precise control over when audio should be finalized — useful for Push-to-talk systems and client-side voice activity detection (VAD).
Read more about Manual finalization
Pause, resume and muting audio source
Recording will also react on system level mute events and will start sending keepalive messages to keep the session alive.
You are billed for the full stream duration even when session is paused.
Handling translation
The SDK supports one-way and two-way real-time translation. Configure translation in the session config, then filter tokens by translation_status to separate original and translated text.
One-way translation
Translates all spoken audio into a single target language.
Two-way translation
Translates between two languages — each speaker's speech is translated into the other language.
Translation token fields
When translation is enabled, each RealtimeToken includes:
| Field | Type | Description |
|---|---|---|
translation_status | 'none' | 'original' | 'translation' | Whether this token is original speech or a translation. |
source_language | string | The source language code for translated tokens. |
language | string | The language of this token's text. |
Learn more about Real-time translation
You can provide custom translation terms in the context to improve translation accuracy.
Handle permissions
The SDK provides a platform-agnostic permission system for checking and requesting microphone access before starting a recording. This is optional but recommended for a good user experience — you can show appropriate UI based on the permission state rather than waiting for the recording to fail.
Setup
Pass a BrowserPermissionResolver when creating the client:
Check permission status
check() queries the current microphone permission without prompting the user:
Request permission
request() triggers the browser permission prompt. On platforms where
permission is already granted, this is a no-op.
Only create BrowserPermissionResolver in browser environments
Use custom audio source
By default, client.realtime.record() uses the built-in MicrophoneSource which captures audio via getUserMedia and MediaRecorder.
You can replace it with any object that implements the AudioSource interface.