Real-time transcription with Node SDK
Create and manage real-time speech-to-text sessions with the Soniox Node SDK
Soniox Node SDK supports real-time streaming transcription over WebSocket. This allows you to transcribe live audio with low latency — ideal for voice agents, live captions, and interactive experiences. You can consume results via events, async iteration, or buffers that group tokens into utterances. SDK provides you helper methods to work both with direct and proxy streaming.
Direct stream and temporary API keys
Read more about Direct stream
Node SDK provides you a helper method to issue temporary API Keys to use with Direct stream from the client's browser.
Soniox's Web Library handles everything client-side — capturing microphone input, managing the WebSocket connection, and authenticating using temporary API keys.
Proxy stream helpers
Read more about Proxy stream
Use the SDK's real-time session for low-latency transcription, live captions, and voice agent experiences.
Create a real-time session
Connect and stream
Use sendAudio to send audio chunks to the session.
See the full example with a demo stream in the quickstart: Create your first real-time session
Handle session events
Session lifecycle
Endpoint detection and manual finalization
Endpoint detection lets you know when a speaker has finished speaking. This is critical for real-time voice AI assistants, command-and-response systems, and conversational apps where you want to respond immediately without waiting for long silences.
Read more about Endpoint detection
Enable endpoint detection by setting enable_endpoint_detection: true in the session configuration.
Manual finalization gives you precise control over when audio should be finalized — useful for Push-to-talk systems and client-side voice activity detection (VAD).
Read more about Manual finalization
Pause and resume
You are billed for the full stream duration even when session is paused.
In a typical voice agent loop, you pause the STT session while the agent is responding to avoid transcribing the agent's own audio or processing overlapping speech:
Keepalive
Read more about Connection keepalive
Node SDK automatically sends keepalive messages when session is paused via session.pause().
You can also send keepalive messages manually:
Detecting utterance for voice agents
When building voice AI agents, you need to know when the user has finished speaking so you can process their input. The SDK provides RealtimeUtteranceBuffer to collect streaming tokens into complete utterances, driven by the server's endpoint detection.
How it works
- Set
enable_endpoint_detection: truein the session config – the server detects when the user stops speaking and emits an endpoint event. - Feed every result event into the buffer with
addResult(). - When an endpoint fires, call
markEndpoint()to flush the buffer and get the complete utterance.
Example
Streaming audio from a file
Use sendStream() to pipe audio directly from a file (or any async source) into a real-time session. It accepts any AsyncIterable – Node.js file streams, Web ReadableStream, Bun file streams, fetch response bodies, or custom async generators.
Simulating real-time pace
When streaming pre-recorded files, you can throttle sending with pace_ms to simulate how audio would arrive from a live source (e.g. a microphone). This isn't needed for live audio – it naturally arrives at real-time pace.
Use sendAudio if you need more control.