Real-time transcription with Node SDK

Soniox Node SDK supports real-time streaming transcription over WebSocket. This allows you to transcribe live audio with low latency — ideal for voice agents, live captions, and interactive experiences. You can consume results via events, async iteration, or buffers that group tokens into utterances. SDK provides you helper methods to work both with direct and proxy streaming.

Direct stream and temporary API keys

Proxy stream helpers

Create a real-time session

const session = client.realtime.stt({
  model: 'stt-rt-v4',
  audio_format: 'pcm_s16le',
  sample_rate: 16000,
  num_channels: 1,
  enable_endpoint_detection: true,
  enable_speaker_diarization: true,
  language_hints: ['en'],
  context: {
    text: 'Support call about billing',
    terms: ['invoice', 'refund'],
  },
});

Connect and stream

Use sendAudio to send audio chunks to the session.

await session.connect();

session.on('result', (result) => {
  process.stdout.write(result.tokens.map(t => t.text).join(''));
});

for await (const chunk of audioStream) {
  session.sendAudio(chunk);
}

await session.finish();

See the full example with a demo stream in the quickstart: Create your first real-time session

Handle session events

session.on('connected', () => console.log('connected'));
session.on('disconnected', (reason) => console.log('disconnected:', reason));
session.on('error', (error) => console.error('error:', error));

session.on('result', (result) => console.log(result.tokens.map(t => t.text).join('')));
session.on('endpoint', () => console.log('endpoint'));
session.on('finalized', () => console.log('finalized'));
session.on('finished', () => console.log('finished'));

Session lifecycle

// Connect to the session
await session.connect(); // idle -> connected

// Send audio chunks to the session
for await (const chunk of audioStream) {
  session.sendAudio(chunk);
}

// Gracefully end the session (Signal end of audio and wait for remaining results from the server)
await session.finish(); 

// Or cancel immediately:
session.close(); // connected -> closed

Endpoint detection and manual finalization

Endpoint detection lets you know when a speaker has finished speaking. This is critical for real-time voice AI assistants, command-and-response systems, and conversational apps where you want to respond immediately without waiting for long silences.

Pause and resume

session.pause();   // keeps connection alive, drops audio while paused
session.resume();  // resume sending audio

You are billed for the full stream duration even when session is paused.

In a typical voice agent loop, you pause the STT session while the agent is responding to avoid transcribing the agent's own audio or processing overlapping speech:

session.on("endpoint", async () => {
  const utterance = utteranceBuffer.markEndpoint(); // Read more about utterance buffer below
  if (!utterance) return;

  // Pause STT while the agent processes and responds
  session.pause();

  const response = await myAgent.respond(utterance.text);
  // ... send response audio to the client ...

  // Resume listening for the next utterance
  session.resume();
});

SDK will finalize audio on pause. Make sure to adjust your VAD sensitivity to have enough silence before pause. Learn more about Manual finalization

Keepalive

Detecting utterance for voice agents

When building voice AI agents, you need to know when the user has finished speaking so you can process their input. The SDK provides RealtimeUtteranceBuffer to collect streaming tokens into complete utterances, driven by the server's endpoint detection.

How it works

Set enable_endpoint_detection: true in the session config – the server detects when the user stops speaking and emits an endpoint event.
Feed every result event into the buffer with addResult().
When an endpoint fires, call markEndpoint() to flush the buffer and get the complete utterance.

Example

import { SonioxNodeClient, RealtimeUtteranceBuffer } from "@soniox/node";

const client = new SonioxNodeClient();

// Call this for each new user/connection - each session needs its own buffer
function createAgentSession(onUtterance: (text: string) => void) {
  const session = client.realtime.stt({
    model: "stt-rt-v4",
    enable_endpoint_detection: true,
  });

  // Each session gets its own buffer
  const utteranceBuffer = new RealtimeUtteranceBuffer({ 
    final_only: true 
  });

  session.on("result", (result) => {
    utteranceBuffer.addResult(result);
  });

  session.on("endpoint", () => {
    const utterance = utteranceBuffer.markEndpoint();
    if (utterance) {
      onUtterance(utterance.text);
    }
  });

  return session;
}

// Usage: create a session per user connection
const session = createAgentSession((text) => {
  console.log("User said:", text);
  // Pass to your LLM / agent pipeline
});

await session.connect();
session.sendAudio(audioChunk);

Streaming audio from a file

Use sendStream() to pipe audio directly from a file (or any async source) into a real-time session. It accepts any AsyncIterable – Node.js file streams, Web ReadableStream, Bun file streams, fetch response bodies, or custom async generators.

Simulating real-time pace

When streaming pre-recorded files, you can throttle sending with pace_ms to simulate how audio would arrive from a live source (e.g. a microphone). This isn't needed for live audio – it naturally arrives at real-time pace.

Use sendAudio if you need more control.

Real-time transcription with Node SDK

On this page