Soniox

Real-time transcription with Node SDK

Create and manage real-time speech-to-text sessions with the Soniox Node SDK

Soniox Node SDK supports real-time streaming transcription over WebSocket. This allows you to transcribe live audio with low latency — ideal for voice agents, live captions, and interactive experiences. You can consume results via events, async iteration, or buffers that group tokens into utterances. SDK provides you helper methods to work both with direct and proxy streaming.

Direct stream and temporary API keys

Read more about Direct stream

Node SDK provides you a helper method to issue temporary API Keys to use with Direct stream from the client's browser.

const { api_key, expires_at } = await client.auth.createTemporaryKey({
  usage_type: 'transcribe_websocket',
  expires_in_seconds: 3600,
  client_reference_id: 'support-call-123',
});

console.log(api_key, expires_at);

Soniox's Web Library handles everything client-side — capturing microphone input, managing the WebSocket connection, and authenticating using temporary API keys.

Proxy stream helpers

Read more about Proxy stream

Use the SDK's real-time session for low-latency transcription, live captions, and voice agent experiences.

Create a real-time session

const session = client.realtime.stt({
  model: 'stt-rt-v4',
  audio_format: 'pcm_s16le',
  sample_rate: 16000,
  num_channels: 1,
  enable_endpoint_detection: true,
  enable_speaker_diarization: true,
  language_hints: ['en'],
  context: {
    text: 'Support call about billing',
    terms: ['invoice', 'refund'],
  },
});

Connect and stream

Use sendAudio to send audio chunks to the session.

await session.connect();

session.on('result', (result) => {
  process.stdout.write(result.tokens.map(t => t.text).join(''));
});

for await (const chunk of audioStream) {
  session.sendAudio(chunk);
}

await session.finish();

See the full example with a demo stream in the quickstart: Create your first real-time session

Handle session events

session.on('connected', () => console.log('connected'));
session.on('disconnected', (reason) => console.log('disconnected:', reason));
session.on('error', (error) => console.error('error:', error));

session.on('result', (result) => console.log(result.tokens.map(t => t.text).join('')));
session.on('endpoint', () => console.log('endpoint'));
session.on('finalized', () => console.log('finalized'));
session.on('finished', () => console.log('finished'));

Session lifecycle

// Connect to the session
await session.connect(); // idle -> connected

// Send audio chunks to the session
for await (const chunk of audioStream) {
  session.sendAudio(chunk);
}

// Gracefully end the session (Signal end of audio and wait for remaining results from the server)
await session.finish(); 

// Or cancel immediately:
session.close(); // connected -> closed

Endpoint detection and manual finalization

Endpoint detection lets you know when a speaker has finished speaking. This is critical for real-time voice AI assistants, command-and-response systems, and conversational apps where you want to respond immediately without waiting for long silences.

Read more about Endpoint detection

Enable endpoint detection by setting enable_endpoint_detection: true in the session configuration.

const session = client.realtime.stt({
  model: 'stt-rt-v4',
  enable_endpoint_detection: true,
});

Manual finalization gives you precise control over when audio should be finalized — useful for Push-to-talk systems and client-side voice activity detection (VAD).

Read more about Manual finalization

session.finalize();

Pause and resume

session.pause();   // keeps connection alive, drops audio while paused
session.resume();  // resume sending audio

You are billed for the full stream duration even when session is paused.

In a typical voice agent loop, you pause the STT session while the agent is responding to avoid transcribing the agent's own audio or processing overlapping speech:

session.on("endpoint", async () => {
  const utterance = utteranceBuffer.markEndpoint(); // Read more about utterance buffer below
  if (!utterance) return;

  // Pause STT while the agent processes and responds
  session.pause();

  const response = await myAgent.respond(utterance.text);
  // ... send response audio to the client ...

  // Resume listening for the next utterance
  session.resume();
});

Keepalive

Read more about Connection keepalive

Node SDK automatically sends keepalive messages when session is paused via session.pause().

You can also send keepalive messages manually:

session.sendKeepalive();

Detecting utterance for voice agents

When building voice AI agents, you need to know when the user has finished speaking so you can process their input. The SDK provides RealtimeUtteranceBuffer to collect streaming tokens into complete utterances, driven by the server's endpoint detection.

How it works

  1. Set enable_endpoint_detection: true in the session config – the server detects when the user stops speaking and emits an endpoint event.
  2. Feed every result event into the buffer with addResult().
  3. When an endpoint fires, call markEndpoint() to flush the buffer and get the complete utterance.

Example

import { SonioxNodeClient, RealtimeUtteranceBuffer } from "@soniox/node";

const client = new SonioxNodeClient();

// Call this for each new user/connection - each session needs its own buffer
function createAgentSession(onUtterance: (text: string) => void) {
  const session = client.realtime.stt({
    model: "stt-rt-v4",
    enable_endpoint_detection: true,
  });

  // Each session gets its own buffer
  const utteranceBuffer = new RealtimeUtteranceBuffer({ 
    final_only: true 
  });

  session.on("result", (result) => {
    utteranceBuffer.addResult(result);
  });

  session.on("endpoint", () => {
    const utterance = utteranceBuffer.markEndpoint();
    if (utterance) {
      onUtterance(utterance.text);
    }
  });

  return session;
}

// Usage: create a session per user connection
const session = createAgentSession((text) => {
  console.log("User said:", text);
  // Pass to your LLM / agent pipeline
});

await session.connect();
session.sendAudio(audioChunk);

Streaming audio from a file

Use sendStream() to pipe audio directly from a file (or any async source) into a real-time session. It accepts any AsyncIterable – Node.js file streams, Web ReadableStream, Bun file streams, fetch response bodies, or custom async generators.

Simulating real-time pace

When streaming pre-recorded files, you can throttle sending with pace_ms to simulate how audio would arrive from a live source (e.g. a microphone). This isn't needed for live audio – it naturally arrives at real-time pace.

Use sendAudio if you need more control.