Soniox
Docs
Guides

Direct stream

Stream directly from microphone to Soniox Speech-to-Text WebSocket API to minimize latency.

Overview

This guide walks you through capturing and transcribing microphone audio in real time using the Soniox WebSocket API — optimized for the lowest possible latency.

The direct stream approach enables the browser to send audio directly to the Soniox WebSocket API over a WebSocket connection, eliminating the need for any intermediary server. This results in faster transcription and a simpler architecture.

Soniox's Web Library handles everything client-side — capturing microphone input, managing the WebSocket connection, and authenticating using temporary API keys.

Use this setup when you want real-time speech-to-text performance directly in the browser with minimal delay.

Soniox Speech-to-Text direct stream flowchart

Temporary API keys

Temporary API keys (obtained from REST API) are required solely to establish the WebSocket connection. Once the connection is established, it will be kept alive as long it remains active. The expires_in_seconds configuration parameter should be set to a short duration.

Following parameters are required to create a temporary API key:

{
  "usage_type": "transcribe_websocket",
  "expires_in_seconds": 60
}

API request limits apply when creating temporary API keys. See Limits section in the Soniox Console.


Example

This is an example of a browser-based transcription, but same principle applies to any other type of client - you minimize latency by connecting the client directly to the WebSocket API using a temporary API key.

First we create a simple HTTP server that on request:

  1. Renders the index.html template.
  2. Exposes an endpoint to serve the temporary API key (/temporary-api-key).

Python server using FastAPI:

import os
 
import requests
import uvicorn
from dotenv import load_dotenv
from fastapi import FastAPI, Request
from fastapi.responses import HTMLResponse, JSONResponse
from fastapi.templating import Jinja2Templates
 
load_dotenv()
 
templates = Jinja2Templates(directory="templates")
 
app = FastAPI()
 
 
@app.get("/", response_class=HTMLResponse)
async def get_index(request: Request):
    return templates.TemplateResponse(
        request=request,
        name="index.html",
    )
 
 
@app.get("/temporary-api-key", response_class=JSONResponse)
async def get_temporary_api_key():
    try:
        response = requests.post(
            "https://api.soniox.com/v1/auth/temporary-api-key",
            headers={
                "Authorization": f"Bearer {os.getenv('SONIOX_API_KEY')}",
                "Content-Type": "application/json",
            },
            json={
                "usage_type": "transcribe_websocket",
                "expires_in_seconds": 60,
            },
        )
 
        if not response.ok:
            raise Exception(f"Error: {response.json()}")
 
        temporaryApiKeyData = response.json()
        return temporaryApiKeyData
    except Exception as error:
        print(error)
        return JSONResponse(
            status_code=500,
            content={"error": f"Server failed to obtain temporary api key: {error}"},
        )
 
 
if __name__ == "__main__":
    port = int(os.getenv("PORT", 3001))
    uvicorn.run(app, host="0.0.0.0", port=port)
View example on GitHub

Our HTML client template contains a single "Start" button, that when clicked:

  1. Requests microphone permissions.
  2. Calls the /temporary-api-key endpoint to obtain a temporary API key.
  3. Creates a new RecordTranscribe class instance passing temporary api key as apiKey parameter.
  4. Connects to the WebSocket API.
  5. Starts transcribing from microphone input and renders transcribed text into a div in real-time.
<!DOCTYPE html>
<html>
 
<body>
  <h1>Browser direct stream example</h1>
  <button id="trigger">Start</button>
  <hr />
  <div>
    <span id="final"></span>
    <span id="nonfinal" style="color: gray"></span>
  </div>
  <div id="error"></div>
  <script type="module">
    // import Soniox STT Web Library
    import { RecordTranscribe } from "https://unpkg.com/@soniox/speech-to-text-web?module";
 
    const finalEl = document.getElementById("final");
    const nonFinalEl = document.getElementById("nonfinal");
    const errorEl = document.getElementById("error");
    const trigger = document.getElementById("trigger");
 
    let recordTranscribe;
    let recordTranscribeState = "stopped"; // "stopped" | "starting" | "running" | "stopping"
 
    async function getTemporaryApiKey() {
      const response = await fetch('/temporary-api-key');
      return await response.json();
    }
 
    trigger.onclick = async () => {
      if (recordTranscribeState === "stopped") {
        finalEl.textContent = "";
        nonFinalEl.textContent = "";
        errorEl.textContent = "";
        trigger.textContent = "Starting...";
        recordTranscribeState = "starting";
 
        // obtain a temporary api key from our server
        const response = await getTemporaryApiKey();
        const temporaryApiKey = response.api_key;
 
        if (!temporaryApiKey) {
          errorEl.textContent += response.error || "Error fetching temp api key.";
          resetTrigger();
          return;
        }
 
        // create new instance of RecordTranscribe class and authenticate with temp API key
        recordTranscribe = new RecordTranscribe({
          apiKey: temporaryApiKey
        });
 
        let finalText = "";
 
        // start transcribing and bind callbacks
        recordTranscribe.start({
          model: "stt-rt-preview",
          languageHints: ["en"],
          onStarted: () => {
            // library connected to Soniox STT WebSocket API and is transcribing
            recordTranscribeState = "running";
            trigger.textContent = "Stop";
          },
          onPartialResult: (result) => {
            // render the transcript
            let nonFinalText = "";
 
            for (let token of result.tokens) {
              if (token.is_final) {
                finalText += token.text;
              } else {
                nonFinalText += token.text;
              }
            }
 
            finalEl.textContent = finalText;
            nonFinalEl.textContent = nonFinalText;
          },
          onFinished: () => {
            // transcription finished, we go back to initial state
            resetTrigger();
          },
          onError: (status, message) => {
            console.log("Error occurred", status, message);
            errorEl.textContent = message;
            resetTrigger();
          },
        });
      } else if (recordTranscribeState === "running") {
        // stop transcribing and wait for final result and connections to close
        trigger.textContent = "Stopping...";
        recordTranscribeState = "stopping";
        recordTranscribe.stop();
      }
    };
 
    function resetTrigger() {
      trigger.textContent = "Start";
      recordTranscribeState = "stopped";
    }
  </script>
</body>
 
</html>
View example on GitHub

On this page