Soniox
Docs
Core concepts

Manual finalization

Learn how manual finalization works.

Overview

In addition to automatic mechanisms such as endpoint detection and real-time latency tuning, Soniox also supports manual finalization. This gives you precise control over when a block of audio should be finalized—useful for voice activity detection, push-to-talk systems, or segment-based transcription pipelines.

Manual finalization is triggered by sending a special message over the WebSocket connection:

{"type": "finalize"}

How it works

When you send a {"type": "finalize"} message:

  • Soniox will finalize all audio received up to that point
  • All tokens associated with the finalized audio will be returned as is_final: true
  • After the finalization is complete, the model returns a special token:
    {
      "text": "<fin>",
      "is_final": true
    }
    This marks the end of the finalize operation.

Key characteristics

  • You can call finalize multiple times in a session.
  • You may continue sending audio after a finalize call.
  • This allows you to segment audio on your terms, instead of relying on automatic endpoint detection.
  • The <fin> token is always returned as a final token and can be used to trigger downstream processing (e.g., sending text to an LLM).

Why use manual finalization?

Manual finalization gives you full control over when transcribed content should be considered complete. It is especially useful for:

  • Push-to-talk interfaces
  • Voice activity detection on the client side
  • Manual segmentation of long sessions into distinct transcription blocks
  • Applications where automatic endpoint detection may not be ideal or reliable

Example use case

  1. Stream audio from the user in real time.
  2. After the user finishes a short utterance (based on your own VAD or timing logic), send:
    {"type": "finalize"}
  3. Receive finalized tokens from Soniox, ending with:
    {
      "text": "<fin>",
      "is_final": true
    }
  4. Trigger downstream processing (e.g., send to an LLM).
  5. Continue streaming the next utterance as needed.

On this page