Manual finalization

Overview

In addition to automatic mechanisms such as endpoint detection and real-time latency tuning, Soniox also supports manual finalization. This gives you precise control over when a block of audio should be finalized — useful for voice activity detection, push-to-talk systems, or segment-based transcription pipelines.

Manual finalization is triggered by sending a special message over the WebSocket connection:

{"type": "finalize"}

How it works

When you send a {"type": "finalize"} message:

Soniox will finalize all audio received up to that point
All tokens associated with the finalized audio will be returned as is_final: true
After the finalization is complete, the model returns a special token:
{ "text": "<fin>", "is_final": true }
This marks the end of the finalize operation.

Key characteristics

You can call finalize multiple times in a session.
You may continue sending audio after a finalize call.
This allows you to segment audio on your terms, instead of relying on automatic endpoint detection.
The <fin> token is always returned as a final token and can be used to trigger downstream processing (e.g., sending text to an LLM).

Why use manual finalization?

Manual finalization gives you full control over when transcribed content should be considered complete. It is especially useful for:

Push-to-talk interfaces
Voice activity detection on the client side
Manual segmentation of long sessions into distinct transcription blocks
Applications where automatic endpoint detection may not be ideal or reliable

Example use case

Stream audio from the user in real time.
After the user finishes a short utterance (based on your own VAD or timing logic), send:
{"type": "finalize"}
Receive finalized tokens from Soniox, ending with:
{ "text": "<fin>", "is_final": true }
Trigger downstream processing (e.g., send to an LLM).
Continue streaming the next utterance as needed.

Overview

How it works

Key characteristics

Why use manual finalization?

Example use case

On this page