We rethought the essence of how to build AI models that result in robust speaker recognition for conversations in real-world environments. Soniox is the leader in speaker recognition and speaker identification AI technology.

Improve conversational AI with speaker tags

Knowing who said what is a fundamental element when building conversational AI applications. For example, knowing "who asked the question" and "who answered the question" is a must to properly understand the conversations.

Soniox Speaker Recognition AI supports recognition of up to 20 speakers in a given conversation from audio alone. It supports both speaker diarization (or separation) and speaker identification.

Speaker diarization

Speaker diarization recognizes different speakers, but it does not identify the speakers. It recognizes there are two different speakers in the audio ("Speaker-1" and "Speaker-2"), but it does not know who these speakers are.

Speaker diarization does not require any additional input to recognize different speakers. The recognition is performed based on the audio input alone.

Speaker-1: Hi, how are you?

Speaker-2: I am great. What about you?

Speaker-1: Fantastic, thank you.

Speaker identification

Speaker identification associates recognized speakers with a unique speaker identity. With speaker identification, the AI model knows Speaker-1 is Mike and Speaker-2 is John and provides speakers names (identities).

Speaker identification requires speakers to register their voice ahead of time by providing a short audio example with their speech.

Mike: Hi, how are you?

John: I am great. What about you?

Mike: Fantastic, thank you.

Recognize speakers in live streams, not just files

Streaming speaker recognition

Soniox supports recognizing speakers in live streams, enabling you to instantly recognize speakers when they start speaking.

Global speaker recognition

Soniox also supports recognizing speakers from files, optimized for the highest accuracy, as the AI model can leverage context from the entire audio file.

Integrated into speech recognition API

Speaker recognition is seamlessly integrated into speech recognition API, i.e. with one API call you get the transcript which contains the recognized words, and each recognized word comes with a speaker tag.

text: "YouTube";
start_ms: 1450;
duration_ms: 350;
is_final: true;
speaker: 1;
confidence: 0.98;

