Storage and Search#

Soniox Storage and Search enables you to store, index, retrieve and search over your audio/transcript data. Audio and/or transcript can be stored as well as metadata. You can search over stored data in numerous ways to obtain relevant search results. You can also retrieve the audio file or subsegments to play the audio.

Storage and Search functionality is made available immediately after a transcription request completes, i.e. audio/transcript is immediately stored, indexed and made available to retrieve or search. This enables you to build near real-time applications on top of Soniox service that require access to your audio and transcript data.

On the backend, Soniox takes care of all the storage and search functionality for you. The data is privately and securely stored in Soniox Cloud in a isolated namespace for your Soniox Account.

Storage and Search is not used by default and must be enabled on a per-request basis. This means that audio and transcript are not stored unless Storage and Search is explicitly enabled for the specific transcription request.

Key Functionalities#

Store audio
Store transcript
Store associated metadata with audio/transcript
Retrieve audio/subsegments in streaming mode
Retrieve transcript in structured format
Search audio/transcript by its metadata
Search audio/transcript by transcript content

Example#

In this example, we will transcribe an audio file and at the same time instruct Soniox to store and index the audio/transcript data. After the transcription is over, both the audio and transcript can be retrieved or searched via Soniox API.

storage_and_search.py

from soniox.transcribe_file import transcribe_file_short
from soniox.speech_service import SpeechClient, StorageConfig
from soniox.storage import search_objects


# Do not forget to set your API key in the SONIOX_API_KEY environment variable.
def main():
    with SpeechClient() as client:
        storage_config = StorageConfig(
            object_id="my_id_for_audio",
            metadata={
                "company": "Nike",
                "agent": "12345",
            },
            title="Air Jordan shoes review",
        )

        transcribe_file_short(
            "../test_data/test_audio_storage.flac",
            client,
            model="en_v2",
            storage_config=storage_config,
        )

        # Search for objects with query "homesick".
        search_response = search_objects(client, text_query="air jordan")

        # Print search results.
        print(f"Results: {search_response.num_found}")
        for result in search_response.results:
            print(f"Object ID: {result.object_id}")
            print(f"Preview: {result.preview}")


if __name__ == "__main__":
    main()

Run

python3 storage_and_search.py

Output

Results: 1
Object ID: my_id_for_audio
Preview: This, my friends, is the <em>Air</em> <em>Jordan</em> 6 in what's being labeled currently the Toro colorway.