Run real-time low-latency speech recognition on any device, including servers, laptops, and mobile devices.
Automatically scale speech recognition computation across multiple devices, depending on the current load.
Low-latency control and logging over computation and APIs on any layer of computation.
We designed and implemented multiple libraries to support all of these requirements. Each library has its own set of interfaces and has been thoroughly tested. Majority of the code is written in C++ with highly optimized code. Some minor components are written in Python. The libraries are linked into one unified framework which powers all of the Soniox speech recognition services. The following sections explain in more details the engineering design and functionality of these core libraries.
We built a proprietary inference engine for streaming artificial neural networks. Streaming network consists of a sequence of modules, where each module can be run in a streaming mode. The engine takes as input a stream of numbers and computes outputs for all the modules in the network. The computation is done to produce the final output of the network with the minimum latency possible. With proper neural network architectures, this enables us to do real-time low-latency inference.
The inference engine can run on any device as long as every module's computation can be implement on that device. This is typically not an issue, since many devices provide efficient implementation of numerous mathematical operators (modules). We implemented an efficient CPU-based inference for any CPU based machine with Linux OS. This enables us to run the inference engine on any private cluster or cloud environment with commodity CPU machines. We are currently running the inference engine on Intel and AMD CPU processors.
We built a proprietary decoder for speech recognition that enables us to efficiently explore multiple static and dynamic lattices at once. The decoder supports an efficient pruning of the lattice with configurable parameters per each individual decoding session. Parameters include “look back” (history context) and “look ahead” (future context).
The decoder also supports different language models and other biasing techniques for decoding. This includes vocabulary adaptation and boosting of specified context words and phrases for each individual decoding session.
We implemented a proprietary framework for concurrent and distributed processing of real-time streams with low latency responses. The framework consists of a load balancer and workers.
The load balancer supports automatic dynamic scaling of workers to adjust to the current load of incoming audio streams. When the load increases, the load balancer will automatically launch new workers to distribute the computation and still provide real-time low-latency responses. When the load decreases, the load balancer will automatically shut down the workers to save computational resources. The rate of dynamic scaling can be configured via parameters of the load balancer.
Workers are machines that run the Soniox inference engine in concurrent mode, i.e. each worker can process multiple audio streams. Workers are implemented in highly efficient async mult-threaded code.
To enable running the entire framework on any cluster or cloud infrastructure, we implemented an abstraction layer called resource service, which provides computational resources (e.g. workers) that are required for the framework to operate. To deploy the framework to a particular cluster or cloud environment, one only needs to implement the resource service for that environment.
We implemented a proprietary API management system for stream processing. The system supports multiple authentication schemes, including JWT token and API key authentication. It also supports low-latency data logging as streams are being processed, which is especially important for real-time speech recognition applications. The system also supports numerous access control and throttling mechanisms, including maximum number of active streams and maximum throughput.