About the role
At Soniox, our mission is to make voice universally accessible and programmable in real time. Our models depend on vast, diverse, and high-quality datasets to train state-of-the-art AI systems across languages and domains. As a software engineer working on data acquisition, you will lead the development of scalable infrastructure for acquiring, indexing, and managing data at massive scale — powering the next generation of speech and language models.
In this role, you will:
- Design, build, and scale systems for web crawling, large-scale data ingestion, and content indexing.
- Work closely with data processing and model training teams to ensure smooth and efficient data pipelines.
- Own backend infrastructure for storage, indexing, and search across multi-petabyte datasets.
- Architect distributed systems that are robust, performant, and optimized for research and production workloads.
- Deploy and operate services in a Kubernetes environment using Infrastructure-as-Code.
- Analyze system performance and data coverage through experimentation and instrumentation.
You might thrive in this role if you:
- Have 6+ years of experience building large-scale software systems.
- Have deep knowledge of distributed systems, web crawling, and backend engineering.
- Are comfortable with key-value stores, data synchronization, and scalable storage systems.
- Are pragmatic and curious — unafraid to try new tools and rethink old assumptions.
- Communicate clearly and proactively, especially across cross-functional teams.
- Care about building infrastructure that directly powers real-world AI systems.
Why Soniox
You’ll help build one of the most technically advanced AI platforms in the world — and shape how it reaches and supports users globally.
You’ll work directly with a world-class team of engineers and researchers solving frontier problems in speech and language AI.
You'll have a voice in how our company grows, how our customers succeed, and how AI transforms human communication.
Ready to join Soniox? Apply now