Soniox | Soniox 7B

We are excited to release Soniox 7B, the most powerful large language model for its size to date.

Soniox 7B summary

Outperforms Mistral 7B on all benchmarks
Outperforms Mixtral 8X7B on almost all benchmarks
Matches GPT-4 on some benchmarks
Supports English and code with 8K context window
Built on top of Mistral 7B with additional pre-training and fine-tuning
Released under Apache 2.0 license and can be used without restrictions

Usage

Use it from Hugging Face or download zip.

Motivation

Over the past year, we have tried to develop an advanced AI assistant capable of performing complex tasks, including internet searches, analysis of gathered data, data presentation, and making predictions. Despite experimenting with various LLMs, including GPT-4, we have consistently encountered limitations in their functionality. The primary limitations can be summarized as follows:

Fragility: Minor and seemingly insignificant variations in instructions can lead to significantly different output quality.
Inconsistency: Outputs sometimes fail to adhere to the provided instructions, a problem that persists across different inputs and identical instructions.
Slowness: Models like GPT-4 introduce considerable latency, significantly degrading user experience.
Cost: The expense of proprietary LLMs is prohibitive for AI applications requiring larger processing volume.

We have heard similar experiences from other developers and companies, which motivated us to start developing our own LLM technology.

Our first release, the Soniox 7B, has been specifically developed to excel in following instructions and API calling, and to perform NLP tasks at a level comparable to GPT-4, according to our benchmarks. The Soniox 7B model is easy to deploy, runs swiftly on many hardware configurations, and is being released under the Apache 2.0 license, allowing unrestricted use.

Benchmarks

We compared Soniox 7B with Mistral-7B-Instruct-v0.2, Mixtral-8x7B-Instruct-v0.1, and GPT-4-1106-preview. We compared the models on 17 different benchmarks with the same evaluation pipeline and parameters. The benchmark results are presented in the table below.

Model	MMLU	HellaSwag	Arc-c	Arc-e	WinoGrande	PIQA	SIQA	OpenBookQA	CommonsenseQA	BoolQ	Math	GSM8K	BBH	AGIEval	HumanEval	MBPP	SonioxText
Soniox-7B-v1.0	62.3%	91.6%	84.6%	92.8%	77.6%	87.2%	80.6%	90.6%	79.9%	86.8%	20.4%	76.3%	44.8%	41.0%	44.7%	42.4%	9.01
Mistral-7B-Instruct-v0.2	56.1%	55.9%	73.0%	86.2%	45.9%	68.0%	58.9%	81.4%	65.4%	78.6%	8.9%	42.4%	39.6%	27.0%	25.4%	2.0%	8.34
Mixtral-8x7B-Instruct-v0.1	69.5%	73.5%	83.0%	89.9%	67.1%	83.4%	72.4%	88.4%	73.1%	81.2%	20.9%	59.0%	47.1%	40.3%	14.0%	28.6%	8.43
GPT-4-1106-preview	78.9%	90.5%	96.3%	99.2%	85.2%	94.6%	78.8%	98.4%	84.5%	88.9%	60.5%	94.2%	66.1%	68.9%	87.5%	77.0%	9.13

Benchmarks can be summarized as follows:

Soniox 7B outperforms Mistral 7B on all benchmarks by a large margin.
Soniox 7B outperforms Mixtral 8x7B on 14 out of 17 benchmarks.
Soniox 7B and GPT-4 have similar performance on 4 benchmarks (HellaSwag, SIQA, BoolQ, and SonioxText), while GPT-4 dominates in other benchmarks, including mathematics (Math, GSM8K) and coding (HUmanEval, MBPP) benchmarks.
Soniox 7B nearly matches the performance of GPT-4 on the SonioxText benchmark, which comprises a diverse set of NLP tasks based on real-world text datasets.

SonioxText

SonioxText dataset is our proprietary benchmark designed to evaluate the performance of executing NLP (Natural Language Processing) tasks on given texts. This benchmark is intentionally created to include a wide variety of NLP tasks and a diverse range of real-world texts. The motivation behind this benchmark is to simulate the processing of any text—whether short, long, structured, messy, or on any topic—in a real-world scenario using some NLP task.

Specifically, the dataset comprises samples from sources such as Wikipedia articles, general and financial news, Reddit posts, scientific and medical papers, legal articles, source code documentation, website content, and video captions. For each text sample, we tasked GPT-4 with generating up to three relevant NLP tasks.

For evaluation, we assess how well an LLM performs a given NLP task on a text. The LLM's response is then graded by GPT-4 on a scale from 0 to 10.

Evaluation pipeline

We have developed our own benchmarking pipeline to enable accurate and fair comparisons of different instruction-tuned LLMs. For each benchmark, we craft a specific prompt that clearly defines and describes the problem the LLM needs to solve. Additionally, we have implemented a robust response parsing system to account for the wide range of variations in how LLMs present their answers, ensuring maximal accuracy in parsing and evaluation. For benchmarks such as MMLU, Math, AGIEval, HumanEval, and MBPP, we utilized external reference libraries for proper response parsing.

Decontamination

We observed that several publicly available datasets, such as OpenOrca and ultrachat_200k, contain text that is contaminated with questions from public benchmarks. To maintain the integrity of our benchmark results, we meticulously removed all such questions and their variants from our pre-training and fine-tuning datasets.