vLLM#
Soniox-7B can be deployed using the vLLM OpenAI-compatible API server and used via the Chat Completions API. The correct conversation template will be used automatically. The server can be deployed using a docker image or directly from Python.
With docker#
On a GPU-enabled host, you can run the Soniox-7B vLLM image with the following command:
docker run --gpus all \
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
public.ecr.aws/r6l7m9m8/soniox-7b-vllm:latest \
--host 0.0.0.0 \
--port 8000 \
--model soniox/Soniox-7B-v1.0 \
--max-model-len 8192 \
--enforce-eager \
--dtype float16
This will download the model from Hugging Face.
Make sure to set HF_TOKEN
to your Hugging Face user access token.
Parameters passed to the container will be forwarded to the vLLM server. For an explanation of these see Run vLLM server.
Without docker#
Alternatively, you can directly start the vLLM server on a GPU-enabled host.
Install vLLM#
First you need to install vLLM (or use conda add vllm
if you are using Anaconda):
pip3 install -U vllm
Log in to Hugging Face#
You will also need to log in to the Hugging Face hub using:
huggingface-cli login
Run vLLM server#
You can now use the following command to start the server:
python3 -u -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--model soniox/Soniox-7B-v1.0 \
--max-model-len 8192 \
--enforce-eager \
--dtype float16
Explanation:
vllm.entrypoints.openai.api_server
is the vLLM OpenAI-compatible API server module.--max-model-len
prevents going beyond the context length that the model was trained with.--enforce-eager
disables use of CUDA graphs to avoid a GPU memory leak.--dtype
specifies the computation data type. We recommendfloat16
.
If you downloaded the model as a zip archive, then --model
should be the path
to the Soniox-7B-v1.0
directory extracted from the archive.
Please note that in this case, clients need to specify the same string for model
.
Client API#
Clients should use the Chat Completions API with the vLLM server.
- Base URL:
http://hostname:8000/v1
- API key:
none
- Model:
soniox/Soniox-7B-v1.0
Here is an example of usage from Python.
pip3 install -U openai
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="none")
choice = client.chat.completions.create(
model="soniox/Soniox-7B-v1.0",
messages=[{"role": "user", "content": "3*7?"}],
temperature=0.5,
).choices[0]
if choice.finish_reason != "stop":
raise Exception(f"finish_reason is not stop but {choice.finish_reason}")
response = choice.message.content
print(response)