SkyPilot#
SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution.
Below is an example of the SkyPilot config to deploy Soniox 7B.
SkyPilot Configuration#
After installing SkyPilot, you need to create a configuration file that tells SkyPilot how and where to deploy your inference server, using our pre-built docker container:
resources:
cloud: ${CLOUD_PROVIDER}
accelerators: A10G:1
ports:
- 8000
run: |
docker run --gpus all -p 8000:8000 public.ecr.aws/r6l7m9m8/soniox-7b-vllm:latest \
--host 0.0.0.0 \
--port 8000 \
--model soniox/Soniox-7B-v1.0 \
--max-model-len 8192 \
--enforce-eager \
--dtype float16
Once these environment variables are set, you can use sky launch
to launch the inference server with the name soniox-7b
:
sky launch -c soniox-7b soniox-7b.yaml --region us-east-1
Caution
When deployed that way, the model will be accessible to the whole world.
You must secure it, either by exposing it exclusively on your private network
(change the --host
Docker option for that), by adding a load-balancer with
an authentication mechanism in front of it, or by configuring your instance networking properly.
Test it out#
To easily retrieve the IP address of the deployed soniox-7b cluster you can use:
sky status --ip soniox-7b
You can then use curl to send a chat completion request:
IP=$(sky status --ip cluster-name)
curl http://$IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "soniox/Soniox-7B-v1.0",
"messages": [{"role": "user", "content": "12 * 7?"}],
"max_tokens": 128
}'