Docker运行vLLM:
# Deploy with docker on Linux:
docker run --runtime nvidia --gpus all \
--name my_vllm_container \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-0.6B
加载和运行模型:
docker exec -it my_vllm_container bash -c "vllm serve Qwen/Qwen3-0.6B"
使用curl调用:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'
