Get Started with vLLM
Introduction
vLLM is a high-performance serving engine designed for efficient and scalable deployment of large language models (LLMs). It provides fast inference speeds with reduced memory overhead, leveraging continuous batching, paged attention, and optimized GPU utilization.
This tutorial will guide you through setting up vLLM, serving a model, and making API calls to interact with it.
Prerequisites
Before proceeding, ensure you have:
- A compatible GPU with CUDA support (optional but recommended)
- Python 3.8+
pip
installedtorch
andtransformers
installed
Installing vLLM
To install vLLM, use:
1
pip install vllm
If you need GPU acceleration, ensure you have installed PyTorch with CUDA support:
1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Serving a Model with vLLM
To serve an LLM using vLLM, use the python -m vllm.entrypoints.api_server command:
1
python -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-chat-hf
This will download and serve the Meta Llama 2 7B model.
Key Arguments:
--model
: Specifies the Hugging Face model to use.--port
: Sets the API server port (default: 8000).--gpu-memory-utilization
: Controls GPU memory allocation.--dtype
: Sets the precision (e.g., bfloat16, float16).
For example, to run with reduced memory usage:
1
python -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-chat-hf --dtype float16 --gpu-memory-utilization 0.9
Making API Requests
Once the server is running, you can send requests using cURL or Python.
cURL Example:
1
2
3
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "What is vLLM?", "max_tokens": 100}'
Python Example:
1
2
3
4
5
6
import requests
url = "http://localhost:8000/generate"
data = {"prompt": "What is vLLM?", "max_tokens": 100}
response = requests.post(url, json=data)
print(response.json())
Advanced Configurations
Running with Multiple GPUs
vLLM supports multi-GPU inference. Use the --tensor-parallel-size
flag to enable it:
1
python -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 2
Using Custom Models
To use a local model, specify the path instead of a Hugging Face ID:
1
python -m vllm.entrypoints.api_server --model /path/to/model
Adjusting Batch Size and Throughput
You can tweak performance using these flags:
1
2
--max-num-batched-tokens 2048 # Controls the max tokens per batch
--gpu-memory-utilization 0.8 # Adjusts GPU memory usage
Conclusion
vLLM provides an efficient and scalable solution for serving large language models. By optimizing GPU memory, enabling continuous batching, and supporting multi-GPU inference, vLLM significantly improves the performance of LLM deployments.
For further customization, refer to the official documentation.