[https://github.com/vllm-project/vllm vLLM] is a community-driven project that provides high-throughput and memory-efficient inference and serving for large language models (LLMs). It supports various decoding algorithms, quantizations, parallelism, and models from Hugging Face and other sources. = Installation = ==Latest available wheels== To see the latest version of vLLM that we have built: {{Command|avail_wheels "vllm"}} For more information, see [[Python#Available_wheels |Available wheels]]. ==Installing our wheel== The preferred option is to install it using the Python [https://pythonwheels.com/ wheel] as follows: :1. Load dependencies, load a Python and OpenCV [[Utiliser_des_modules/en#Sub-command_load|modules]], {{Command|module load opencv/4.11 python/3.12}} :2. Create and start a temporary [[Python#Creating_and_using_a_virtual_environment|virtual environment]]. {{Commands |virtualenv --no-download ~/vllm_env |source ~/vllm_env/bin/activate }} :3. Install vLLM in the virtual environment and its Python dependencies. {{Commands |prompt=(vllm_env) [name@server ~] |pip install --no-index --upgrade pip |pip install --no-index vllm{{=}}{{=}}X.Y.Z }} where X.Y.Z is the exact desired version, for instance 0.8.4. You can omit to specify the version in order to install the latest one available from the wheelhouse. :4. Freeze the environment and requirements set. {{Command |prompt=(vllm_env) [name@server ~] |pip freeze > ~/vllm-requirements.txt }} :5. Deactivate the environment. {{Command |prompt=(vllm_env) [name@server ~] |deactivate }} :6. Clean up and remove the virtual environment. {{Command |rm -r ~/vllm_env }} = Job submission = == Before submitting a job: Downloading models == Models loaded for inference on vLLM will typically come from the [https://huggingface.co/docs/hub/models-the-hub Hugging Face Hub]. The following is an example of how to use the command line tool from the Hugging face to download a model. Note that models must be downloaded on a login node to avoid idle compute while waiting for resources to download. Also note that models will be cached at by default at $HOME/.cache/huggingface/hub. For more information on how to change the default cache location, as well as other means of downloading models, please see our article on the [[Huggingface|Hugging Face ecosystem]]. module load python/3.12 virtualenv --no-download temp_env && source temp_env/bin/activate pip install --no-index huggingface_hub huggingface-cli download facebook/opt-125m rm -r temp_env == Single Node == The following is an example of how to submit a job that performs inference on a model split across 2 GPUs. If your model '''fits entirely inside one GPU''', change the python script below to call LLM() without extra arguments. This example '''assumes you have pre-downloaded''' the model facebook/opt-125m as described on the previous section. {{File |name=vllm-example.sh |lang="bash" |contents= #!/bin/bash #SBATCH --ntasks=1 #SBATCH --gpus-per-task=2 #SBATCH --cpus-per-task=2 #SBATCH --mem=32000M #SBATCH --time=0-00:05 #SBATCH --output=%N-%j.out module load python/3.12 gcc opencv/4.11 virtualenv --no-download $SLURM_TMPDIR/env source $SLURM_TMPDIR/env/bin/activate pip install -r vllm-requiremnts.txt --no-index python vllm-example.py }} {{File |name=vllm-example.py |lang="python" |contents= from vllm import LLM prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Set "tensor_parallel_size" to the number of GPUs in your job. llm = LLM(model="facebook/opt-125m",tensor_parallel_size=2) outputs = llm.generate(prompts) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") }} == Multiple Nodes == The following example revisits the single node example above, but splits the model across 4 GPUs over 2 separate nodes, i.e., 2 GPUs per node. Currently, vLLM relies on [[Ray]] to manage splitting models over multiple nodes. The code example below contains the necessary steps to start a [[Ray#Multiple_Nodes|multi-node Ray cluster]] and run vLLM on top of it: {{File |name=vllm-multinode-example.sh |lang="bash" |contents= #!/bin/bash #SBATCH --nodes 2 #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-task=2 #SBATCH --cpus-per-task=6 #SBATCH --mem=32000M #SBATCH --time=0-00:10 #SBATCH --output=%N-%j.out ## Create a virtualenv and install Ray on all nodes ## module load gcc python/3.12 arrow/19 opencv/4.11 srun -N $SLURM_NNODES -n $SLURM_NNODES config_env.sh export HEAD_NODE=$(hostname --ip-address) # store head node's address export RAY_PORT=34567 # choose a port to start Ray on the head node ## Set Huggingface libraries to OFFLINE mode ## export HF_HUB_OFFLINE=1 export TRANSFORMERS_OFFLINE=1 source $SLURM_TMPDIR/ENV/bin/activate ## Start Ray cluster Head Node ## ray start --head --node-ip-address=$HEAD_NODE --port=$RAY_PORT --num-cpus=$SLURM_CPUS_PER_TASK --num-gpus=2 --block & sleep 10 ## Launch worker nodes on all the other nodes allocated by the job ## srun launch_ray.sh & ray_cluster_pid=$! sleep 10 VLLM_HOST_IP=`hostname --ip-address` python vllm_example.py ## Shut down Ray worker nodes after the Python script exits ## kill $ray_cluster_pid }} Where the script config_env.sh is: {{File |name=config_env.sh |lang="bash" |contents= #!/bin/bash module load python/3.12 gcc opencv/4.11 arrow/19 virtualenv --no-download $SLURM_TMPDIR/ENV source $SLURM_TMPDIR/ENV/bin/activate pip install --upgrade pip --no-index pip install ray -r vllm-requirements.txt --no-index deactivate }} The script launch_ray.sh is: {{File |name=launch_ray.sh |lang="bash" |contents= #!/bin/bash if [[ "$SLURM_PROCID" -eq "0" ]]; then echo "Ray head node already started..." sleep 10 else export VLLM_HOST_IP=`hostname --ip-address` ray start --address "${HEAD_NODE}:${RAY_PORT}" --num-cpus="${SLURM_CPUS_PER_TASK}" --num-gpus=2 --block sleep 5 echo "ray worker started!" fi }} And finally, the script vllm_example.py is: {{File |name=vllm_example.py |lang="python" |contents= from vllm import LLM prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Set "tensor_parallel_size" to the TOTAL number of GPUs on all nodes. llm = LLM(model="facebook/opt-125m",tensor_parallel_size=4) outputs = llm.generate(prompts) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") }}