Running HuggingFace models¶
The LLM Mesh supports locally-running HuggingFace transformers models, such as Mistral, Llama3, Falcon, or smaller task-specific models.
Cautions¶
Running local large scale HuggingFace models is a complex and very costly setup, and both quality and performance tend to be below proprietary LLM APIs. We strongly recommend that you make your first experiments in the domain of LLMs using Hosted LLM APIs. The vast majority of LLM API providers have strong guarantees as to not reusing your data for training their models.
Pre-requisites¶
In order to run local HuggingFace models, you will need:
A running Elastic AI Kubernetes cluster with NVIDIA GPUs
Smaller models, such as Falcon-7B or Llama2-7B, work with an
A10
GPU with 40 GB of VRAM. Single-GPU nodes are recommended.Larger models require multi-GPU nodes. For instance, Mixtral-7x8B or Llama2-70B work with 2
A100
GPUs with 80 GB of VRAM each.
GPUs compute capability level must be >= 7.0 and NVIDIA driver version >= 535
A fully setup Elastic AI computation capability, with Almalinux 8 base images (i.e. by adding –distrib almalinux8 when building images)
Python 3.9 setup
Note
The base images and Python 3.9 requirements are automatically fulfilled when using Dataiku Cloud Stacks, you do not need additional setup for container images, as long as you have not customized base images.
If you require assistance with the cluster setup, please reach out to your Dataiku Technical Account Manager or Customer Success Manager
Note
For air-gapped instances, you will need to import the HuggingFace model manually to DSS’s model cache and enable DSS’s model in the HuggingFace connection. Refer to the model cache documentation.
For running models such as Falcon 7B or Llama2 7B, you will require A10 GPUs with 40 GB of VRAM. We recommend single-GPU nodes.
Create a containerized execution config¶
Create a new containerized execution config.
In “custom limits”, add nvidia.com/gpu with value 1.
If you are using multi-GPU nodes, you can set a higher value. In any case, it is strongly recommended to use a number of GPUs among 1, 2, 4, or 8 in order to maximize compatibility with vLLM’s tensor parallelism constraints.
Enable “Increase shared memory size”, without setting a specific value
Do not set memory or CPU requests or limits. Anyway, each node will only accommodate a single container, since the GPU is not shared.
Create a code env¶
Create a new Python 3.9 code env
In “Packages to install”:
click “Add sets of packages”, select “Local HuggingFace models” and click “Add”
In “Containerized execution”:
Select “Build for”: “All container configurations”
In “Container runtime additions”, click “Add”, and in “Type”, select “GPU support for Torch 2”
Click “Save and update” (this will take at least 10-20 minutes)
Create a HuggingFace connection¶
In Connections, create a new Local HuggingFace connection
Enter connection name
In “Containerized execution”, enter the name of the containerized execution config you just created
In “code environment name”, enter the name of the code environment you just created
Create the connection
If you want to use Llama2, you must have a HuggingFace account in which Llama2 has been approved. Enter an access token.
We recommend disabling “Use DSS-managed model cache” if your containers have good outgoing Internet connectivity, as it will be faster to download the models directly from HuggingFace
Note
The LLM mesh automatically selects the LLM inference engine. It uses vLLM by default if the model and runtime environment are compatible, otherwise it uses transformers.
You can manually override this default behavior in the HuggingFace connection settings (Advanced tuning > Custom Properties). To do so, add a new property engine.completion
and set its value to TRANSFORMERS
, VLLM
or AUTO
(default, recommended unless you experience issues with the automatic engine selection).
On Dataiku Cloud¶
To access this feature you will need to activate the GPU on your instance. Contact us to know more about our GPU offering.
In order to run local HuggingFace models, you will need:
Activate your GPU extension in the Launchpad,
Create a Python 3.9 code env with the “Local HuggingFace models” set of packages and the option “GPU support for Torch 2” activated,
Create a HuggingFace connection in the Launchpad and linked the code environment you created.
Test¶
In a project, create a new Prompt Studio (from the green menu). Create a new single prompt. In the LLM dropdown, choose for example Falcon 7B, and click Run.
The model is downloaded, and a container starts, which requires pulling the image to the GPU machine. The first run can take 10-15 minutes (subsequent runs will be faster).