Running Hugging Face models¶
The LLM Mesh supports locally-running Hugging Face transformers models, such as Mistral, Llama3, Falcon, or smaller task-specific models.
Cautions¶
Running local large scale Hugging Face models is a complex and very costly setup, and both quality and performance tend to be below proprietary LLM APIs. We strongly recommend that you make your first experiments in the domain of LLMs using Hosted LLM APIs. The vast majority of LLM API providers have strong guarantees as to not reusing your data for training their models.
Pre-requisites¶
In order to run local Hugging Face models, you will need:
A running Elastic AI Kubernetes cluster with NVIDIA GPUs
Smaller models, such as Falcon-7B or Llama2-7B, work with an
A10
GPU with 24 GB of VRAM. Single-GPU nodes are recommended.Larger models require multi-GPU nodes. For instance, Mixtral-7x8B or Llama2-70B work with 2
A100
GPUs with 80 GB of VRAM each.
GPUs compute capability level must be >= 7.0 and NVIDIA driver version >= 535
A fully setup Elastic AI computation capability, with Almalinux 8 base images (i.e. by adding
--distrib almalinux8
when building images)Python 3.9 setup
Note
The base images and Python 3.9 requirements are automatically fulfilled when using Dataiku Cloud Stacks, you do not need additional setup for container images, as long as you have not customized base images.
If you require assistance with the cluster setup, please reach out to your Dataiku Technical Account Manager or Customer Success Manager
Note
For air-gapped instances, you will need to import the Hugging Face model manually to DSS’s model cache and enable DSS’s model cache in the Hugging Face connection. Refer to the model cache documentation.
Create a containerized execution config¶
Create a new containerized execution config.
In “custom limits”, add nvidia.com/gpu with value 1.
If you are using multi-GPU nodes, you can set a higher value. In any case, it is strongly recommended to use a number of GPUs among 1, 2, 4, or 8 in order to maximize compatibility with vLLM’s tensor parallelism constraints.
Enable “Increase shared memory size”, without setting a specific value
Do not set memory or CPU requests or limits. Anyway, each node will only accommodate a single container, since the GPU is not shared.
Create a code env¶
In “Administration > Settings > Misc”, in the “Local Hugging Face models code environment” section, select a Python interpreter in the list and click “Create code environment”
Once your code env is created, go to the code env settings. In “Containerized execution”, select “Build for”: “All container configurations” (or select relevant, e.g. GPU-enabled, container configurations).
Click “Save and update” (this will take at least 10-20 minutes)
Create a Hugging Face connection¶
In “Administration > Connections”, create a new “Local Hugging Face” connection
Enter connection name
In “Containerized execution configuration”, enter the name of the containerized execution config you just created
In “Code environment”, select “Use internal code env”
Create the connection
If you want to use Llama or Mistral models, you must have a Hugging Face account in which access to these models have been approved. Enter an access token.
We recommend disabling “Use DSS-managed model cache” if your containers have good outgoing Internet connectivity, as it will be faster to download the models directly from Hugging Face.
Note
The LLM mesh automatically selects the LLM inference engine. It uses vLLM by default if the model and runtime environment are compatible, otherwise it uses transformers.
You can manually override this default behavior in the Hugging Face connection settings (Advanced tuning > Custom Properties). To do so, add a new property engine.completion
and set its value to TRANSFORMERS
, VLLM
or AUTO
(default, recommended unless you experience issues with the automatic engine selection).
On Dataiku Cloud¶
To access this feature you will need to activate the GPU on your instance. Contact us to know more about our GPU offering.
In order to run local Hugging Face models, you will need:
Activate your GPU extension in the Launchpad,
Create a Python 3.9 code env with the “Local Hugging Face models” set of packages and the containerized execution option “GPU support for Torch 2” activated,
Create a Hugging Face connection in the Launchpad and linked the code environment you created.
Test¶
In a project, create a new Prompt Studio (from the green menu). Create a new single prompt. In the LLM dropdown, choose for example Falcon 7B, and click Run.
The model is downloaded, and a container starts, which requires pulling the image to the GPU machine. The first run can take 10-15 minutes (subsequent runs will be faster).