Running Hugging Face models

The LLM Mesh supports locally-running Hugging Face transformers models, such as Mistral, Llama3, Falcon, or smaller task-specific models.

Cautions

Running local large scale Hugging Face models is a complex and very costly setup, and both quality and performance tend to be below proprietary LLM APIs. We strongly recommend that you make your first experiments in the domain of LLMs using Hosted LLM APIs. The vast majority of LLM API providers have strong guarantees as to not reusing your data for training their models.

Pre-requisites

In order to run local Hugging Face models, you will need:

  • A running Elastic AI Kubernetes cluster with NVIDIA GPUs

    • Smaller models, such as Falcon-7B or Llama2-7B, work with an A10 GPU with 24 GB of VRAM. Single-GPU nodes are recommended.

    • Larger models require multi-GPU nodes. For instance, Mixtral-7x8B or Llama2-70B work with 2 A100 GPUs with 80 GB of VRAM each.

    • Large image generation models such as FLUX.1-schnell require GPUs with 40 GB of VRAM. Local image generation LLM Mesh models do not benefit from multi-GPU setups or quantization.

  • GPUs compute capability level must be >= 7.0 and NVIDIA driver version >= 535

  • A fully setup Elastic AI computation capability, with Almalinux 8 base images (i.e. by adding --distrib almalinux8 when building images)

  • Python 3.9 setup

Note

The base images and Python 3.9 requirements are automatically fulfilled when using Dataiku Cloud Stacks, you do not need additional setup for container images, as long as you have not customized base images.

If you require assistance with the cluster setup, please reach out to your Dataiku Technical Account Manager or Customer Success Manager

Note

For air-gapped instances, you will need to import the Hugging Face model manually to DSS’s model cache and enable DSS’s model cache in the Hugging Face connection. Refer to the model cache documentation.

Create a containerized execution config

  • Create a new containerized execution config.

  • In “custom limits”, add nvidia.com/gpu with value 1.

    • If you are using multi-GPU nodes, you can set a higher value. In any case, it is strongly recommended to use a number of GPUs among 1, 2, 4, or 8 in order to maximize compatibility with vLLM’s tensor parallelism constraints.

  • Enable “Increase shared memory size”, without setting a specific value

Do not set memory or CPU requests or limits. Anyway, each node will only accommodate a single container, since the GPU is not shared.

Create a code env

  • In “Administration > Code envs > Internal envs setup”, in the “Local Hugging Face models code environment” section, select a Python interpreter in the list and click “Create code environment”

  • Once your code env is created, go to the code env settings. In “Containerized execution”, select “Build for”: “All container configurations” (or select relevant, e.g. GPU-enabled, container configurations).

  • Click “Save and update” (this will take at least 10-20 minutes)

Create a Hugging Face connection

  • In “Administration > Connections”, create a new “Local Hugging Face” connection

  • Enter connection name

  • In “Containerized execution configuration”, enter the name of the containerized execution config you just created

  • In “Code environment”, select “Use internal code env”

  • Create the connection

If you want to use Llama or Mistral models, you must have a Hugging Face account in which access to these models have been approved. Enter an access token.

We recommend disabling “Use DSS-managed model cache” if your containers have good outgoing Internet connectivity, as it will be faster to download the models directly from Hugging Face.

Note

For text generation, the LLM mesh automatically selects the LLM inference engine. It uses vLLM by default if the model and runtime environment are compatible, otherwise it uses transformers.

You can manually override this default behavior in the Hugging Face connection settings (Advanced tuning > Custom Properties). To do so, add a new property engine.completion and set its value to TRANSFORMERS, VLLM or AUTO (default, recommended unless you experience issues with the automatic engine selection).

On Dataiku Cloud

To access this feature you will need to activate the GPU on your instance. Contact us to know more about our GPU offering.

In order to run local Hugging Face models, you will need:

  • Activate your GPU extension in the Launchpad,

  • Create a Python 3.9 code env with the “Local Hugging Face models” set of packages and the containerized execution option “GPU support for Torch 2” activated,

  • Create a Hugging Face connection in the Launchpad and linked the code environment you created.

Test

Text generation

In a project, create a new Prompt Studio (from the green menu). Create a new single prompt. In the LLM dropdown, choose for example Falcon 7B, and click Run.

The model is downloaded, and a container starts, which requires pulling the image to the GPU machine. The first run can take 10-15 minutes (subsequent runs will be faster).

Image generation

The image generation capabilities are only available through the Dataiku DSS API. See the Developer Guide for tutorials using this feature.

Image-to-image mode is not available using local HuggingFace models.