Running HuggingFace models

The LLM Mesh supports locally-running HuggingFace transformers models, such as Falcon, Llama2, Dolly 2, or smaller task-specific models.

Cautions

Running local large scale HuggingFace models is a complex and very costly setup, and both quality and performance tend to be below proprietary LLM APIs. We strongly recommend that you make your first experiments in the domain of LLMs using Hosted LLM APIs. The vast majority of LLM API providers have strong guarantees as to not reusing your data for training their models.

Pre-requisites

In order to run local HuggingFace models, you will need:

  • A running Elastic AI Kubernetes cluster with NVIDIA GPUs (including proper installation of the NVIDIA driver)

  • A fully setup Elastic AI computation capability, with Almalinux 8 base images (i.e. by adding –distrib almalinux8 when building images)

  • Python 3.9 setup

  • A setup with full outgoing Internet connectivity for downloading the models. Air-gapped setups are not supported.

Note

The base images and Python 3.9 requirements are automatically fulfilled when using Dataiku Cloud Stacks, you do not need additional setup for container images, as long as you have not customized base images.

If you require assistance with the cluster setup, please reach out to your Dataiku Technical Account Manager or Customer Success Manager

For running models such as Falcon 7B or Llama2 7B, you will require A10 GPUs with 40 GB of VRAM. We recommend single-GPU nodes.

Create a containerized execution config

Create a new containerized execution config. In “custom limits”, add nvidia.com/gpu with value 1.

Do not set memory or CPU requests or limits. Anyway, each node will only accomodate a single container, since the GPU is not shared.

Create a code env

  • Create a new Python 3.9 code env

  • In “Packages to install”:

    • click “Add sets of packages”, select “Local HuggingFace models” and click “Add”

  • In “Containerized execution”:

    • Select “Build for”: “All container configurations”

    • In “Container runntime additions”, click “Add”, and in “Type”, select “GPU support for Torch 2”

  • Click “Save and update” (this will take at least 10-20 minutes)

Create a HuggingFace connection

  • In Connections, create a new Local HuggingFace connection

  • Enter connection name

  • In “Containerized execution”, enter the name of the containerized execution config you just created

  • In “code environment name”, enter the name of the code environment you just created

  • Create the connection

If you want to use Llama2, you must have a HuggingFace account in which Llama2 has been approved. Enter an access token.

We recommend disabling “Use DSS-managed model cache” if your containers have good outgoing Internet connectivity, as it will be faster to download the models directly from HuggingFace

On Dataiku Cloud

To access this feature you will need to activate the GPU on your instance. Contact us to know more about our GPU offering.

In order to run local HuggingFace models, you will need:

  • Activate your GPU extension in the Launchpad,

  • Create a Python 3.9 code env with the “Local HuggingFace models” set of packages and the option “GPU support for Torch 2” activated,

  • Create a HuggingFace connection in the Launchpad and linked the code environment you created.

Test

In a project, create a new Prompt Studio (from the green menu). Create a new single prompt. In the LLM dropdown, choose for example Falcon 7B, and click Run.

The model will be downloaded, and a container will be started, which will require pulling the image to the GPU machine. The first run will take 10-15 minutes (subsequent runs will be faster).