Setup & prerequisites

Local Hugging Face models can run either on a Kubernetes cluster with NVIDIA GPUs, or on the DSS server itself if it has NVIDIA GPUs. The Kubernetes setup is the recommended one.

Common requirements

In order to run local Hugging Face models, you will need:

Some models are gated on Hugging Face Hub. Accessing them requires a Hugging Face access token, and gated models may require requesting access on the model repository page first.

Note

The base images and Python 3.11 requirements are automatically fulfilled when using Dataiku Cloud Stacks, so you do not need additional setup for container images, as long as you have not customized base images. See Elastic AI computation setup.

If you require assistance with the cluster setup, please reach out to your Dataiku Technical Account Manager or Customer Success Manager

Note

For air-gapped instances, you will need to import Hugging Face models manually into DSS’s model cache and enable DSS’s model cache in the Hugging Face connection. See The model cache section.

Kubernetes setup

This setup is the recommended approach for running Hugging Face models locally in DSS.

Prerequisites

You need:

  • A fully set up Elastic AI computation capability

  • A running Elastic AI Kubernetes cluster with NVIDIA GPUs

    • Smaller models, such as Qwen3 4B or GPT-OSS 20B, work with one A10 GPU with 24 GB of VRAM

    • Larger models require multi-GPU nodes. For example, some models such as Llama 3.3 70B or Kimi Linear 48B A3B can run on 2 x A100 GPUs with 80 GB of VRAM each, depending on quantization and context length

    • Large image generation models such as FLUX.1-schnell require GPUs with at least 40 GB of VRAM. Local image generation models do not benefit from multi-GPU setups or quantization

Create a containerized execution configuration

  • In Administration > Settings > Compute & Scaling > Containerized Execution, create a new Kubernetes containerized execution config

  • In Custom limits, add nvidia.com/gpu with value 1

  • If you are using multi-GPU nodes, set a higher value. It is recommended to use 1, 2, 4, or 8 in order to maximize compatibility with vLLM tensor parallelism constraints

  • Enable Increase shared memory size, without setting a specific value

Do not set memory or CPU requests or limits.

Create the code environment and a Local Hugging Face connection

  • In Administration > Code envs > Internal envs setup, in the Local Hugging Face models code environment section, select a Python interpreter and click Create code environment

  • Once the code environment is created, go to its settings. In Containerized execution, set Build for to All container configurations, or select the relevant GPU-enabled container configurations

  • Click Save and update (this will take at least 10-20 minutes)

  • Create a Local Hugging Face connection

  • Enter the connection name

  • In Container execution, choose Select a container configuration, then choose the containerized execution config name

  • In Code environment, select Use internal code env

  • In the relevant model section of the connection, click Add model from preset to import a preset from the catalog

  • Click Save

Note

Different models can have different hardware requirements, including GPU count and GPU type. After the model is created, it is recommended to configure Container execution in the model’s Deployment settings tab rather than relying only on the connection-level default.

Deployment settings

In the model’s Deployment settings tab, DSS can automatically scale the number of model instances up and down depending on the load. Max. model instances limits how many instances can run at the same time. Target requests per model instance defines the target average load per running instance and is used by DSS to decide when to add or remove instances. Autoscaling time window controls how quickly DSS reacts to load changes.

By default, models scale from zero, which adds latency on the first request while the model instance starts. To avoid this, configure Min. model instances so that the model remains always running. This setting only applies if Reserved capacity is enabled on the Local Hugging Face connection.

Local DSS server setup

Running local Hugging Face models directly on the DSS server is supported when the DSS server has NVIDIA GPUs, but this is not the recommended setup.

Create the code environment and a Local Hugging Face connection

  • In Administration > Code envs > Internal envs setup, in the Local Hugging Face models code environment section, select a Python interpreter and click Create code environment

  • Once the code environment is created, click Save and update

  • Create a Local Hugging Face connection

  • Enter the connection name

  • In Container execution, select None - Use backend to execute

  • In Code environment, select Use internal code env

  • In the relevant model section of the connection, click Add model from preset to import a preset from the catalog

  • Click Save

Deployment settings

In local DSS server mode, GPU allocation is not automatic. Running multiple models on the same GPU will likely fail with out-of-memory errors.

By default, a model uses all visible GPUs. If the DSS server has several GPUs and you want to run several models, use Cuda visible devices to pin each model to a specific GPU.

Set Max. model instances to 1. Otherwise, multiple instances are likely to run on the same GPU and fail with out-of-memory errors.

As in Kubernetes mode, models scale from zero by default. To keep a model always running, configure Min. model instances. This setting only applies if Reserved capacity is enabled on the Local Hugging Face connection.