Setup & prerequisites¶

Local Hugging Face models can run either on a Kubernetes cluster with NVIDIA GPUs, or on the DSS server itself if it has NVIDIA GPUs. The Kubernetes setup is the recommended one.

Common requirements ¶

In order to run local Hugging Face models, you will need:

GPUs compute capability level >= 7.5
NVIDIA driver version compatible with CUDA 12.9
Python 3.11

Some models are gated on Hugging Face Hub. Accessing them requires a Hugging Face access token, and gated models may require requesting access on the model repository page first.

Air-gapped instances ¶

For air-gapped instances, you will need to import Hugging Face models manually into DSS’s model cache and enable DSS’s model cache in the Hugging Face connection. See The model cache section.

Local Hugging Face internal code environments install some packages from additional package indexes, not only from PyPI. In particular, DSS pulls packages from the following package indexes:

https://downloads.dataiku.com/public/pypi/
https://download.pytorch.org/whl/

Make sure to add these package indexes to the allowlist when creating or updating the internal code environment. If you use an internal PyPI mirror, you can download the required wheels from the Dataiku and PyTorch package indexes and publish them in your internal PyPI mirror.

Note

The base images and Python 3.11 requirements are automatically fulfilled when using Dataiku Cloud Stacks, so you do not need additional setup for container images, as long as you have not customized base images. See Elastic AI computation setup.

If you require assistance with the cluster setup, please reach out to your Dataiku Technical Account Manager or Customer Success Manager

Kubernetes setup ¶

This setup is the recommended approach for running Hugging Face models locally in DSS.

Prerequisites ¶

You need:

A base image built for AlmaLinux 9
A fully set up Elastic AI computation capability
A running Elastic AI Kubernetes cluster with NVIDIA GPUs
- Smaller models, such as Qwen3 4B or GPT-OSS 20B, work with one A10 GPU with 24 GB of VRAM
- Larger models require multi-GPU nodes. For example, some models such as Llama 3.3 70B or Kimi Linear 48B A3B can run on 2 x A100 GPUs with 80 GB of VRAM each, depending on quantization and context length
- Large image generation models such as FLUX.1-schnell require GPUs with at least 40 GB of VRAM. Local image generation models do not benefit from multi-GPU setups or quantization

Create a containerized execution configuration ¶

In Administration > Settings > Compute & Scaling > Containerized Execution > Image Build Configs, create a new image build config
In Administration > Settings > Compute & Scaling > Containerized Execution > Containerized execution configs, create a new Kubernetes containerized execution config
In Custom limits, add nvidia.com/gpu with value 1
If you are using multi-GPU nodes, set a higher value. It is recommended to use 1, 2, 4, or 8 in order to maximize compatibility with vLLM tensor parallelism constraints
Enable Increase shared memory size, without setting a specific value

Do not set memory or CPU requests or limits.

Create the code environment and a Local Hugging Face connection ¶

In Administration > Code envs > Internal envs setup, in the Local Hugging Face models code environment section, select a Python interpreter and click Create code environment
Once the code environment is created, go to its settings. In Containerized execution, set Build for to All container configurations, or select the relevant GPU-enabled container configurations
Click Save and update (this will take at least 10-20 minutes)
Create a Local Hugging Face connection
Enter the connection name
In Container execution, choose Select a container configuration, then choose the containerized execution config name
In Code environment, select Use internal code env
In the relevant model section of the connection, click Add model from preset to import a preset from the catalog
Click Save

Note

Different models can have different hardware requirements, including GPU count and GPU type. After the model is created, it is recommended to configure Container execution in the model’s Deployment settings tab rather than relying only on the connection-level default.

Note

During setup or update of the Local Hugging Face internal code environment in a containerized execution, DSS pulls required GPU dependencies from the NVIDIA CUDA Package Repository: https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo

If your DSS instance has restricted outbound network access, make sure to add the package repository above to the allowlist or mirror it internally.

Deployment settings ¶

In the model’s Deployment settings tab, DSS can automatically scale the number of model instances up and down depending on the load. Max. model instances limits how many instances can run at the same time. Target requests per model instance defines the target average load per running instance and is used by DSS to decide when to add or remove instances. Autoscaling time window controls how quickly DSS reacts to load changes.

By default, models scale from zero, which adds latency on the first request while the model instance starts. To avoid this, configure Min. model instances so that the model remains always running. This setting only applies if Reserved capacity is enabled on the Local Hugging Face connection.

Local DSS server setup ¶

Running local Hugging Face models directly on the DSS server is supported when the DSS server has NVIDIA GPUs, but this is not the recommended setup.

Prerequisites ¶

In addition to the common requirements above, the DSS server must have:

The CUDA 12.9 toolkit installed. Installing only the NVIDIA driver is not sufficient.
The ninja-build system package installed.

Create the code environment and a Local Hugging Face connection ¶

In Administration > Code envs > Internal envs setup, in the Local Hugging Face models code environment section, select a Python interpreter and click Create code environment
Once the code environment is created, click Save and update
Create a Local Hugging Face connection
Enter the connection name
In Container execution, select None - Use backend to execute
In Code environment, select Use internal code env
In the relevant model section of the connection, click Add model from preset to import a preset from the catalog
Click Save

Deployment settings ¶

In local DSS server mode, GPU allocation is not automatic. Running multiple models on the same GPU will likely fail with out-of-memory errors.

By default, a model uses all visible GPUs. If the DSS server has several GPUs and you want to run several models, use Cuda visible devices to pin each model to a specific GPU.

Set Max. model instances to 1. Otherwise, multiple instances are likely to run on the same GPU and fail with out-of-memory errors.

As in Kubernetes mode, models scale from zero by default. To keep a model always running, configure Min. model instances. This setting only applies if Reserved capacity is enabled on the Local Hugging Face connection.