Setup & prerequisites¶
Local Hugging Face models can run either on a Kubernetes cluster with NVIDIA GPUs, or on the DSS server itself if it has NVIDIA GPUs. The Kubernetes setup is the recommended one.
Common requirements¶
In order to run local Hugging Face models, you will need:
GPUs compute capability level >= 7.5
NVIDIA driver version compatible with CUDA 12.8
Python 3.11
Some models are gated on Hugging Face Hub. Accessing them requires a Hugging Face access token, and gated models may require requesting access on the model repository page first.
Note
The base images and Python 3.11 requirements are automatically fulfilled when using Dataiku Cloud Stacks, so you do not need additional setup for container images, as long as you have not customized base images. See Elastic AI computation setup.
If you require assistance with the cluster setup, please reach out to your Dataiku Technical Account Manager or Customer Success Manager
Note
For air-gapped instances, you will need to import Hugging Face models manually into DSS’s model cache and enable DSS’s model cache in the Hugging Face connection. See The model cache section.
Kubernetes setup¶
This setup is the recommended approach for running Hugging Face models locally in DSS.
Prerequisites¶
You need:
A fully set up Elastic AI computation capability
A running Elastic AI Kubernetes cluster with NVIDIA GPUs
Smaller models, such as Qwen3 4B or GPT-OSS 20B, work with one
A10GPU with 24 GB of VRAMLarger models require multi-GPU nodes. For example, some models such as Llama 3.3 70B or Kimi Linear 48B A3B can run on
2 x A100GPUs with 80 GB of VRAM each, depending on quantization and context lengthLarge image generation models such as FLUX.1-schnell require GPUs with at least 40 GB of VRAM. Local image generation models do not benefit from multi-GPU setups or quantization
Create a containerized execution configuration¶
In Administration > Settings > Compute & Scaling > Containerized Execution, create a new Kubernetes containerized execution config
In Custom limits, add
nvidia.com/gpuwith value1If you are using multi-GPU nodes, set a higher value. It is recommended to use
1,2,4, or8in order to maximize compatibility with vLLM tensor parallelism constraintsEnable Increase shared memory size, without setting a specific value
Do not set memory or CPU requests or limits.
Create the code environment and a Local Hugging Face connection¶
In Administration > Code envs > Internal envs setup, in the Local Hugging Face models code environment section, select a Python interpreter and click Create code environment
Once the code environment is created, go to its settings. In Containerized execution, set Build for to All container configurations, or select the relevant GPU-enabled container configurations
Click Save and update (this will take at least 10-20 minutes)
Create a Local Hugging Face connection
Enter the connection name
In Container execution, choose Select a container configuration, then choose the containerized execution config name
In Code environment, select Use internal code env
In the relevant model section of the connection, click Add model from preset to import a preset from the catalog
Click Save
Note
Different models can have different hardware requirements, including GPU count and GPU type. After the model is created, it is recommended to configure Container execution in the model’s Deployment settings tab rather than relying only on the connection-level default.
Deployment settings¶
In the model’s Deployment settings tab, DSS can automatically scale the number of model instances up and down depending on the load. Max. model instances limits how many instances can run at the same time. Target requests per model instance defines the target average load per running instance and is used by DSS to decide when to add or remove instances. Autoscaling time window controls how quickly DSS reacts to load changes.
By default, models scale from zero, which adds latency on the first request while the model instance starts. To avoid this, configure Min. model instances so that the model remains always running. This setting only applies if Reserved capacity is enabled on the Local Hugging Face connection.
Local DSS server setup¶
Running local Hugging Face models directly on the DSS server is supported when the DSS server has NVIDIA GPUs, but this is not the recommended setup.
Create the code environment and a Local Hugging Face connection¶
In Administration > Code envs > Internal envs setup, in the Local Hugging Face models code environment section, select a Python interpreter and click Create code environment
Once the code environment is created, click Save and update
Create a Local Hugging Face connection
Enter the connection name
In Container execution, select None - Use backend to execute
In Code environment, select Use internal code env
In the relevant model section of the connection, click Add model from preset to import a preset from the catalog
Click Save
Deployment settings¶
In local DSS server mode, GPU allocation is not automatic. Running multiple models on the same GPU will likely fail with out-of-memory errors.
By default, a model uses all visible GPUs. If the DSS server has several GPUs and you want to run several models, use Cuda visible devices to pin each model to a specific GPU.
Set Max. model instances to 1. Otherwise, multiple instances are likely to run on the same GPU and fail with out-of-memory errors.
As in Kubernetes mode, models scale from zero by default. To keep a model always running, configure Min. model instances. This setting only applies if Reserved capacity is enabled on the Local Hugging Face connection.