Rate Limiting¶

Rate limiting in the LLM Mesh helps manage the flow of LLM queries and ensure compliance with provider-side usage limits.

Concepts¶

Rate limits are enforced per LLM model and per provider. They apply to all queries executed through the LLM Mesh — across projects, LLM connections, and users.

Limits are expressed in requests per minute (RPM). If the rate is exceeded:

Requests are automatically throttled (i.e., delayed).
If the request cannot be served within a reasonable delay, it fails with a rate limiting error.

DSS provides baseline settings with sensible production-ready defaults. These settings are fully configurable, allowing you to override them as needed to match your specific use case or provider requirements.

Rule Configuration¶

Rate limiting rules are defined per provider, and can be defined in two ways:

For specific models within that provider.
As a default fallback that applies to all other models from the provider.

Each provider’s default rule can target one of the following model categories:

Completion models
Embedding models
Image generation models

Limitations¶

Rate Limiting is not supported for the following LLM connections:

Amazon SageMaker LLM
Databricks Mosaic AI
Snowflake Cortex
Local Hugging Face

Setup¶

A dedicated Rate Limiting section lets you configure the rate limits. Find it in Administration > Settings > LLM Mesh.