Creating and managing Semantic Models¶
The Semantic Model editor is provided by the Semantic Models Lab plugin, which must be installed from the plugin store by an administrator
Creating a Semantic Model¶
Semantic Models are created within any Dataiku project in the Generative AI tab > Semantic Models.
It is a best practice to build semantic models on top of golden datasets.
Semantic models have a name and versions (one of which is active).
Entities¶
Entities are tables linked to business concepts and filters.
They are characterized by:
A dataset from a specific project
A name and a description
A primary key
Attributes
These are mapped to columns from the dataset, come with a SQL expression (usually the column name), a type and a description
Writing custom SQL expressions for attributes is useful when the raw data in your database isn’t in the exact format you need for your semantic model.
Attributes can be resolvable and have sample values
Resolvable attributes are used in a term resolution step, where the user query terms are matched to values from these attributes to account for typo corrections or schema errors
For each attribute, you can add a number of sample values (either by automatically fetching them or adding them manually)

Metrics
Business metrics are aggregations of attributes that can come from various entities
They come with a name, a formal SQL expression, a description and LLM instructions.

Filters
They can be used to define filters applied to columns
They come with a name, a formal SQL expression, a description and LLM instructions.

Entities can either be added manually or automatically generated using an LLM.
When using an LLM to automatically generate entities, you can specify instructions (e.g. to provide the number of entities to create from the dataset selected, how to group attributes, information on filters/metrics to create or relationships between these entities).
Automatically generating an entity takes you 80% of the way, you still need to verify and test the entity before exposing it to end-users.
Relationships¶
Relationships between entities define which keys to use to join two entities. These can both be simple and complex joins.
Glossary¶
The glossary is a collection of business specific terms. Each term comes with a name, a description and synonyms.
These terms can be added manually, or extracted from business documents (i.e. PDFs, PPTs, DOCX).
Once terms have been added to the glossary, the Used tab is used to define which terms are used in the semantic model and out of these which should be linked to entities / attributes / metrics / filters.
Golden Queries¶
Golden Queries are pre-recorded user questions and expected SQL outcomes. Providing Golden queries improves the quality of the Semantic Model Query Tool’s output.
Golden queries have names, the user question and expected SQL outcome.
These can be used for frequently asked questions or to provide guidance for complex queries.
These can be added from the Golden Queries tab or directly from the Playground.
Instructions¶
Instructions can be added, to help guide the LLM in generating the SQL - e.g. to provide specific instructions on how to deal with date attributes.
Playground¶
While creating and refining your semantic model, you can test it out in the Playground. The Playground allows you to see how your Semantic Model responds to business questions.
The Playground uses the Semantic Model Query Tool to generate and execute the SQL.
The Playground shows the LLM answer, the SQL queries executed, the records retrieved by the execution of the queries. It also shows value corrections that have been performed, as well as usage of glossary terms.

The main value of the Playground is testing out questions you are expecting end-users to ask, and using the output of the tool to refine the semantic model.
Other settings¶
The Indexing tab is where you configure and run the process that allows Dataiku to perform terms resolution and fetch sample values.
You need to specify an Embedding LLM to select the specific embedding model used for semantic resolution of attributes.
Once your settings are configured, you must manually trigger the indexing process to update the semantic model’s knowledge of your data.
Security and Permissions¶
Project Access: Users must have Read project content permissions on the project containing the Semantic Model to use it.
Data Security: Semantic Models respect underlying Dataiku dataset permissions. If a user does not have access to the underlying connection or dataset, the Semantic Search tool will fail for that user.
YAML Specification¶
The syntax of a semantic model YAML specification is:
{
// Top-level semantic model metadata
"id": "<string>", // Unique identifier for the semantic model
"projectKey": "<string>", // Dataiku project key
"name": "<string>", // Name of the semantic model
"activeVersionId": "<string>", // The ID of the currently active version (e.g., "v1")
// Array containing the versions of the semantic model
"versions": [
{
"id": "<string>", // Version ID
"description": "<string>",
"created": {
"on": "<timestamp string>",
"by": "<string>"
},
"entities": [
{
"name": "<string>", // Logical name of entity
"description": "<string>", // Description of the entity
"tags": [ "<string>" ],
"type": "<string>", // Usually "DATASET"
"datasetRef": "<string>", // Reference to the Dataiku dataset
// Dimension columns in the logical table
"attributes": [
{
"name": "<string>", // Logical name of the column
"description": "<string>", // Optional: Column description
"dssType": "<string>", // Data type (e.g., "string", "double")
"type": "<string>", // Usually "COLUMN"
"column": "<string>", // The physical column name in the base dataset
"distinctValuesHandlingMode": "<MANUAL | NONE | AUTO_INDEX>",
"manualValues": [ "<string>" ], // List of values if handled manually
"indexDistinctValues": "<boolean>",
"resolveInUserRequests": "<boolean>",
"sqlGenerationConfig": {
"autoValuesLimit": "<integer>" // Optional: limit for auto-indexed values
}
}
// ... more attributes
],
// Metrics scoped to the logical table
"metrics": [
{
"name": "<string>",
"description": "<string>",
"pseudoSQLExpression": "<SQL expression>", // e.g., "COUNT(customer_id)"
"llmInstructions": "<string>", // Optional: Custom instructions for the LLM
"created": {}
}
// ... more metrics
],
// Commonly used filters over the logical table
"filters": [
{
"name": "<string>",
"description": "<string>",
"pseudoSQLExpression": "<SQL expression>",
"llmInstructions": "<string>", // Optional: Custom instructions for the LLM
"created": {}
}
// ... more filters
],
// Primary key definition for the entity
"primaryKey": {
"attributes": [ "<string>" ] // Array of attribute names that make up the PK
},
"foreignKeys": [ "<object>" ] // Array of foreign key definitions, if any
}
// ... more entities
],
// View-level concepts: Relationships between logical tables
"relationships": [
{
"firstEntity": "<string>", // Left table
"secondEntity": "<string>", // Right table
"pseudoSQLExpression": "<SQL join expression>" // e.g., "table_a.id = table_b.a_id"
}
],
// Additional context concepts: Verified queries and example questions
"goldenQueries": [
{
"name": "<string>", // Descriptive name of the query
"question": "<string>", // The natural language question
"generatedSql": "<string>", // The verified SQL query answering the question
"created": {
"on": "<timestamp string>"
}
}
],
// Business glossary for LLM context mapping
"glossaryTerms": [
{
"id": "<string>", // UUID for the term
"term": "<string>", // The business term itself
"description": "<string>", // Definition or business logic
"source": "<MANUAL | EXTRACTED>",
"userModified": "<boolean>",
"created": {
"on": "<timestamp string>",
"by": "<string>" // Optional
},
"modified": { // Optional
"on": "<timestamp string>"
},
"synonyms": [ "<string>" ], // Alternative names for the term
"privateEditorData": {}
}
],
// Links tying glossary terms to specific entities or attributes
"glossaryBindings": [
{
"termId": "<string>", // References glossaryTerms.id
"targetEntityClass": "<string>", // Target entity name
"targetName": "<string>", // Target attribute name (Only if targetType is ATTRIBUTE)
"targetType": "<ATTRIBUTE | ENTITY_CLASS>"
}
],
// Global indexing settings for the LLM search service
"indexingSettings": {
"maxDistinctValuesPerAttribute": "<integer>",
"maxScannedRowsForSQLDatasets": "<integer>",
"maxScannedRowsForNonSQLDatasets": "<integer>"
},
"privateEditorData": {
"lastSavedAt": "<timestamp string>"
},
// Global instructions for LLM SQL generation
"sqlGenerationConfig": {
"instructions": "<string>", // General prompt instructions for the model
"vocabularyTermIds": [ "<string>" ] // Relevant glossary term IDs to inject globally
}
}
],
// Dataiku administrative metadata
"versionTag": {
"versionNumber": "<integer>",
"lastModifiedBy": {
"login": "<string>"
},
"lastModifiedOn": "<timestamp integer>"
},
"creationTag": {
"versionNumber": "<integer>",
"lastModifiedBy": {
"login": "<string>"
},
"lastModifiedOn": "<timestamp integer>"
},
"tags": [ "<string>" ],
"customFields": {},
"checklists": {
"checklists": [ "<object>" ]
}
}