Automated Prompt Optimization¶

Automated Prompt Optimization is a feature for optimizing prompts for question-answering (QA) tasks. It leverages the dspy library, a framework for programming with language models (LMs), to systematically improve prompt performance. It is provided by the Automated Prompt Optimization plugin, which you need to install. Please see Installing plugins.

Given a dataset of questions and answers, the recipe explores different prompt variations and evaluates their performance, then outputs an optimized prompt for your selected language model.

This process helps in:

Increasing the accuracy of QA tasks.
Finding the most effective instructions for a given LLM.
Evaluating different LLMs for a specific task.

This recipe takes a dataset of questions and answers and produces a new dataset containing the original and optimized prompts, along with their performance scores.

Inputs¶

Validation Dataset: A dataset containing the data for prompt optimization. It must contain at least two columns:
- A column with questions.
- A column with the corresponding ground-truth answers.
- Optionally, a column with context for each question-answer pair to help validate the answer.

Output¶

Optimal Prompts: A dataset containing the results of the optimization. It will have the following columns:
- run: The name of the run (initial for the initial prompt, optimized for the optimized prompt).
- prompt: The text of the prompt.
- score_train: The performance score on the training set.
- score_test: The performance score on the test set (if a train/test split is used).

Parameters¶

The recipe’s behavior can be customized with the following parameters:

LLM configuration¶

Target LLM: The language model you want to optimize the prompt for. This LLM will be used to answer the questions during evaluation and to generate new prompt variations during optimization (for COPRO). It can be any LLM connection or Agent from the LLM Mesh.
LLM-based Validation: A boolean parameter.
- If false (default), the answers from the LLM are evaluated against the ground-truth answers using an exact match (F1 score).
- If true, a separate Validation LLM is used to perform a more semantic evaluation of the answer’s quality (using the context column if it exists).
Validation LLM: The language model to use for evaluation when LLM-based Validation is enabled.

Dataset mapping¶

Question column: The column in the input dataset that contains the questions.
Ground Truth Answer column: The column in the input dataset that contains the ground-truth answers.
Context column (optional): The column in the input dataset that contains the context for the questions.

Optimization settings¶

Initial prompt (optional): The initial prompt to start the optimization from. If not provided, a default QA prompt is used.
Train/Test Split Ratio: The ratio to split the input dataset into a training set and a test set.
- The training set is used by the optimizer to find the best prompt.
- The test set is used to evaluate the performance of the initial and final prompts on unseen data.
- If the ratio is 0, no test set is created, and no test scores are computed.
Optimizer: The algorithm to use for optimization. The recipe supports two optimizers from dspy:
- COPRO: “Compilation-based aPproach to pRompt Optimization”. A tree-based optimizer that explores and refines prompts by iteratively generating new instruction variations.
- BootstrapFewShot with Random Search: Optimizes by selecting the best few-shot demonstrations from your training data through random search.

Advanced parameters (Show Advanced Parameters checkbox)¶

For COPRO Optimizer:

COPRO: Depth: The number of optimization levels.
COPRO: Breadth: The number of new prompts to generate at each level.
COPRO: Initial Temperature: The temperature for prompt generation.

For BootstrapFewShot Optimizer:

Bootstrap: Max Bootstrapped Demos: Maximum number of bootstrapped demonstrations to generate (default: 4).
Bootstrap: Max Labeled Demos: Maximum number of labeled demonstrations to use (default: 16). Examples are included in the prompt and will leak some information if you don’t use a test set.
Bootstrap: Number of Candidate Programs: Number of candidate programs to evaluate during random search (default: 16).
Bootstrap: Number of Threads: Number of threads to use for parallel evaluation (default: 4).