Skip to content

MIPROv2 Optimizer

Overview

MIPROv2 (Multiprompt Instruction PRoposal Optimizer Version 2) is an prompt optimizer capable of optimizing both instructions and few-shot examples jointly. It does this by bootstrapping few-shot example candidates, proposing instructions grounded in different dynamics of the task, and finding an optimized combination of these options using Bayesian Optimization. It can be used for optimizing few-shot examples & instructions jointly, or just instructions for 0-shot optimization.

Example Usage

Setting up a Sample Pipeline

We'll be making a basic answer generation pipeline over the GSM8K dataset. So let's start by configuring the LM which will be OpenAI LM client with gpt-3.5-turbo as the LLM in use.

import dspy

turbo = dspy.OpenAI(model='gpt-3.5-turbo', max_tokens=250)
dspy.settings.configure(lm=turbo)

Now that we have the LM client setup it's time to import the train-dev split in GSM8k class that DSPy provides us:

from dspy.datasets.gsm8k import GSM8K, gsm8k_metric

gms8k = GSM8K()

trainset, devset = gms8k.train, gms8k.dev

We'll now define a basic QA inline signature i.e. question->answer and pass it to ChainOfThought module, that applies necessary addition for CoT style prompting to the Signature.

class CoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought("question -> answer")

    def forward(self, question):
        return self.prog(question=question)

Now we need to evaluate this pipeline too! So we'll use the Evaluate class that DSPy provides us, as for the metric we'll use the gsm8k_metric that we imported above.

from dspy.evaluate import Evaluate

evaluate = Evaluate(devset=devset[:], metric=gsm8k_metric, num_threads=8, display_progress=True, display_table=False)

To evaluate the CoT pipeline we'll need to create an object of it and pass it as an arg to the evaluator call.

program = CoT()

evaluate(program, devset=devset[:])

Now we have the baseline pipeline ready to use, so let's try using the MIPROv2 optimizer to improve our pipeline's performance!

Optimizing with MIPROv2

To get started with MIPROv2, we'd recommend using the auto flag, starting with a light optimization run. This will set up hyperparameters for you to do a light optimization run on your program.

# Import the optimizer
from dspy.teleprompt import MIPROv2

# Initialize optimizer
teleprompter = MIPROv2(
    metric=gsm8k_metric,
    auto="light", # Can choose between light, medium, and heavy optimization runs
)

# Optimize program
print(f"Optimizing program with MIPRO...")
optimized_program = teleprompter.compile(
    program.deepcopy(),
    trainset=trainset,
    max_bootstrapped_demos=3,
    max_labeled_demos=4,
    requires_permission_to_run=False,
)

# Save optimize program for future use
optimized_program.save(f"mipro_optimized")

# Evaluate optimized program
print(f"Evaluate optimized program...")
evaluate(optimized_program, devset=devset[:])

Optimizing instructions only with MIPROv2 (0-Shot)

In some cases, we may want to only optimize the instruction, rather than including few-shot examples in the prompt. The code below demonstrates how this can be done using MIPROv2. Note that the key difference involves setting max_labeled_demos and max_bootstrapped_demos to zero.

# Import the optimizer
from dspy.teleprompt import MIPROv2

# Initialize optimizer
teleprompter = MIPROv2(
    metric=gsm8k_metric,
    auto="light", # Can choose between light, medium, and heavy optimization runs
)

# Optimize program
print(f"Optimizing zero-shot program with MIPRO...")
zeroshot_optimized_program = teleprompter.compile(
    program.deepcopy(),
    trainset=trainset,
    max_bootstrapped_demos=0, # ZERO FEW-SHOT EXAMPLES
    max_labeled_demos=0, # ZERO FEW-SHOT EXAMPLES
    requires_permission_to_run=False,
)

# Save optimize program for future use
zeroshot_optimized_program.save(f"mipro_zeroshot_optimized")

# Evaluate optimized program
print(f"Evaluate optimized program...")
evaluate(zeroshot_optimized_program, devset=devset[:])

Optimizing with MIPROv2 (advanced)

Once you've gotten a feel for using MIPROv2 with auto settings, you may want to experiment with setting hyperparameters yourself to get the best results. The code below shows an example of how you can go about this. A full description of each parameter can be found in the section below.

# Import the optimizer
from dspy.teleprompt import MIPROv2

# Initialize optimizer
teleprompter = MIPROv2(
    metric=gsm8k_metric,
    num_candidates=7,
    init_temperature=0.5,
    max_bootstrapped_demos=3,
    max_labeled_demos=4,
    verbose=False,
)

# Optimize program
print(f"Optimizing program with MIPRO...")
optimized_program = teleprompter.compile(
    program.deepcopy(),
    trainset=trainset,
    num_trials=15,
    minibatch_size=25,
    minibatch_full_eval_steps=10,
    minibatch=True, 
    requires_permission_to_run=False,
)

# Save optimize program for future use
optimized_program.save(f"mipro_optimized")

# Evaluate optimized program
print(f"Evaluate optimized program...")
evaluate(optimized_program, devset=devset[:])

Parameters

Initialization Parameters

Parameter Type Default Description
metric dspy.metric Required The evaluation metric used to optimize the task model.
prompt_model dspy.LM LM specified in dspy.settings Model used for prompt generation.
task_model dspy.LM LM specified in dspy.settings Model used for task execution.
auto Optional[str] None If set to light, medium, or heavy, this will automatically configure the following hyperparameters: num_candidates, num_trials, minibatch, and will also cap the size of valset up to 100, 300, and 1000 for light, medium, and heavy runs respectively.
num_candidates int 10 Number of candidate instructions & few-shot examples to generate and evaluate for each predictor. If num_candidates=10, this means for a 2 module LM program we'll be optimizing over 10 candidates x 2 modules x 2 variables (few-shot ex. and instructions for each module)= 40 total variables. Therefore, if we increase num_candidates, we will probably want to increase num_trials as well (see Compile parameters).
num_threads int 6 Threads to use for evaluation.
max_errors int 10 Maximum errors during an evaluation run that can be made before throwing an Exception.
teacher_settings dict {} Settings to use for the teacher model that bootstraps few-shot examples. An example dict would be {lm=<dspy.LM object>}. If your LM program with your default model is struggling to bootstrap any examples, it could be worth using a more powerful teacher model for bootstrapping.
max_bootstrapped_demos int 4 Maximum number of bootstrapped demonstrations to generate and include in the prompt.
max_labeled_demos int 16 Maximum number of labeled demonstrations to generate and include in the prompt. Note that these differ from bootstrapped examples because they are just inputs & outputs sampled directly from the training set and do not have bootstrapped intermediate steps.
init_temperature float 1.0 The initial temperature for prompt generation, influencing creativity.
verbose bool False Enables printing intermediate steps and information.
track_stats bool True Logs relevant information through the optimization process if set to True.
metric_threshold float None A metric threshold is used if we only want to keep bootstrapped few-shot examples that exceed some threshold of performance.
seed int 9 Seed for reproducibility.

Compile Parameters

Parameter Type Default Description
student dspy.Module Required The base program to optimize.
trainset List[dspy.Example] Required Training dataset which is used to bootstrap few-shot examples and instructions. If a separate valset is not specified, 80% of this training set will also be used as a validation set for evaluating new candidate prompts.
valset List[dspy.Example] Defaults to 80% of trainset Dataset which is used to evaluate candidate prompts. We recommend using somewhere between 50-500 examples for optimization.
num_trials int 30 Number of optimization trials to run. When minibatch is set to True, this represents the number of minibatch trials that will be run on batches of size minibatch_size. When minibatch is set to False, each trial uses a full evaluation on the training set. In both cases, we recommend setting num_trials to a minimum of .75 x # modules in program x # variables per module (2 if few-shot examples & instructions will both be optimized, 1 in the 0-shot case).
minibatch bool True Flag to enable evaluating over minibatches of data (instead of the full validation set) for evaluation each trial.
minibatch_size int 25.0 Size of minibatches for evaluations.
minibatch_full_eval_steps int 10 When minibatching is enabled, a full evaluation on the validation set will be carried out every minibatch_full_eval_steps on the top averaging set of prompts (according to their average score on the minibatch trials).
max_bootstrapped_demos Optional[int] Defaults to init value. Maximum number of bootstrapped demonstrations to generate and include in the prompt.
max_labeled_demos Optional[int] Defaults to init value. Maximum number of labeled demonstrations to generate and include in the prompt. Note that these differ from bootstrapped examples because they are just inputs & outputs sampled directly from the training set and do not have bootstrapped intermediate steps.
seed Optional[int] Defaults to init value. Seed for reproducibility.
program_aware_proposer bool True Flag to enable summarizing a reflexive view of the code for your LM program.
data_aware_proposer bool True Flag to enable summarizing your training dataset.
view_data_batch_size int 10 Number of data examples to look at a time when generating the summary.
tip_aware_proposer bool True Flag to enable using a randomly selected tip for instruction generation.
fewshot_aware_proposer bool True Flag to enable using generated few-shot examples for instruction proposal.
requires_permission_to_run bool True Flag to require user confirmation before running optimizations.

How MIPROv2 works

At a high level, MIPROv2 works by creating both few-shot examples and new instructions for each predictor in your LM program, and then searching over these using Bayesian Optimization to find the best combination of these variables for your program.

These steps are broken down in more detail below: 1) Bootstrap Few-Shot Examples: The same bootstrapping technique used in BootstrapFewshotWithRandomSearch is used to create few-shot examples. This works by randomly sampling examples from your training set, which are then run through your LM program. If the output from the program is correct for this example, it is kept as a valid few-shot example candidate. Otherwise, we try another example until we've curated the specified amount of few-shot example candidates. This step creates num_candidates sets of max_bootstrapped_demos bootstrapped examples and max_labeled_demos basic examples sampled from the training set. 2) Propose Instruction Candidates. Next, we propose instruction candidates for each predictor in the program. This is done using another LM program as a proposer, which bootstraps & summarizes relevant information about the task to generate high quality instructions. Specifically, the instruction proposer includes (1) a generated summary of properties of the training dataset, (2) a generated summary of your LM program's code and the specific predictor that an instruction is being generated for, (3) the previously bootstrapped few-shot examples to show reference inputs / outputs for a given predictor and (4) a randomly sampled tip for generation (i.e. "be creative", "be concise", etc.) to help explore the feature space of potential instructions. 3. Find an Optimized Combination of Few-Shot Examples & Instructions. Finally, now that we've created these few-shot examples and instructions, we use Bayesian Optimization to choose which set of these would work best for each predictor in our program. This works by running a series of num_trials trials, where a new set of prompts are evaluated over our validation set at each trial. This helps the Bayesian Optimizer learn which combination of prompts work best over time. If minibatch is set to True (which it is by default), then the new set of prompts are only evaluated on a minibatch of size minibatch_size at each trial which generally allows for more efficient exploration / exploitation. The best averaging set of prompts is then evaluated on the full validation set every minibatch_full_eval_steps get a less noisey performance benchmark. At the end of the optimization process, the LM program with the set of prompts that performed best on the full validation set is returned.

For those interested in more details, more information on MIPROv2 along with a study on MIPROv2 compared with other DSPy optimizers can be found in this paper.