dspy.GEPA: Reflective Prompt Optimizer¶
GEPA (Genetic-Pareto) is a reflective optimizer proposed in "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning" (Agrawal et al., 2025, arxiv:2507.19457), that adaptively evolves textual components (such as prompts) of arbitrary systems. In addition to scalar scores returned by metrics, users can also provide GEPA with a text feedback to guide the optimization process. Such textual feedback provides GEPA more visibility into why the system got the score that it did, and then GEPA can introspect to identify how to improve the score. This allows GEPA to propose high performing prompts in very few rollouts.
dspy.GEPA(metric: GEPAFeedbackMetric, *, auto: Literal['light', 'medium', 'heavy'] | None = None, max_full_evals: int | None = None, max_metric_calls: int | None = None, reflection_minibatch_size: int = 3, candidate_selection_strategy: Literal['pareto', 'current_best'] = 'pareto', reflection_lm: LM | None = None, skip_perfect_score: bool = True, add_format_failure_as_feedback: bool = False, instruction_proposer: ProposalFn | None = None, component_selector: ReflectionComponentSelector | str = 'round_robin', use_merge: bool = True, max_merge_invocations: int | None = 5, num_threads: int | None = None, failure_score: float = 0.0, perfect_score: float = 1.0, log_dir: str | None = None, track_stats: bool = False, use_wandb: bool = False, wandb_api_key: str | None = None, wandb_init_kwargs: dict[str, Any] | None = None, track_best_outputs: bool = False, warn_on_score_mismatch: bool = True, enable_tool_optimization: bool = False, use_mlflow: bool = False, seed: int | None = 0, gepa_kwargs: dict | None = None)
¶
Bases: Teleprompter
GEPA is an evolutionary optimizer, which uses reflection to evolve text components
of complex systems. GEPA is proposed in the paper GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.
The GEPA optimization engine is provided by the gepa package, available from https://github.com/gepa-ai/gepa.
GEPA captures full traces of the DSPy module's execution, identifies the parts of the trace corresponding to a specific predictor, and reflects on the behaviour of the predictor to propose a new instruction for the predictor. GEPA allows users to provide textual feedback to the optimizer, which is used to guide the evolution of the predictor. The textual feedback can be provided at the granularity of individual predictors, or at the level of the entire system's execution.
To provide feedback to the GEPA optimizer, implement a metric as follows:
def metric(
gold: Example,
pred: Prediction,
trace: Optional[DSPyTrace] = None,
pred_name: Optional[str] = None,
pred_trace: Optional[DSPyTrace] = None,
) -> float | ScoreWithFeedback:
"""
This function is called with the following arguments:
- gold: The gold example.
- pred: The predicted output.
- trace: Optional. The trace of the program's execution.
- pred_name: Optional. The name of the target predictor currently being optimized by GEPA, for which
the feedback is being requested.
- pred_trace: Optional. The trace of the target predictor's execution GEPA is seeking feedback for.
Note the `pred_name` and `pred_trace` arguments. During optimization, GEPA will call the metric to obtain
feedback for individual predictors being optimized. GEPA provides the name of the predictor in `pred_name`
and the sub-trace (of the trace) corresponding to the predictor in `pred_trace`.
If available at the predictor level, the metric should return {'score': float, 'feedback': str} corresponding
to the predictor.
If not available at the predictor level, the metric can also return a text feedback at the program level
(using just the gold, pred and trace).
If no feedback is returned, GEPA will use a simple text feedback consisting of just the score:
f"This trajectory got a score of {score}."
"""
...
GEPA can also be used as a batch inference-time search strategy, by passing valset=trainset, track_stats=True, track_best_outputs=True, and using the
detailed_results attribute of the optimized program (returned by compile) to get the Pareto frontier of the batch. optimized_program.detailed_results.best_outputs_valset will contain the best outputs for each task in the batch.
Example:
gepa = GEPA(metric=metric, track_stats=True)
batch_of_tasks = [dspy.Example(...) for task in tasks]
new_prog = gepa.compile(student, trainset=trainset, valset=batch_of_tasks)
pareto_frontier = new_prog.detailed_results.val_aggregate_scores
# pareto_frontier is a list of scores, one for each task in the batch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric
|
GEPAFeedbackMetric
|
The metric function to use for feedback and evaluation. |
required |
auto
|
Literal['light', 'medium', 'heavy'] | None
|
The auto budget to use for the run. Options: "light", "medium", "heavy". |
None
|
max_full_evals
|
int | None
|
The maximum number of full evaluations to perform. |
None
|
max_metric_calls
|
int | None
|
The maximum number of metric calls to perform. |
None
|
reflection_minibatch_size
|
int
|
The number of examples to use for reflection in a single GEPA step. Default is 3. |
3
|
candidate_selection_strategy
|
Literal['pareto', 'current_best']
|
The strategy to use for candidate selection. Default is "pareto", which stochastically selects candidates from the Pareto frontier of all validation scores. Options: "pareto", "current_best". |
'pareto'
|
reflection_lm
|
LM | None
|
The language model to use for reflection. Required parameter. GEPA benefits from
a strong reflection model. Consider using |
None
|
skip_perfect_score
|
bool
|
Whether to skip examples with perfect scores during reflection. Default is True. |
True
|
instruction_proposer
|
ProposalFn | None
|
Optional custom instruction proposer implementing GEPA's ProposalFn protocol.
Default: None (recommended for most users) - Uses GEPA's proven instruction proposer from
the GEPA library, which implements the
See documentation on custom instruction proposers here. Advanced Feature: Only needed for specialized scenarios: - Multi-modal handling: Processing dspy.Image inputs alongside textual information - Nuanced control over constraints: Fine-grained control over instruction length, format, and structural requirements beyond standard feedback mechanisms - Domain-specific knowledge injection: Specialized terminology or context that cannot be provided through feedback_func alone - Provider-specific prompting: Optimizations for specific LLM providers (OpenAI, Anthropic) with unique formatting preferences - Coupled component updates: Coordinated updates of multiple components together rather than independent optimization - External knowledge integration: Runtime access to databases, APIs, or knowledge bases The default proposer handles the vast majority of use cases effectively. Use MultiModalInstructionProposer() from dspy.teleprompt.gepa.instruction_proposal for visual content or implement custom ProposalFn for highly specialized requirements. Note: When both instruction_proposer and reflection_lm are set, the instruction_proposer is called in the reflection_lm context. However, reflection_lm is optional when using a custom instruction_proposer. Custom instruction proposers can invoke their own LLMs if needed. |
None
|
component_selector
|
ReflectionComponentSelector | str
|
Custom component selector implementing the ReflectionComponentSelector protocol, or a string specifying a built-in selector strategy. Controls which components (predictors) are selected for optimization at each iteration. Defaults to 'round_robin' strategy which cycles through components one at a time. Available string options: 'round_robin' (cycles through components sequentially), 'all' (selects all components for simultaneous optimization). Custom selectors can implement strategies using LLM-driven selection logic based on optimization state and trajectories. See gepa component selectors for available built-in selectors and the ReflectionComponentSelector protocol for implementing custom selectors. |
'round_robin'
|
add_format_failure_as_feedback
|
bool
|
Whether to add format failures as feedback. Default is False. |
False
|
use_merge
|
bool
|
Whether to use merge-based optimization. Default is True. |
True
|
max_merge_invocations
|
int | None
|
The maximum number of merge invocations to perform. Default is 5. |
5
|
num_threads
|
int | None
|
The number of threads to use for evaluation with |
None
|
failure_score
|
float
|
The score to assign to failed examples. Default is 0.0. |
0.0
|
perfect_score
|
float
|
The maximum score achievable by the metric. Default is 1.0. Used by GEPA to determine if all examples in a minibatch are perfect. |
1.0
|
log_dir
|
str | None
|
The directory to save the logs. GEPA saves elaborate logs, along with all candidate
programs, in this directory. Running GEPA with the same |
None
|
track_stats
|
bool
|
Whether to return detailed results and all proposed programs in the |
False
|
use_wandb
|
bool
|
Whether to use wandb for logging. Default is False. |
False
|
wandb_api_key
|
str | None
|
The API key to use for wandb. If not provided, wandb will use the API key
from the environment variable |
None
|
wandb_init_kwargs
|
dict[str, Any] | None
|
Additional keyword arguments to pass to |
None
|
track_best_outputs
|
bool
|
Whether to track the best outputs on the validation set. track_stats must
be True if track_best_outputs is True. The optimized program's |
False
|
warn_on_score_mismatch
|
bool
|
GEPA (currently) expects the metric to return the same module-level score when called with and without the pred_name. This flag (defaults to True) determines whether a warning is raised if a mismatch in module-level and predictor-level score is detected. |
True
|
enable_tool_optimization
|
bool
|
Whether to enable joint optimization of dspy.ReAct modules. When enabled, GEPA jointly optimizes predictor instructions and tool descriptions together for dspy.ReAct modules. See the Tool Optimization guide for details on when to use this feature and how it works. Default is False. |
False
|
seed
|
int | None
|
The random seed to use for reproducibility. Default is 0. |
0
|
gepa_kwargs
|
dict | None
|
(Optional) Additional keyword arguments to pass directly to gepa.optimize. Useful for accessing advanced GEPA features not directly exposed through DSPy's GEPA interface. Available parameters:
- batch_sampler: Strategy for selecting training examples. Can be a BatchSampler instance or a string
('epoch_shuffled'). Defaults to 'epoch_shuffled'. Only valid when reflection_minibatch_size is None.
- merge_val_overlap_floor: Minimum number of shared validation ids required between parents before
attempting a merge subsample. Only relevant when using Note: Parameters already handled by DSPy's GEPA class will be overridden by the direct parameters and should not be passed through gepa_kwargs. |
None
|
Note
Budget Configuration: Exactly one of auto, max_full_evals, or max_metric_calls must be provided.
The auto parameter provides preset configurations: "light" for quick experimentation, "medium" for
balanced optimization, and "heavy" for thorough optimization.
Reflection Configuration: The reflection_lm parameter is required and should be a strong language model.
GEPA performs best with models like dspy.LM(model='gpt-5', temperature=1.0, max_tokens=32000).
The reflection process analyzes failed examples to generate feedback for program improvement.
Merge Configuration: GEPA can merge successful program variants using use_merge=True.
The max_merge_invocations parameter controls how many merge attempts are made during optimization.
Evaluation Configuration: Use num_threads to parallelize evaluation. The failure_score and
perfect_score parameters help GEPA understand your metric's range and optimize accordingly.
Logging Configuration: Set log_dir to save detailed logs and enable checkpoint resuming.
Use track_stats=True to access detailed optimization results via the detailed_results attribute.
Enable use_wandb=True for experiment tracking and visualization.
Reproducibility: Set seed to ensure consistent results across runs with the same configuration.
Source code in dspy/teleprompt/gepa/gepa.py
343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 | |
Functions¶
auto_budget(num_preds, num_candidates, valset_size: int, minibatch_size: int = 35, full_eval_steps: int = 5) -> int
¶
Source code in dspy/teleprompt/gepa/gepa.py
compile(student: Module, *, trainset: list[Example], teacher: Module | None = None, valset: list[Example] | None = None) -> Module
¶
GEPA uses the trainset to perform reflective updates to the prompt, but uses the valset for tracking Pareto scores. If no valset is provided, GEPA will use the trainset for both.
Parameters: - student: The student module to optimize. - trainset: The training set to use for reflective updates. - valset: The validation set to use for tracking Pareto scores. If not provided, GEPA will use the trainset for both.
Source code in dspy/teleprompt/gepa/gepa.py
554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 | |
get_params() -> dict[str, Any]
¶
Get the parameters of the teleprompter.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
The parameters of the teleprompter. |
:::
One of the key insights behind GEPA is its ability to leverage domain-specific textual feedback. Users should provide a feedback function as the GEPA metric, which has the following call signature:
dspy.teleprompt.gepa.gepa.GEPAFeedbackMetric
¶
Bases: Protocol
Functions¶
__call__(gold: Example, pred: Prediction, trace: Optional[DSPyTrace], pred_name: str | None, pred_trace: Optional[DSPyTrace]) -> Union[float, ScoreWithFeedback]
¶
This function is called with the following arguments: - gold: The gold example. - pred: The predicted output. - trace: Optional. The trace of the program's execution. - pred_name: Optional. The name of the target predictor currently being optimized by GEPA, for which the feedback is being requested. - pred_trace: Optional. The trace of the target predictor's execution GEPA is seeking feedback for.
Note the pred_name and pred_trace arguments. During optimization, GEPA will call the metric to obtain
feedback for individual predictors being optimized. GEPA provides the name of the predictor in pred_name
and the sub-trace (of the trace) corresponding to the predictor in pred_trace.
If available at the predictor level, the metric should return dspy.Prediction(score: float, feedback: str)
corresponding to the predictor.
If not available at the predictor level, the metric can also return a text feedback at the program level
(using just the gold, pred and trace).
If no feedback is returned, GEPA will use a simple text feedback consisting of just the score:
f"This trajectory got a score of {score}."
Source code in dspy/teleprompt/gepa/gepa.py
:::
When track_stats=True, GEPA returns detailed results about all of the proposed candidates, and metadata about the optimization run. The results are available in the detailed_results attribute of the optimized program returned by GEPA, and has the following type:
dspy.teleprompt.gepa.gepa.DspyGEPAResult(candidates: list[Module], parents: list[list[int | None]], val_aggregate_scores: list[float], val_subscores: list[list[float]], per_val_instance_best_candidates: list[set[int]], discovery_eval_counts: list[int], best_outputs_valset: list[list[tuple[int, list[Prediction]]]] | None = None, total_metric_calls: int | None = None, num_full_val_evals: int | None = None, log_dir: str | None = None, seed: int | None = None)
dataclass
¶
Additional data related to the GEPA run.
Fields: - candidates: list of proposed candidates (component_name -> component_text) - parents: lineage info; for each candidate i, parents[i] is a list of parent indices or None - val_aggregate_scores: per-candidate aggregate score on the validation set (higher is better) - val_subscores: per-candidate per-instance scores on the validation set (len == num_val_instances) - per_val_instance_best_candidates: for each val instance t, a set of candidate indices achieving the best score on t - discovery_eval_counts: Budget (number of metric calls / rollouts) consumed up to the discovery of each candidate
- total_metric_calls: total number of metric calls made across the run
- num_full_val_evals: number of full validation evaluations performed
- log_dir: where artifacts were written (if any)
-
seed: RNG seed for reproducibility (if known)
-
best_idx: candidate index with the highest val_aggregate_scores
- best_candidate: the program text mapping for best_idx
Attributes¶
candidates: list[Module]
instance-attribute
¶
parents: list[list[int | None]]
instance-attribute
¶
val_aggregate_scores: list[float]
instance-attribute
¶
val_subscores: list[list[float]]
instance-attribute
¶
per_val_instance_best_candidates: list[set[int]]
instance-attribute
¶
discovery_eval_counts: list[int]
instance-attribute
¶
best_outputs_valset: list[list[tuple[int, list[Prediction]]]] | None = None
class-attribute
instance-attribute
¶
total_metric_calls: int | None = None
class-attribute
instance-attribute
¶
num_full_val_evals: int | None = None
class-attribute
instance-attribute
¶
log_dir: str | None = None
class-attribute
instance-attribute
¶
seed: int | None = None
class-attribute
instance-attribute
¶
best_idx: int
property
¶
best_candidate: dict[str, str]
property
¶
highest_score_achieved_per_val_task: list[float]
property
¶
Functions¶
to_dict() -> dict[str, Any]
¶
Source code in dspy/teleprompt/gepa/gepa.py
from_gepa_result(gepa_result: GEPAResult, adapter: DspyAdapter) -> DspyGEPAResult
staticmethod
¶
Source code in dspy/teleprompt/gepa/gepa.py
:::
Usage Examples¶
See GEPA usage tutorials in GEPA Tutorials.
Inference-Time Search¶
GEPA can act as a test-time/inference search mechanism. By setting your valset to your evaluation batch and using track_best_outputs=True, GEPA produces for each batch element the highest-scoring outputs found during the evolutionary search.
gepa = dspy.GEPA(metric=metric, track_stats=True, ...)
new_prog = gepa.compile(student, trainset=my_tasks, valset=my_tasks)
highest_score_achieved_per_task = new_prog.detailed_results.highest_score_achieved_per_val_task
best_outputs = new_prog.detailed_results.best_outputs_valset
How Does GEPA Work?¶
1. Reflective Prompt Mutation¶
GEPA uses LLMs to reflect on structured execution traces (inputs, outputs, failures, feedback), targeting a chosen module and proposing a new instruction/program text tailored to real observed failures and rich textual/environmental feedback.
2. Rich Textual Feedback as Optimization Signal¶
GEPA can leverage any textual feedback available—not just scalar rewards. This includes evaluation logs, code traces, failed parses, constraint violations, error message strings, or even isolated submodule-specific feedback. This allows actionable, domain-aware optimization.
3. Pareto-based Candidate Selection¶
Rather than evolving just the best global candidate (which leads to local optima or stagnation), GEPA maintains a Pareto frontier: the set of candidates which achieve the highest score on at least one evaluation instance. In each iteration, the next candidate to mutate is sampled (with probability proportional to coverage) from this frontier, guaranteeing both exploration and robust retention of complementary strategies.
Algorithm Summary¶
- Initialize the candidate pool with the the unoptimized program.
- Iterate:
- Sample a candidate (from Pareto frontier).
- Sample a minibatch from the train set.
- Collect execution traces + feedbacks for module rollout on minibatch.
- Select a module of the candidate for targeted improvement.
- LLM Reflection: Propose a new instruction/prompt for the targeted module using reflective meta-prompting and the gathered feedback.
- Roll out the new candidate on the minibatch; if improved, evaluate on Pareto validation set.
- Update the candidate pool/Pareto frontier.
- [Optionally] System-aware merge/crossover: Combine best-performing modules from distinct lineages.
- Continue until rollout or metric budget is exhausted.
- Return candidate with best aggregate performance on validation.
Implementing Feedback Metrics¶
A well-designed metric is central to GEPA's sample efficiency and learning signal richness. GEPA expects the metric to returns a dspy.Prediction(score=..., feedback=...). GEPA leverages natural language traces from LLM-based workflows for optimization, preserving intermediate trajectories and errors in plain text rather than reducing them to numerical rewards. This mirrors human diagnostic processes, enabling clearer identification of system behaviors and bottlenecks.
Practical Recipe for GEPA-Friendly Feedback:
- Leverage Existing Artifacts: Use logs, unit tests, evaluation scripts, and profiler outputs; surfacing these often suffices.
- Decompose Outcomes: Break scores into per-objective components (e.g., correctness, latency, cost, safety) and attribute errors to steps.
- Expose Trajectories: Label pipeline stages, reporting pass/fail with salient errors (e.g., in code generation pipelines).
- Ground in Checks: Employ automatic validators (unit tests, schemas, simulators) or LLM-as-a-judge for non-verifiable tasks (as in PUPA).
- Prioritize Clarity: Focus on error coverage and decision points over technical complexity.
Examples¶
- Document Retrieval (e.g., HotpotQA): List correctly retrieved, incorrect, or missed documents, beyond mere Recall/F1 scores.
- Multi-Objective Tasks (e.g., PUPA): Decompose aggregate scores to reveal contributions from each objective, highlighting tradeoffs (e.g., quality vs. privacy).
- Stacked Pipelines (e.g., code generation: parse → compile → run → profile → evaluate): Expose stage-specific failures; natural-language traces often suffice for LLM self-correction.
Tool Optimization with GEPA¶
When enable_tool_optimization=True, GEPA jointly optimizes dspy.ReAct modules with the tools - GEPA updates predictor instructions and tool descriptions/argument descriptions together, based on execution traces and feedback, instead of keeping tool behavior fixed.
For details, examples, and the underlying design (tool discovery, naming requirements, and interaction with custom instruction proposers), see Tool Optimization.
Custom Instruction Proposal¶
For advanced customization of GEPA's instruction proposal mechanism, including custom instruction proposers and component selectors, see Advanced Features.
Further Reading¶
- GEPA Paper: arxiv:2507.19457
- GEPA Github - This repository provides the core GEPA evolution pipeline used by
dspy.GEPAoptimizer. - DSPy Tutorials