dspy.GEPA: Reflective Prompt Optimizer
GEPA (Genetic-Pareto) is a reflective optimizer proposed in "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning" (Agrawal et al., 2025, arxiv:2507.19457), that adaptively evolves textual components (such as prompts) of arbitrary systems. In addition to scalar scores returned by metrics, users can also provide GEPA with a text feedback to guide the optimization process. Such textual feedback provides GEPA more visibility into why the system got the score that it did, and then GEPA can introspect to identify how to improve the score. This allows GEPA to propose high performing prompts in very few rollouts.
dspy.GEPA(metric: GEPAFeedbackMetric, *, auto: Literal['light', 'medium', 'heavy'] | None = None, max_full_evals: int | None = None, max_metric_calls: int | None = None, reflection_minibatch_size: int = 3, candidate_selection_strategy: Literal['pareto', 'current_best'] = 'pareto', reflection_lm: LM | None = None, skip_perfect_score: bool = True, add_format_failure_as_feedback: bool = False, use_merge: bool = True, max_merge_invocations: int | None = 5, num_threads: int | None = None, failure_score: float = 0.0, perfect_score: float = 1.0, log_dir: str = None, track_stats: bool = False, use_wandb: bool = False, wandb_api_key: str | None = None, wandb_init_kwargs: dict[str, Any] | None = None, track_best_outputs: bool = False, seed: int | None = 0)
Bases: Teleprompter
GEPA is an evolutionary optimizer, which uses reflection to evolve text components
of complex systems. GEPA is proposed in the paper GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.
The GEPA optimization engine is provided by the gepa
package, available from https://github.com/gepa-ai/gepa.
GEPA captures full traces of the DSPy module's execution, identifies the parts of the trace corresponding to a specific predictor, and reflects on the behaviour of the predictor to propose a new instruction for the predictor. GEPA allows users to provide textual feedback to the optimizer, which is used to guide the evolution of the predictor. The textual feedback can be provided at the granularity of individual predictors, or at the level of the entire system's execution.
To provide feedback to the GEPA optimizer, implement a metric as follows:
def metric(
gold: Example,
pred: Prediction,
trace: Optional[DSPyTrace] = None,
pred_name: Optional[str] = None,
pred_trace: Optional[DSPyTrace] = None,
) -> float | ScoreWithFeedback:
"""
This function is called with the following arguments:
- gold: The gold example.
- pred: The predicted output.
- trace: Optional. The trace of the program's execution.
- pred_name: Optional. The name of the target predictor currently being optimized by GEPA, for which
the feedback is being requested.
- pred_trace: Optional. The trace of the target predictor's execution GEPA is seeking feedback for.
Note the `pred_name` and `pred_trace` arguments. During optimization, GEPA will call the metric to obtain
feedback for individual predictors being optimized. GEPA provides the name of the predictor in `pred_name`
and the sub-trace (of the trace) corresponding to the predictor in `pred_trace`.
If available at the predictor level, the metric should return {'score': float, 'feedback': str} corresponding
to the predictor.
If not available at the predictor level, the metric can also return a text feedback at the program level
(using just the gold, pred and trace).
If no feedback is returned, GEPA will use a simple text feedback consisting of just the score:
f"This trajectory got a score of {score}."
"""
...
GEPA can also be used as a batch inference-time search strategy, by passing valset=trainset, track_stats=True, track_best_outputs=True
, and using the
detailed_results
attribute of the optimized program (returned by compile
) to get the Pareto frontier of the batch. optimized_program.detailed_results.best_outputs_valset
will contain the best outputs for each task in the batch.
Example:
gepa = GEPA(metric=metric, track_stats=True)
batch_of_tasks = [dspy.Example(...) for task in tasks]
new_prog = gepa.compile(student, trainset=trainset, valset=batch_of_tasks)
pareto_frontier = new_prog.detailed_results.val_aggregate_scores
# pareto_frontier is a list of scores, one for each task in the batch.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
-
|
metric
|
The metric function to use for feedback and evaluation. |
required |
Budget
|
configuration (exactly one of the following must be provided
|
|
required |
-
|
auto
|
The auto budget to use for the run. |
required |
-
|
max_full_evals
|
The maximum number of full evaluations to perform. |
required |
-
|
max_metric_calls
|
The maximum number of metric calls to perform. |
required |
Reflection
|
based configuration
|
|
required |
-
|
reflection_minibatch_size
|
The number of examples to use for reflection in a single GEPA step. |
required |
-
|
candidate_selection_strategy
|
The strategy to use for candidate selection. Default is "pareto", which stochastically selects candidates from the Pareto frontier of all validation scores. |
required |
-
|
reflection_lm
|
[Required] The language model to use for reflection. GEPA benefits from a strong reflection model, and you can use |
required |
Merge-based
|
configuration
|
|
required |
-
|
use_merge
|
Whether to use merge-based optimization. Default is True. |
required |
-
|
max_merge_invocations
|
The maximum number of merge invocations to perform. Default is 5. |
required |
Evaluation
|
configuration
|
|
required |
-
|
num_threads
|
The number of threads to use for evaluation with |
required |
-
|
failure_score
|
The score to assign to failed examples. Default is 0.0. |
required |
-
|
perfect_score
|
The maximum score achievable by the metric. Default is 1.0. Used by GEPA to determine if all examples in a minibatch are perfect. |
required |
Logging
|
configuration
|
|
required |
-
|
log_dir
|
The directory to save the logs. GEPA saves elaborate logs, along with all the candidate programs, in this directory. Running GEPA with the same |
required |
-
|
track_stats
|
Whether to return detailed results and all proposed programs in the |
required |
-
|
use_wandb
|
Whether to use wandb for logging. Default is False. |
required |
-
|
wandb_api_key
|
The API key to use for wandb. If not provided, wandb will use the API key from the environment variable |
required |
-
|
wandb_init_kwargs
|
Additional keyword arguments to pass to |
required |
-
|
track_best_outputs
|
Whether to track the best outputs on the validation set. track_stats must be True if track_best_outputs is True. |
required |
Reproducibility
|
|
required | |
-
|
seed
|
The random seed to use for reproducibility. Default is 0. |
required |
Source code in dspy/teleprompt/gepa/gepa.py
232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 |
|
Functions
compile(student: Module, *, trainset: list[Example], teacher: Module | None = None, valset: list[Example] | None = None) -> Module
GEPA uses the trainset to perform reflective updates to the prompt, but uses the valset for tracking Pareto scores. If no valset is provided, GEPA will use the trainset for both.
Parameters: - student: The student module to optimize. - trainset: The training set to use for reflective updates. - valset: The validation set to use for tracking Pareto scores. If not provided, GEPA will use the trainset for both.
Source code in dspy/teleprompt/gepa/gepa.py
344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 |
|
:::
One of the key insights behind GEPA is its ability to leverage domain-specific textual feedback. Users should provide a feedback function as the GEPA metric, which has the following call signature:
dspy.teleprompt.gepa.gepa.GEPAFeedbackMetric
Bases: Protocol
Functions
__call__(gold: Example, pred: Prediction, trace: Optional[DSPyTrace], pred_name: str | None, pred_trace: Optional[DSPyTrace]) -> Union[float, ScoreWithFeedback]
This function is called with the following arguments: - gold: The gold example. - pred: The predicted output. - trace: Optional. The trace of the program's execution. - pred_name: Optional. The name of the target predictor currently being optimized by GEPA, for which the feedback is being requested. - pred_trace: Optional. The trace of the target predictor's execution GEPA is seeking feedback for.
Note the pred_name
and pred_trace
arguments. During optimization, GEPA will call the metric to obtain
feedback for individual predictors being optimized. GEPA provides the name of the predictor in pred_name
and the sub-trace (of the trace) corresponding to the predictor in pred_trace
.
If available at the predictor level, the metric should return dspy.Prediction(score: float, feedback: str) corresponding
to the predictor.
If not available at the predictor level, the metric can also return a text feedback at the program level
(using just the gold, pred and trace).
If no feedback is returned, GEPA will use a simple text feedback consisting of just the score:
f"This trajectory got a score of {score}."
Source code in dspy/teleprompt/gepa/gepa.py
:::
When track_stats=True
, GEPA returns detailed results about all of the proposed candidates, and metadata about the optimization run. The results are available in the detailed_results
attribute of the optimized program returned by GEPA, and has the following type:
dspy.teleprompt.gepa.gepa.DspyGEPAResult(candidates: list[Module], parents: list[list[int | None]], val_aggregate_scores: list[float], val_subscores: list[list[float]], per_val_instance_best_candidates: list[set[int]], discovery_eval_counts: list[int], best_outputs_valset: list[list[tuple[int, list[Prediction]]]] | None = None, total_metric_calls: int | None = None, num_full_val_evals: int | None = None, log_dir: str | None = None, seed: int | None = None)
dataclass
Additional data related to the GEPA run.
Fields: - candidates: list of proposed candidates (component_name -> component_text) - parents: lineage info; for each candidate i, parents[i] is a list of parent indices or None - val_aggregate_scores: per-candidate aggregate score on the validation set (higher is better) - val_subscores: per-candidate per-instance scores on the validation set (len == num_val_instances) - per_val_instance_best_candidates: for each val instance t, a set of candidate indices achieving the best score on t - discovery_eval_counts: Budget (number of metric calls / rollouts) consumed up to the discovery of each candidate
- total_metric_calls: total number of metric calls made across the run
- num_full_val_evals: number of full validation evaluations performed
- log_dir: where artifacts were written (if any)
-
seed: RNG seed for reproducibility (if known)
-
best_idx: candidate index with the highest val_aggregate_scores
- best_candidate: the program text mapping for best_idx
Attributes
candidates: list[Module]
instance-attribute
parents: list[list[int | None]]
instance-attribute
val_aggregate_scores: list[float]
instance-attribute
val_subscores: list[list[float]]
instance-attribute
per_val_instance_best_candidates: list[set[int]]
instance-attribute
discovery_eval_counts: list[int]
instance-attribute
best_outputs_valset: list[list[tuple[int, list[Prediction]]]] | None = None
class-attribute
instance-attribute
total_metric_calls: int | None = None
class-attribute
instance-attribute
num_full_val_evals: int | None = None
class-attribute
instance-attribute
log_dir: str | None = None
class-attribute
instance-attribute
seed: int | None = None
class-attribute
instance-attribute
best_idx: int
property
best_candidate: dict[str, str]
property
highest_score_achieved_per_val_task: list[float]
property
Functions
to_dict() -> dict[str, Any]
Source code in dspy/teleprompt/gepa/gepa.py
from_gepa_result(gepa_result: GEPAResult, adapter: DspyAdapter) -> DspyGEPAResult
staticmethod
Source code in dspy/teleprompt/gepa/gepa.py
:::
Usage Examples
See GEPA usage tutorials in GEPA Tutorials.
Inference-Time Search
GEPA can act as a test-time/inference search mechanism. By setting your valset
to your evaluation batch and using track_best_outputs=True
, GEPA produces for each batch element the highest-scoring outputs found during the evolutionary search.
gepa = dspy.GEPA(metric=metric, track_stats=True, ...)
new_prog = gepa.compile(student, trainset=my_tasks, valset=my_tasks)
highest_score_achieved_per_task = new_prog.detailed_results.highest_score_achieved_per_val_task
best_outputs = new_prog.detailed_results.best_outputs_valset
How Does GEPA Work?
1. Reflective Prompt Mutation
GEPA uses LLMs to reflect on structured execution traces (inputs, outputs, failures, feedback), targeting a chosen module and proposing a new instruction/program text tailored to real observed failures and rich textual/environmental feedback.
2. Rich Textual Feedback as Optimization Signal
GEPA can leverage any textual feedback available—not just scalar rewards. This includes evaluation logs, code traces, failed parses, constraint violations, error message strings, or even isolated submodule-specific feedback. This allows actionable, domain-aware optimization.
3. Pareto-based Candidate Selection
Rather than evolving just the best global candidate (which leads to local optima or stagnation), GEPA maintains a Pareto frontier: the set of candidates which achieve the highest score on at least one evaluation instance. In each iteration, the next candidate to mutate is sampled (with probability proportional to coverage) from this frontier, guaranteeing both exploration and robust retention of complementary strategies.
Algorithm Summary
- Initialize the candidate pool with the the unoptimized program.
- Iterate:
- Sample a candidate (from Pareto frontier).
- Sample a minibatch from the train set.
- Collect execution traces + feedbacks for module rollout on minibatch.
- Select a module of the candidate for targeted improvement.
- LLM Reflection: Propose a new instruction/prompt for the targeted module using reflective meta-prompting and the gathered feedback.
- Roll out the new candidate on the minibatch; if improved, evaluate on Pareto validation set.
- Update the candidate pool/Pareto frontier.
- [Optionally] System-aware merge/crossover: Combine best-performing modules from distinct lineages.
- Continue until rollout or metric budget is exhausted.
- Return candidate with best aggregate performance on validation.
Implementing Feedback Metrics
A well-designed metric is central to GEPA's sample efficiency and learning signal richness. GEPA expects the metric to returns a dspy.Prediction(score=..., feedback=...)
. GEPA leverages natural language traces from LLM-based workflows for optimization, preserving intermediate trajectories and errors in plain text rather than reducing them to numerical rewards. This mirrors human diagnostic processes, enabling clearer identification of system behaviors and bottlenecks.
Practical Recipe for GEPA-Friendly Feedback:
- Leverage Existing Artifacts: Use logs, unit tests, evaluation scripts, and profiler outputs; surfacing these often suffices.
- Decompose Outcomes: Break scores into per-objective components (e.g., correctness, latency, cost, safety) and attribute errors to steps.
- Expose Trajectories: Label pipeline stages, reporting pass/fail with salient errors (e.g., in code generation pipelines).
- Ground in Checks: Employ automatic validators (unit tests, schemas, simulators) or LLM-as-a-judge for non-verifiable tasks (as in PUPA).
- Prioritize Clarity: Focus on error coverage and decision points over technical complexity.
Examples
- Document Retrieval (e.g., HotpotQA): List correctly retrieved, incorrect, or missed documents, beyond mere Recall/F1 scores.
- Multi-Objective Tasks (e.g., PUPA): Decompose aggregate scores to reveal contributions from each objective, highlighting tradeoffs (e.g., quality vs. privacy).
- Stacked Pipelines (e.g., code generation: parse → compile → run → profile → evaluate): Expose stage-specific failures; natural-language traces often suffice for LLM self-correction.
Further Reading
- GEPA Paper: arxiv:2507.19457
- GEPA Github - This repository provides the core GEPA evolution pipeline used by
dspy.GEPA
optimizer. - DSPy Tutorials