Tutorial: Retrieval-Augmented Generation (RAG)¶
Let's walk through a quick example of basic question answering with and without retrieval-augmented generation (RAG) in DSPy. Specifically, let's build a system for answering Tech questions, e.g. about Linux or iPhone apps.
Install the latest DSPy via pip install -U dspy
and follow along. If you're looking instead for a conceptual overview of DSPy, this recent lecture is a good place to start.
Configuring the DSPy environment.¶
Let's tell DSPy that we will use OpenAI's gpt-4o-mini
in our modules. To authenticate, DSPy will look into your OPENAI_API_KEY
. You can easily swap this out for other providers or local models.
import dspy
lm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=lm)
Exploring some basic DSPy Modules.¶
You can always prompt the LM directly via lm(prompt="prompt")
or lm(messages=[...])
. However, DSPy gives you Modules
as a better way to define your LM functions.
The simplest module is dspy.Predict
. It takes a DSPy Signature, i.e. a structured input/output schema, and gives you back a callable function for the behavior you specified. Let's use the "in-line" notation for signatures to declare a module that takes a question
(of type str
) as input and produces a response
as an output.
qa = dspy.Predict('question: str -> response: str')
response = qa(question="what are high memory and low memory on linux?")
print(response.response)
In Linux, "high memory" and "low memory" refer to different regions of the system's memory address space, particularly in the context of 32-bit architectures. - **Low Memory**: This typically refers to the first 896 MB of memory in a 32-bit system. The kernel can directly access this memory without any special handling. It is used for kernel data structures and for user processes. The low memory region is crucial for the kernel's operation, as it allows for efficient memory management and access. - **High Memory**: This refers to memory above the 896 MB threshold in a 32-bit system. The kernel cannot directly access this memory; instead, it must use special mechanisms to map it into the kernel's address space when needed. High memory is often used for user processes and can be allocated dynamically, but it requires additional overhead for the kernel to manage. In 64-bit systems, the distinction between high and low memory is less relevant, as the addressable memory space is significantly larger, and the kernel can access most of the memory directly.
Notice how the variable names we specified in the signature defined our input and output argument names and their role.
Now, what did DSPy do to build this qa
module? Nothing fancy in this example, yet. The module passed your signature, LM, and inputs to an Adapter, which is a layer that handles structuring the inputs and parsing structured outputs to fit your signature.
Let's see it directly. You can inspect the n
last prompts sent by DSPy easily.
dspy.inspect_history(n=1)
[2024-11-10T12:39:19.458514] System message: Your input fields are: 1. `question` (str) Your output fields are: 1. `response` (str) All interactions will be structured in the following way, with the appropriate values filled in. [[ ## question ## ]] {question} [[ ## response ## ]] {response} [[ ## completed ## ]] In adhering to this structure, your objective is: Given the fields `question`, produce the fields `response`. User message: [[ ## question ## ]] what are high memory and low memory on linux? Respond with the corresponding output fields, starting with the field `[[ ## response ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`. Response: [[ ## response ## ]] In Linux, "high memory" and "low memory" refer to different regions of the system's memory address space, particularly in the context of 32-bit architectures. - **Low Memory**: This typically refers to the first 896 MB of memory in a 32-bit system. The kernel can directly access this memory without any special handling. It is used for kernel data structures and for user processes. The low memory region is crucial for the kernel's operation, as it allows for efficient memory management and access. - **High Memory**: This refers to memory above the 896 MB threshold in a 32-bit system. The kernel cannot directly access this memory; instead, it must use special mechanisms to map it into the kernel's address space when needed. High memory is often used for user processes and can be allocated dynamically, but it requires additional overhead for the kernel to manage. In 64-bit systems, the distinction between high and low memory is less relevant, as the addressable memory space is significantly larger, and the kernel can access most of the memory directly. [[ ## completed ## ]]
DSPy has various built-in modules, e.g. dspy.ChainOfThought
, dspy.ProgramOfThought
, and dspy.ReAct
. These are interchangeable with basic dspy.Predict
: they take your signature, which is specific to your task, and they apply general-purpose prompting techniques and inference-time strategies to it.
For example, dspy.ChainOfThought
is an easy way to elicit reasoning
out of your LM before it commits to the outputs requested in your signature.
In the example below, we'll omit str
types (as the default type is string). You should feel free to experiment with other fields and types, e.g. try topics: list[str]
or is_realistic: bool
.
cot = dspy.ChainOfThought('question -> response')
cot(question="should curly braces appear on their own line?")
Prediction( reasoning="The placement of curly braces on their own line is largely a matter of coding style and conventions. In some programming languages and style guides, such as those used in C, C++, and Java, it is common to place opening curly braces on the same line as the control statement (like `if`, `for`, etc.) and closing braces on a new line. However, other styles, such as the Allman style, advocate for placing both opening and closing braces on their own lines. Ultimately, the decision should be based on the team's coding standards or personal preference, as long as it maintains readability and consistency.", response="Curly braces can either appear on their own line or not, depending on the coding style you choose to follow. It's important to adhere to a consistent style throughout your codebase." )
Interestingly, asking for reasoning made the output response
shorter in this case. Is this a good thing or a bad thing? It depends on what you need: there's no free lunch, but DSPy gives you the tools to experiment with different strategies extremely quickly.
By the way, dspy.ChainOfThought
is implemented in DSPy, using dspy.Predict
. This is a good place to dspy.inspect_history
if you're curious.
Using DSPy well involves evaluation and iterative development.¶
You already know a lot about DSPy at this point. If all you want is quick scripting, this much of DSPy already enables a lot. Sprinkling DSPy signatures and modules into your Python control flow is a pretty ergonomic way to just get stuff done with LMs.
That said, you're likely here because you want to build a high-quality system and improve it over time. The way to do that in DSPy is to iterate fast by evaluating the quality of your system and using DSPy's powerful tools, e.g. Optimizers. You can learn about the appropriate development cycle in DSPy here.
Manipulating Examples in DSPy.¶
To measure the quality of your DSPy system, you need (1) a bunch of input values, like question
s for example, and (2) a metric
that can score the quality of an output from your system. Metrics vary widely. Some metrics need ground-truth labels of ideal outputs, e.g. for classification or question answering. Other metrics are self-supervised, e.g. checking faithfulness or lack of hallucination, perhaps using a DSPy program as a judge of these qualities.
Let's load a dataset of questions and their (pretty long) gold answers. Since we started this notebook with the goal of building a system for answering Tech questions, we obtained a bunch of StackExchange-based questions and their correct answers from the RAG-QA Arena dataset. (Learn more about the development cycle if you don't have data for your task.)
import os
import ujson
import requests
def download(url):
filename = os.path.basename(url)
remote_size = int(requests.head(url, allow_redirects=True).headers.get('Content-Length', 0))
local_size = os.path.getsize(filename) if os.path.exists(filename) else 0
if local_size != remote_size:
print(f"Downloading '{filename}'...")
with requests.get(url, stream=True) as r, open(filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192): f.write(chunk)
# Download 500 question--answer pairs from the RAG-QA Arena "Tech" dataset.
download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_500.json")
with open('ragqa_arena_tech_500.json') as f:
data = ujson.load(f)
# Inspect one datapoint.
data[0]
{'question': 'how to transfer whatsapp voice message to computer?', 'response': 'To transfer voice notes from WhatsApp on your device to your computer, you have the option to select the "Share" feature within the app and send the files via Email, Gmail, Bluetooth, or other available services. \nYou can also move the files onto your phone\'s SD card, connect your phone to your computer via a USB cable, then find and transfer the files via File Explorer on your PC. \nAlternatively, you can choose to attach all the desired voice notes to an email and, from your phone, send them to your own email address. \nUpon receiving the email on your computer, you can then download the voice note attachments.'}
Given a simple dict like this, let's create a list of dspy.Example
s, which is the datatype that carries training (or test) datapoints in DSPy.
When you build a dspy.Example
, you should generally specify .with_inputs("field1", "field2", ...)
to indicate which fields are inputs. The other fields are treated as labels or metadata.
data = [dspy.Example(**d).with_inputs('question') for d in data]
# Let's pick an `example` here from the data.
example = data[2]
example
Example({'question': 'what are high memory and low memory on linux?', 'response': '"High Memory" refers to the application or user space, the memory that user programs can use and which isn\'t permanently mapped in the kernel\'s space, while "Low Memory" is the kernel\'s space, which the kernel can address directly and is permanently mapped. \nThe user cannot access the Low Memory as it is set aside for the required kernel programs.'}) (input_keys={'question'})
Now, let's divide the data into:
Training and Validation sets:
- These are the splits you typically give to DSPy optimizers.
- Optimizers typically learn directly from the training examples and check their progress using the validation examples.
- It's good to have 30--300 examples for training and validation each.
- For prompt optimizers in particular, it's often better to pass more validation than training.
Development and Test sets: The rest, typically on the order of 30--1000, can be used for:
- development (i.e., you can inspect them as you iterate on your system) and
- testing (final held-out evaluation).
trainset, valset, devset, testset = data[:50], data[50:150], data[150:300], data[300:500]
len(trainset), len(valset), len(devset), len(testset)
(50, 100, 150, 200)
Evaluation in DSPy.¶
What kind of metric can suit our question-answering task? There are many choices, but since the answers are long, we may ask: How well does the system response cover all key facts in the gold response? And the other way around, how well is the system response not saying things that aren't in the gold response?
That metric is essentially a semantic F1, so let's load a SemanticF1
metric from DSPy. This metric is actually implemented as a very simple DSPy module using whatever LM we're working with.
from dspy.evaluate import SemanticF1
# Instantiate the metric.
metric = SemanticF1()
# Produce a prediction from our `cot` module, using the `example` above as input.
pred = cot(**example.inputs())
# Compute the metric score for the prediction.
score = metric(example, pred)
print(f"Question: \t {example.question}\n")
print(f"Gold Response: \t {example.response}\n")
print(f"Predicted Response: \t {pred.response}\n")
print(f"Semantic F1 Score: {score:.2f}")
Question: what are high memory and low memory on linux? Gold Response: "High Memory" refers to the application or user space, the memory that user programs can use and which isn't permanently mapped in the kernel's space, while "Low Memory" is the kernel's space, which the kernel can address directly and is permanently mapped. The user cannot access the Low Memory as it is set aside for the required kernel programs. Predicted Response: In Linux, "low memory" refers to the first 896 MB of RAM, which is directly accessible by the kernel and used for kernel operations and user processes. "High memory" refers to memory above this limit, which is not directly accessible by the kernel in 32-bit systems and is used for user processes, requiring special handling to access. This distinction is crucial for effective memory management in Linux. Semantic F1 Score: 0.87
The final DSPy module call above actually happens inside metric
. You might be curious how it measured the semantic F1 for this example.
dspy.inspect_history(n=1)
[2024-11-10T12:39:19.701005] System message: Your input fields are: 1. `question` (str) 2. `ground_truth` (str) 3. `system_response` (str) Your output fields are: 1. `reasoning` (str) 2. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response 3. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth All interactions will be structured in the following way, with the appropriate values filled in. [[ ## question ## ]] {question} [[ ## ground_truth ## ]] {ground_truth} [[ ## system_response ## ]] {system_response} [[ ## reasoning ## ]] {reasoning} [[ ## recall ## ]] {recall} # note: the value you produce must be a single float value [[ ## precision ## ]] {precision} # note: the value you produce must be a single float value [[ ## completed ## ]] In adhering to this structure, your objective is: Compare a system's response to the ground truth to compute its recall and precision. If asked to reason, enumerate key ideas in each response, and whether they are present in the other response. User message: [[ ## question ## ]] what are high memory and low memory on linux? [[ ## ground_truth ## ]] "High Memory" refers to the application or user space, the memory that user programs can use and which isn't permanently mapped in the kernel's space, while "Low Memory" is the kernel's space, which the kernel can address directly and is permanently mapped. The user cannot access the Low Memory as it is set aside for the required kernel programs. [[ ## system_response ## ]] In Linux, "low memory" refers to the first 896 MB of RAM, which is directly accessible by the kernel and used for kernel operations and user processes. "High memory" refers to memory above this limit, which is not directly accessible by the kernel in 32-bit systems and is used for user processes, requiring special handling to access. This distinction is crucial for effective memory management in Linux. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## recall ## ]]` (must be formatted as a valid Python float), then `[[ ## precision ## ]]` (must be formatted as a valid Python float), and then ending with the marker for `[[ ## completed ## ]]`. Response: [[ ## reasoning ## ]] The system response correctly identifies "Low Memory" as the portion of RAM directly accessible by the kernel and used for kernel operations, which aligns with the ground truth. It also mentions that "High Memory" refers to memory above a certain limit that is not directly accessible by the kernel, which is consistent with the ground truth's description of user space. However, the ground truth emphasizes that Low Memory is set aside for kernel programs and that users cannot access it, which is not explicitly stated in the system response. Overall, the key ideas are present, but the system response lacks the explicit mention of user access limitations for Low Memory. [[ ## recall ## ]] 0.85 [[ ## precision ## ]] 0.90 [[ ## completed ## ]]
For evaluation, you could use the metric above in a simple loop and just average the score. But for nice parallelism and utilities, we can rely on dspy.Evaluate
.
# Define an evaluator that we can re-use.
evaluate = dspy.Evaluate(devset=devset, metric=metric, num_threads=24,
display_progress=True, display_table=2)
# Evaluate the Chain-of-Thought program.
evaluate(cot)
Average Metric: 55.380830691218016 / 150 (36.9): 100%|██████████| 150/150 [00:00<00:00, 513.51it/s] 2024/11/10 12:39:20 INFO dspy.evaluate.evaluate: Average Metric: 55.380830691218016 / 150 (36.9%)
question | example_response | reasoning | pred_response | SemanticF1 | |
---|---|---|---|---|---|
0 | why is mercurial considered to be easier than git? | Mercurial's syntax is considered more familiar, especially for tho... | Mercurial is often considered easier than Git for several reasons.... | Mercurial is considered easier than Git primarily due to its simpl... | ✔️ [0.545] |
1 | open finder window from current terminal location? | If you type 'open .' in Terminal, it will open the current directo... | To open a Finder window from the current terminal location on a Ma... | You can open a Finder window from your current terminal location b... | ✔️ [0.667] |
36.92
So far, we built a very simple chain-of-thought module for question answering and evaluated it on a small dataset.
Can we do better? In the rest of this guide, we will build a retrieval-augmented generation (RAG) program in DSPy for the same task. We'll see how this can boost the score substantially, then we'll use one of the DSPy Optimizers to compile our RAG program to higher-quality prompts, raising our scores even more.
Basic Retrieval-Augmented Generation (RAG).¶
First, let's download the corpus data that we will use for RAG search. The next cell will seek to download 4 GBs, so it may take a few minutes. A future version of this notebook will come with a cache that allows you to skip downloads and the PyTorch installation.
download('https://huggingface.co/datasets/colbertv2/lotte_passages/resolve/main/technology/test_collection.jsonl')
download('https://huggingface.co/dspy/cache/resolve/main/index.pt')
Set up your system's retriever.¶
As far as DSPy is concerned, you can plug in any Python code for calling tools or retrievers. Hence, for our RAG system, we can plug any tools for the search step. Here, we'll just use OpenAI Embeddings and PyTorch for top-K search, but this is not a special choice, just a convenient one.
import torch
import functools
from litellm import embedding as Embed
with open("test_collection.jsonl") as f:
corpus = [ujson.loads(line) for line in f]
index = torch.load('index.pt', weights_only=True)
max_characters = 4000 # >98th percentile of document lengths
@functools.lru_cache(maxsize=None)
def search(query, k=5):
query_embedding = torch.tensor(Embed(input=query, model="text-embedding-3-small").data[0]['embedding'])
topk_scores, topk_indices = torch.matmul(index, query_embedding).topk(k)
topK = [dict(score=score.item(), **corpus[idx]) for idx, score in zip(topk_indices, topk_scores)]
return [doc['text'][:max_characters] for doc in topK]
Build your first RAG Module.¶
In the previous guide, we looked at individual DSPy modules in isolation, e.g. dspy.Predict("question -> answer")
.
What if we want to build a DSPy program that has multiple steps? The syntax below with dspy.Module
allows you to connect a few pieces together, in this case, our retriever and a generation module, so the whole system can be optimized.
Concretely, in the __init__
method, you declare any sub-module you'll need, which in this case is just a dspy.ChainOfThought('context, question -> response')
module that takes retrieved context, a question, and produces a response. In the forward
method, you simply express any Python control flow you like, possibly using your modules. In this case, we first invoke the search
function defined earlier and then invoke the self.respond
ChainOfThought module.
class RAG(dspy.Module):
def __init__(self, num_docs=5):
self.num_docs = num_docs
self.respond = dspy.ChainOfThought('context, question -> response')
def forward(self, question):
context = search(question, k=self.num_docs)
return self.respond(context=context, question=question)
Let's use the RAG module.
rag = RAG()
rag(question="what are high memory and low memory on linux?")
Prediction( reasoning="High memory and low memory in Linux refer to the way the operating system organizes and manages memory for user-space applications and the kernel. Low memory is the portion of memory that is directly accessible by the kernel, while high memory is the part that is not directly mapped by the kernel's page tables. In a typical 32-bit architecture, low memory usually consists of the lower 3 GB of virtual memory, which is accessible to user-space applications, while high memory comprises the upper 1 GB, which is reserved for the kernel. The distinction is important for memory management, especially in systems with large amounts of RAM, as it affects how the kernel accesses and manages memory resources.", response="In Linux, high memory refers to the portion of memory that is not directly mapped by the kernel's page tables, meaning the kernel cannot access it directly without mapping it into its address space first. Low memory, on the other hand, is the segment of memory that the kernel can access directly. In a typical 32-bit system, low memory consists of the lower 3 GB of virtual memory, while high memory comprises the upper 1 GB. This organization helps manage memory more efficiently, especially in systems with large physical memory." )
dspy.inspect_history()
[2024-11-10T12:39:22.802994] System message: Your input fields are: 1. `context` (str) 2. `question` (str) Your output fields are: 1. `reasoning` (str) 2. `response` (str) All interactions will be structured in the following way, with the appropriate values filled in. [[ ## context ## ]] {context} [[ ## question ## ]] {question} [[ ## reasoning ## ]] {reasoning} [[ ## response ## ]] {response} [[ ## completed ## ]] In adhering to this structure, your objective is: Given the fields `context`, `question`, produce the fields `response`. User message: [[ ## context ## ]] [1] «As far as I remember, High Memory is used for application space and Low Memory for the kernel. Advantage is that (user-space) applications cant access kernel-space memory.» [2] «For the people looking for an explanation in the context of Linux kernel memory space, beware that there are two conflicting definitions of the high/low memory split (unfortunately there is no standard, one has to interpret that in context): High memory defined as the totality of kernel space in VIRTUAL memory. This is a region that only the kernel can access and comprises all virtual addresses greater or equal than PAGE_OFFSET. Low memory refers therefore to the region of the remaining addresses, which correspond to the user-space memory accessible from each user process. For example: on 32-bit x86 with a default PAGE_OFFSET, this means that high memory is any address ADDR with ADDR ≥ 0xC0000000 = PAGE_OFFSET (i.e. higher 1 GB). This is the reason why in Linux 32-bit processes are typically limited to 3 GB. Note that PAGE_OFFSET cannot be configured directly, it depends on the configurable VMSPLIT_x options (source). To summarize: in 32-bit archs, virtual memory is by default split into lower 3 GB (user space) and higher 1 GB (kernel space). For 64 bit, PAGE_OFFSET is not configurable and depends on architectural details that are sometimes detected at runtime during kernel load. On x86_64, PAGE_OFFSET is 0xffff888000000000 for 4-level paging (typical) and 0xff11000000000000 for 5-level paging (source). For ARM64 this is usually 0x8000000000000000. Note though, if KASLR is enabled, this value is intentionally unpredictable. High memory defined as the portion of PHYSICAL memory that cannot be mapped contiguously with the rest of the kernel virtual memory. A portion of the kernel virtual address space can be mapped as a single contiguous chunk into the so-called physical low memory. To fully understand what this means, a deeper knowledge of the Linux virtual memory space is required. I would recommend going through these slides. From the slides: This kind of high/low memory split is only applicable to 32-bit architectures where the installed physical RAM size is relatively high (more than ~1 GB). Otherwise, i.e. when the physical address space is small (<1 GB) or when the virtual memory space is large (64 bits), the whole physical space can be accessed from the kernel virtual memory space. In that case, all physical memory is considered low memory. It is preferable that high memory does not exist at all because the whole physical space can be accessed directly from the kernel, which makes memory management a lot simpler and efficient. This is especially important when dealing with DMAs (which typically require physically contiguous memory). See also the answer by @gilles» [3] «Low and High do not refer to whether there is a lot of usage or not. They represent the way it is organized by the system. According to Wikipedia: High Memory is the part of physical memory in a computer which is not directly mapped by the page tables of its operating system kernel. There is no duration for the free command which simply computes a snapshot of the information available. Most people, including programmers, do not need to understand it more clearly as it is managed in a much simpler form through system calls and compiler/interpreter operations.» [4] «This is relevant to the Linux kernel; Im not sure how any Unix kernel handles this. The High Memory is the segment of memory that user-space programs can address. It cannot touch Low Memory. Low Memory is the segment of memory that the Linux kernel can address directly. If the kernel must access High Memory, it has to map it into its own address space first. There was a patch introduced recently that lets you control where the segment is. The tradeoff is that you can take addressable memory away from user space so that the kernel can have more memory that it does not have to map before using. Additional resources: http://tldp.org/HOWTO/KernelAnalysis-HOWTO-7.html http://linux-mm.org/HighMemory» [5] «HIGHMEM is a range of kernels memory space, but it is NOT memory you access but its a place where you put what you want to access. A typical 32bit Linux virtual memory map is like: 0x00000000-0xbfffffff: user process (3GB) 0xc0000000-0xffffffff: kernel space (1GB) (CPU-specific vector and whatsoever are ignored here). Linux splits the 1GB kernel space into 2 pieces, LOWMEM and HIGHMEM. The split varies from installation to installation. If an installation chooses, say, 512MB-512MB for LOW and HIGH mems, the 512MB LOWMEM (0xc0000000-0xdfffffff) is statically mapped at the kernel boot time; usually the first so many bytes of the physical memory is used for this so that virtual and physical addresses in this range have a constant offset of, say, 0xc0000000. On the other hand, the latter 512MB (HIGHMEM) has no static mapping (although you could leave pages semi-permanently mapped there, but you must do so explicitly in your driver code). Instead, pages are temporarily mapped and unmapped here so that virtual and physical addresses in this range have no consistent mapping. Typical uses of HIGHMEM include single-time data buffers.» [[ ## question ## ]] what are high memory and low memory on linux? Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## response ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`. Response: [[ ## reasoning ## ]] High memory and low memory in Linux refer to the way the operating system organizes and manages memory for user-space applications and the kernel. Low memory is the portion of memory that is directly accessible by the kernel, while high memory is the part that is not directly mapped by the kernel's page tables. In a typical 32-bit architecture, low memory usually consists of the lower 3 GB of virtual memory, which is accessible to user-space applications, while high memory comprises the upper 1 GB, which is reserved for the kernel. The distinction is important for memory management, especially in systems with large amounts of RAM, as it affects how the kernel accesses and manages memory resources. [[ ## response ## ]] In Linux, high memory refers to the portion of memory that is not directly mapped by the kernel's page tables, meaning the kernel cannot access it directly without mapping it into its address space first. Low memory, on the other hand, is the segment of memory that the kernel can access directly. In a typical 32-bit system, low memory consists of the lower 3 GB of virtual memory, while high memory comprises the upper 1 GB. This organization helps manage memory more efficiently, especially in systems with large physical memory. [[ ## completed ## ]]
Earlier with a CoT module, we got nearly 37% in terms of semantic F1 on our devset
. Would this RAG
module score better?
evaluate(RAG())
Average Metric: 74.61311832900337 / 150 (49.7): 100%|██████████| 150/150 [00:05<00:00, 27.92it/s] 2024/11/10 12:39:28 INFO dspy.evaluate.evaluate: Average Metric: 74.61311832900337 / 150 (49.7%)
question | example_response | reasoning | pred_response | SemanticF1 | |
---|---|---|---|---|---|
0 | why is mercurial considered to be easier than git? | Mercurial's syntax is considered more familiar, especially for tho... | Mercurial is considered easier than Git for several reasons. First... | Mercurial is considered easier than Git because it has a more fami... | ✔️ [0.797] |
1 | open finder window from current terminal location? | If you type 'open .' in Terminal, it will open the current directo... | To open a Finder window from the current terminal location, you ca... | You can open a Finder window from your current terminal location b... | ✔️ [0.667] |
49.74
Using a DSPy Optimizer to improve your RAG prompt.¶
Off the shelf, our RAG
module scores nearly 50%. What are our options to make it stronger? One of the various choices DSPy offers is optimizing the prompts in our pipeline.
If there are many sub-modules in your program, all of them will be optimized together. In this case, there's only one: self.respond = dspy.ChainOfThought('context, question -> response')
Let's set up and use DSPy's MIPRO (v2) optimizer. The run below has a cost around $1.5 (for the medium
auto setting) and may take some 20-30 minutes depending on your number of threads.
tp = dspy.MIPROv2(metric=metric, auto="medium", num_threads=24) # use fewer threads if your rate limit is small
optimized_rag = tp.compile(RAG(), trainset=trainset, valset=valset,
max_bootstrapped_demos=2, max_labeled_demos=2,
requires_permission_to_run=False)
The prompt optimization process here is pretty systematic, you can learn about it for example in this paper. Importantly, it's not a magic button. It's very possible that it can overfit your training set for instance and not generalize well to a held-out set, making it essential that we iteratively validate our programs.
Let's check on an example here, asking the same question to the baseline rag = RAG()
program, which was not optimized, and to the optimized_rag = MIPROv2(..)(..)
program, after prompt optimization.
baseline = rag(question="cmd+tab does not work on hidden or minimized windows")
print(baseline.response)
You are correct; cmd+Tab does not work on hidden or minimized windows in macOS. It is designed to switch between applications and will only show non-minimized windows of the active application. To access minimized windows, you need to click on them directly or use other shortcuts.
pred = optimized_rag(question="cmd+tab does not work on hidden or minimized windows")
print(pred.response)
In macOS, the Command+Tab shortcut is specifically designed to switch between applications, not individual windows. This means that if an application is minimized or hidden, it will not be activated using Command+Tab. Here are some alternative methods to manage minimized or hidden windows: 1. **Click on the Minimized Window:** - You can directly click on the minimized window in the Dock to restore it. 2. **Use Command+M:** - If you want to minimize the current window, you can use Command+M. To restore it, you will need to click on it in the Dock. 3. **Use Mission Control:** - You can activate Mission Control (F3 or Control+Up Arrow) to see all open windows and select the one you want to bring to the front. 4. **Third-Party Applications:** - Consider using third-party applications like HyperSwitch or Witch, which can provide enhanced window management features, including switching between windows of the same application. 5. **Keyboard Shortcuts for Specific Applications:** - Some applications may have their own shortcuts for managing windows. Check the preferences or documentation for the specific application you are using. By using these methods, you can effectively manage and restore minimized or hidden windows in macOS.
You can use dspy.inspect_history(n=2)
to view the RAG prompt before optimization and after optimization.
Concretely, in of run of this notebook, the optimized prompt:
- Constructs the following instruction,
Using the provided `context` and `question`, analyze the information step by step to generate a comprehensive and informative `response`. Ensure that the response clearly explains the concepts involved, highlights key distinctions, and addresses any complexities noted in the context.
- And includes two fully worked out RAG examples with synthetic reasoning and answers, e.g.
how to transfer whatsapp voice message to computer?
.
Let's now evaluate on the overall devset.
evaluate(optimized_rag)
Average Metric: 89.78303512426604 / 150 (59.9): 100%|██████████| 150/150 [00:00<00:00, 424.18it/s] 2024/11/10 12:39:36 INFO dspy.evaluate.evaluate: Average Metric: 89.78303512426604 / 150 (59.9%)
question | example_response | reasoning | pred_response | SemanticF1 | |
---|---|---|---|---|---|
0 | why is mercurial considered to be easier than git? | Mercurial's syntax is considered more familiar, especially for tho... | Mercurial is often considered easier than Git for several reasons,... | Mercurial is considered easier than Git for several key reasons: 1... | ✔️ [0.874] |
1 | open finder window from current terminal location? | If you type 'open .' in Terminal, it will open the current directo... | To open a Finder window from the current terminal location in macO... | To open a Finder window from your current terminal location in mac... | ✔️ [0.600] |
59.86
Keeping an eye on cost.¶
DSPy allows you to track the cost of your programs, which can be used to monitor the cost of your calls. Here, we'll show you how to track the cost of your programs with DSPy.
cost = sum([x['cost'] for x in lm.history if x['cost'] is not None]) # in USD, as calculated by LiteLLM for certain providers
Saving and loading.¶
The optimized program has a pretty simple structure on the inside. Feel free to explore it.
Here, we'll save optimized_rag
so we can load it again later without having to optimize from scratch.
optimized_rag.save("optimized_rag.json")
loaded_rag = RAG()
loaded_rag.load("optimized_rag.json")
loaded_rag(question="cmd+tab does not work on hidden or minimized windows")
Prediction( reasoning='The behavior of the Command+Tab shortcut in macOS is designed to switch between applications rather than individual windows. This means that if an application is minimized or hidden, it will not be brought to the forefront using Command+Tab. Instead, the shortcut will only cycle through applications that are currently open and not minimized. To manage minimized windows, users may need to use different shortcuts or methods to restore them.', response='In macOS, the Command+Tab shortcut is specifically designed to switch between applications, not individual windows. This means that if an application is minimized or hidden, it will not be activated using Command+Tab. Here are some alternative methods to manage minimized or hidden windows:\n\n1. **Click on the Minimized Window:**\n - You can directly click on the minimized window in the Dock to restore it.\n\n2. **Use Command+M:**\n - If you want to minimize the current window, you can use Command+M. To restore it, you will need to click on it in the Dock.\n\n3. **Use Mission Control:**\n - You can activate Mission Control (F3 or Control+Up Arrow) to see all open windows and select the one you want to bring to the front.\n\n4. **Third-Party Applications:**\n - Consider using third-party applications like HyperSwitch or Witch, which can provide enhanced window management features, including switching between windows of the same application.\n\n5. **Keyboard Shortcuts for Specific Applications:**\n - Some applications may have their own shortcuts for managing windows. Check the preferences or documentation for the specific application you are using.\n\nBy using these methods, you can effectively manage and restore minimized or hidden windows in macOS.' )
What's next?¶
Improving from around 37% to approximately 60% on this task, in terms of SemanticF1
, was pretty easy.
But DSPy gives you paths to continue iterating on the quality of your system and we have barely scratched the surface.
In general, you have the following tools:
- Explore better system architectures for your program, e.g. what if we ask the LM to generate search queries for the retriever? See this notebook or the STORM pipeline built in DSPy.
- Explore different prompt optimizers or weight optimizers. See the Optimizers Docs.
- Scale inference time compute using DSPy Optimizers, e.g. this notebook.
- Cut cost by distilling to a smaller LM, via prompt or weight optimization, e.g. this notebook or this notebook.
How do you decide which ones to proceed with first?
The first step is to look at your system outputs, which will allow you to identify the sources of lower performance if any. While doing all of this, make sure you continue to refine your metric, e.g. by optimizing against your judgments, and to collect more (or more realistic) data, e.g. from related domains or from putting a demo of your system in front of users.
Learn more about the development cycle in DSPy.