RAG is simpler and more powerful than you think

It is not a secret that LLMs occasionally get things wrong. The common term for this is hallucination. Sometimes, the mistakes are minor. For example, getting a small fact like a date or a name just slightly off. Other times…well, with enough persistence, I got an LLM to tell me how many giraffes fit into an accessible restroom stall and to try to prove P == NP. So, hallucinations can also be massive.

One of the more common techniques for preventing hallucinations is RAG or Retrieval Augmented Generation. It sounds complex, and it can be. But the basic idea is extremely simple. It is also extremely powerful, and while there are common architectures for RAG solutions, the underlying technique can be used in many different situations and ways.

The Basics

The basics of retrieval augmented generation are simple. To help the LLM not hallucinate or to produce better answers, you can retrieve some relevant information and add it to the prompt along with your question or generation task. The most common example I see is for customer service tasks, where you retrieve relevant pages from a user manual and ask the LLM to base its response on that information. For example,

Help me restart my cable modem. Here are some pages from the manual for my particular cable modem. Base your answer on the information in this manual, and do not add anything extra.

{Insert text from the relevant pages of the user manual here}

How do you figure out what pages from the user manual are relevant? Usually, that’s done using a technique called vector search. Vector search deserves its own blog post, but to simplify it, we can say we turn each page of a user manual into a list of numbers. Then, when a request comes in like “Help me restart my cable modem” we can also convert that sentence to numbers and then find the pages with the most similar numbers. Or, as I said to a colleague once, “It’s all just linear algebra.”

This example is perfectly adequate, and many cool RAG systems have been built using this basic technique with documents and vector search, but that’s not the limit of what RAG can do. I’ve found that thinking about RAG in terms of documents and vector search can make it so folks don’t easily see other uses for retrieval in grounding responses from an LLM.

RAG without documents

Many RAG systems break a corpus of large documents into chunks and then retrieve some of those chunks for grounding. But what if your data isn’t large documents? Can you still use RAG for smaller pieces of text like employee bios, product descriptions, or book summaries?

It turns out you can. It sometimes requires minor changes to your similarity algorithm, but you can vectorize smaller pieces of text and then do a similarity search on those the same way you do on larger documents. And with that you can create systems that identify good mentors for new employees, help users find products that complement their previous purchases, and identify the perfect beach read for their next vacation.

You can also skip documents altogether and do the retrieval step from databases or web services. For example, perhaps you retrieve the week’s weather forecast to help an LLM answer, “Which is the best day this weekend for a picnic?” Or maybe you add recent order information to the prompt when a customer has questions about when their order will arrive. Depending on the exact architecture, some folks may call these non-vector sources “tools.” But if you find relevant information (retrieval) and add it to the prompt (augmentation) to improve the response (generation), it is another form of RAG.

More fun with RAG

Another use of RAG I’ve found is for multi-shot prompting. Multi-shot prompting is where you add some examples of what you want from the LLM to the prompt. For example:

Give me examples of things I can cook with chicken thighs. Here are some things I like:

Chicken tacos, Chicken in dill sauce, Chicken sandwiches

And here are some things I don’t like:

Chicken salad, Chicken wings

If you work on an app that logs prompts, responses, and potentially user ratings of those responses, you could do something similar.

The other use of RAG I like is when you have an LLM help you with the prompt before you do the retrieval. User prompts can often be incomplete sentences, contain misspellings, use synonyms, or otherwise be problematic for an LLM. You can ask an LLM to improve the user’s prompt before you look for related content in a vector database to increase the chances of finding relevant data. The demo Aaron Wanjala gave at Next 2024 used this technique. Instead of taking a text prompt from the user, it took a photo, asked an LLM to describe the image, and then used vector search to find products with similar descriptions in the product catalog.

RAG is simple and powerful

RAG is everywhere, and it can be intimidating. Many of the tutorials I’ve seen combine concepts with implementation and can be hard to follow. At the end of the day, RAG is a simple technique where you fetch relevant information to help the LLM craft a better response, add that to the prompt you send the LLM, and then let the LLM generate a response. That’s all it is. And that basic three-step process can power a huge variety of applications and agents.