Skip to content

2. The RAG Pattern

The workshop teaches you to build, evaluate, and deploy a retail copilot code-first on Azure AI - using the Retrieval Augmented Generation (RAG) design pattern to make sure that our copilot responses are grounded in the (private) data maintained by the enterprise, for this application.

RAG

Let's learn how this design pattern works in the context of our Contoso Chat application. Click on the tabs in order, to understand the sequence of events shown in the figure above.


The user query arrives at our copilot implementation via the endpoint (API)

Our deployed Contoso Chat application is exposed as a hosted API endpoint using Azure Container Apps. The inoming "user query" has 3 components: the user question (text input), the user's customer ID (text input), and an optional chat history (object array).

The API server extracts these parameters from the incoming request, and invokes the Contoso Chat application - starting the workflow reflecting this RAG design pattern.

The copilot sends the text query to a retrieval service after first vectorizing it.

The Contoso Chat application converts the text question into a vectorized query using a Large Language "Embedding" Model (e.g., Azure Open AI text-embedding-ada-002). This is then sent to the information retrieval service (e.g., Azure AI Search) in the next step.

The retrieval service uses vectorized query to return matching results by similarity

The information retrieval service maintains a search index for relevant information (here, for our product catalog). In this step, we use the vectorized query from the previous step to find and return matching product results based on vector similarity. The information retrieval service can also use features like semantic ranking to order the returned results.

The copilot augments user prompt with retrieved knowledge in request to model

The Contoso Chat application combines the user's original question with returned "documents" from the information retrieval service, to create an enhanced model prompt. This is made easier using prompt template technologies (e.g., Prompty) with placeholders - for chat history, retrieved documents, and customer profile information - that are filled in at this step.

The chat model uses prompt to generate a grounded response to user question.

This enhanced prompt is now sent to the Large Language "chat" model (e.g., Azure OpenAI gpt-35-turbo or gpt-4o) which sees the enhanced prompt (retrieved documents, customer profile data, chat history) as grounding context for generating the final response, improving the quality (e.g., relevance, groundedness) of results returned from Contoso Chat.