RAG Optimization: Unlocking Success through Data Enhancement

Published on

May 23, 2024

Author

Authors

Si Chen

Head of Strategy

Appen

Phoebe Liu

Staff Data Scientist

Appen

Cal Wilmott

Solutions Architect

Appen

While off-the-shelf Large Language Models (LLMs) can be incredibly powerful, Enterprises are finding that customizing LLMs on their proprietary data can unlock greater potential. Retrieval Augmented Generation (RAG) has emerged as one of the leading approaches for this customization. RAG models combine large language models' powerful language understanding capabilities with a retrieval component that allows them to gather relevant information from external data sources. This enables the model to "read" and draw upon enterprise data to generate outputs, resulting in more accurate and contextually relevant answers, updated with the latest information.

Many tools exist to help Enterprises with RAG architectures; however, building a high-performing RAG system requires optimization across each step of the architecture. This post will focus on the data preparation processes and considerations for building effective RAG architectures at an enterprise scale.

AI models are only as good as their data. Implementing RAG requires meticulous preparation of the data sources which the model will learn from and retrieve context from. Cleaning, structuring, and optimizing large knowledge bases for ingestion into a vector database can be challenging due to diverse data sources that often include both structured and unstructured data.

Ingestion Process

Data Sources: The data sources used to build the knowledge base for a RAG architecture are foundational. They need to be comprehensive, high-quality sources that accurately cover the domains and topics the system will be queried on. This process typically involves selecting a relevant subset of an enterprise’s structured and unstructured data repositories that meet your use case requirements with input from subject matter experts (SMEs).
Data Cleaning: Raw data is often noisy, containing irrelevant content, outdated information, and duplicate data. This creates challenges for RAG implementation, where models are unable to retrieve relevant and accurate information from their knowledge base, negatively impacting generation. For example, enterprise knowledge in Jira or Confluence often contains user comments and a history of version changes that would not be relevant to store in the knowledge base. Effective data cleaning techniques, such as filtering and deduplication, are crucial before feeding data into the vector store.
Privacy/PII: Enterprise datasets often contain sensitive and private information. As part of the data preparation process, enterprises need to define how this data will be treated based on their use case and potential end user. In an internal use case, it may be acceptable for LLMs to incorporate information about individuals, for example, querying, “Who is the sales rep for the Walmart account?” However, for external use cases, exposing information about individuals could result in privacy violations. Even with guardrails in place, adversarial attacks could result in unexpected leaks from training data. Ensuring that PII elements are treated appropriately with detection, filtering, redaction, and substitution with synthetic data where appropriate can protect privacy while maintaining data utility and protecting against potential compliance issues.
Text Extraction: Enterprise data comes in various formats, including PDFs, PowerPoint presentations, and images. Extracting clean, usable text from these unstructured and semi-structured sources is crucial for building comprehensive knowledge bases. The approach to text extraction can vary based on the document's structure, modalities, and complexity. Simple cases might be addressed with standard text extraction tools, while more complex documents could require a combination of automated tools and human annotation.
Text Normalisation: Data from multiple sources often lacks consistency in aspects like spelling, abbreviations, numeric formats, and referencing styles. This can cause the same concepts to be treated as distinct entities and poorly matched by the model. Applying normalization rules to standardize spelling, grammar, measurements, and general nomenclature is essential to get the most utility out of your text data.
Chunking Strategy: Following the above steps, documents need to be split into shorter "chunks" or passages that the retrieval component can match to queries and pass to the language model. The objective is to break documents into retrievable units that maintain complete, relevant context around key information. Common methods include fixed-size chunking, document-based chunking, and semantic chunking. In general, human assessment of whether data should sit within an existing chunk or form a new chunk is still considered the gold standard, and an emerging, more advanced method known as "agent chunking” attempts to mimic this human behavior. The ideal chunk size balances having sufficient context with efficiency, and methods like summarization or hierarchical chunking can also be useful for long documents.
Entity Recognition & Tagging: While the chunks derived from your knowledge bases form the core of your vector store, enriching these chunks with metadata like source details, topics, and key entities across your data can significantly improve a RAG model's accuracy. Named Entity Recognition (NER) for people, organizations, products, concepts, and entity linking can help the model connect passages and enhance retrieval relevance. This can be done systematically using a data annotation platform with automated techniques and human-in-the-loop validation to ensure annotation accuracy and consistency, including domain experts where required.

Query Process

Passage Ranking: After the retrieval component surfaces candidate passages matching a query, ranking and filtering them by relevance is critical before passing them to the language model. This avoids generating responses from marginally relevant passages. Ranking can leverage similarity scores, context reasoning, metadata attributes, and query-passage alignments.
Prompt Engineering & Design: The efficacy of a RAG model relies significantly on augmenting the user input by adding the relevant retrieved data in content (query + context). These prompts must be carefully crafted to effectively obtain and leverage the retrieved context while aligning with the desired style and tone of the output response.

Ongoing Evaluation & Optimization

The data considerations above can all play a vital role in the success of your RAG. However, it can be difficult to understand the effectiveness and impact throughout the training process due to the number of moving parts.

Ongoing testing, evaluation, and optimization are essential for effectively identifying and monitoring performance gaps. Component-wise evaluation can be valuable to address specific problems, for example, assessing whether the retrieval is from the best source within the vector store. End-to-end evaluation can be used to assess the quality of the entire system based on its targeted use case, with the ultimate goal of generating responses that will be valuable to human end users.

Leverage Appen's Expertise

Data for training RAG models can be complex and remains a challenge for Enterprises looking to deploy LLMs. Appen's AI data annotation platform allows you to seamlessly enhance and integrate your proprietary data, helping improve the success of RAG implementations with data at the core.

Contact Appen today to learn how our expertise and advanced platforms can help accelerate your RAG journey.

‍

RAG Optimization: Unlocking Success through Data Enhancement

Ingestion Process

Query Process

Ongoing Evaluation & Optimization

Leverage Appen's Expertise

Related posts