Creating a Basic RAG Pipeline

One of the first projects I was given when I joined my current org peopleplus.ai was with Haqdarshak. If you haven’t heard, they provide schemes and benefits to the marginalised community in India. They wanted to solve one of the key problems facing them - fast resolution of people’s query at their centres.

They approached us asking how we can integrate AI into their current workflow. After few iterations, we decided to create a RAG based chatbot which their staff can use to resolve queries fast. Here’s the story of how it got built and how you can built it too:

Fig 1: Please see the full code here (it’s open source :-P )

https://github.com/PeoplePlusAI/Haq/blob/expand/core/ai.py

How RAG works

Imagine you're at a library with a brilliant librarian who has read every book but can only remember general knowledge up to a certain date. Now, what if you could enhance this librarian's capabilities by giving them instant access to your specific documents, research papers, or company data? This is essentially what Retrieval Augmented Generation (RAG) does for Large Language Models (LLMs).

Let's break down the magic behind RAG into three simple steps:

1. Document Processing (The Library Organization)

- Your documents are split into manageable chunks, with best chunk size happening between 100 to 300

- Each chunk is converted into embedding (a numerical representation)

- These embeddings are stored in a vector database (many options here like qdrant, pinecone, milvus, etc), like books organised on shelves

2. Retrieval (The Smart Search)

- When a question is asked, it's also converted into an embedding. You can build a pre processing layer to remove irrelevant input as is the standard industry practice. The same layer can also act as your guardrails layer.

- The system searches for the most relevant chunks by comparing embeddings (using cosine similarity and other methods). It may be hybrid retrieval (where keyword based search & embedding based search both are combined)

- Think of it as instantly finding the most relevant pages across all your documents

3. Generation (The Knowledgeable Response)

- Generally, the retrieved information is passed as context alongside the original question

- The LLM uses both its general knowledge and the retrieved context. but you can restrict the source to be source document only (in which case LLM won’t use its pre-training data)

- This produces an answer that's both intelligent and specifically informed by your data

What makes RAG particularly powerful is its hybrid approach:

User Query → [Embedding] → Vector Search → Relevant Chunks
                                              ↓
LLM Knowledge + Retrieved Context → Generated Response

Our RAG implementation consists of three main components:

Document Processing Pipeline
Vector Storage and Retrieval System
Generation Pipeline with LLM Integration

Let's dive into each component and understand how they work together:

Document Processing Pipeline

# import necessary libraries
# find full source code at this repo - https://github.com/PeoplePlusAI/Haq/blob/expand/core/ai.py
from llama_index.core import SimpleDirectoryReader, Document

documents = SimpleDirectoryReader(input_files=['data/HD_data.txt']).load_data()
document = Document(text="\n\n".join([doc.text for doc in documents]))

The document processing pipeline starts with loading your data. We use SimpleDirectoryReader to load documents and combine them into a single Document object. This approach:

Maintains document coherence
Simplifies the processing pipeline
Enables efficient chunking and embedding generation

Embedding Generation

from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.embed_model = embed_model
Settings.chunk_size = 512 # can try 100 to 500. 
# See Claude Contextual embedding too (might cover in future post)

For embedding generation, we're using OpenAI's latest embedding model, "text-embedding-3-small". Key configurations include:

Chunk size of 512 tokens for optimal context windows
Integration with OpenAI's efficient embedding model
Customizable settings through LlamaIndex's Settings class

Vector Storage & Retrieval

Persistent Storage Implementation

# to store the index. you can use VectorDB here or SQLite3 or any other DB
PERSIST_DIR = "./storage"

if not os.path.exists(PERSIST_DIR):
    index = VectorStoreIndex.from_documents([document])
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

The vector storage system implements:

Persistent storage to avoid recomputing embeddings
Efficient loading of existing indices
Automatic index creation for new documents

Query Engine Configuration

query_engine = index.as_query_engine(similarity_top_k=2)

# this is an alternative but really cool approach. However it might be depreciated in latest version
# this process is called query expansion where we expand the query with some chunks from document and then send to LLM RAG model
flare_query_engine = FLAREInstructQueryEngine(
    query_engine=query_engine,
    max_iterations=1,
    verbose=False,
)

Our retrieval system features:

Top-k similarity search (k=2 in this implementation)
FLARE query engine for improved retrieval accuracy
Configurable iteration parameters for retrieval refinement

LLM Integration and Response Generation

OpenAI Integration with Portkey

headers = {
    "x-portkey-api-key": port_api_key,
    "x-portkey-provider": "openai",
    "Content-Type": "application/json"
}

try:
    llm = OpenAI(model=model, 
                 temperature=0.1, 
                 api_base=PORTKEY_GATEWAY_URL, 
                 default_headers=headers)
except Exception as e:
    llm = OpenAI(model=model, temperature=0.1)

The LLM integration includes:

Primary integration through Portkey for enhanced monitoring
Fallback mechanism to direct OpenAI access
Configurable temperature settings for response control
Error handling for robust production deployment

Production Deployment Tips:

Monitoring and Logging
- Implement comprehensive logging
- Monitor API usage and costs
- Track retrieval performance
Scaling Considerations
- Use persistent storage for large datasets
- Implement caching strategies
- Consider batch processing for large document sets
Cost Optimization
- Cache embeddings to reduce API calls
- Optimize chunk size for your use case
- Monitor and adjust top-k parameters

Remember to configure your environment variables in a .env file:

OPENAI_API_KEY=your_openai_key
PORTKEY_API_KEY=your_portkey_key
MODEL_NAME=your_model_name

RAG isn't just another tool in our AI toolkit – it's a bridge between human knowledge and machine intelligence. As we stand at this intersection, the possibilities are boundless. The organizations that thrive won't be those with the most data, but those that can most effectively turn that data into actionable intelligence.

As you implement your own RAG system, remember:

Start small but think big
Focus on user experience and trust
Measure impacts beyond technical metrics
Stay agile and adapt to emerging capabilities

Remember : The future of AI isn't about replacing human intelligence; it's about augmenting it. RAG is your gateway to this augmented future.

Ready to take your RAG implementation to the next level? Check out our GitHub repository for updates, contribute to the community, and stay tuned for more advanced implementations and optimizations. The future of intelligent information retrieval is here, and it's more exciting than ever.

How I created a basic RAG pipeline

By Luv Singh, AI Engineer at Peopleplus.ai

Table of contents