RAG System
Tutorial - Build a Retrieval-Augmented Generation system with vector embeddings and MCP
Step-by-step tutorial for building a Retrieval-Augmented Generation system with NeurosLink AI and Model Context Protocol (MCP)
What You'll Build
A production-ready RAG (Retrieval-Augmented Generation) system featuring:
📚 Document ingestion from multiple formats (PDF, MD, TXT)
🔍 Semantic search with vector embeddings
🤖 AI-powered Q&A with source citations
🔧 MCP integration for file system access
💾 Vector storage with Pinecone/in-memory
🎯 Context-aware responses
📊 Relevance scoring and ranking
Tech Stack:
Next.js 14+
TypeScript
NeurosLink AI with MCP
OpenAI Embeddings
Pinecone (or in-memory vector store)
PDF parsing libraries
Time to Complete: 60-90 minutes
Prerequisites
Node.js 18+
OpenAI API key (for embeddings)
Anthropic API key (for generation)
Pinecone account (optional, free tier)
Sample documents to index
Understanding RAG
RAG combines retrieval and generation:
Why RAG?
✅ Access to custom/private data
✅ Up-to-date information
✅ Reduced hallucinations
✅ Source attribution
✅ Cost-effective (smaller context windows)
Step 1: Project Setup
Initialize Project
Options:
TypeScript: Yes
Tailwind CSS: Yes
App Router: Yes
Install Dependencies
Environment Setup
Create .env.local:
Step 2: Document Processing
Create Document Parser
Create src/lib/document-parser.ts:
Step 3: Text Chunking
Create src/lib/text-chunker.ts:
Step 4: Embedding Service
Create src/lib/embeddings.ts:
Step 5: Vector Store (In-Memory)
Create src/lib/vector-store.ts:
Vector entry structure: Each entry stores the chunk's embedding vector, metadata, and a reference to the original chunk.
In-memory storage: All vectors are stored in RAM. For production with large datasets (>10K docs), use Pinecone or another vector database.
Batch embedding: Process all chunks together for efficiency. OpenAI allows up to 100 texts per API call.
Convert text to vectors: Each chunk is converted to a 1536-dimensional embedding vector (using OpenAI's
text-embedding-3-smallmodel).Semantic search: Find the most relevant chunks by comparing vector similarity, not keyword matching.
Query embedding: Convert the user's question into the same vector space as the document chunks.
Calculate similarity: Compute cosine similarity between query vector and all document vectors. Score ranges from -1 to 1 (higher = more similar).
Rank by relevance: Sort results by similarity score in descending order (most relevant first).
Return top results: Return only the
topKmost relevant chunks to use as context for the AI.
Step 6: Alternative: Pinecone Vector Store
Create src/lib/pinecone-store.ts:
Step 7: RAG Service
Create src/lib/rag-service.ts:
Use Claude for generation: Claude 3.5 Sonnet excels at following instructions and citing sources accurately in RAG applications.
Chunk configuration: 1000 characters per chunk with 200 character overlap to maintain context across chunk boundaries.
Indexing pipeline: Parse documents → chunk text → create embeddings → store in vector database. Run this once when documents change.
Text chunking: Split documents into smaller chunks. Large documents can't fit in context windows, and smaller chunks improve retrieval precision.
Create embeddings: Convert each chunk to a vector representation. This is the most expensive operation (OpenAI API costs ~$0.02/1M tokens).
RAG query flow: Retrieve relevant chunks → build context → generate answer with citations.
Semantic search: Find the 5 most relevant chunks using vector similarity (not keyword matching).
Build augmented context: Format retrieved chunks with source labels to enable the AI to cite sources in its answer.
Structured prompt: Clear instructions help the AI stay grounded in the provided context and cite sources properly.
Generate final answer: NeurosLink AI sends the question + context to Claude, which generates an answer based on the retrieved information.
Step 8: API Routes
Index Documents API
Create src/app/api/index/route.ts:
Query API
Create src/app/api/query/route.ts:
Step 9: Frontend Interface
Create src/app/page.tsx:
Step 10: Testing
Prepare Test Documents
Create docs/ folder with sample files:
docs/introduction.md:
docs/architecture.md:
Index Documents
Start dev server:
npm run devClick "Index Documents"
Wait for completion
Test Queries
Try these questions:
Verify:
Relevant sources retrieved
Answer cites sources
Relevance scores make sense
Step 11: Production Enhancements
Add Streaming Responses
Add Document Upload
Add Metadata Filtering
Step 12: MCP Integration (Advanced)
Using Model Context Protocol for file access:
Troubleshooting
Embeddings API Errors
Memory Issues with Large Documents
Poor Retrieval Quality
Related Documentation
Feature Guides:
Auto Evaluation - Automated quality scoring for RAG responses
Guardrails - Content filtering for generated answers
Multimodal Chat - Add image/PDF processing to RAG
Tutorials & Examples:
Chat App Tutorial - Build a chat interface
MCP Server Catalog - MCP servers for data retrieval
Summary
You've built a production-ready RAG system with:
✅ Multi-format document ingestion (PDF, MD, TXT) ✅ Text chunking with overlap ✅ Vector embeddings (OpenAI) ✅ Semantic search ✅ AI-powered Q&A with source citations ✅ Relevance scoring ✅ Modern web interface
Cost Analysis:
Embedding: ~$0.02 per 1M tokens
Generation: ~$3 per 1M input tokens (Claude 3.5 Sonnet)
1000 documents → ~$0.50 to index
1000 queries → ~$2
Next Steps:
Add authentication
Implement caching
Add document versioning
Deploy to production
Last updated
Was this helpful?

