Workshop #11
Kaplan Business School
By the end of this workshop, you will be able to:
TechCorp has accumulated 10 years of customer service transcripts, product documentation, and technical troubleshooting guides. They want to use ChatGPT to help support agents quickly find solutions.
Question: "How do I reset the TC-5000 router?"
Response: "I don't have specific information about the TC-5000 router. Generally, routers can be reset by..."
Vague, unhelpful, potentially incorrect
Question: "How do I reset the TC-5000 router?"
Response: "According to TechCorp Manual v3.2: Hold the reset button for 15 seconds, then power cycle. The LED will flash amber during reset."
Accurate, specific, cited
ChatGPT and similar models have fundamental limitations for business applications:
LLMs only know what they were trained on. Your company's internal documents, customer data, and proprietary information don't exist in their knowledge base.
Training data has a cutoff date. New products, updated policies, or recent market changes are unknown to the model.
When LLMs don't know the answer, they often generate plausible-sounding but incorrect information—dangerous for business decisions.
The Solution: We need to give LLMs access to specific, relevant information when answering questions. This is where Retrieval-Augmented Generation comes in.
Concept: Give the LLM access to relevant documents when it answers questions
Analogy: Open-book exam—the model can consult reference materials
Best for: Frequently changing information, need for source citations, limited training data
Concept: Train the LLM on your specific domain and writing style
Analogy: Intensive studying—the model internalizes the knowledge
Best for: Stable knowledge domain, consistent style needed, large training datasets
Today's focus: RAG (We'll cover fine-tuning later)
User asks: "How do I reset the TC-5000?"
System finds relevant documents in database
Pulls TC-5000 manual section
LLM reads docs + question, generates answer
The LLM never "memorizes" your documents. Instead, it's given relevant excerpts in real-time to inform its response. This means it always has access to the latest information.
Next question: How does the system know which documents are "relevant"?
Example: We know "kitten" and "cat" are related even if the exact words don't match.
Computers can't "read"—they need numerical representations to compare documents.
We convert each document into a unique "fingerprint" made of numbers. Documents with similar content get similar fingerprints, allowing computers to find related information mathematically.
An embedding is a list of numbers that represents the meaning of text. Similar meanings result in similar numbers.
"cat" = [1.6, 2.3, 4.7, ..., 8.2, 17.9]
"kitten" = [1.6, 2.4, 4.7, ..., 8.3, 18.4]
"dog" = [1.5, 2.2, 4.8, ..., 8.1, 17.7]
"puppy" = [1.5, 2.3, 4.8, ..., 8.2, 18.0]
"car" = [8.2, 1.1, 2.3, ..., 4.5, 6.7]
Notice: "cat" and "kitten" have very similar numbers. "car" is completely different. The algorithm captures semantic meaning!
For business documents: A product manual and a troubleshooting guide about the same product would have similar embeddings, making them easy to find together.
In reality, embeddings have hundreds of dimensions. Here's a simplified 2D visualization:
Key Observation: Similar concepts cluster together. When searching for "plumbing," the system finds the nearest neighbors in vector space—in this case, kitchen/sink/faucet items rather than tools or appliances.
Once we have vector embeddings, we need to measure how similar they are. Cosine similarity is the standard approach:
Similarity = (Vector1 · Vector2) / (||Vector1|| × ||Vector2||)
Range: -1 (opposite) to +1 (identical)
Document A: "The TC-5000 router supports WiFi 6 and has excellent range."
Document B: "TC-5000 provides WiFi 6 connectivity with extended coverage."
Document C: "Our company picnic will be held in the park next Saturday."
Results:
Similarity(A, B) = 0.94 ← Very similar (same topic)
Similarity(A, C) = 0.12 ← Not similar (different topics)
Don't worry: Python libraries handle these calculations automatically. You just need to understand the concept.
User asks question: "How do I troubleshoot TC-5000 connectivity issues?"
Converts question into vector: [2.3, 1.7, 4.2, ..., 8.9]
Searches for documents with similar vectors → Finds TC-5000 troubleshooting guide (similarity: 0.92)
Combines: User question + Retrieved documents → Sends to LLM
Reads context and generates: "According to the TC-5000 guide, try these steps..."
Purpose: Convert text to vectors
Function: Provides pre-trained models that understand semantic meaning across multiple languages and domains
Example: sentence-transformers/all-MiniLM-L6-v2
Purpose: Store and search vectors
Function: Specialized database optimized for finding similar vectors quickly, even with millions of documents
Benefit: 100x faster than traditional databases for similarity search
Purpose: Connect everything
Function: Framework that orchestrates the entire workflow—from question to embedding to retrieval to generation
Benefit: Write less code, focus on business logic
Each component is specialized and best-in-class. Using them together gives you flexibility to swap components as technology improves while maintaining the same overall architecture.
RAG dramatically reduces hallucinations because responses are grounded in actual documents. Users can verify source citations.
Metric: Studies show 85% reduction in factual errors compared to generic LLMs
Update documents in the database, and the LLM immediately has access to new information—no retraining required.
Example: Add today's product release notes, answer questions about new features tomorrow
No expensive model training. Use existing LLMs with your data. Typical implementation cost is 90% less than fine-tuning.
Comparison: RAG setup: $1,000-5,000 | Fine-tuning: $50,000-200,000
Every answer can be traced to source documents. Critical for compliance, auditing, and building user confidence.
Use case: Financial advice, medical information, legal guidance
| Industry | Application | Business Impact |
|---|---|---|
| Customer Support | Instant access to product manuals, troubleshooting guides, and past support tickets | 67% reduction in average handling time, 45% improvement in first-call resolution |
| Legal Services | Query case law, contracts, and regulations to support legal research | 80% faster document review, 90% cost reduction in junior associate hours |
| Healthcare | Access medical literature, clinical guidelines, and patient history for decision support | 35% improvement in diagnostic accuracy, 50% reduction in research time |
| Finance | Analyze company filings, market reports, and financial regulations | Real-time insights from 10,000+ documents, 70% faster compliance checks |
| Human Resources | Employee handbook, policies, benefits information instantly accessible | 60% reduction in HR inquiry volume, 24/7 employee self-service |
Fine-tuning means continuing the training of a pre-trained LLM on your specific dataset. The model learns patterns, terminology, and style from your data.
| Factor | RAG | Fine-Tuning |
|---|---|---|
| Setup Time | Hours to days | Weeks to months |
| Setup Cost | $1,000 - $5,000 | $50,000 - $200,000 |
| Data Required | Any amount (even 10 docs) | 10,000+ training examples |
| Update Frequency | Real-time (just add docs) | Requires full retraining |
| Transparency | High (shows sources) | Low (black box) |
| Response Speed | Slower (retrieval + generation) | Faster (just generation) |
| Best Use Case | Dynamic knowledge bases, customer support, research | Specialized domains, consistent style, stable knowledge |
Recommended Strategy: Start with RAG. It's faster, cheaper, and more flexible. Consider fine-tuning only when you have specific style/domain requirements and substantial training data.
Collect and organize your company's documents. Clean the data (remove duplicates, fix formatting, remove sensitive information).
Tip: Start with 50-100 documents to test the system before scaling up.
Set up Python environment with LangChain, ChromaDB, and HuggingFace libraries.
Convert each document into vector embeddings using HuggingFace models.
Load embeddings into ChromaDB for efficient similarity search.
Use LangChain to connect the retrieval system to your LLM, creating a complete question-answering pipeline.
A working RAG system that can answer questions about a set of sample documents. You'll see how each component works together.
File: DATA5000_Simple_RAG.ipynb
What you'll do:
Time: 20-25 minutes
File: DATA5000_Cosine_similarity.ipynb
What you'll do:
Time: 15 minutes
Learning Goal: By the end of the activities, you should understand how to connect your own company documents to an LLM and see why vector embeddings enable semantic search.
TechCorp's customer support team handles 1,000 emails daily about 500+ products. Agents spend 40% of their time searching for information in 10 years of documentation.
Challenge: Average response time of 4 hours, inconsistent answers, high agent frustration
Cost: $2.5M annually in support costs
Data: Ingested 10,000 documents (product manuals, FAQs, past tickets, troubleshooting guides)
System: Built RAG interface integrated with email system
Cost: $25,000 setup + $500/month hosting
✓ Average response time: 45 minutes (89% reduction)
✓ First-response accuracy: 92% (up from 67%)
✓ Agent satisfaction: +45 points
✓ Cost savings: $1.2M annually
ROI: 4,700% in first year
For maximum performance, some organizations use both approaches together:
Fine-tuned model: Trained on legal writing style, case law structure, and legal reasoning patterns
RAG system: Accesses current case database, recent rulings, and client-specific documents
Result: Natural legal writing (from fine-tuning) with accurate, current case citations (from RAG)
Cost consideration: This approach costs $150,000-300,000 to implement but provides best-in-class performance for mission-critical applications.
| Factor | Success Pattern | Failure Pattern |
|---|---|---|
| Document Quality | Clean, well-organized, accurate source documents with clear structure | Poorly formatted, outdated, or contradictory documents that confuse the system |
| Chunk Size | Optimal: 200-500 words per chunk. Balances context and precision. | Too small (fragments sentences) or too large (retrieves irrelevant content) |
| Metadata Strategy | Rich metadata (date, author, department, document type) enables filtering | No metadata means retrieving potentially outdated or irrelevant documents |
| Monitoring | Track answer quality, user feedback, retrieval accuracy—iterate constantly | "Set and forget"—quality degrades as new docs are added without review |
| User Training | Teach users how to ask good questions and interpret citations | Users don't trust system because they don't understand how it works |
Assessment 3 requires you to develop a data-driven business recommendation. RAG and fine-tuning are powerful tools you can use:
Build a RAG system using your business case documents, market research, and industry reports to support your analysis and recommendations.
Shows prescriptive analytics capability
Create a custom knowledge base that executives can query to understand your recommendations and supporting evidence.
Demonstrates business value
Use RAG to analyze competitor strategies, customer feedback, or market trends specific to your business case.
Provides data-driven insights
Next week (Week 12): We'll explore advanced LLM applications including agents, tool use, and more sophisticated RAG implementations that you can leverage in your assessments.
Generic LLMs don't have access to your company's specific data, leading to vague or incorrect answers.
Retrieval-Augmented Generation gives LLMs access to relevant documents in real-time, grounding responses in actual information.
Convert documents to vector embeddings → Store in specialized database → Retrieve similar documents when users ask questions → LLM generates informed answer.
HuggingFace (embeddings), ChromaDB (vector storage), LangChain (orchestration), LLM (generation).
85% reduction in factual errors, 60-90% faster response times, 90% lower cost than fine-tuning, always current information.
RAG: Dynamic information, need sources, fast implementation. Fine-tuning: Stable domain, style requirements, large training data. Both: Maximum performance for critical applications.
Complete the two Google Colab notebooks to gain hands-on experience with vector embeddings and RAG systems.
File: DATA5000_Cosine_similarity.ipynb
See how word embeddings capture semantic meaning and calculate similarity scores between different words.
File: DATA5000_Simple_RAG.ipynb
Create a complete RAG pipeline with document embeddings, vector storage, and querying capabilities.
Next week we'll explore advanced LLM applications: agents that can use tools, more sophisticated RAG architectures, and multi-step reasoning systems.
Remember: The goal isn't to memorize syntax—it's to understand how these systems work so you can apply them to real business problems.
Let's discuss how RAG can solve real business challenges
Kaplan Business School
DATA5000 - Week 11