DATA4800 - Workshop 9
Understanding how machines process and generate human language
Understand text pre-processing, classification techniques, and the Naïve Bayes algorithm
Discover transformers, GPT evolution, and real-world use cases in business contexts
Examine responsible use of LLMs in academic and professional environments
Your online retail company receives 10,000 customer reviews daily across multiple platforms
Approach: Feature engineering, statistical models, rule-based systems
Example: Naïve Bayes classifier, Bag-of-Words
Characteristics: Requires labeled data, explainable, limited context understanding
Approach: Deep learning, transformer architecture, pre-trained models
Example: GPT-4, ChatGPT, Claude
Characteristics: Context-aware, minimal training, human-like understanding
Today's Journey: We will explore both approaches to understand when and why to use each method
Natural Language Processing (NLP) is a field of artificial intelligence that enables computers to understand, interpret, and generate human language.
"I saw a man with a telescope"
Did I use a telescope to see him? Or did he have a telescope?
"The bank is closed"
Financial institution or river bank?
"Great! Another meeting..."
Positive words, negative sentiment
"Book" as a noun vs. verb
"Read this book" vs. "Book a flight"
Chatbots, automated support, sentiment analysis
Document classification, information extraction
Multi-language support, localization
Source: Gartner NLP Market Analysis 2024
Before machines can analyze text, raw text must be transformed into a structured format. This involves several essential steps:
Breaking text into individual words or tokens
Remove punctuation, special characters, and noise
Filter out common words with little meaning (a, an, the)
Stemming or lemmatization to standardize word forms
Tokenization breaks continuous text into discrete units (tokens), typically words or subwords.
"The product works well! I would recommend this product to others."
Result: 13 individual tokens that can be analyzed independently
Stop words are common words that appear frequently but carry little semantic meaning. Removing them reduces noise and computational complexity.
The, product, works, well, I, would, recommend, this, product, to, others
11 tokens
product, works, well, recommend, product, others
6 tokens
Impact: 45% reduction in tokens while preserving core meaning
Normalization reduces words to their base or root form, helping machines recognize that different forms represent the same concept.
Definition: Crude rule-based cutting of word endings
Examples:
Advantage: Fast and simple
Limitation: Can produce non-words
Definition: Linguistically informed reduction to dictionary form
Examples:
Advantage: Always produces valid words
Limitation: Slower, requires language models
Let's see how all steps transform a customer review:
"The product works amazingly well! I would definitely recommend this product to other customers."
The | product | works | amazingly | well | ! | I | would | definitely | recommend | this | product | to | other | customers | .
product | works | amazingly | well | definitely | recommend | product | customers
product | work | amazing | well | definite | recommend | product | customer
Final Result: Clean, normalized tokens ready for machine learning analysis
The Bag-of-Words (BoW) model represents text as a collection of word frequencies, ignoring grammar and word order.
Create vocabulary of all unique words
Count word occurrences in each document
Represent as numerical vector
["bad", "good", "great", "product", "recommend", "terrible", "well"]
"The product works well! I would recommend the product."
[0, 0, 0, 2, 1, 0, 1]
"Terrible product. Bad quality, not good."
[1, 1, 0, 1, 0, 1, 0]
Sentence 1: "The product is not terrible"
Vector: [0, 0, 0, 1, 0, 1, 0]
Sentence 2: "The product is terrible"
Vector: [0, 0, 0, 1, 0, 1, 0]
Problem: Opposite meanings produce identical vectors!
Despite limitations, BoW remains effective for many classification tasks and serves as foundation for more advanced methods.
Naïve Bayes is a probabilistic classification algorithm based on Bayes' Theorem, with an assumption of independence between features.
Given a document with words, calculate the probability it belongs to each category (Positive, Negative, Neutral)
P(Category | Words) = [P(Words | Category) × P(Category)] / P(Words)
"Naïve" Assumption: The algorithm assumes words are independent (appearing of one word doesn't affect others). While unrealistic, this simplification makes computation tractable and often works well in practice.
The model learns from labeled training data by calculating word probabilities for each category.
| Review | Words Present | Label |
|---|---|---|
| "Great location, clean rooms" | great, location, clean, room | Positive |
| "Terrible service, dirty bathroom" | terrible, service, dirty, bathroom | Negative |
| "Excellent staff, would recommend" | excellent, staff, recommend | Positive |
| "Poor value, noisy location" | poor, value, noisy, location | Negative |
| "Good breakfast, friendly staff" | good, breakfast, friendly, staff | Positive |
Let's classify a new review: "Great staff and good location"
Words: ["great", "staff", "good", "location"]
For Positive:
For Negative:
Positive: 87% probability
Negative: 13% probability
Classification: POSITIVE
What happens when a word never appears in training data for a particular category?
New review: "The breakfast was excellent"
If "excellent" never appeared in negative reviews during training, then:
P(excellent | Negative) = 0
Result: Entire probability calculation becomes 0, regardless of other words!
Add a small count (typically 1) to all word frequencies, ensuring no probability is exactly zero.
"excellent" in negative: 0/100 = 0%
Causes complete failure
"excellent" in negative: 1/105 = 0.95%
Small but non-zero probability
Impact: Smoothing prevents model breakdown while minimally affecting overall accuracy
Bag-of-Words, Naïve Bayes, TF-IDF
Limitation: No understanding of context or meaning
Word2Vec, GloVe - words represented as dense vectors
Breakthrough: Captured semantic relationships (king - man + woman ≈ queen)
"Attention is All You Need" paper introduced transformers
Innovation: Attention mechanism allows models to focus on relevant words
GPT series, BERT, Claude, DeepSeek
Capability: Human-like text understanding and generation
Transformers fundamentally changed how machines process language by introducing the "attention mechanism"
Traditional models process words sequentially (left to right)
Transformers process all words simultaneously, with attention determining which words are most relevant to each other
Processes word-by-word, may lose context of what "it" refers to by the time it reaches "tired"
Simultaneously considers all words, correctly identifies "it" refers to "animal" (not "street") based on "tired"
Result: Much better understanding of context, relationships, and meaning across entire documents
"Translate this to Spanish"
Understands input context
Multiple layers with attention
Generates output
Attends to encoder + previous words
"Traduce esto al español"
When translating "The cat sat on the mat":
Each word dynamically focuses on the most relevant other words
Large Language Models are transformer-based neural networks trained on massive text datasets to understand and generate human language.
Parameters are the learned weights in the neural network. More parameters generally mean:
For context: The human brain has approximately 100 trillion synaptic connections
Key Capability: LLMs can perform tasks they weren't explicitly trained for through "emergent abilities" - complex behaviors arising from scale
| Model | Release | Parameters | Key Capability | Business Impact |
|---|---|---|---|---|
| GPT-2 | Feb 2019 | 1.5 billion | Coherent text generation | Proof of concept |
| GPT-3 | May 2020 | 175 billion | Few-shot learning, improved reasoning | First commercial applications |
| GPT-3.5 | Jan 2022 | 175 billion | Reduced toxicity, better instruction following | Foundation for ChatGPT |
| GPT-4 | Mar 2023 | ~1 trillion (est.) | Multimodal (text + images), improved reasoning | Professional-grade AI assistant |
| GPT-4.0 / DeepSeek | 2024-2025 | Undisclosed | Vision, audio, code, multilingual excellence | Enterprise integration, specialized tasks |
Write articles, emails, reports, creative content
Write and debug code in multiple languages
Translate between 100+ languages
Condense long documents into key points
Answer questions based on context
Detect emotion and tone in text
| Aspect | Traditional NLP | Large Language Models |
|---|---|---|
| Training Data | Task-specific labeled data | Massive unlabeled text corpus |
| Context Understanding | Limited (bag-of-words) | Deep contextual understanding |
| Setup Time | Days-weeks for data labeling | Minutes (few examples needed) |
| Performance on New Tasks | Requires retraining | Immediate with prompting |
| Cost | Low (after initial development) | Higher (API fees or compute) |
Task: Predict the next word in billions of text sequences
Data: Entire internet - books, websites, articles, code repositories
Duration: Months on thousands of GPUs
Cost: $10M - $100M+
Task: Learn to follow instructions and format responses
Data: Human-labeled examples of desired behavior
Example: "Write a professional email rejecting a job candidate" → [Expected response]
Task: Learn to generate responses humans prefer
Process: Human evaluators rank multiple model responses; model learns from preferences
Goal: Helpful, harmless, honest responses
Key Insight: The model never "memorizes" the training data - it learns patterns and relationships that allow it to generate novel text
Use Case: AI-powered customer support chatbots
Use Case: Marketing content generation
Use Case: Automated report generation from structured data
Transform databases into narrative insights without manual analysis
Use Case: GitHub Copilot, AI pair programming
Developers report 55% faster task completion (GitHub, 2023)
LLMs can confidently generate false information that sounds plausible
Example: Asked about a non-existent book, may invent realistic-sounding plot summary
Models are trained on data up to a specific date - they don't know what happened after
Impact: Cannot provide current news, recent events, or latest research
LLMs pattern-match rather than comprehend - they don't have mental models of the world
Example: May fail at basic spatial reasoning or logical consistency
Running large models requires significant computing power
Impact: API costs can be substantial for high-volume applications
| Scenario | Recommended Approach | Rationale |
|---|---|---|
| Simple spam detection with 10,000 labeled emails | Traditional (Naïve Bayes) | Fast, cheap, explainable, high accuracy for this specific task |
| Customer service chatbot handling diverse queries | LLM (GPT-4) | Needs context understanding, handles unexpected questions |
| Regulatory document classification (banking) | Traditional + LLM Hybrid | Need explainability for compliance, but benefit from LLM understanding |
| Processing 1 million reviews per day | Traditional (cost considerations) | Traditional: $500/month vs LLM: $50,000/month |
| Multilingual customer support (20+ languages) | LLM | Training 20 traditional models vs 1 LLM |
| Real-time sentiment monitoring dashboard | Traditional | Speed critical, simpler models process faster |
Best Practice: Hybrid approach - use traditional NLP for filtering/preprocessing, then LLM for complex cases requiring nuanced understanding
LLMs should enhance your learning and thinking, not replace it. Your submissions should reflect your understanding, voice, and effort.
When in doubt, declare your AI use. For example:
"I used ChatGPT to help explain the concept of transformers in simple language, then wrote this explanation in my own words based on my understanding."
Text pre-processing, Bag-of-Words, and Naïve Bayes remain valuable for many business applications - especially when explainability and cost are priorities
The attention mechanism enabled models to understand context and relationships across entire documents, dramatically improving NLP capabilities
LLMs like GPT-4 can perform diverse language tasks with minimal examples, but come with limitations including hallucinations and computational costs
Choose between traditional and modern approaches based on task complexity, budget, explainability needs, and scale
Use AI as a tool to enhance your capabilities, not replace your thinking. Maintain academic integrity and transparency
Next Steps: Hands-on activities with Orange Data Mining and ChatGPT to apply these concepts to real datasets