Natural Language Processing and Generative AI

DATA4800 - Workshop 9

Understanding how machines process and generate human language

Workshop Learning Outcomes

1

Explore Traditional NLP and Machine Learning

Understand text pre-processing, classification techniques, and the Naïve Bayes algorithm

2

Explore Large Language Models and Business Applications

Discover transformers, GPT evolution, and real-world use cases in business contexts

3

Understand Ethical Considerations

Examine responsible use of LLMs in academic and professional environments

The Business Challenge

Scenario: E-commerce Customer Feedback Analysis

Your online retail company receives 10,000 customer reviews daily across multiple platforms

The Problem

Manual review reading is impossible at scale
Response time directly impacts customer satisfaction
Competitive advantage requires real-time insights
Negative reviews need immediate attention

Business Impact

10,000 Daily reviews

24h Manual processing time

$2M Annual manual cost

Two Approaches to Text Analysis

2000-2015

Traditional NLP + Machine Learning

Approach: Feature engineering, statistical models, rule-based systems

Example: Naïve Bayes classifier, Bag-of-Words

Characteristics: Requires labeled data, explainable, limited context understanding

2017-2025

Modern Large Language Models

Approach: Deep learning, transformer architecture, pre-trained models

Example: GPT-4, ChatGPT, Claude

Characteristics: Context-aware, minimal training, human-like understanding

Today's Journey: We will explore both approaches to understand when and why to use each method

What is Natural Language Processing?

Natural Language Processing (NLP) is a field of artificial intelligence that enables computers to understand, interpret, and generate human language.

Key Challenges in NLP

Ambiguity

"I saw a man with a telescope"

Did I use a telescope to see him? Or did he have a telescope?

Context Dependency

"The bank is closed"

Financial institution or river bank?

Sarcasm and Irony

"Great! Another meeting..."

Positive words, negative sentiment

Multiple Meanings

"Book" as a noun vs. verb

"Read this book" vs. "Book a flight"

Business Applications of NLP

Customer Service

Chatbots, automated support, sentiment analysis

→

Content Analysis

Document classification, information extraction

→

Translation

Multi-language support, localization

Industry Statistics

Customer Service Automation

85% adoption

Sentiment Analysis

72% adoption

Document Processing

68% adoption

Speech Recognition

61% adoption

Source: Gartner NLP Market Analysis 2024

Text Pre-processing Pipeline

Before machines can analyze text, raw text must be transformed into a structured format. This involves several essential steps:

1. Tokenization

Breaking text into individual words or tokens

↓

2. Cleaning

Remove punctuation, special characters, and noise

↓

3. Stop Word Removal

Filter out common words with little meaning (a, an, the)

↓

4. Normalization

Stemming or lemmatization to standardize word forms

Step 1: Tokenization

Tokenization breaks continuous text into discrete units (tokens), typically words or subwords.

Original Text

"The product works well! I would recommend this product to others."

After Tokenization

The product works well ! I would recommend this product to others .

Result: 13 individual tokens that can be analyzed independently

Step 2: Stop Word Removal

Stop words are common words that appear frequently but carry little semantic meaning. Removing them reduces noise and computational complexity.

Common Stop Words

the a an is to in of I would this

Before Removal

The, product, works, well, I, would, recommend, this, product, to, others

11 tokens

After Removal

product, works, well, recommend, product, others

6 tokens

Impact: 45% reduction in tokens while preserving core meaning

Step 3: Text Normalization

Normalization reduces words to their base or root form, helping machines recognize that different forms represent the same concept.

Stemming

Definition: Crude rule-based cutting of word endings

Examples:

running → run
worked → work
better → better
caring → car (error!)

Advantage: Fast and simple

Limitation: Can produce non-words

Lemmatization

Definition: Linguistically informed reduction to dictionary form

Examples:

running → run
worked → work
better → good
caring → care

Advantage: Always produces valid words

Limitation: Slower, requires language models

Complete Pre-processing Pipeline

Let's see how all steps transform a customer review:

Original Review

"The product works amazingly well! I would definitely recommend this product to other customers."

↓

Step 1: Tokenization

↓

Step 2: Remove Punctuation & Stop Words

↓

Step 3: Lemmatization

Final Result: Clean, normalized tokens ready for machine learning analysis

Knowledge Check: Text Pre-processing

Which pre-processing step would be most important for reducing the vocabulary size in a sentiment analysis model?

A) Tokenization only

B) Stop word removal only

C) Lemmatization combined with stop word removal

D) Keeping all words in their original form

Bag-of-Words Model

The Bag-of-Words (BoW) model represents text as a collection of word frequencies, ignoring grammar and word order.

How It Works

Step 1

Create vocabulary of all unique words

→

Step 2

Count word occurrences in each document

→

Step 3

Represent as numerical vector

Example: Vocabulary

["bad", "good", "great", "product", "recommend", "terrible", "well"]

Review 1

"The product works well! I would recommend the product."

[0, 0, 0, 2, 1, 0, 1]

Review 2

"Terrible product. Bad quality, not good."

[1, 1, 0, 1, 0, 1, 0]

Limitations of Bag-of-Words

What BoW Captures

Word frequency information
Vocabulary presence/absence
Simple pattern recognition
Computational efficiency

What BoW Misses

Word order and grammar
Context and relationships
Semantic meaning
Negations ("not good")

Critical Example

Sentence 1: "The product is not terrible"

Vector: [0, 0, 0, 1, 0, 1, 0]

Sentence 2: "The product is terrible"

Vector: [0, 0, 0, 1, 0, 1, 0]

Problem: Opposite meanings produce identical vectors!

Despite limitations, BoW remains effective for many classification tasks and serves as foundation for more advanced methods.

Naïve Bayes Classifier

Naïve Bayes is a probabilistic classification algorithm based on Bayes' Theorem, with an assumption of independence between features.

Core Concept

Given a document with words, calculate the probability it belongs to each category (Positive, Negative, Neutral)

Bayes' Theorem (Simplified)

P(Category | Words) = [P(Words | Category) × P(Category)] / P(Words)

P(Category | Words): Probability review is positive given these words
P(Words | Category): Probability these words appear in positive reviews
P(Category): Overall proportion of positive reviews

"Naïve" Assumption: The algorithm assumes words are independent (appearing of one word doesn't affect others). While unrealistic, this simplification makes computation tractable and often works well in practice.

Training a Naïve Bayes Classifier

The model learns from labeled training data by calculating word probabilities for each category.

Training Dataset Example (Hotel Reviews)

Review	Words Present	Label
"Great location, clean rooms"	great, location, clean, room	Positive
"Terrible service, dirty bathroom"	terrible, service, dirty, bathroom	Negative
"Excellent staff, would recommend"	excellent, staff, recommend	Positive
"Poor value, noisy location"	poor, value, noisy, location	Negative
"Good breakfast, friendly staff"	good, breakfast, friendly, staff	Positive

60% Positive Reviews

40% Negative Reviews

Making Predictions with Naïve Bayes

Let's classify a new review: "Great staff and good location"

Step 1: Extract Words

Words: ["great", "staff", "good", "location"]

↓

Step 2: Calculate Probabilities

For Positive:

"great" appears in 2/3 positive reviews
"staff" appears in 2/3 positive reviews
"good" appears in 1/3 positive reviews
"location" appears in 1/3 positive reviews

For Negative:

"great" appears in 0/2 negative reviews
"staff" appears in 0/2 negative reviews
"good" appears in 0/2 negative reviews
"location" appears in 1/2 negative reviews

↓

Step 3: Final Prediction

Positive: 87% probability

Negative: 13% probability

Classification: POSITIVE

The Zero-Probability Problem

What happens when a word never appears in training data for a particular category?

Problem Scenario

New review: "The breakfast was excellent"

If "excellent" never appeared in negative reviews during training, then:

P(excellent | Negative) = 0

Result: Entire probability calculation becomes 0, regardless of other words!

Solution: Laplace Smoothing

Add a small count (typically 1) to all word frequencies, ensuring no probability is exactly zero.

Without Smoothing

"excellent" in negative: 0/100 = 0%

Causes complete failure

With Smoothing

"excellent" in negative: 1/105 = 0.95%

Small but non-zero probability

Impact: Smoothing prevents model breakdown while minimally affecting overall accuracy

Naïve Bayes in Practice

Strengths

Fast training and prediction
Works well with small datasets
Handles high-dimensional data
Transparent and explainable
Requires minimal tuning

Limitations

Independence assumption often violated
Cannot capture word order
Struggles with sarcasm
Limited context understanding
Requires labeled training data

Real-World Performance Metrics

Spam Detection

95% accuracy

Sentiment Analysis (Simple)

82% accuracy

Document Classification

88% accuracy

Sentiment Analysis (Complex)

68% accuracy

Knowledge Check: Naïve Bayes

Why is Laplace smoothing necessary in Naïve Bayes classifiers?

A) To increase the accuracy of the model

B) To prevent zero probabilities when a word hasn't appeared in training data

C) To reduce computational complexity

D) To handle multiple languages

The Evolution: From Statistics to Neural Networks

2000s

Statistical NLP

Bag-of-Words, Naïve Bayes, TF-IDF

Limitation: No understanding of context or meaning

2013

Word Embeddings

Word2Vec, GloVe - words represented as dense vectors

Breakthrough: Captured semantic relationships (king - man + woman ≈ queen)

2017

Transformer Revolution

"Attention is All You Need" paper introduced transformers

Innovation: Attention mechanism allows models to focus on relevant words

2018-2025

Large Language Models Era

GPT series, BERT, Claude, DeepSeek

Capability: Human-like text understanding and generation

Key Innovation: The Transformer Architecture

Transformers fundamentally changed how machines process language by introducing the "attention mechanism"

The Core Concept

Traditional models process words sequentially (left to right)

Transformers process all words simultaneously, with attention determining which words are most relevant to each other

Example: Translating "The animal didn't cross the street because it was too tired"

Traditional Approach

Processes word-by-word, may lose context of what "it" refers to by the time it reaches "tired"

Transformer Approach

Simultaneously considers all words, correctly identifies "it" refers to "animal" (not "street") based on "tired"

Result: Much better understanding of context, relationships, and meaning across entire documents

Transformer Components

Input Text

"Translate this to Spanish"

→

Encoder

Understands input context

Multiple layers with attention

→

Decoder

Generates output

Attends to encoder + previous words

→

Output Text

"Traduce esto al español"

Attention Mechanism in Action

When translating "The cat sat on the mat":

"cat" pays high attention to "The" (for gender in French: le/la)
"sat" pays high attention to "cat" (subject-verb agreement)
"mat" pays attention to "on" (preposition relationship)

Each word dynamically focuses on the most relevant other words

Knowledge Check: Transformers

What is the primary advantage of the attention mechanism in transformers compared to traditional sequential models?

A) It processes text faster

B) It requires less training data

C) It can consider relationships between all words simultaneously, regardless of distance

D) It uses less computer memory

Large Language Models (LLMs)

Large Language Models are transformer-based neural networks trained on massive text datasets to understand and generate human language.

What Makes Them "Large"?

175B+ Parameters (GPT-3)

1T+ Parameters (GPT-4 estimated)

What are Parameters?

Parameters are the learned weights in the neural network. More parameters generally mean:

Greater capacity to capture complex patterns
Better understanding of nuanced language
Ability to perform diverse tasks without specific training

For context: The human brain has approximately 100 trillion synaptic connections

Key Capability: LLMs can perform tasks they weren't explicitly trained for through "emergent abilities" - complex behaviors arising from scale

Evolution of GPT Models

Model	Release	Parameters	Key Capability	Business Impact
GPT-2	Feb 2019	1.5 billion	Coherent text generation	Proof of concept
GPT-3	May 2020	175 billion	Few-shot learning, improved reasoning	First commercial applications
GPT-3.5	Jan 2022	175 billion	Reduced toxicity, better instruction following	Foundation for ChatGPT
GPT-4	Mar 2023	~1 trillion (est.)	Multimodal (text + images), improved reasoning	Professional-grade AI assistant
GPT-4.0 / DeepSeek	2024-2025	Undisclosed	Vision, audio, code, multilingual excellence	Enterprise integration, specialized tasks

667× Parameter increase (GPT-2 to GPT-4)

$100M+ Estimated training cost (GPT-4)

What LLMs Can Do

Text Generation

Write articles, emails, reports, creative content

Code Generation

Write and debug code in multiple languages

Translation

Translate between 100+ languages

Summarization

Condense long documents into key points

Question Answering

Answer questions based on context

Sentiment Analysis

Detect emotion and tone in text

Comparison: Traditional NLP vs. LLMs

Aspect	Traditional NLP	Large Language Models
Training Data	Task-specific labeled data	Massive unlabeled text corpus
Context Understanding	Limited (bag-of-words)	Deep contextual understanding
Setup Time	Days-weeks for data labeling	Minutes (few examples needed)
Performance on New Tasks	Requires retraining	Immediate with prompting
Cost	Low (after initial development)	Higher (API fees or compute)

How LLMs Are Trained

Stage 1: Pre-training

Task: Predict the next word in billions of text sequences

Data: Entire internet - books, websites, articles, code repositories

Duration: Months on thousands of GPUs

Cost: $10M - $100M+

↓

Stage 2: Supervised Fine-tuning

Task: Learn to follow instructions and format responses

Data: Human-labeled examples of desired behavior

Example: "Write a professional email rejecting a job candidate" → [Expected response]

↓

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

Task: Learn to generate responses humans prefer

Process: Human evaluators rank multiple model responses; model learns from preferences

Goal: Helpful, harmless, honest responses

Key Insight: The model never "memorizes" the training data - it learns patterns and relationships that allow it to generate novel text

Knowledge Check: Large Language Models

What is the primary advantage of LLMs over traditional supervised learning approaches for new NLP tasks?

A) They are always more accurate

B) They can perform new tasks with few or no task-specific training examples

A) They cost less to operate

D) They process text faster

LLMs in Business: Real-World Applications

Customer Service

Use Case: AI-powered customer support chatbots

Handle 70% of common inquiries automatically
24/7 availability in multiple languages
Escalate complex issues to humans

40% Cost reduction (Gartner 2024)

Content Creation

Use Case: Marketing content generation

Product descriptions
Social media posts
Email campaigns

10× Content output increase

Data Analysis

Use Case: Automated report generation from structured data

Transform databases into narrative insights without manual analysis

Code Development

Use Case: GitHub Copilot, AI pair programming

Developers report 55% faster task completion (GitHub, 2023)

Current Limitations of LLMs

1. Hallucinations

LLMs can confidently generate false information that sounds plausible

Example: Asked about a non-existent book, may invent realistic-sounding plot summary

2. Knowledge Cutoff

Models are trained on data up to a specific date - they don't know what happened after

Impact: Cannot provide current news, recent events, or latest research

3. No True Understanding

LLMs pattern-match rather than comprehend - they don't have mental models of the world

Example: May fail at basic spatial reasoning or logical consistency

4. Computational Cost

Running large models requires significant computing power

Impact: API costs can be substantial for high-volume applications

Decision Framework: Traditional NLP vs. LLMs

Scenario	Recommended Approach	Rationale
Simple spam detection with 10,000 labeled emails	Traditional (Naïve Bayes)	Fast, cheap, explainable, high accuracy for this specific task
Customer service chatbot handling diverse queries	LLM (GPT-4)	Needs context understanding, handles unexpected questions
Regulatory document classification (banking)	Traditional + LLM Hybrid	Need explainability for compliance, but benefit from LLM understanding
Processing 1 million reviews per day	Traditional (cost considerations)	Traditional: $500/month vs LLM: $50,000/month
Multilingual customer support (20+ languages)	LLM	Training 20 traditional models vs 1 LLM
Real-time sentiment monitoring dashboard	Traditional	Speed critical, simpler models process faster

Best Practice: Hybrid approach - use traditional NLP for filtering/preprocessing, then LLM for complex cases requiring nuanced understanding

Ethical Considerations: LLMs in Academic Work

Acceptable Uses

Grammar and spelling checking
Brainstorming and idea generation
Understanding complex concepts
Generating practice questions
Debugging code errors
Translating technical concepts

Unacceptable Uses

Copying AI-generated text as your own
Having AI write entire essays/reports
Using AI to complete assessments
Submitting AI code without understanding
Generating fake references

General Principle

LLMs should enhance your learning and thinking, not replace it. Your submissions should reflect your understanding, voice, and effort.

Transparency Rule

When in doubt, declare your AI use. For example:

"I used ChatGPT to help explain the concept of transformers in simple language, then wrote this explanation in my own words based on my understanding."

Knowledge Check: Ethical AI Use

Which of the following represents the most ethical use of ChatGPT for a university assignment?

A) Asking ChatGPT to write the entire assignment and submitting it with minor edits

B) Using ChatGPT to explain difficult concepts, then writing your assignment in your own words based on your understanding

C) Having ChatGPT generate an outline and using it verbatim as your submission structure

D) Asking ChatGPT to generate references for sources you haven't read

Workshop Summary

Key Takeaways

1. Traditional NLP Foundation

Text pre-processing, Bag-of-Words, and Naïve Bayes remain valuable for many business applications - especially when explainability and cost are priorities

2. Transformer Revolution

The attention mechanism enabled models to understand context and relationships across entire documents, dramatically improving NLP capabilities

3. Large Language Models

LLMs like GPT-4 can perform diverse language tasks with minimal examples, but come with limitations including hallucinations and computational costs

4. Strategic Tool Selection

Choose between traditional and modern approaches based on task complexity, budget, explainability needs, and scale

5. Ethical Responsibility

Use AI as a tool to enhance your capabilities, not replace your thinking. Maintain academic integrity and transparency

Next Steps: Hands-on activities with Orange Data Mining and ChatGPT to apply these concepts to real datasets