1 / 34

Natural Language Processing and Generative AI

DATA4800 - Workshop 9

Understanding how machines process and generate human language

Workshop Learning Outcomes

1

Explore Traditional NLP and Machine Learning

Understand text pre-processing, classification techniques, and the Naïve Bayes algorithm

2

Explore Large Language Models and Business Applications

Discover transformers, GPT evolution, and real-world use cases in business contexts

3

Understand Ethical Considerations

Examine responsible use of LLMs in academic and professional environments

The Business Challenge

Scenario: E-commerce Customer Feedback Analysis

Your online retail company receives 10,000 customer reviews daily across multiple platforms

The Problem

  • Manual review reading is impossible at scale
  • Response time directly impacts customer satisfaction
  • Competitive advantage requires real-time insights
  • Negative reviews need immediate attention

Business Impact

10,000 Daily reviews
24h Manual processing time
$2M Annual manual cost

Two Approaches to Text Analysis

2000-2015

Traditional NLP + Machine Learning

Approach: Feature engineering, statistical models, rule-based systems

Example: Naïve Bayes classifier, Bag-of-Words

Characteristics: Requires labeled data, explainable, limited context understanding

2017-2025

Modern Large Language Models

Approach: Deep learning, transformer architecture, pre-trained models

Example: GPT-4, ChatGPT, Claude

Characteristics: Context-aware, minimal training, human-like understanding

Today's Journey: We will explore both approaches to understand when and why to use each method

What is Natural Language Processing?

Natural Language Processing (NLP) is a field of artificial intelligence that enables computers to understand, interpret, and generate human language.

Key Challenges in NLP

Ambiguity

"I saw a man with a telescope"

Did I use a telescope to see him? Or did he have a telescope?

Context Dependency

"The bank is closed"

Financial institution or river bank?

Sarcasm and Irony

"Great! Another meeting..."

Positive words, negative sentiment

Multiple Meanings

"Book" as a noun vs. verb

"Read this book" vs. "Book a flight"

Business Applications of NLP

Customer Service

Chatbots, automated support, sentiment analysis

Content Analysis

Document classification, information extraction

Translation

Multi-language support, localization

Industry Statistics

Customer Service Automation
85% adoption
Sentiment Analysis
72% adoption
Document Processing
68% adoption
Speech Recognition
61% adoption

Source: Gartner NLP Market Analysis 2024

Text Pre-processing Pipeline

Before machines can analyze text, raw text must be transformed into a structured format. This involves several essential steps:

1. Tokenization

Breaking text into individual words or tokens

2. Cleaning

Remove punctuation, special characters, and noise

3. Stop Word Removal

Filter out common words with little meaning (a, an, the)

4. Normalization

Stemming or lemmatization to standardize word forms

Step 1: Tokenization

Tokenization breaks continuous text into discrete units (tokens), typically words or subwords.

Original Text

"The product works well! I would recommend this product to others."

After Tokenization

The product works well ! I would recommend this product to others .

Result: 13 individual tokens that can be analyzed independently

Step 2: Stop Word Removal

Stop words are common words that appear frequently but carry little semantic meaning. Removing them reduces noise and computational complexity.

Common Stop Words

the a an is to in of I would this

Before Removal

The, product, works, well, I, would, recommend, this, product, to, others

11 tokens

After Removal

product, works, well, recommend, product, others

6 tokens

Impact: 45% reduction in tokens while preserving core meaning

Step 3: Text Normalization

Normalization reduces words to their base or root form, helping machines recognize that different forms represent the same concept.

Stemming

Definition: Crude rule-based cutting of word endings

Examples:

  • running → run
  • worked → work
  • better → better
  • caring → car (error!)

Advantage: Fast and simple

Limitation: Can produce non-words

Lemmatization

Definition: Linguistically informed reduction to dictionary form

Examples:

  • running → run
  • worked → work
  • better → good
  • caring → care

Advantage: Always produces valid words

Limitation: Slower, requires language models

Complete Pre-processing Pipeline

Let's see how all steps transform a customer review:

Original Review

"The product works amazingly well! I would definitely recommend this product to other customers."

Step 1: Tokenization

The | product | works | amazingly | well | ! | I | would | definitely | recommend | this | product | to | other | customers | .

Step 2: Remove Punctuation & Stop Words

product | works | amazingly | well | definitely | recommend | product | customers

Step 3: Lemmatization

product | work | amazing | well | definite | recommend | product | customer

Final Result: Clean, normalized tokens ready for machine learning analysis

Knowledge Check: Text Pre-processing

Which pre-processing step would be most important for reducing the vocabulary size in a sentiment analysis model?
A) Tokenization only
B) Stop word removal only
C) Lemmatization combined with stop word removal
D) Keeping all words in their original form

Bag-of-Words Model

The Bag-of-Words (BoW) model represents text as a collection of word frequencies, ignoring grammar and word order.

How It Works

Step 1

Create vocabulary of all unique words

Step 2

Count word occurrences in each document

Step 3

Represent as numerical vector

Example: Vocabulary

["bad", "good", "great", "product", "recommend", "terrible", "well"]

Review 1

"The product works well! I would recommend the product."

[0, 0, 0, 2, 1, 0, 1]

Review 2

"Terrible product. Bad quality, not good."

[1, 1, 0, 1, 0, 1, 0]

Limitations of Bag-of-Words

What BoW Captures

  • Word frequency information
  • Vocabulary presence/absence
  • Simple pattern recognition
  • Computational efficiency

What BoW Misses

  • Word order and grammar
  • Context and relationships
  • Semantic meaning
  • Negations ("not good")

Critical Example

Sentence 1: "The product is not terrible"

Vector: [0, 0, 0, 1, 0, 1, 0]

Sentence 2: "The product is terrible"

Vector: [0, 0, 0, 1, 0, 1, 0]

Problem: Opposite meanings produce identical vectors!

Despite limitations, BoW remains effective for many classification tasks and serves as foundation for more advanced methods.

Naïve Bayes Classifier

Naïve Bayes is a probabilistic classification algorithm based on Bayes' Theorem, with an assumption of independence between features.

Core Concept

Given a document with words, calculate the probability it belongs to each category (Positive, Negative, Neutral)

Bayes' Theorem (Simplified)

P(Category | Words) = [P(Words | Category) × P(Category)] / P(Words)

  • P(Category | Words): Probability review is positive given these words
  • P(Words | Category): Probability these words appear in positive reviews
  • P(Category): Overall proportion of positive reviews

"Naïve" Assumption: The algorithm assumes words are independent (appearing of one word doesn't affect others). While unrealistic, this simplification makes computation tractable and often works well in practice.

Training a Naïve Bayes Classifier

The model learns from labeled training data by calculating word probabilities for each category.

Training Dataset Example (Hotel Reviews)

Review Words Present Label
"Great location, clean rooms" great, location, clean, room Positive
"Terrible service, dirty bathroom" terrible, service, dirty, bathroom Negative
"Excellent staff, would recommend" excellent, staff, recommend Positive
"Poor value, noisy location" poor, value, noisy, location Negative
"Good breakfast, friendly staff" good, breakfast, friendly, staff Positive
60% Positive Reviews
40% Negative Reviews

Making Predictions with Naïve Bayes

Let's classify a new review: "Great staff and good location"

Step 1: Extract Words

Words: ["great", "staff", "good", "location"]

Step 2: Calculate Probabilities

For Positive:

  • "great" appears in 2/3 positive reviews
  • "staff" appears in 2/3 positive reviews
  • "good" appears in 1/3 positive reviews
  • "location" appears in 1/3 positive reviews

For Negative:

  • "great" appears in 0/2 negative reviews
  • "staff" appears in 0/2 negative reviews
  • "good" appears in 0/2 negative reviews
  • "location" appears in 1/2 negative reviews

Step 3: Final Prediction

Positive: 87% probability

Negative: 13% probability

Classification: POSITIVE

The Zero-Probability Problem

What happens when a word never appears in training data for a particular category?

Problem Scenario

New review: "The breakfast was excellent"

If "excellent" never appeared in negative reviews during training, then:

P(excellent | Negative) = 0

Result: Entire probability calculation becomes 0, regardless of other words!

Solution: Laplace Smoothing

Add a small count (typically 1) to all word frequencies, ensuring no probability is exactly zero.

Without Smoothing

"excellent" in negative: 0/100 = 0%

Causes complete failure

With Smoothing

"excellent" in negative: 1/105 = 0.95%

Small but non-zero probability

Impact: Smoothing prevents model breakdown while minimally affecting overall accuracy

Naïve Bayes in Practice

Strengths

  • Fast training and prediction
  • Works well with small datasets
  • Handles high-dimensional data
  • Transparent and explainable
  • Requires minimal tuning

Limitations

  • Independence assumption often violated
  • Cannot capture word order
  • Struggles with sarcasm
  • Limited context understanding
  • Requires labeled training data

Real-World Performance Metrics

Spam Detection
95% accuracy
Sentiment Analysis (Simple)
82% accuracy
Document Classification
88% accuracy
Sentiment Analysis (Complex)
68% accuracy

Knowledge Check: Naïve Bayes

Why is Laplace smoothing necessary in Naïve Bayes classifiers?
A) To increase the accuracy of the model
B) To prevent zero probabilities when a word hasn't appeared in training data
C) To reduce computational complexity
D) To handle multiple languages

The Evolution: From Statistics to Neural Networks

2000s

Statistical NLP

Bag-of-Words, Naïve Bayes, TF-IDF

Limitation: No understanding of context or meaning

2013

Word Embeddings

Word2Vec, GloVe - words represented as dense vectors

Breakthrough: Captured semantic relationships (king - man + woman ≈ queen)

2017

Transformer Revolution

"Attention is All You Need" paper introduced transformers

Innovation: Attention mechanism allows models to focus on relevant words

2018-2025

Large Language Models Era

GPT series, BERT, Claude, DeepSeek

Capability: Human-like text understanding and generation

Key Innovation: The Transformer Architecture

Transformers fundamentally changed how machines process language by introducing the "attention mechanism"

The Core Concept

Traditional models process words sequentially (left to right)

Transformers process all words simultaneously, with attention determining which words are most relevant to each other

Example: Translating "The animal didn't cross the street because it was too tired"

Traditional Approach

Processes word-by-word, may lose context of what "it" refers to by the time it reaches "tired"

Transformer Approach

Simultaneously considers all words, correctly identifies "it" refers to "animal" (not "street") based on "tired"

Result: Much better understanding of context, relationships, and meaning across entire documents

Transformer Components

Input Text

"Translate this to Spanish"

Encoder

Understands input context

Multiple layers with attention

Decoder

Generates output

Attends to encoder + previous words

Output Text

"Traduce esto al español"

Attention Mechanism in Action

When translating "The cat sat on the mat":

Each word dynamically focuses on the most relevant other words

Knowledge Check: Transformers

What is the primary advantage of the attention mechanism in transformers compared to traditional sequential models?
A) It processes text faster
B) It requires less training data
C) It can consider relationships between all words simultaneously, regardless of distance
D) It uses less computer memory

Large Language Models (LLMs)

Large Language Models are transformer-based neural networks trained on massive text datasets to understand and generate human language.

What Makes Them "Large"?

175B+ Parameters (GPT-3)
1T+ Parameters (GPT-4 estimated)

What are Parameters?

Parameters are the learned weights in the neural network. More parameters generally mean:

For context: The human brain has approximately 100 trillion synaptic connections

Key Capability: LLMs can perform tasks they weren't explicitly trained for through "emergent abilities" - complex behaviors arising from scale

Evolution of GPT Models

Model Release Parameters Key Capability Business Impact
GPT-2 Feb 2019 1.5 billion Coherent text generation Proof of concept
GPT-3 May 2020 175 billion Few-shot learning, improved reasoning First commercial applications
GPT-3.5 Jan 2022 175 billion Reduced toxicity, better instruction following Foundation for ChatGPT
GPT-4 Mar 2023 ~1 trillion (est.) Multimodal (text + images), improved reasoning Professional-grade AI assistant
GPT-4.0 / DeepSeek 2024-2025 Undisclosed Vision, audio, code, multilingual excellence Enterprise integration, specialized tasks
667× Parameter increase (GPT-2 to GPT-4)
$100M+ Estimated training cost (GPT-4)

What LLMs Can Do

Text Generation

Write articles, emails, reports, creative content

Code Generation

Write and debug code in multiple languages

Translation

Translate between 100+ languages

Summarization

Condense long documents into key points

Question Answering

Answer questions based on context

Sentiment Analysis

Detect emotion and tone in text

Comparison: Traditional NLP vs. LLMs

Aspect Traditional NLP Large Language Models
Training Data Task-specific labeled data Massive unlabeled text corpus
Context Understanding Limited (bag-of-words) Deep contextual understanding
Setup Time Days-weeks for data labeling Minutes (few examples needed)
Performance on New Tasks Requires retraining Immediate with prompting
Cost Low (after initial development) Higher (API fees or compute)

How LLMs Are Trained

Stage 1: Pre-training

Task: Predict the next word in billions of text sequences

Data: Entire internet - books, websites, articles, code repositories

Duration: Months on thousands of GPUs

Cost: $10M - $100M+

Stage 2: Supervised Fine-tuning

Task: Learn to follow instructions and format responses

Data: Human-labeled examples of desired behavior

Example: "Write a professional email rejecting a job candidate" → [Expected response]

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

Task: Learn to generate responses humans prefer

Process: Human evaluators rank multiple model responses; model learns from preferences

Goal: Helpful, harmless, honest responses

Key Insight: The model never "memorizes" the training data - it learns patterns and relationships that allow it to generate novel text

Knowledge Check: Large Language Models

What is the primary advantage of LLMs over traditional supervised learning approaches for new NLP tasks?
A) They are always more accurate
B) They can perform new tasks with few or no task-specific training examples
A) They cost less to operate
D) They process text faster

LLMs in Business: Real-World Applications

Customer Service

Use Case: AI-powered customer support chatbots

  • Handle 70% of common inquiries automatically
  • 24/7 availability in multiple languages
  • Escalate complex issues to humans
40% Cost reduction (Gartner 2024)

Content Creation

Use Case: Marketing content generation

  • Product descriptions
  • Social media posts
  • Email campaigns
10× Content output increase

Data Analysis

Use Case: Automated report generation from structured data

Transform databases into narrative insights without manual analysis

Code Development

Use Case: GitHub Copilot, AI pair programming

Developers report 55% faster task completion (GitHub, 2023)

Current Limitations of LLMs

1. Hallucinations

LLMs can confidently generate false information that sounds plausible

Example: Asked about a non-existent book, may invent realistic-sounding plot summary

2. Knowledge Cutoff

Models are trained on data up to a specific date - they don't know what happened after

Impact: Cannot provide current news, recent events, or latest research

3. No True Understanding

LLMs pattern-match rather than comprehend - they don't have mental models of the world

Example: May fail at basic spatial reasoning or logical consistency

4. Computational Cost

Running large models requires significant computing power

Impact: API costs can be substantial for high-volume applications

Decision Framework: Traditional NLP vs. LLMs

Scenario Recommended Approach Rationale
Simple spam detection with 10,000 labeled emails Traditional (Naïve Bayes) Fast, cheap, explainable, high accuracy for this specific task
Customer service chatbot handling diverse queries LLM (GPT-4) Needs context understanding, handles unexpected questions
Regulatory document classification (banking) Traditional + LLM Hybrid Need explainability for compliance, but benefit from LLM understanding
Processing 1 million reviews per day Traditional (cost considerations) Traditional: $500/month vs LLM: $50,000/month
Multilingual customer support (20+ languages) LLM Training 20 traditional models vs 1 LLM
Real-time sentiment monitoring dashboard Traditional Speed critical, simpler models process faster

Best Practice: Hybrid approach - use traditional NLP for filtering/preprocessing, then LLM for complex cases requiring nuanced understanding

Ethical Considerations: LLMs in Academic Work

Acceptable Uses

  • Grammar and spelling checking
  • Brainstorming and idea generation
  • Understanding complex concepts
  • Generating practice questions
  • Debugging code errors
  • Translating technical concepts

Unacceptable Uses

  • Copying AI-generated text as your own
  • Having AI write entire essays/reports
  • Using AI to complete assessments
  • Submitting AI code without understanding
  • Generating fake references

General Principle

LLMs should enhance your learning and thinking, not replace it. Your submissions should reflect your understanding, voice, and effort.

Transparency Rule

When in doubt, declare your AI use. For example:

"I used ChatGPT to help explain the concept of transformers in simple language, then wrote this explanation in my own words based on my understanding."

Knowledge Check: Ethical AI Use

Which of the following represents the most ethical use of ChatGPT for a university assignment?
A) Asking ChatGPT to write the entire assignment and submitting it with minor edits
B) Using ChatGPT to explain difficult concepts, then writing your assignment in your own words based on your understanding
C) Having ChatGPT generate an outline and using it verbatim as your submission structure
D) Asking ChatGPT to generate references for sources you haven't read

Workshop Summary

Key Takeaways

1. Traditional NLP Foundation

Text pre-processing, Bag-of-Words, and Naïve Bayes remain valuable for many business applications - especially when explainability and cost are priorities

2. Transformer Revolution

The attention mechanism enabled models to understand context and relationships across entire documents, dramatically improving NLP capabilities

3. Large Language Models

LLMs like GPT-4 can perform diverse language tasks with minimal examples, but come with limitations including hallucinations and computational costs

4. Strategic Tool Selection

Choose between traditional and modern approaches based on task complexity, budget, explainability needs, and scale

5. Ethical Responsibility

Use AI as a tool to enhance your capabilities, not replace your thinking. Maintain academic integrity and transparency

Next Steps: Hands-on activities with Orange Data Mining and ChatGPT to apply these concepts to real datasets