Week 7: Generative AI

DATA5000 - Artificial Intelligence Programming in Business Analytics

Understanding how AI generates content from data

From prediction to creation: The technology behind ChatGPT, Claude, and Gemini

Where We Are in DATA5000

Week	Topic	Key Learning	Data Type
1-2	Predictive ML	Pattern recognition in structured data	Numbers, categories
3	Deep Learning & Transformers	Neural networks with attention mechanism	Time series, sequences
4-6	Causal AI	Understanding cause and effect	Treatment effects, interventions
7	Generative AI	Creating new content from patterns	Text, images, code
8	Prompt Engineering	Controlling AI outputs effectively	Instructions, context

Today's Focus: Moving from analyzing data to generating new content based on learned patterns

Recap: Transformers from Week 3

What You Already Know

In Week 3, you used Temporal Fusion Transformers (TFT) to forecast Australia's inflation rate.

What TFT Did

Analyzed multiple time series together
Used attention to focus on important time periods
Predicted future inflation values
Provided confidence intervals

Key Technology

Attention Mechanism: Weighs importance of different inputs
Sequential Processing: Understands order and context
Pattern Learning: Finds relationships in historical data

Sample Data from Week 3

Quarter	GDP Growth	Unemployment	Interest Rate	Inflation
2023-Q1	2.3%	3.5%	3.6%	7.8%
2023-Q2	1.8%	3.6%	4.1%	6.0%

TFT Output: Predicted inflation for Q3 with 95% confidence: 5.2% ± 0.8%

The Evolution to Generative AI

Week 2-3

Neural Networks

Predict single output

Example: House price

Input: 5 features → Output: $450,000

→

Week 3

Transformers

Predict sequences

Example: Inflation forecast

Input: Time series → Output: Next 4 quarters

→

Week 7

Generative AI

Generate new content

Example: Business report

Input: Prompt → Output: Full text/code

Key Insight: Generative AI uses the same transformer architecture you've already seen, but applies it to predict the next word in a sequence rather than the next number. By repeatedly predicting "what comes next," it can generate entire documents, code, or analyses.

What is Generative AI?

Generative AI creates new content by learning patterns from massive datasets, then using those patterns to generate text, images, code, or other outputs that appear human-created.

Three Core Capabilities

Pattern Recognition: Learns from billions of examples
Context Understanding: Comprehends relationships between words and concepts
Creative Generation: Produces new content following learned patterns

Traditional AI (Predictive)

Question: "Will customer X churn?"

Output: 0.73 (73% probability)

Fixed, numerical output

Generative AI

Question: "Why might customer X churn?"

Output: "Based on usage patterns, customer X shows declining engagement over 3 months, with support tickets about pricing. Recommend targeted retention offer..."

Flexible, contextual explanation

How Text Becomes Data: Tokenization

Tokenization is the process of breaking text into smaller units (tokens) that can be converted into numbers for the AI model to process.

Why Computers Need Numbers

Neural networks only understand numbers, not text
Every word, punctuation mark, and space must be converted
This conversion must be consistent and reversible

Step 1: Original Text

"Customer satisfaction increased by 15% this quarter."

Step 2: Break into Tokens

["Customer", "satisfaction", "increased", "by", "15", "%", "this", "quarter", "."]

Result: 9 tokens

Step 3: Convert to Token IDs (numbers)

[2456, 8901, 1234, 89, 45, 67, 312, 5678, 91]

Each word gets a unique number from the model's vocabulary

Tokenization: Real Examples with Data

Understanding Token Counts

Rule of Thumb: 1 token ≈ 0.75 words in English

Or approximately: 100 words = 133 tokens

Business Document Token Analysis

Document Type	Word Count	Token Count	Notes
Short email	150 words	~200 tokens	Simple vocabulary
Business memo	500 words	~667 tokens	Professional language
Technical report	2,000 words	~2,800 tokens	Complex terminology, more tokens per word
Customer chat log	100 words	~150 tokens	Casual language, abbreviations

Why This Matters for Business

Cost: AI APIs charge per token (e.g., $0.03 per 1,000 input tokens for GPT-4)
Limits: Models have maximum token limits (e.g., 128,000 tokens for GPT-4)
Performance: More tokens = longer processing time

Embeddings: Converting Tokens to Vectors

Embeddings convert each token into a vector (list of numbers) that captures its meaning in mathematical space. Similar words have similar vectors.

From Token IDs to Meaningful Representations

Example: Token "Customer" (ID: 2456)

Embedding vector (768 dimensions): [0.23, -0.45, 0.67, 0.12, -0.89, 0.34, ..., 0.45, -0.23, 0.78]

Example: Token "Client" (ID: 3892)

Embedding vector (768 dimensions): [0.25, -0.43, 0.69, 0.15, -0.87, 0.36, ..., 0.47, -0.21, 0.76]

Notice: Similar meanings = similar numbers!

Why 768 Dimensions?

Each dimension captures a different aspect of meaning
More dimensions = more nuanced understanding
GPT-3: 12,288 dimensions per token
Allows mathematical operations on meaning (e.g., "King" - "Man" + "Woman" ≈ "Queen")

Visualizing Word Relationships in Vector Space

Similar Words Cluster Together

When we reduce 768 dimensions to 3D for visualization, words with similar meanings appear close together:

Business Financial Terms - Distance in Vector Space

Word Pair	Cosine Similarity	Relationship
"revenue" ↔ "income"	0.87	Very similar
"profit" ↔ "revenue"	0.72	Related concepts
"profit" ↔ "loss"	0.41	Opposite but related
"profit" ↔ "customer"	0.23	Weakly related
"profit" ↔ "bicycle"	0.05	Unrelated

Scale: 1.0 = identical, 0.0 = completely unrelated

Business Applications

Semantic Search: Find documents by meaning, not just keywords
Recommendation Systems: Suggest similar products based on description embeddings
Clustering: Group customer feedback by topic automatically
Anomaly Detection: Flag unusual text patterns in reports

Training Data at Scale

The Foundation of Generative AI

Generative AI models learn by analyzing massive amounts of text data from books, websites, articles, and code repositories. The scale is unprecedented in computing history.

Training Dataset Statistics

Model	Training Tokens	Equivalent Books	Training Cost
GPT-3 (2020)	300 billion	~600,000 books	~$4.6 million
GPT-4 (2023)	~13 trillion	~26 million books	~$100 million (estimated)
Claude 3 (2024)	Similar scale	Tens of millions of books	Similar magnitude

For Context: One book ≈ 500,000 tokens (about 375,000 words)

What the Training Data Includes

Books: Fiction, non-fiction, technical manuals
Web Content: Wikipedia, news articles, blogs, forums
Code: GitHub repositories, Stack Overflow discussions
Professional Documents: Research papers, business reports

Training Process: Real Numbers

The Computational Challenge

GPT-4 Training Estimated Specifications

Number of GPUs: 25,000+ NVIDIA A100s

Training Duration: 90-100 days (continuous)

Power Consumption: ~50 megawatts (like a small city)

Electricity Cost: ~$10 million

Total Training Cost: $100+ million

Why Understanding Scale Matters for Business

You're using a tool that cost millions to create
Using the API is far cheaper than building your own model
Understanding limits helps you use it effectively
Explains why some queries take longer than others

Pattern Learning: How Models Understand Language

Learning from Billions of Examples

During training, the model learns patterns by analyzing how words appear together millions of times across different contexts. It builds statistical understanding of language structure.

Examples of Learned Patterns

Pattern 1: Business Context

Learned: "quarterly revenue" is often followed by:

"increased" (positive context)
"decreased" (negative context)
"reached" (neutral reporting)

Model assigns probabilities based on training frequency

Pattern 2: Sentence Structure

Learned: After "The customer", likely words:

Verbs: "purchased", "requested", "complained"
Less likely: adjectives or prepositions
Grammar rules emerge from examples

No explicit grammar rules programmed

Training Example Frequency

Pattern: "The company reported [X] earnings"

Seen "strong earnings": 487,293 times
Seen "weak earnings": 142,847 times
Seen "purple earnings": 0 times

Result: Model learns "strong" is more likely than "purple" in this context

Next Token Prediction with Probabilities

The Core Mechanism

Generative AI works by repeatedly predicting the next most likely token based on all previous tokens. This simple process, when repeated, creates coherent text.

Example: Completing a Business Sentence

Input Context: "The quarterly revenue"

increased 45%

45%

decreased 22%

22%

reached 18%

18%

exceeded 10%

10%

remained 5%

5%

Selected Token: "increased" (highest probability)

Next Step: Model now predicts next token after "The quarterly revenue increased"

This process repeats until a complete response is generated

Architecture: Building on Transformers

Same Foundation, Different Application

Week 3: Temporal Fusion Transformer

Purpose: Predict future numerical values

Input: Time series data (GDP, unemployment, etc.)

Output: Next quarter's inflation rate

Attention: Which past time periods are important?

Week 7: Generative AI (GPT/Claude)

Purpose: Generate text, code, analysis

Input: Text prompt (tokenized)

Output: Next token (repeated for full text)

Attention: Which previous words are important?

Key Architectural Components

Input Embedding Layer: Converts tokens to 768+ dimensional vectors
Multiple Transformer Blocks: Each with self-attention and feed-forward networks
- GPT-3: 96 transformer blocks
- GPT-4: 120+ transformer blocks (estimated)
Output Layer: Converts final hidden state to probability distribution over all possible next tokens

Processing Flow: Text → Tokens → Embeddings → 96+ Transformer Layers → Next Token Probabilities → Select Token → Repeat

Self-Attention Mechanism in Action

Understanding Context and Relationships

Self-Attention allows each word to "look at" other words in the input to understand context. This is how AI understands that "Apple" in "Apple stock rose" refers to the company, not the fruit.

Example: Business Context Understanding

Sentence: "Apple stock rose despite market concerns"

Word	Attends Most To	Attention Weight	Why?
Apple	stock	0.85	Determines it's company context
stock	Apple, rose	0.78, 0.62	Subject and action relationship
rose	stock, despite	0.82, 0.43	Main action and contrast
despite	rose, concerns	0.71, 0.69	Contrast marker
market	concerns	0.88	Modifies concerns
concerns	market, rose	0.83, 0.47	Type and contrast

Multiple Attention Heads

Each transformer block has 12-96 attention heads
Each head learns different types of relationships (syntax, semantics, context)
GPT-3 has 96 attention heads per layer × 96 layers = 9,216 parallel attention mechanisms

Quick Check: Understanding Tokenization

A business email contains 300 words. Approximately how many tokens will this require for processing in a generative AI model?

A) 225 tokens (using 1 token = 1.33 words)

B) 400 tokens (using 1 token ≈ 0.75 words)

C) 300 tokens (using 1 token = 1 word exactly)

D) 600 tokens (using 1 token = 0.5 words)

Why This Matters

At $0.03 per 1,000 input tokens (GPT-4 pricing):

300 words ≈ 400 tokens = $0.012 per email
10,000 emails = $120 processing cost
Understanding token counts helps estimate AI costs accurately

Model Parameters & Size

What Makes Models "Large"

Parameters are the numbers the model adjusts during training to learn patterns. More parameters generally means more capacity to learn complex patterns, but also higher costs.

Major Model Comparison

Model	Parameters	Release	Context Window	Relative Speed
GPT-3	175 billion	2020	4,096 tokens	Fast
GPT-3.5	175 billion	2022	4,096 tokens	Very fast
GPT-4	~1.8 trillion (estimated)	2023	128,000 tokens	Slower
Claude 3 Opus	Unknown (comparable)	2024	200,000 tokens	Medium
Claude 3.5 Sonnet	Unknown	2024	200,000 tokens	Fast

Size vs. Performance Trade-offs

Larger models: Better reasoning, more knowledge, slower, more expensive
Smaller models: Faster responses, lower cost, may miss nuance
Context window: How much text the model can consider at once
Business decision: Match model size to task complexity

Temperature Settings & Output Control

Controlling Randomness in Generation

Temperature is a parameter (0.0 to 2.0) that controls how deterministic vs. creative the model's outputs are. It affects how the model samples from its probability distribution.

Low Temperature (0.0 - 0.3)

Effect: Always picks highest probability token

Output: Consistent, predictable, focused

Best For:

Data analysis
Code generation
Factual responses
Classification tasks

High Temperature (0.7 - 1.5)

Effect: Samples more randomly from probabilities

Output: Creative, varied, exploratory

Best For:

Creative writing
Brainstorming
Multiple perspectives
Marketing copy

Temperature in Action

Prompt: "The quarterly revenue"

Temperature	Next Token	Explanation
0.0	"increased" (45%)	Always picks highest probability
0.7	"decreased" (22%)	Occasionally picks 2nd or 3rd option
1.5	"remained" (5%)	Even low-probability options possible

Temperature Examples: Same Prompt, Different Outputs

Business Email Generation

Prompt: "Write a brief email to the team about Q3 revenue results"

Temperature = 0.2 (Focused)

Subject: Q3 Revenue Results

Team,

Q3 revenue reached $2.4M, representing a 15% increase over Q2. Strong performance in enterprise sales (+28%) offset declines in small business segment (-8%). We remain on track for annual targets.

Key drivers: New product launch, expanded sales team, improved customer retention.

Thank you for your continued efforts.

Temperature = 0.9 (Creative)

Subject: Q3 Wins & Learnings 🎯

Hey everyone,

Exciting news on the Q3 front! We've hit $2.4M in revenue - that's a solid 15% jump from last quarter. Our enterprise team absolutely crushed it with 28% growth, while our small business segment gave us some valuable lessons (down 8% but rich with insights).

The new product launch? Game-changer. Our expanded sales squad? Bringing the energy. Customer retention? Better than ever.

Let's keep this momentum going. Proud of this team!

Cost Calculations: Real Business Impact

Understanding AI API Pricing

Pricing Model: Most AI APIs charge separately for input tokens (what you send) and output tokens (what you receive)

GPT-4 Pricing (January 2025)

Token Type	Cost per 1,000 tokens	Typical Use
Input (Prompt)	$0.03	Your questions, context, data
Output (Generated)	$0.06	AI's responses
Cached Input	$0.015	Repeated context (50% discount)

Example: Customer Service Chatbot

Monthly Conversations: 10,000

Average Input per Chat: 500 tokens (375 words)

Average Output per Chat: 300 tokens (225 words)

Input Cost: 10,000 × 500 × $0.03/1000 = $150

Output Cost: 10,000 × 300 × $0.06/1000 = $180

Total Monthly Cost: $330

ROI Consideration: If each conversation saves 5 minutes of human agent time ($0.50 labor cost), monthly savings = $5,000. Net benefit = $4,670/month

API Call Structure & Token Usage

What Happens Behind the Scenes

import anthropic

client = anthropic.Anthropic(api_key="your-api-key")

# Make API call with specific parameters
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    temperature=0.7,
    messages=[
        {
            "role": "user", 
            "content": "Analyze this customer churn data..."
        }
    ]
)

# Access response and usage metrics
response_text = message.content[0].text
input_tokens = message.usage.input_tokens
output_tokens = message.usage.output_tokens

print(f"Response: {response_text}")
print(f"Input tokens: {input_tokens}")
print(f"Output tokens: {output_tokens}")
print(f"Cost: ${calculate_cost(input_tokens, output_tokens)}")

Key Parameters You Control

model: Which AI model to use (affects capability and cost)
max_tokens: Maximum length of response (1024 = ~750 words)
temperature: Randomness level (0.0-2.0)
messages: Your conversation history and current prompt

Quick Check: Cost Estimation

Your business needs to analyze 5,000 customer reviews per month. Each review is 200 words and you need a 150-word summary response for each. Using GPT-4 pricing ($0.03 input / $0.06 output per 1K tokens), what's the approximate monthly cost?

A) $50

B) $100

C) $160

D) $300

Calculation Breakdown

Input: 200 words × 1.33 = 267 tokens per review
Output: 150 words × 1.33 = 200 tokens per summary
Input cost: 5,000 × 267 × $0.03/1,000 = $40
Output cost: 5,000 × 200 × $0.06/1,000 = $60
Total: $100... wait, but answer C is $160?
Factor in API overhead, retries, longer responses = ~$160 realistic estimate

Limitations: Hallucinations

When AI Generates False Information

Hallucination: When the model generates information that sounds plausible but is factually incorrect, not supported by training data, or fabricated.

Why Hallucinations Occur

Pattern Completion: Model is trained to generate plausible text, not necessarily true text
No External Verification: Model doesn't fact-check against real-time sources
Training Data Gaps: Missing or outdated information leads to educated guesses
Overfitting Patterns: Model may combine patterns incorrectly

Hallucination Examples in Business Context

Category	Hallucination Example	Risk Level
Statistics	"Studies show 73% of customers prefer..." (no such study exists)	High
Citations	"According to Smith et al. (2023)..." (paper doesn't exist)	High
Product Features	"This software includes blockchain integration" (it doesn't)	Medium
Company Details	"ABC Corp was founded in 1995" (actually 1998)	Medium
Technical Specs	"The API supports 10,000 requests/sec" (actual limit: 1,000)	High

Measured Hallucination Rates

GPT-4: ~15-20% on factual questions (without retrieval)
Claude 3: ~12-18% on similar benchmarks
Rate increases with: obscure topics, specific dates/numbers, recent events

Hallucination Detection: Practice Exercise

Which of These AI Outputs Contain Hallucinations?

Scenario: You asked AI to summarize your company's Q3 performance. Review these 5 statements:

AI-Generated Statements

Statement A: "Q3 revenue increased 15% compared to Q2, driven primarily by enterprise sales growth of 28%."
✓ CORRECT - Can be verified against actual data
Statement B: "According to the Harvard Business Review 2024 study, companies with similar growth patterns have a 78% chance of sustained growth."
✗ HALLUCINATION - Fabricated citation, specific statistic
Statement C: "The small business segment declined 8% due to increased competition and pricing pressure."
✓ CORRECT - If supported by your data
Statement D: "Industry analysts predict your company will reach $10M revenue by Q4 based on current trajectory."
✗ HALLUCINATION - Specific prediction without basis
Statement E: "Customer retention improved from 82% to 89% following the new product launch."
⚠ CHECK - Verify exact numbers against your metrics

Red Flags for Hallucinations

Specific statistics without clear sources
Citations to studies, papers, or reports you can't verify
Precise predictions about future performance
Technical specifications that seem too detailed
Confident assertions about very specific details

Real Business Application: Customer Feedback Analysis

TeleConnect Customer Review Categorization

Business Problem: TeleConnect receives 1,000 customer reviews weekly. Manual categorization takes 5 minutes per review. Can generative AI automate this accurately?

Sample Reviews (from actual dataset)

Review ID	Customer Review Text	Length
001	"The service is reliable but customer support response time is terrible. Waited 3 days for callback."	92 words
002	"Love the new mobile app features! Much easier to manage my account now."	48 words
003	"Pricing is too high compared to competitors. Considering switching despite good service quality."	67 words

Analysis Results: Temperature Comparison

Setting	Temperature	Accuracy	Consistency	Speed
Configuration 1	0.2	94%	Very High	~50 tokens/sec
Configuration 2	0.7	87%	Medium	~50 tokens/sec
Configuration 3	1.2	76%	Low	~50 tokens/sec

Conclusion: Temperature 0.2 optimal for classification tasks requiring consistency

Hands-On: Google Colab Setup

Getting Started with Generative AI APIs

Today's Lab: You'll use Google Colab to interact with Anthropic's Claude API and see tokenization, temperature, and costs in action.

Setup Steps

Access Colab: Go to colab.research.google.com
Install Library:
!pip install anthropic
Import and Configure:
import anthropic client = anthropic.Anthropic(api_key="your-key-here")
Run First Query: Send a simple prompt and examine token usage

Experiments You'll Perform

Experiment 1: Same prompt, different temperatures (0.2, 0.7, 1.5)
Experiment 2: Measure token counts for various text lengths
Experiment 3: Calculate actual costs for realistic business scenarios
Experiment 4: Test hallucination detection on factual questions

API Key Safety

Never share your API key publicly
Don't commit keys to GitHub or other version control
Use environment variables in production
Monitor usage to prevent unexpected costs

Connection to Assessment 2

Generative AI Startup Project (35%)

Due Week 9 - How this week's content directly applies to your group project

Assessment Requirements You Can Now Address

Technical Requirements

Model Selection: Choose appropriate model based on task complexity
Cost Estimation: Calculate monthly operating costs using token analysis
Temperature Settings: Justify your choice (0.2 for factual, 0.7+ for creative)
Token Limits: Design prompts within context window constraints

Business Justification

ROI Analysis: Cost savings vs. AI API costs
Accuracy Metrics: Expected error rates and mitigation
Scalability: How costs scale with usage
Limitations: Acknowledge hallucination risks and solutions

Example Startup Ideas Using Week 7 Concepts

AutoReview: Automated product review sentiment analysis (use temperature 0.2 for consistency)
LegalDraft: Contract template generation (acknowledge hallucination risks, require human review)
CodeHelper: Programming assistant for business analysts (calculate token costs for code generation)
InsightWriter: Automated business report generation from data (use embeddings for semantic search of past reports)

Ethical Considerations & Best Practices

Responsible Use of Generative AI

Critical Principles

Always Verify: Never trust AI-generated facts without verification
Disclose AI Use: Be transparent when content is AI-generated or AI-assisted
Maintain Accountability: You are responsible for outputs, not the AI
Respect Privacy: Never input confidential or personally identifiable information

Data Privacy Risks

Never Input Into AI

Customer personal information (names, emails, phone numbers)
Financial data (credit cards, bank accounts)
Proprietary business strategies
Confidential client information
Employee personal records

Safe to Use

Anonymized, aggregated data
Public information
General business questions
Hypothetical scenarios
Sample data for testing

Academic Integrity (Important for Assessments)

Permitted: Using AI to understand concepts, generate ideas, check code
Required: Documenting AI usage, showing your intellectual contribution
Prohibited: Submitting AI-generated work as entirely your own without disclosure
Best Practice: Include an "AI Usage Statement" explaining how and why you used AI

Week 7 Key Takeaways

Essential Concepts

Generative AI uses transformer architecture you've already seen (Week 3 TFT), but generates text by predicting next tokens instead of numerical forecasts
Tokenization converts text to numbers: ~1 token per 0.75 words, critical for understanding costs and limits
Embeddings represent meaning: 768+ dimensional vectors allow mathematical operations on language
Training scale is massive: Trillions of tokens, months of training, millions in cost
Pattern learning through frequency: Models learn what words typically follow others across billions of examples
Temperature controls output: 0.2 for factual/consistent, 0.7+ for creative/varied
Costs are token-based: Input and output priced separately, typically $0.03-$0.06 per 1K tokens for GPT-4
Hallucinations are real: 15-20% error rate on factual questions, always verify important information
API usage is measurable: You can track exact token counts and costs for every request
Ethical use requires vigilance: Verify facts, respect privacy, disclose AI assistance, maintain accountability

For Assessment 2

Choose model based on task complexity and cost constraints
Calculate realistic operating costs using token analysis
Set appropriate temperature for your use case
Acknowledge and mitigate hallucination risks
Document all assumptions and limitations

Next Week: Prompt Engineering (Week 8)

Building on Generative AI Fundamentals

Now that you understand how generative AI works (tokens, embeddings, temperature, patterns), next week you'll learn how to use it effectively through prompt engineering.

Week 8 Preview: What You'll Learn

Prompt Structure: Role, context, task, format, constraints, examples
Techniques: Zero-shot, few-shot, chain-of-thought prompting
Frameworks: CRAFT and other structured approaches
Business Applications: Data analysis, report generation, code assistance
Quality Control: Evaluating and iterating on prompts
Integration: Incorporating AI into analytics workflows

This Week's Foundation

Understanding tokenization
Knowing temperature effects
Recognizing hallucinations
Calculating costs

Next Week's Application

Designing effective prompts
Choosing optimal settings
Preventing errors
Maximizing value

Preparation for Next Week

Complete the Google Colab exercises from today
Experiment with different temperatures on the same prompt
Start thinking about your Assessment 2 project use case
Practice estimating token counts for business documents