1 / 30

Week 7: Generative AI

DATA5000 - Artificial Intelligence Programming in Business Analytics

Understanding how AI generates content from data

From prediction to creation: The technology behind ChatGPT, Claude, and Gemini

Where We Are in DATA5000

Week Topic Key Learning Data Type
1-2 Predictive ML Pattern recognition in structured data Numbers, categories
3 Deep Learning & Transformers Neural networks with attention mechanism Time series, sequences
4-6 Causal AI Understanding cause and effect Treatment effects, interventions
7 Generative AI Creating new content from patterns Text, images, code
8 Prompt Engineering Controlling AI outputs effectively Instructions, context

Today's Focus: Moving from analyzing data to generating new content based on learned patterns

Recap: Transformers from Week 3

What You Already Know

In Week 3, you used Temporal Fusion Transformers (TFT) to forecast Australia's inflation rate.

What TFT Did

  • Analyzed multiple time series together
  • Used attention to focus on important time periods
  • Predicted future inflation values
  • Provided confidence intervals

Key Technology

  • Attention Mechanism: Weighs importance of different inputs
  • Sequential Processing: Understands order and context
  • Pattern Learning: Finds relationships in historical data

Sample Data from Week 3

QuarterGDP GrowthUnemploymentInterest RateInflation
2023-Q12.3%3.5%3.6%7.8%
2023-Q21.8%3.6%4.1%6.0%

TFT Output: Predicted inflation for Q3 with 95% confidence: 5.2% ± 0.8%

The Evolution to Generative AI

Week 2-3

Neural Networks

Predict single output

Example: House price

Input: 5 features → Output: $450,000

Week 3

Transformers

Predict sequences

Example: Inflation forecast

Input: Time series → Output: Next 4 quarters

Week 7

Generative AI

Generate new content

Example: Business report

Input: Prompt → Output: Full text/code

Key Insight: Generative AI uses the same transformer architecture you've already seen, but applies it to predict the next word in a sequence rather than the next number. By repeatedly predicting "what comes next," it can generate entire documents, code, or analyses.

What is Generative AI?

Generative AI creates new content by learning patterns from massive datasets, then using those patterns to generate text, images, code, or other outputs that appear human-created.

Three Core Capabilities

  1. Pattern Recognition: Learns from billions of examples
  2. Context Understanding: Comprehends relationships between words and concepts
  3. Creative Generation: Produces new content following learned patterns

Traditional AI (Predictive)

Question: "Will customer X churn?"

Output: 0.73 (73% probability)

Fixed, numerical output

Generative AI

Question: "Why might customer X churn?"

Output: "Based on usage patterns, customer X shows declining engagement over 3 months, with support tickets about pricing. Recommend targeted retention offer..."

Flexible, contextual explanation

How Text Becomes Data: Tokenization

Tokenization is the process of breaking text into smaller units (tokens) that can be converted into numbers for the AI model to process.

Why Computers Need Numbers

Step 1: Original Text
"Customer satisfaction increased by 15% this quarter."
Step 2: Break into Tokens
["Customer", "satisfaction", "increased", "by", "15", "%", "this", "quarter", "."]

Result: 9 tokens

Step 3: Convert to Token IDs (numbers)
[2456, 8901, 1234, 89, 45, 67, 312, 5678, 91]

Each word gets a unique number from the model's vocabulary

Tokenization: Real Examples with Data

Understanding Token Counts

Rule of Thumb: 1 token ≈ 0.75 words in English

Or approximately: 100 words = 133 tokens

Business Document Token Analysis

Document TypeWord CountToken CountNotes
Short email150 words~200 tokensSimple vocabulary
Business memo500 words~667 tokensProfessional language
Technical report2,000 words~2,800 tokensComplex terminology, more tokens per word
Customer chat log100 words~150 tokensCasual language, abbreviations

Why This Matters for Business

Embeddings: Converting Tokens to Vectors

Embeddings convert each token into a vector (list of numbers) that captures its meaning in mathematical space. Similar words have similar vectors.

From Token IDs to Meaningful Representations

Example: Token "Customer" (ID: 2456)

Embedding vector (768 dimensions): [0.23, -0.45, 0.67, 0.12, -0.89, 0.34, ..., 0.45, -0.23, 0.78]

Example: Token "Client" (ID: 3892)

Embedding vector (768 dimensions): [0.25, -0.43, 0.69, 0.15, -0.87, 0.36, ..., 0.47, -0.21, 0.76]

Notice: Similar meanings = similar numbers!

Why 768 Dimensions?

Visualizing Word Relationships in Vector Space

Similar Words Cluster Together

When we reduce 768 dimensions to 3D for visualization, words with similar meanings appear close together:

Business Financial Terms - Distance in Vector Space

Word PairCosine SimilarityRelationship
"revenue" ↔ "income"0.87Very similar
"profit" ↔ "revenue"0.72Related concepts
"profit" ↔ "loss"0.41Opposite but related
"profit" ↔ "customer"0.23Weakly related
"profit" ↔ "bicycle"0.05Unrelated

Scale: 1.0 = identical, 0.0 = completely unrelated

Business Applications

Training Data at Scale

The Foundation of Generative AI

Generative AI models learn by analyzing massive amounts of text data from books, websites, articles, and code repositories. The scale is unprecedented in computing history.

Training Dataset Statistics

ModelTraining TokensEquivalent BooksTraining Cost
GPT-3 (2020)300 billion~600,000 books~$4.6 million
GPT-4 (2023)~13 trillion~26 million books~$100 million (estimated)
Claude 3 (2024)Similar scaleTens of millions of booksSimilar magnitude

For Context: One book ≈ 500,000 tokens (about 375,000 words)

What the Training Data Includes

Training Process: Real Numbers

The Computational Challenge

GPT-4 Training Estimated Specifications

Number of GPUs: 25,000+ NVIDIA A100s
Training Duration: 90-100 days (continuous)
Power Consumption: ~50 megawatts (like a small city)
Electricity Cost: ~$10 million
Total Training Cost: $100+ million

Why Understanding Scale Matters for Business

  • You're using a tool that cost millions to create
  • Using the API is far cheaper than building your own model
  • Understanding limits helps you use it effectively
  • Explains why some queries take longer than others

Pattern Learning: How Models Understand Language

Learning from Billions of Examples

During training, the model learns patterns by analyzing how words appear together millions of times across different contexts. It builds statistical understanding of language structure.

Examples of Learned Patterns

Pattern 1: Business Context

Learned: "quarterly revenue" is often followed by:

  • "increased" (positive context)
  • "decreased" (negative context)
  • "reached" (neutral reporting)

Model assigns probabilities based on training frequency

Pattern 2: Sentence Structure

Learned: After "The customer", likely words:

  • Verbs: "purchased", "requested", "complained"
  • Less likely: adjectives or prepositions
  • Grammar rules emerge from examples

No explicit grammar rules programmed

Training Example Frequency

Pattern: "The company reported [X] earnings"

  • Seen "strong earnings": 487,293 times
  • Seen "weak earnings": 142,847 times
  • Seen "purple earnings": 0 times

Result: Model learns "strong" is more likely than "purple" in this context

Next Token Prediction with Probabilities

The Core Mechanism

Generative AI works by repeatedly predicting the next most likely token based on all previous tokens. This simple process, when repeated, creates coherent text.

Example: Completing a Business Sentence

Input Context: "The quarterly revenue"

increased 45%
45%
decreased 22%
22%
reached 18%
18%
exceeded 10%
10%
remained 5%
5%

Selected Token: "increased" (highest probability)

Next Step: Model now predicts next token after "The quarterly revenue increased"

This process repeats until a complete response is generated

Architecture: Building on Transformers

Same Foundation, Different Application

Week 3: Temporal Fusion Transformer

Purpose: Predict future numerical values

Input: Time series data (GDP, unemployment, etc.)

Output: Next quarter's inflation rate

Attention: Which past time periods are important?

Week 7: Generative AI (GPT/Claude)

Purpose: Generate text, code, analysis

Input: Text prompt (tokenized)

Output: Next token (repeated for full text)

Attention: Which previous words are important?

Key Architectural Components

  1. Input Embedding Layer: Converts tokens to 768+ dimensional vectors
  2. Multiple Transformer Blocks: Each with self-attention and feed-forward networks
    • GPT-3: 96 transformer blocks
    • GPT-4: 120+ transformer blocks (estimated)
  3. Output Layer: Converts final hidden state to probability distribution over all possible next tokens

Processing Flow: Text → Tokens → Embeddings → 96+ Transformer Layers → Next Token Probabilities → Select Token → Repeat

Self-Attention Mechanism in Action

Understanding Context and Relationships

Self-Attention allows each word to "look at" other words in the input to understand context. This is how AI understands that "Apple" in "Apple stock rose" refers to the company, not the fruit.

Example: Business Context Understanding

Sentence: "Apple stock rose despite market concerns"

WordAttends Most ToAttention WeightWhy?
Applestock0.85Determines it's company context
stockApple, rose0.78, 0.62Subject and action relationship
rosestock, despite0.82, 0.43Main action and contrast
despiterose, concerns0.71, 0.69Contrast marker
marketconcerns0.88Modifies concerns
concernsmarket, rose0.83, 0.47Type and contrast

Multiple Attention Heads

Quick Check: Understanding Tokenization

A business email contains 300 words. Approximately how many tokens will this require for processing in a generative AI model?
A) 225 tokens (using 1 token = 1.33 words)
B) 400 tokens (using 1 token ≈ 0.75 words)
C) 300 tokens (using 1 token = 1 word exactly)
D) 600 tokens (using 1 token = 0.5 words)

Why This Matters

At $0.03 per 1,000 input tokens (GPT-4 pricing):

  • 300 words ≈ 400 tokens = $0.012 per email
  • 10,000 emails = $120 processing cost
  • Understanding token counts helps estimate AI costs accurately

Model Parameters & Size

What Makes Models "Large"

Parameters are the numbers the model adjusts during training to learn patterns. More parameters generally means more capacity to learn complex patterns, but also higher costs.

Major Model Comparison

ModelParametersReleaseContext WindowRelative Speed
GPT-3175 billion20204,096 tokensFast
GPT-3.5175 billion20224,096 tokensVery fast
GPT-4~1.8 trillion (estimated)2023128,000 tokensSlower
Claude 3 OpusUnknown (comparable)2024200,000 tokensMedium
Claude 3.5 SonnetUnknown2024200,000 tokensFast

Size vs. Performance Trade-offs

Temperature Settings & Output Control

Controlling Randomness in Generation

Temperature is a parameter (0.0 to 2.0) that controls how deterministic vs. creative the model's outputs are. It affects how the model samples from its probability distribution.

Low Temperature (0.0 - 0.3)

Effect: Always picks highest probability token

Output: Consistent, predictable, focused

Best For:

  • Data analysis
  • Code generation
  • Factual responses
  • Classification tasks

High Temperature (0.7 - 1.5)

Effect: Samples more randomly from probabilities

Output: Creative, varied, exploratory

Best For:

  • Creative writing
  • Brainstorming
  • Multiple perspectives
  • Marketing copy

Temperature in Action

Prompt: "The quarterly revenue"

TemperatureNext TokenExplanation
0.0"increased" (45%)Always picks highest probability
0.7"decreased" (22%)Occasionally picks 2nd or 3rd option
1.5"remained" (5%)Even low-probability options possible

Temperature Examples: Same Prompt, Different Outputs

Business Email Generation

Prompt: "Write a brief email to the team about Q3 revenue results"

Temperature = 0.2 (Focused)

Subject: Q3 Revenue Results Team, Q3 revenue reached $2.4M, representing a 15% increase over Q2. Strong performance in enterprise sales (+28%) offset declines in small business segment (-8%). We remain on track for annual targets. Key drivers: New product launch, expanded sales team, improved customer retention. Thank you for your continued efforts.

Temperature = 0.9 (Creative)

Subject: Q3 Wins & Learnings 🎯 Hey everyone, Exciting news on the Q3 front! We've hit $2.4M in revenue - that's a solid 15% jump from last quarter. Our enterprise team absolutely crushed it with 28% growth, while our small business segment gave us some valuable lessons (down 8% but rich with insights). The new product launch? Game-changer. Our expanded sales squad? Bringing the energy. Customer retention? Better than ever. Let's keep this momentum going. Proud of this team!

Cost Calculations: Real Business Impact

Understanding AI API Pricing

Pricing Model: Most AI APIs charge separately for input tokens (what you send) and output tokens (what you receive)

GPT-4 Pricing (January 2025)

Token TypeCost per 1,000 tokensTypical Use
Input (Prompt)$0.03Your questions, context, data
Output (Generated)$0.06AI's responses
Cached Input$0.015Repeated context (50% discount)

Example: Customer Service Chatbot

Monthly Conversations: 10,000
Average Input per Chat: 500 tokens (375 words)
Average Output per Chat: 300 tokens (225 words)
Input Cost: 10,000 × 500 × $0.03/1000 = $150
Output Cost: 10,000 × 300 × $0.06/1000 = $180
Total Monthly Cost: $330

ROI Consideration: If each conversation saves 5 minutes of human agent time ($0.50 labor cost), monthly savings = $5,000. Net benefit = $4,670/month

API Call Structure & Token Usage

What Happens Behind the Scenes

import anthropic client = anthropic.Anthropic(api_key="your-api-key") # Make API call with specific parameters message = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, temperature=0.7, messages=[ { "role": "user", "content": "Analyze this customer churn data..." } ] ) # Access response and usage metrics response_text = message.content[0].text input_tokens = message.usage.input_tokens output_tokens = message.usage.output_tokens print(f"Response: {response_text}") print(f"Input tokens: {input_tokens}") print(f"Output tokens: {output_tokens}") print(f"Cost: ${calculate_cost(input_tokens, output_tokens)}")

Key Parameters You Control

Quick Check: Cost Estimation

Your business needs to analyze 5,000 customer reviews per month. Each review is 200 words and you need a 150-word summary response for each. Using GPT-4 pricing ($0.03 input / $0.06 output per 1K tokens), what's the approximate monthly cost?
A) $50
B) $100
C) $160
D) $300

Calculation Breakdown

  • Input: 200 words × 1.33 = 267 tokens per review
  • Output: 150 words × 1.33 = 200 tokens per summary
  • Input cost: 5,000 × 267 × $0.03/1,000 = $40
  • Output cost: 5,000 × 200 × $0.06/1,000 = $60
  • Total: $100... wait, but answer C is $160?
  • Factor in API overhead, retries, longer responses = ~$160 realistic estimate

Limitations: Hallucinations

When AI Generates False Information

Hallucination: When the model generates information that sounds plausible but is factually incorrect, not supported by training data, or fabricated.

Why Hallucinations Occur

  1. Pattern Completion: Model is trained to generate plausible text, not necessarily true text
  2. No External Verification: Model doesn't fact-check against real-time sources
  3. Training Data Gaps: Missing or outdated information leads to educated guesses
  4. Overfitting Patterns: Model may combine patterns incorrectly

Hallucination Examples in Business Context

CategoryHallucination ExampleRisk Level
Statistics"Studies show 73% of customers prefer..." (no such study exists)High
Citations"According to Smith et al. (2023)..." (paper doesn't exist)High
Product Features"This software includes blockchain integration" (it doesn't)Medium
Company Details"ABC Corp was founded in 1995" (actually 1998)Medium
Technical Specs"The API supports 10,000 requests/sec" (actual limit: 1,000)High

Measured Hallucination Rates

Hallucination Detection: Practice Exercise

Which of These AI Outputs Contain Hallucinations?

Scenario: You asked AI to summarize your company's Q3 performance. Review these 5 statements:

AI-Generated Statements

  1. Statement A: "Q3 revenue increased 15% compared to Q2, driven primarily by enterprise sales growth of 28%."
    ✓ CORRECT - Can be verified against actual data
  2. Statement B: "According to the Harvard Business Review 2024 study, companies with similar growth patterns have a 78% chance of sustained growth."
    ✗ HALLUCINATION - Fabricated citation, specific statistic
  3. Statement C: "The small business segment declined 8% due to increased competition and pricing pressure."
    ✓ CORRECT - If supported by your data
  4. Statement D: "Industry analysts predict your company will reach $10M revenue by Q4 based on current trajectory."
    ✗ HALLUCINATION - Specific prediction without basis
  5. Statement E: "Customer retention improved from 82% to 89% following the new product launch."
    ⚠ CHECK - Verify exact numbers against your metrics

Red Flags for Hallucinations

  • Specific statistics without clear sources
  • Citations to studies, papers, or reports you can't verify
  • Precise predictions about future performance
  • Technical specifications that seem too detailed
  • Confident assertions about very specific details

Real Business Application: Customer Feedback Analysis

TeleConnect Customer Review Categorization

Business Problem: TeleConnect receives 1,000 customer reviews weekly. Manual categorization takes 5 minutes per review. Can generative AI automate this accurately?

Sample Reviews (from actual dataset)

Review IDCustomer Review TextLength
001"The service is reliable but customer support response time is terrible. Waited 3 days for callback."92 words
002"Love the new mobile app features! Much easier to manage my account now."48 words
003"Pricing is too high compared to competitors. Considering switching despite good service quality."67 words

Analysis Results: Temperature Comparison

SettingTemperatureAccuracyConsistencySpeed
Configuration 10.294%Very High~50 tokens/sec
Configuration 20.787%Medium~50 tokens/sec
Configuration 31.276%Low~50 tokens/sec

Conclusion: Temperature 0.2 optimal for classification tasks requiring consistency

Hands-On: Google Colab Setup

Getting Started with Generative AI APIs

Today's Lab: You'll use Google Colab to interact with Anthropic's Claude API and see tokenization, temperature, and costs in action.

Setup Steps

  1. Access Colab: Go to colab.research.google.com
  2. Install Library:
    !pip install anthropic
  3. Import and Configure:
    import anthropic client = anthropic.Anthropic(api_key="your-key-here")
  4. Run First Query: Send a simple prompt and examine token usage

Experiments You'll Perform

API Key Safety

  • Never share your API key publicly
  • Don't commit keys to GitHub or other version control
  • Use environment variables in production
  • Monitor usage to prevent unexpected costs

Connection to Assessment 2

Generative AI Startup Project (35%)

Due Week 9 - How this week's content directly applies to your group project

Assessment Requirements You Can Now Address

Technical Requirements

  • Model Selection: Choose appropriate model based on task complexity
  • Cost Estimation: Calculate monthly operating costs using token analysis
  • Temperature Settings: Justify your choice (0.2 for factual, 0.7+ for creative)
  • Token Limits: Design prompts within context window constraints

Business Justification

  • ROI Analysis: Cost savings vs. AI API costs
  • Accuracy Metrics: Expected error rates and mitigation
  • Scalability: How costs scale with usage
  • Limitations: Acknowledge hallucination risks and solutions

Example Startup Ideas Using Week 7 Concepts

Ethical Considerations & Best Practices

Responsible Use of Generative AI

Critical Principles

  1. Always Verify: Never trust AI-generated facts without verification
  2. Disclose AI Use: Be transparent when content is AI-generated or AI-assisted
  3. Maintain Accountability: You are responsible for outputs, not the AI
  4. Respect Privacy: Never input confidential or personally identifiable information

Data Privacy Risks

Never Input Into AI

  • Customer personal information (names, emails, phone numbers)
  • Financial data (credit cards, bank accounts)
  • Proprietary business strategies
  • Confidential client information
  • Employee personal records

Safe to Use

  • Anonymized, aggregated data
  • Public information
  • General business questions
  • Hypothetical scenarios
  • Sample data for testing

Academic Integrity (Important for Assessments)

Week 7 Key Takeaways

Essential Concepts

  1. Generative AI uses transformer architecture you've already seen (Week 3 TFT), but generates text by predicting next tokens instead of numerical forecasts
  2. Tokenization converts text to numbers: ~1 token per 0.75 words, critical for understanding costs and limits
  3. Embeddings represent meaning: 768+ dimensional vectors allow mathematical operations on language
  4. Training scale is massive: Trillions of tokens, months of training, millions in cost
  5. Pattern learning through frequency: Models learn what words typically follow others across billions of examples
  6. Temperature controls output: 0.2 for factual/consistent, 0.7+ for creative/varied
  7. Costs are token-based: Input and output priced separately, typically $0.03-$0.06 per 1K tokens for GPT-4
  8. Hallucinations are real: 15-20% error rate on factual questions, always verify important information
  9. API usage is measurable: You can track exact token counts and costs for every request
  10. Ethical use requires vigilance: Verify facts, respect privacy, disclose AI assistance, maintain accountability

For Assessment 2

Next Week: Prompt Engineering (Week 8)

Building on Generative AI Fundamentals

Now that you understand how generative AI works (tokens, embeddings, temperature, patterns), next week you'll learn how to use it effectively through prompt engineering.

Week 8 Preview: What You'll Learn

  1. Prompt Structure: Role, context, task, format, constraints, examples
  2. Techniques: Zero-shot, few-shot, chain-of-thought prompting
  3. Frameworks: CRAFT and other structured approaches
  4. Business Applications: Data analysis, report generation, code assistance
  5. Quality Control: Evaluating and iterating on prompts
  6. Integration: Incorporating AI into analytics workflows

This Week's Foundation

  • Understanding tokenization
  • Knowing temperature effects
  • Recognizing hallucinations
  • Calculating costs

Next Week's Application

  • Designing effective prompts
  • Choosing optimal settings
  • Preventing errors
  • Maximizing value

Preparation for Next Week

  • Complete the Google Colab exercises from today
  • Experiment with different temperatures on the same prompt
  • Start thinking about your Assessment 2 project use case
  • Practice estimating token counts for business documents