Week 10: Reinforcement Learning
Learning Through Trial and Error
DATA5000: Artificial Intelligence and Machine Learning
Learning Outcomes
  • Understand how reinforcement learning differs from supervised and unsupervised learning
  • Identify the key components of RL: agent, environment, state, action, and reward
  • Recognize business applications where RL provides competitive advantage
  • Understand the exploration vs exploitation tradeoff in decision-making
  • Evaluate when to use RL versus traditional machine learning approaches
Quick Recap: Three Pillars of Machine Learning
Learning Type Question It Answers Business Example Has Labels?
Supervised Learning "What will happen?" Predict customer churn, forecast sales ✓ Yes (historical outcomes)
Unsupervised Learning "What patterns exist?" Customer segmentation, anomaly detection ✗ No (find hidden structures)
Reinforcement Learning "What should I do?" Dynamic pricing, recommendations ⭐ Learn from rewards/penalties
The Critical Distinction
  • Supervised Learning looks backward → "What happened in the past?" (learns from historical data with known outcomes)
  • Unsupervised Learning looks inward → "What structure exists in my data?" (finds hidden patterns without labels)
  • Reinforcement Learning looks forward → "What should I do next to maximize long-term value?" (learns optimal actions through trial and error)
RL is fundamentally about sequential decision-making in dynamic environments
Business Motivation: The Dynamic Pricing Challenge
Scenario: Uber's Surge Pricing Algorithm

Every 2 minutes, Uber must decide: Should we increase the price multiplier to 1.8x? Keep it at 1.4x? Lower it to 1.2x?


If price is too high: Customers cancel rides, demand drops (penalty: lost revenue)

If price is too low: Not enough drivers available, long wait times (penalty: customer dissatisfaction)

If price is optimal: Supply meets demand, rides completed efficiently (reward: revenue + satisfaction)


The Problem: No dataset tells us the "correct" price for every moment. The optimal decision depends on constantly changing conditions (traffic, weather, events, competitor pricing).

Why Supervised Learning Falls Short
Challenge Supervised Learning Limitation RL Solution
Dynamic Environment Trained on static historical data; can't adapt to new conditions Continuously learns from live interactions
No "Correct" Answers Requires labeled examples of optimal decisions Discovers optimal actions through exploration
Sequential Decisions Treats each prediction independently Optimizes long-term cumulative reward
Delayed Feedback Assumes immediate outcome visibility Handles delayed consequences naturally
Reinforcement Learning: Core Components

Agent

The decision-maker (e.g., recommendation algorithm, pricing engine, trading bot)

Environment

The external system the agent interacts with (e.g., users, market, inventory)

State

Current situation/context (e.g., user browsing history, current inventory levels, time of day)

Action

Choices available to the agent (e.g., which product to recommend, price to set, route to take)

Reward

Feedback signal indicating action quality (e.g., revenue, clicks, user engagement)

Policy

The agent's strategy mapping states to actions (e.g., "recommend similar items when user browses category X")

The Reinforcement Learning Loop
Click "Step Through Interaction" to see how the agent learns
Example: Netflix Recommendation System
Step Component Details
1 State User just watched "Stranger Things" S1, typically watches sci-fi thrillers, watches evenings
2 Action Agent recommends "Dark" (German sci-fi thriller) in top 3 position
3 Environment Response User clicks on "Dark", watches 3 episodes (2.5 hours engagement)
4 Reward +15 points (click: +5, watch time: +10, based on engagement metrics)
5 New State User now has "Dark" in watch history, demonstrated preference for foreign sci-fi
6 Learning Agent updates policy: "For users who watch Stranger Things, recommending similar international sci-fi yields high engagement"
Knowledge Check 1
In a reinforcement learning system for automated email marketing, which component represents the REWARD?
  • The email subject line and content being sent
  • User engagement metrics (opens, clicks, conversions) after sending the email
  • The user's demographic profile and past behavior
  • The algorithm that decides when to send emails
Business Application: Recommender Systems
Why Recommender Systems Are Perfect for RL:
  • Sequential Nature: Each recommendation influences future user behavior and preferences
  • Delayed Rewards: A good recommendation today might lead to subscription renewal next month
  • Exploration Needed: Must balance showing proven popular items vs discovering niche preferences
  • Dynamic Preferences: User tastes evolve over time; RL adapts continuously
Real Impact:
  • Amazon: 35% of revenue from RL-powered recommendations
  • Netflix: 80% of watched content comes from recommendations
  • Spotify: Discover Weekly increased user retention by 24%
Data-Driven Example: E-commerce Product Recommendations
127K
User Interactions
15
Product Categories
$2.8M
Revenue Increase
Cumulative reward over 30 days of RL training (RL vs Random vs Rule-based)
The Exploration vs Exploitation Dilemma
The Restaurant Analogy

You're visiting a new city with 10 restaurants. How do you decide where to eat?


Exploitation: Always go to the ONE restaurant you know is good (safe, but you might miss something better)


Exploration: Try new restaurants every time (might discover amazing food, but might waste money on bad meals)


Optimal Strategy:

• Week 1: Try 5-6 new places (heavy exploration)

• Week 2: Go back to best from Week 1, try 2 new ones

• Week 3+: Mostly stick to top 2, occasionally try something new (80% exploitation, 20% exploration)

Interactive: Multi-Armed Bandit Problem
You have 3 website layouts. Which generates the most conversions? Click buttons to test!
0
Layout A Trials
0
Layout B Trials
0
Layout C Trials
Key Business Applications
Application State Action Reward
Dynamic Pricing
(Airlines, Ride-sharing)
Demand, inventory, time, competitors Set price level Revenue - customer churn penalty
Inventory Management
(Retail, Warehousing)
Stock levels, demand forecast, lead times Order quantity Sales profit - holding costs - stockout costs
Ad Placement
(Google, Facebook)
User profile, context, ad history Which ad to show Click-through rate × bid price
Customer Service Routing
(Call Centers)
Agent skills, queue length, customer priority Route to specific agent Resolution time + customer satisfaction
Knowledge Check 2
A company wants to optimize its email send times for maximum engagement. They have historical data showing that Tuesday at 10am had the highest open rates last year. What is the PRIMARY risk of only using this historical insight?
  • The data sample size might be too small
  • Customer behavior and preferences change over time; past optimal times may not remain optimal
  • Different time zones might affect the results
  • The email content quality is more important than timing
Q-Learning: Building a "Cheat Sheet"
Q-Learning creates a table that stores "how good is each action from each state?"

Example: Route Planning App

The agent learns Q-values by trying routes and getting rewards (fast arrival time = high reward).

Higher Q-value = Better action. Over time, the agent discovers optimal routes through trial and error.

Decision Framework: When to Use RL
✓ Use Reinforcement Learning ✗ Use Supervised/Unsupervised Learning
Sequential decisions with delayed consequences One-shot predictions (single classification/regression)
Environment changes based on actions taken Static environment; historical data sufficient
Need to balance exploration and exploitation Clear optimal solution exists in training data
No dataset of "correct" actions available Labeled examples of correct outcomes available
Long-term value maximization critical Immediate prediction accuracy is the goal
Rule of Thumb: If you're asking "What should I do next?" repeatedly → Consider RL. If you're asking "What is this?" → Use supervised/unsupervised learning.
Challenges and Practical Considerations
  • Data Requirements: RL often requires millions of interactions to learn effectively. Solution: Use simulators or offline RL with historical data
  • Reward Function Design: Defining "success" mathematically can be challenging. Poor reward design leads to unexpected behaviors
  • Safety Concerns: Cannot explore dangerous actions on real systems (e.g., self-driving cars). Solution: Train in simulation first
  • Computational Cost: Training can be expensive. Consider simpler approaches first if supervised learning works well
  • Delayed Rewards: Attributing outcomes to specific actions when feedback comes much later (credit assignment problem)
Summary and Career Applications
Key Takeaways:
  • RL solves sequential decision problems through trial-and-error learning
  • Core components: agent, environment, state, action, reward, policy
  • Critical for dynamic systems: recommendations, pricing, inventory, routing
  • Exploration vs exploitation is fundamental to RL strategy
  • Use RL when decisions are sequential and environment is dynamic
Business Analytics Career Relevance:
  • Digital marketing: Ad optimization, campaign timing
  • E-commerce: Personalization engines, dynamic pricing
  • Supply chain: Inventory optimization, logistics
  • Finance: Algorithmic trading, portfolio management
  • Growing field with increasing business adoption
1 / 20