Week 10: Reinforcement Learning

Learning Through Trial and Error

DATA5000: Artificial Intelligence and Machine Learning

Learning Outcomes

Understand how reinforcement learning differs from supervised and unsupervised learning
Identify the key components of RL: agent, environment, state, action, and reward
Recognize business applications where RL provides competitive advantage
Understand the exploration vs exploitation tradeoff in decision-making
Evaluate when to use RL versus traditional machine learning approaches

Quick Recap: Three Pillars of Machine Learning

Learning Type	Question It Answers	Business Example	Has Labels?
Supervised Learning	"What will happen?"	Predict customer churn, forecast sales	✓ Yes (historical outcomes)
Unsupervised Learning	"What patterns exist?"	Customer segmentation, anomaly detection	✗ No (find hidden structures)
Reinforcement Learning	"What should I do?"	Dynamic pricing, recommendations	⭐ Learn from rewards/penalties

The Critical Distinction

Supervised Learning looks backward → "What happened in the past?" (learns from historical data with known outcomes)
Unsupervised Learning looks inward → "What structure exists in my data?" (finds hidden patterns without labels)
Reinforcement Learning looks forward → "What should I do next to maximize long-term value?" (learns optimal actions through trial and error)

RL is fundamentally about sequential decision-making in dynamic environments

Business Motivation: The Dynamic Pricing Challenge

Scenario: Uber's Surge Pricing Algorithm

Every 2 minutes, Uber must decide: Should we increase the price multiplier to 1.8x? Keep it at 1.4x? Lower it to 1.2x?

If price is too high: Customers cancel rides, demand drops (penalty: lost revenue)

If price is too low: Not enough drivers available, long wait times (penalty: customer dissatisfaction)

If price is optimal: Supply meets demand, rides completed efficiently (reward: revenue + satisfaction)

The Problem: No dataset tells us the "correct" price for every moment. The optimal decision depends on constantly changing conditions (traffic, weather, events, competitor pricing).

Why Supervised Learning Falls Short

Challenge	Supervised Learning Limitation	RL Solution
Dynamic Environment	Trained on static historical data; can't adapt to new conditions	Continuously learns from live interactions
No "Correct" Answers	Requires labeled examples of optimal decisions	Discovers optimal actions through exploration
Sequential Decisions	Treats each prediction independently	Optimizes long-term cumulative reward
Delayed Feedback	Assumes immediate outcome visibility	Handles delayed consequences naturally

Reinforcement Learning: Core Components

Agent

The decision-maker (e.g., recommendation algorithm, pricing engine, trading bot)

Environment

The external system the agent interacts with (e.g., users, market, inventory)

State

Current situation/context (e.g., user browsing history, current inventory levels, time of day)

Action

Choices available to the agent (e.g., which product to recommend, price to set, route to take)

Reward

Feedback signal indicating action quality (e.g., revenue, clicks, user engagement)

Policy

The agent's strategy mapping states to actions (e.g., "recommend similar items when user browses category X")

The Reinforcement Learning Loop

Click "Step Through Interaction" to see how the agent learns

Example: Netflix Recommendation System

Step	Component	Details
1	State	User just watched "Stranger Things" S1, typically watches sci-fi thrillers, watches evenings
2	Action	Agent recommends "Dark" (German sci-fi thriller) in top 3 position
3	Environment Response	User clicks on "Dark", watches 3 episodes (2.5 hours engagement)
4	Reward	+15 points (click: +5, watch time: +10, based on engagement metrics)
5	New State	User now has "Dark" in watch history, demonstrated preference for foreign sci-fi
6	Learning	Agent updates policy: "For users who watch Stranger Things, recommending similar international sci-fi yields high engagement"

Knowledge Check 1

In a reinforcement learning system for automated email marketing, which component represents the REWARD?

The email subject line and content being sent
User engagement metrics (opens, clicks, conversions) after sending the email
The user's demographic profile and past behavior
The algorithm that decides when to send emails

Business Application: Recommender Systems

Why Recommender Systems Are Perfect for RL:

Sequential Nature: Each recommendation influences future user behavior and preferences
Delayed Rewards: A good recommendation today might lead to subscription renewal next month
Exploration Needed: Must balance showing proven popular items vs discovering niche preferences
Dynamic Preferences: User tastes evolve over time; RL adapts continuously

Real Impact:

Amazon: 35% of revenue from RL-powered recommendations
Netflix: 80% of watched content comes from recommendations
Spotify: Discover Weekly increased user retention by 24%

Data-Driven Example: E-commerce Product Recommendations

127K

User Interactions

15

Product Categories

$2.8M

Revenue Increase

Cumulative reward over 30 days of RL training (RL vs Random vs Rule-based)

The Exploration vs Exploitation Dilemma

The Restaurant Analogy

You're visiting a new city with 10 restaurants. How do you decide where to eat?

Exploitation: Always go to the ONE restaurant you know is good (safe, but you might miss something better)

Exploration: Try new restaurants every time (might discover amazing food, but might waste money on bad meals)

Optimal Strategy:

• Week 1: Try 5-6 new places (heavy exploration)

• Week 2: Go back to best from Week 1, try 2 new ones

• Week 3+: Mostly stick to top 2, occasionally try something new (80% exploitation, 20% exploration)

Interactive: Multi-Armed Bandit Problem

You have 3 website layouts. Which generates the most conversions? Click buttons to test!

0

Layout A Trials

0

Layout B Trials

0

Layout C Trials

Key Business Applications

Application	State	Action	Reward
Dynamic Pricing (Airlines, Ride-sharing)	Demand, inventory, time, competitors	Set price level	Revenue - customer churn penalty
Inventory Management (Retail, Warehousing)	Stock levels, demand forecast, lead times	Order quantity	Sales profit - holding costs - stockout costs
Ad Placement (Google, Facebook)	User profile, context, ad history	Which ad to show	Click-through rate × bid price
Customer Service Routing (Call Centers)	Agent skills, queue length, customer priority	Route to specific agent	Resolution time + customer satisfaction

Knowledge Check 2

A company wants to optimize its email send times for maximum engagement. They have historical data showing that Tuesday at 10am had the highest open rates last year. What is the PRIMARY risk of only using this historical insight?

The data sample size might be too small
Customer behavior and preferences change over time; past optimal times may not remain optimal
Different time zones might affect the results
The email content quality is more important than timing

Q-Learning: Building a "Cheat Sheet"

Q-Learning creates a table that stores "how good is each action from each state?"

Example: Route Planning App

The agent learns Q-values by trying routes and getting rewards (fast arrival time = high reward).

Higher Q-value = Better action. Over time, the agent discovers optimal routes through trial and error.

Decision Framework: When to Use RL

✓ Use Reinforcement Learning	✗ Use Supervised/Unsupervised Learning
Sequential decisions with delayed consequences	One-shot predictions (single classification/regression)
Environment changes based on actions taken	Static environment; historical data sufficient
Need to balance exploration and exploitation	Clear optimal solution exists in training data
No dataset of "correct" actions available	Labeled examples of correct outcomes available
Long-term value maximization critical	Immediate prediction accuracy is the goal

Rule of Thumb: If you're asking "What should I do next?" repeatedly → Consider RL. If you're asking "What is this?" → Use supervised/unsupervised learning.

Challenges and Practical Considerations

Data Requirements: RL often requires millions of interactions to learn effectively. Solution: Use simulators or offline RL with historical data
Reward Function Design: Defining "success" mathematically can be challenging. Poor reward design leads to unexpected behaviors
Safety Concerns: Cannot explore dangerous actions on real systems (e.g., self-driving cars). Solution: Train in simulation first
Computational Cost: Training can be expensive. Consider simpler approaches first if supervised learning works well
Delayed Rewards: Attributing outcomes to specific actions when feedback comes much later (credit assignment problem)

Summary and Career Applications

Key Takeaways:

RL solves sequential decision problems through trial-and-error learning
Core components: agent, environment, state, action, reward, policy
Critical for dynamic systems: recommendations, pricing, inventory, routing
Exploration vs exploitation is fundamental to RL strategy
Use RL when decisions are sequential and environment is dynamic

Business Analytics Career Relevance:

Digital marketing: Ad optimization, campaign timing
E-commerce: Personalization engines, dynamic pricing
Supply chain: Inventory optimization, logistics
Finance: Algorithmic trading, portfolio management
Growing field with increasing business adoption