| Learning Type | Question It Answers | Business Example | Has Labels? |
|---|---|---|---|
| Supervised Learning | "What will happen?" | Predict customer churn, forecast sales | ✓ Yes (historical outcomes) |
| Unsupervised Learning | "What patterns exist?" | Customer segmentation, anomaly detection | ✗ No (find hidden structures) |
| Reinforcement Learning | "What should I do?" | Dynamic pricing, recommendations | ⭐ Learn from rewards/penalties |
Every 2 minutes, Uber must decide: Should we increase the price multiplier to 1.8x? Keep it at 1.4x? Lower it to 1.2x?
If price is too high: Customers cancel rides, demand drops (penalty: lost revenue)
If price is too low: Not enough drivers available, long wait times (penalty: customer dissatisfaction)
If price is optimal: Supply meets demand, rides completed efficiently (reward: revenue + satisfaction)
The Problem: No dataset tells us the "correct" price for every moment. The optimal decision depends on constantly changing conditions (traffic, weather, events, competitor pricing).
| Challenge | Supervised Learning Limitation | RL Solution |
|---|---|---|
| Dynamic Environment | Trained on static historical data; can't adapt to new conditions | Continuously learns from live interactions |
| No "Correct" Answers | Requires labeled examples of optimal decisions | Discovers optimal actions through exploration |
| Sequential Decisions | Treats each prediction independently | Optimizes long-term cumulative reward |
| Delayed Feedback | Assumes immediate outcome visibility | Handles delayed consequences naturally |
The decision-maker (e.g., recommendation algorithm, pricing engine, trading bot)
The external system the agent interacts with (e.g., users, market, inventory)
Current situation/context (e.g., user browsing history, current inventory levels, time of day)
Choices available to the agent (e.g., which product to recommend, price to set, route to take)
Feedback signal indicating action quality (e.g., revenue, clicks, user engagement)
The agent's strategy mapping states to actions (e.g., "recommend similar items when user browses category X")
| Step | Component | Details |
|---|---|---|
| 1 | State | User just watched "Stranger Things" S1, typically watches sci-fi thrillers, watches evenings |
| 2 | Action | Agent recommends "Dark" (German sci-fi thriller) in top 3 position |
| 3 | Environment Response | User clicks on "Dark", watches 3 episodes (2.5 hours engagement) |
| 4 | Reward | +15 points (click: +5, watch time: +10, based on engagement metrics) |
| 5 | New State | User now has "Dark" in watch history, demonstrated preference for foreign sci-fi |
| 6 | Learning | Agent updates policy: "For users who watch Stranger Things, recommending similar international sci-fi yields high engagement" |
You're visiting a new city with 10 restaurants. How do you decide where to eat?
Exploitation: Always go to the ONE restaurant you know is good (safe, but you might miss something better)
Exploration: Try new restaurants every time (might discover amazing food, but might waste money on bad meals)
Optimal Strategy:
• Week 1: Try 5-6 new places (heavy exploration)
• Week 2: Go back to best from Week 1, try 2 new ones
• Week 3+: Mostly stick to top 2, occasionally try something new (80% exploitation, 20% exploration)
| Application | State | Action | Reward |
|---|---|---|---|
| Dynamic Pricing (Airlines, Ride-sharing) |
Demand, inventory, time, competitors | Set price level | Revenue - customer churn penalty |
| Inventory Management (Retail, Warehousing) |
Stock levels, demand forecast, lead times | Order quantity | Sales profit - holding costs - stockout costs |
| Ad Placement (Google, Facebook) |
User profile, context, ad history | Which ad to show | Click-through rate × bid price |
| Customer Service Routing (Call Centers) |
Agent skills, queue length, customer priority | Route to specific agent | Resolution time + customer satisfaction |
Example: Route Planning App
The agent learns Q-values by trying routes and getting rewards (fast arrival time = high reward).
Higher Q-value = Better action. Over time, the agent discovers optimal routes through trial and error.
| ✓ Use Reinforcement Learning | ✗ Use Supervised/Unsupervised Learning |
|---|---|
| Sequential decisions with delayed consequences | One-shot predictions (single classification/regression) |
| Environment changes based on actions taken | Static environment; historical data sufficient |
| Need to balance exploration and exploitation | Clear optimal solution exists in training data |
| No dataset of "correct" actions available | Labeled examples of correct outcomes available |
| Long-term value maximization critical | Immediate prediction accuracy is the goal |