AI Programming in Business Analytics
Kaplan Business School
KBS
0.1
Overview
What You Will Learn Today
By the end of this session you will be able to:
Explain how a computer sees an image as numbers
Describe what a convolution does and why it is useful
Trace the layers of a CNN: Conv → ReLU → Pool → Flatten → Dense
Explain how CNNs learn features automatically
Describe the object detection problem and how YOLO solves it in one pass
Identify real business applications for CNNs and YOLO
Big Idea
You do not need to memorise equations. Focus on the intuition — understanding why each layer exists is more important than the maths.
KBS
0.2
Overview
Today's Agenda
#
Topic
Approx. Time
1
Images as numbers — how computers see
15 min
2
Convolution — the core operation
25 min
3
Building a full CNN — layer by layer
30 min
4
YOLO — detecting objects in real time
25 min
5
Real-world applications & wrap-up
15 min
KBS
§1
Section 1
How Computers See Images
Images are just grids of numbers — and that changes everything about how we process them.
KBS
1.1
Section 1 · How Computers See
1.1 An Image is a Grid of Numbers
Every image on your screen is made up of tiny squares called pixels. Each pixel stores a brightness value.
A greyscale image: one number per pixel (0 = black, 255 = white)
A colour image: three numbers per pixel — Red, Green, Blue (RGB)
Example
A 28 × 28 greyscale image (like a handwritten digit) = 784 numbers.
A 224 × 224 colour photo = 224 × 224 × 3 = 150,528 numbers.
The computer never actually "sees" a cat. It just sees thousands of numbers that represent brightness values across the image.
5 × 5 greyscale pixel grid — each cell is a number 0–255
KBS
1.2
Section 1 · How Computers See
1.2 Why Can't We Just Use a Normal Neural Network?
A regular (fully connected) neural network connects every input to every neuron.
Problem
A 224 × 224 colour photo has 150,528 pixels.
If the first hidden layer has 1,000 neurons, that's 150 million weights to learn — just for one layer!
Too many parameters → the model is huge and slow to train
No spatial awareness → the network doesn't know that nearby pixels are related
Not translation-invariant → a cat in the top-left corner looks completely different from a cat in the bottom-right
The Solution
CNNs solve all three problems by:
Sharing weights (one filter scans the whole image)
Using local connections (only look at nearby pixels)
Detecting the same pattern anywhere in the image
Flat network vs CNN — fewer parameters, smarter design
KBS
§2
Section 2
The Convolution Operation
A small filter slides over the image, detecting one type of pattern at a time.
KBS
2.1
Section 2 · Convolution
2.1 What is Convolution? The Spotlight Analogy
Think of convolution as shining a small spotlight (called a filter or kernel) across an image, one step at a time.
The filter is a small grid — typically 3 × 3 or 5 × 5 pixels
At each position, it multiplies its values with the image pixels underneath and adds them up
The result is a single number that summarises "how strongly this pattern appeared here"
The filter then slides across to the next position and repeats
Definition
A filter (kernel) is a small matrix of learnable numbers. During training, the CNN automatically learns the best filter values to detect useful features.
The red box slides across the image; each position produces one output number
KBS
2.2
Section 2 · Convolution
2.2 Worked Example — One Convolution Step
Suppose the filter sits over the top-left 3×3 patch of the image:
The number 1 is placed in the top-left cell of the feature map. The filter then slides right and we repeat for every position.
This particular filter (called a Sobel filter) detects vertical edges. High output = strong vertical edge at that location.
KBS
2.3
Section 2 · Convolution
2.3 What Filters Actually Detect
Different filter values detect different visual patterns. A CNN learns the best filters automatically during training.
Figure 2.3: Three manually-designed filters — in practice, CNNs learn hundreds of such filters
KBS
2.4
Section 2 · Convolution
2.4 Feature Maps — Seeing the Image Differently
A single filter produces one feature map. In practice, a convolutional layer uses many filters at once.
Figure 2.4: One convolutional block — 32 filters each produce a feature map
Intuition
Early filters detect edges and colours. Deeper layers combine these to detect shapes, then textures, then objects (like ears, eyes, wheels).
KBS
2.5
Section 2 · Convolution
2.5 ReLU — Keeping Only the Positives
After each convolution, we apply an activation function. The most common one is ReLU (Rectified Linear Unit).
ReLU Formula
\(\text{ReLU}(x) = \max(0,\, x)\)
If the value is positive → keep it. If the value is negative → set it to 0.
Why? Negative convolution outputs mean "this pattern was NOT found here." We simply ignore them. This also introduces non-linearity, letting the network learn complex patterns.
Figure 2.5: ReLU keeps positives, zeros out negatives
KBS
2.6
Section 2 · Convolution
2.6 Pooling — Downsizing Without Losing Information
After ReLU, a pooling layer shrinks each feature map to make the network faster and more robust.
Max Pooling is the most common type. It slides a 2×2 window over the feature map and keeps only the largest value in each window.
Why Max?
We want to know whether a feature was detected somewhere in that region — not exactly where. Max pooling answers "was this feature present?" with the strongest signal.
Effect
A 4×4 feature map becomes a 2×2 map after 2×2 max pooling — 4× fewer numbers to process, while preserving the most important detections.
Max pooling picks the largest value from each 2×2 region
KBS
2.Q
Knowledge Check
Knowledge Check — Sections 1 & 2
Q1: A colour image (RGB) that is 100 × 100 pixels is represented as how many numbers?
A colour image has 3 channels (R, G, B). Each pixel requires 3 values, so a 100×100 colour image = 100 × 100 × 3 = 30,000 numbers.
Q2: What does a convolutional filter do?
A filter (kernel) slides across the image performing multiply-and-sum at each position, producing a feature map that shows where that pattern appears. Removing negatives is ReLU; resizing is pooling.
Figure 3.1: The standard CNN pipeline from image to prediction
Layer
What it does
Output format
Conv
Detects local patterns using filters
3D: width × height × channels
ReLU
Removes negatives (non-linearity)
3D: same shape
Pool
Shrinks spatially, keeps strongest signal
3D: smaller width × height
Flatten
Converts 3D volume to a 1D list of numbers
1D vector
Dense
Standard neural network layer
1D vector (smaller)
Output
Final classification probabilities
1D: one value per class
KBS
3.2
Section 3 · CNN Architecture
3.2 Flatten and Dense — The Decision Stage
Flatten
After several Conv+Pool blocks, we have a 3D volume (width × height × depth). We need to convert it into a simple list to feed into a standard neural network.
Example
A 7 × 7 × 128 feature map has 7 × 7 × 128 = 6,272 numbers.
Flatten simply lays them end-to-end → a vector of length 6,272.
No learning happens here — it's just a reshape.
Dense (Fully Connected) Layers
This is the "decision-making" part of the CNN. The flat vector is passed through one or more standard neural network layers.
Each neuron connects to every value in the vector
The network combines all the detected features to make a final prediction
Final Output (Softmax)
The last dense layer outputs one probability per class. E.g., for cat/dog/bird:
[Cat: 0.85, Dog: 0.12, Bird: 0.03]
→ Predicted class: Cat
KBS
3.3
Section 3 · CNN Architecture
3.3 How Does a CNN Actually Learn?
A CNN learns its filter values and dense layer weights through a process called backpropagation, guided by a loss function.
Forward pass — image goes through all layers, producing a prediction (e.g., "Cat: 62%")
Calculate loss — compare prediction to true label. If it said "Cat: 62%" but it was actually a Dog, the loss is high.
Backward pass — the error is traced back through every layer, and each weight is adjusted slightly to reduce the loss
Repeat for thousands of images → filters become meaningful detectors
The training loop — repeat until loss is minimised
KBS
3.4
Section 3 · CNN Architecture
3.4 Famous CNN Architectures — Shoulders of Giants
You don't need to build a CNN from scratch. Many pre-trained models are freely available.
Architecture
Year
Layers
Notable For
LeNet-5
1998
7
First successful CNN — handwritten digit recognition
AlexNet
2012
8
Launched the deep learning revolution; won ImageNet by a wide margin
VGG-16
2014
16
Simple and uniform design; widely used as a baseline
ResNet-50
2015
50
Skip connections — allowed very deep networks without vanishing gradients
EfficientNet
2019
varies
Best accuracy-to-compute ratio; widely used in production
Transfer Learning
These models are pre-trained on millions of images. In practice, we reuse their learned filters and only retrain the final output layer for our specific task — saving weeks of computation.
KBS
3.Q
Knowledge Check
Knowledge Check — Section 3
Q3: What is the purpose of the Flatten layer in a CNN?
The Flatten layer performs a reshape operation — no weights, no learning. It converts a 3D volume (e.g., 7×7×128) into a 1D vector (6,272) that a fully-connected Dense layer can accept as input.
Q4: What is Transfer Learning?
Transfer learning reuses the lower layers (which detect general edges, shapes, textures) from a large pre-trained model. Only the final classification layers need fine-tuning for the new task — making it practical even with small datasets.
KBS
§4
Section 4
YOLO — You Only Look Once
Detecting and locating multiple objects in a single forward pass through the network.
KBS
4.1
Section 4 · YOLO
4.1 Classification vs Object Detection
Image Classification
Question asked
"What is in this image?"
→ Answer: "Cat" (with 92% confidence)
One label for the whole image
No information about where the object is
Example: "Is this invoice or a receipt?"
Object Detection
Question asked
"What is in this image and where?"
→ Answer: "Cat at [120, 80, 200, 160]"
Returns a bounding box around every detected object
Can detect multiple objects at different locations
Example: "Find all vehicles and pedestrians in this CCTV frame"
Figure 4.1: Classification gives one label; detection gives labels AND locations for every object
KBS
4.2
Section 4 · YOLO
4.2 The Challenge — Objects Can Be Anywhere
Object detection is harder than classification because:
Objects can appear anywhere in the image
There can be multiple objects of different sizes
We need to output both a class label AND a bounding box for each one
Traditional Approach (Slow)
Older methods like R-CNN would:
Propose ~2,000 candidate regions in the image
Run a CNN on each region separately
This was accurate but very slow — up to 47 seconds per image!
YOLO's Insight
Instead of looking at thousands of regions one at a time, YOLO asks: "What if we run the CNN just once on the whole image?"
The result: real-time detection at 30–100+ frames per second.
YOLO divides the image into a grid (e.g., 7 × 7 = 49 cells). Each cell is responsible for detecting objects whose centre falls within it.
For each cell, the network predicts:
Bounding box coordinates: x, y (centre), width, height
Confidence score: how sure is it that an object is here?
Class probabilities: is it a car, person, dog, etc.?
Bounding Box
A rectangle defined by four numbers: the centre coordinates (x, y) and the box dimensions (width, height). Expressed as fractions of the image size (0 to 1).
All 49 cells make their predictions simultaneously in a single forward pass — that's why it's "You Only Look Once."
Highlighted cells are "responsible" for each detected object
KBS
4.4
Section 4 · YOLO
4.4 YOLO Output — Predictions and Filtering
What the network outputs
For each grid cell, YOLO outputs a vector:
Per Cell Output
[x, y, w, h, confidence, P(class₁), P(class₂), …, P(classN)]
x, y: centre of the box (relative to the cell)
w, h: width and height (relative to image)
confidence: probability × overlap with ground truth
P(classᵢ): probability it belongs to each class
Non-Maximum Suppression (NMS)
Many cells near an object all predict a box. We end up with dozens of overlapping boxes.
NMS cleans this up:
Keep the box with the highest confidence
Remove any boxes that overlap it too much (IoU > threshold)
Repeat for the next highest
Result
One clean, final bounding box per detected object.
KBS
4.5
Section 4 · YOLO
4.5 YOLO Versions — A Brief History
Version
Year
Key Improvement
Speed / Accuracy
YOLOv1
2016
Original one-pass detection concept
Fast, struggled with small/clustered objects
YOLOv3
2018
Multi-scale detection (detects small objects better)
Good balance — popular baseline
YOLOv5
2020
PyTorch-native, easy to use and fine-tune
Very popular in industry
YOLOv8
2023
Unified framework: detection, segmentation, pose
State-of-the-art accuracy + speed
YOLOv11
2024
Further efficiency gains, edge-device optimised
Fastest yet; production deployments
Code — running YOLOv8 in 3 lines
from ultralytics import YOLO
model = YOLO("yolov8n.pt") # load a pre-trained model
results = model("street.jpg") # detect objects in one image
results[0].show() # display with bounding boxes
KBS
4.Q
Knowledge Check
Knowledge Check — Section 4
Q5: What does "You Only Look Once" mean in the context of YOLO?
YOLO's key innovation is running the CNN once on the full image (divided into a grid), predicting bounding boxes and class labels for all cells simultaneously — unlike older methods that ran the CNN separately on thousands of candidate regions.
Q6: What problem does Non-Maximum Suppression (NMS) solve?
Since multiple neighbouring cells predict boxes for the same object, NMS keeps the highest-confidence box and suppresses all others that overlap it by more than a set threshold (IoU). The result is one clean bounding box per object.
KBS
§5
Section 5
Real-World Applications
Where are CNNs and YOLO actually used in business today?
KBS
5.1
Section 5 · Applications
5.1 CNNs & YOLO in Business
CNNs — Classification & Analysis
Healthcare: Detecting tumours in X-rays and MRI scans (often matching specialist accuracy)
Retail: Visual product search — "find similar shoes"
Finance: Cheque and document verification; fraud detection via signature analysis
Manufacturing: Defect detection on production lines (faster than human inspection)
Agriculture: Identifying crop disease from drone imagery
YOLO — Real-Time Detection
Autonomous vehicles: Detecting cars, pedestrians, cyclists, traffic signs in real time
Retail analytics: Customer counting and shelf stock monitoring from CCTV
Security: Intruder detection and perimeter monitoring
Sport analytics: Tracking player positions and ball movement
Logistics: Reading package labels and sorting items on conveyor belts
KBS
5.2
Section 5 · Applications
5.2 Limitations and Practical Considerations
Technical limitations
Data hungry: CNNs typically need thousands (or millions) of labelled images to train from scratch
Computationally expensive: Training requires GPUs and significant time/cost
Sensitive to distribution shift: A model trained on daytime images may fail on night images
YOLO small object detection: Struggles to detect very small or tightly-clustered objects in early versions
Ethical and business considerations
Bias: Models trained on unrepresentative data embed that bias (e.g., facial recognition systems performing poorly on darker skin tones)
Privacy: Real-time surveillance raises serious privacy and consent issues
Interpretability: CNNs are largely "black boxes" — hard to explain why a decision was made
Edge cases: High confidence ≠ correct — models can fail unexpectedly on unusual inputs
Key Principle
Always ask: What happens when the model is wrong? Who is affected?
KBS
5.3
Section 5 · Applications
5.3 CNN vs YOLO — When to Use Which?
Standard CNN
YOLO
Task
Image classification
Object detection
Output
One label per image
Labels + bounding boxes for all objects
Speed
Fast
Real-time capable
Use when
You need to know what is in an image
You need to know what and where
Example
Is this a defective product? (yes/no)
Mark every defect on the product image
Training data
Labels only
Labels + bounding box coordinates
Business Scenario
A retailer wants to check if shelves are stocked.
→ CNN: "Is this shelf full or empty?" (classification)
→ YOLO: "Which specific products are missing and where?" (detection)
KBS
5.4
Week 5 · Summary
Week 5 Summary — Key Takeaways
CNNs
Images are grids of numbers (pixels)
Filters detect local patterns; shared weights make CNNs efficient
Early layers = edges; deep layers = complex objects
Transfer learning lets you reuse pre-trained models
YOLO
Object detection = what + where
Divides image into a grid; each cell predicts boxes simultaneously
Single forward pass → real-time speed
NMS removes duplicate overlapping predictions
YOLOv8 / v11: state-of-the-art, 3-line Python API
The Big Picture
CNNs gave computers the ability to understand images. YOLO made that understanding fast enough for the real world. Together, they underpin much of modern AI in business — from medical imaging to autonomous vehicles to retail analytics.
KBS
5.5
Week 5 · Looking Ahead
Looking Ahead
Next Week — Week 6
We move from vision to language: Generative AI and Large Language Models.
How do models like GPT and Gemini generate text?
What is a Transformer? (attention mechanism)
Pre-training vs fine-tuning vs prompting
Business applications: summarisation, Q&A, code generation
Before Next Session
Recommended Practice
Open the Week 5 Colab notebook and run the CNN demo
Try changing the number of filters and observe the effect on accuracy
Run the YOLO detection example on your own image
Reflection question
Think of a business problem in your industry where knowing where something is (not just what it is) would be valuable. Could YOLO address it?
KBS
Week 5 · Data5000
Questions?
CNN Architecture & YOLO
Key terms: pixel, filter, convolution, feature map, ReLU, max pooling,
flatten, dense layer, transfer learning, bounding box, grid, NMS