Table of Contents

Data5000 · Week 5

Convolutional Neural Networks & YOLO

Teaching computers to see: from pixels to objects

AI Programming in Business Analytics
Kaplan Business School

KBS
0.1
Overview

What You Will Learn Today

By the end of this session you will be able to:

Big Idea
You do not need to memorise equations. Focus on the intuition — understanding why each layer exists is more important than the maths.
KBS
0.2
Overview

Today's Agenda

#TopicApprox. Time
1Images as numbers — how computers see15 min
2Convolution — the core operation25 min
3Building a full CNN — layer by layer30 min
4YOLO — detecting objects in real time25 min
5Real-world applications & wrap-up15 min
KBS
§1
Section 1

How Computers See Images

Images are just grids of numbers — and that changes everything about how we process them.

KBS
1.1
Section 1 · How Computers See

1.1 An Image is a Grid of Numbers

Every image on your screen is made up of tiny squares called pixels. Each pixel stores a brightness value.

  • A greyscale image: one number per pixel (0 = black, 255 = white)
  • A colour image: three numbers per pixel — Red, Green, Blue (RGB)
Example
A 28 × 28 greyscale image (like a handwritten digit) = 784 numbers.
A 224 × 224 colour photo = 224 × 224 × 3 = 150,528 numbers.

The computer never actually "sees" a cat. It just sees thousands of numbers that represent brightness values across the image.

20 80 160 210 240 50 120 185 220 250 10 60 100 155 225 5 30 75 130 190 25 85 145 195 245
5 × 5 greyscale pixel grid — each cell is a number 0–255
KBS
1.2
Section 1 · How Computers See

1.2 Why Can't We Just Use a Normal Neural Network?

A regular (fully connected) neural network connects every input to every neuron.

Problem
A 224 × 224 colour photo has 150,528 pixels.
If the first hidden layer has 1,000 neurons, that's 150 million weights to learn — just for one layer!
  • Too many parameters → the model is huge and slow to train
  • No spatial awareness → the network doesn't know that nearby pixels are related
  • Not translation-invariant → a cat in the top-left corner looks completely different from a cat in the bottom-right
The Solution

CNNs solve all three problems by:

  • Sharing weights (one filter scans the whole image)
  • Using local connections (only look at nearby pixels)
  • Detecting the same pattern anywhere in the image
FLAT NETWORK All inputs → every neuron CNN filter map Shared filter scans entire image
Flat network vs CNN — fewer parameters, smarter design
KBS
§2
Section 2

The Convolution Operation

A small filter slides over the image, detecting one type of pattern at a time.

KBS
2.1
Section 2 · Convolution

2.1 What is Convolution? The Spotlight Analogy

Think of convolution as shining a small spotlight (called a filter or kernel) across an image, one step at a time.

  • The filter is a small grid — typically 3 × 3 or 5 × 5 pixels
  • At each position, it multiplies its values with the image pixels underneath and adds them up
  • The result is a single number that summarises "how strongly this pattern appeared here"
  • The filter then slides across to the next position and repeats
Definition
A filter (kernel) is a small matrix of learnable numbers. During training, the CNN automatically learns the best filter values to detect useful features.
IMAGE (5×5) 12013 03121 21032 10210 31021 filter here → FILTER 10-1 20-2 10-1 Sobel (edge detector) multiply & sum → 1 number feature map
The red box slides across the image; each position produces one output number
KBS
2.2
Section 2 · Convolution

2.2 Worked Example — One Convolution Step

Suppose the filter sits over the top-left 3×3 patch of the image:

Image patch (top-left 3×3):

120
031
210

Filter:

10−1
20−2
10−1

Calculation:

Multiply each pair, then sum:

Step by Step
(1×1) + (2×0) + (0×−1)
+ (0×2) + (3×0) + (1×−2)
+ (2×1) + (1×0) + (0×−1)

= 1 + 0 + 0 + 0 + 0 − 2 + 2 + 0 + 0
= 1
Output
The number 1 is placed in the top-left cell of the feature map. The filter then slides right and we repeat for every position.

This particular filter (called a Sobel filter) detects vertical edges. High output = strong vertical edge at that location.

KBS
2.3
Section 2 · Convolution

2.3 What Filters Actually Detect

Different filter values detect different visual patterns. A CNN learns the best filters automatically during training.

EDGE DETECTOR 10-1 20-2 10-1 Detects vertical edges (left − right intensity) → high activation at edge BLUR / SMOOTH Averages nearby pixels (reduces noise) → smoother, blurred image SHARPEN 0-10 -15-1 0-10 Emphasises centre pixel (enhances fine details) → crisper edges In a CNN filters are NOT hand- designed. The network learns them automatically!
Figure 2.3: Three manually-designed filters — in practice, CNNs learn hundreds of such filters
KBS
2.4
Section 2 · Convolution

2.4 Feature Maps — Seeing the Image Differently

A single filter produces one feature map. In practice, a convolutional layer uses many filters at once.

Input Image 224 × 224 × 3 32 filters Conv Layer 32 × 3×3 filters 32 maps 224×224×32 ReLU 32 maps negatives→0 Pool 32 maps 112×112×32 (halved!) Each map highlights where one type of pattern appears in the image
Figure 2.4: One convolutional block — 32 filters each produce a feature map
Intuition
Early filters detect edges and colours. Deeper layers combine these to detect shapes, then textures, then objects (like ears, eyes, wheels).
KBS
2.5
Section 2 · Convolution

2.5 ReLU — Keeping Only the Positives

After each convolution, we apply an activation function. The most common one is ReLU (Rectified Linear Unit).

ReLU Formula
\(\text{ReLU}(x) = \max(0,\, x)\)

If the value is positive → keep it.
If the value is negative → set it to 0.

Why? Negative convolution outputs mean "this pattern was NOT found here." We simply ignore them. This also introduces non-linearity, letting the network learn complex patterns.

Example
Input: [−3, 0, 2, −1, 5, −0.5]
After ReLU: [0, 0, 2, 0, 5, 0]
x y 0 ReLU(x) −3 3 = 0 (cut off) = x (keep)
Figure 2.5: ReLU keeps positives, zeros out negatives
KBS
2.6
Section 2 · Convolution

2.6 Pooling — Downsizing Without Losing Information

After ReLU, a pooling layer shrinks each feature map to make the network faster and more robust.

Max Pooling is the most common type. It slides a 2×2 window over the feature map and keeps only the largest value in each window.

Why Max?
We want to know whether a feature was detected somewhere in that region — not exactly where. Max pooling answers "was this feature present?" with the strongest signal.
Effect
A 4×4 feature map becomes a 2×2 map after 2×2 max pooling — 4× fewer numbers to process, while preserving the most important detections.
4×4 Feature Map 4 2 7 1 3 1 6 8 2 5 1 3 1 3 2 6 max pool → 2×2 Output 4 8 5 6
Max pooling picks the largest value from each 2×2 region
KBS
2.Q
Knowledge Check

Knowledge Check — Sections 1 & 2

Q1: A colour image (RGB) that is 100 × 100 pixels is represented as how many numbers?
A colour image has 3 channels (R, G, B). Each pixel requires 3 values, so a 100×100 colour image = 100 × 100 × 3 = 30,000 numbers.
Q2: What does a convolutional filter do?
A filter (kernel) slides across the image performing multiply-and-sum at each position, producing a feature map that shows where that pattern appears. Removing negatives is ReLU; resizing is pooling.
KBS
§3
Section 3

The Full CNN Architecture

Stacking layers: Conv → ReLU → Pool → Flatten → Dense → Output

KBS
3.1
Section 3 · CNN Architecture

3.1 The CNN Pipeline — Layer by Layer

Input Image e.g. 224×224×3 CONV Filter slides over image feature maps ReLU max(0, x) zeroes negatives POOL 2×2 max halves size repeat 2–5 times FLATTEN 2D maps → 1D vector reshape only DENSE Fully conn. layers combine features OUTPUT Softmax / Sigmoid class probabilities
Figure 3.1: The standard CNN pipeline from image to prediction
LayerWhat it doesOutput format
ConvDetects local patterns using filters3D: width × height × channels
ReLURemoves negatives (non-linearity)3D: same shape
PoolShrinks spatially, keeps strongest signal3D: smaller width × height
FlattenConverts 3D volume to a 1D list of numbers1D vector
DenseStandard neural network layer1D vector (smaller)
OutputFinal classification probabilities1D: one value per class
KBS
3.2
Section 3 · CNN Architecture

3.2 Flatten and Dense — The Decision Stage

Flatten

After several Conv+Pool blocks, we have a 3D volume (width × height × depth). We need to convert it into a simple list to feed into a standard neural network.

Example
A 7 × 7 × 128 feature map has
7 × 7 × 128 = 6,272 numbers.
Flatten simply lays them end-to-end → a vector of length 6,272.

No learning happens here — it's just a reshape.

Dense (Fully Connected) Layers

This is the "decision-making" part of the CNN. The flat vector is passed through one or more standard neural network layers.

  • Each neuron connects to every value in the vector
  • The network combines all the detected features to make a final prediction
Final Output (Softmax)
The last dense layer outputs one probability per class. E.g., for cat/dog/bird:
[Cat: 0.85, Dog: 0.12, Bird: 0.03]
→ Predicted class: Cat
KBS
3.3
Section 3 · CNN Architecture

3.3 How Does a CNN Actually Learn?

A CNN learns its filter values and dense layer weights through a process called backpropagation, guided by a loss function.

  1. Forward pass — image goes through all layers, producing a prediction (e.g., "Cat: 62%")
  2. Calculate loss — compare prediction to true label. If it said "Cat: 62%" but it was actually a Dog, the loss is high.
  3. Backward pass — the error is traced back through every layer, and each weight is adjusted slightly to reduce the loss
  4. Repeat for thousands of images → filters become meaningful detectors
Forward Pass Compute Loss Backward Pass Update Weights
The training loop — repeat until loss is minimised
KBS
3.4
Section 3 · CNN Architecture

3.4 Famous CNN Architectures — Shoulders of Giants

You don't need to build a CNN from scratch. Many pre-trained models are freely available.

ArchitectureYearLayersNotable For
LeNet-519987First successful CNN — handwritten digit recognition
AlexNet20128Launched the deep learning revolution; won ImageNet by a wide margin
VGG-16201416Simple and uniform design; widely used as a baseline
ResNet-50201550Skip connections — allowed very deep networks without vanishing gradients
EfficientNet2019variesBest accuracy-to-compute ratio; widely used in production
Transfer Learning
These models are pre-trained on millions of images. In practice, we reuse their learned filters and only retrain the final output layer for our specific task — saving weeks of computation.
KBS
3.Q
Knowledge Check

Knowledge Check — Section 3

Q3: What is the purpose of the Flatten layer in a CNN?
The Flatten layer performs a reshape operation — no weights, no learning. It converts a 3D volume (e.g., 7×7×128) into a 1D vector (6,272) that a fully-connected Dense layer can accept as input.
Q4: What is Transfer Learning?
Transfer learning reuses the lower layers (which detect general edges, shapes, textures) from a large pre-trained model. Only the final classification layers need fine-tuning for the new task — making it practical even with small datasets.
KBS
§4
Section 4

YOLO — You Only Look Once

Detecting and locating multiple objects in a single forward pass through the network.

KBS
4.1
Section 4 · YOLO

4.1 Classification vs Object Detection

Image Classification

Question asked
"What is in this image?"
→ Answer: "Cat" (with 92% confidence)
  • One label for the whole image
  • No information about where the object is
  • Example: "Is this invoice or a receipt?"

Object Detection

Question asked
"What is in this image and where?"
→ Answer: "Cat at [120, 80, 200, 160]"
  • Returns a bounding box around every detected object
  • Can detect multiple objects at different locations
  • Example: "Find all vehicles and pedestrians in this CCTV frame"
[ image ] CNN Cat: 92% Dog: 6% CLASSIFICATION cat dog DETECTION
Figure 4.1: Classification gives one label; detection gives labels AND locations for every object
KBS
4.2
Section 4 · YOLO

4.2 The Challenge — Objects Can Be Anywhere

Object detection is harder than classification because:

  • Objects can appear anywhere in the image
  • There can be multiple objects of different sizes
  • We need to output both a class label AND a bounding box for each one

Traditional Approach (Slow)

Older methods like R-CNN would:

  1. Propose ~2,000 candidate regions in the image
  2. Run a CNN on each region separately
  3. This was accurate but very slow — up to 47 seconds per image!
YOLO's Insight

Instead of looking at thousands of regions one at a time, YOLO asks: "What if we run the CNN just once on the whole image?"

The result: real-time detection at 30–100+ frames per second.

Speed comparison
R-CNN: ~47 seconds/image
Fast R-CNN: ~2 seconds/image
YOLO: <0.03 seconds/image (real time!)
KBS
4.3
Section 4 · YOLO

4.3 How YOLO Works — The Grid

YOLO divides the image into a grid (e.g., 7 × 7 = 49 cells). Each cell is responsible for detecting objects whose centre falls within it.

For each cell, the network predicts:

  • Bounding box coordinates: x, y (centre), width, height
  • Confidence score: how sure is it that an object is here?
  • Class probabilities: is it a car, person, dog, etc.?
Bounding Box
A rectangle defined by four numbers: the centre coordinates (x, y) and the box dimensions (width, height). Expressed as fractions of the image size (0 to 1).

All 49 cells make their predictions simultaneously in a single forward pass — that's why it's "You Only Look Once."

car 0.91 person 0.87 7×7 grid — 49 cells total
Highlighted cells are "responsible" for each detected object
KBS
4.4
Section 4 · YOLO

4.4 YOLO Output — Predictions and Filtering

What the network outputs

For each grid cell, YOLO outputs a vector:

Per Cell Output
[x, y, w, h, confidence, P(class₁), P(class₂), …, P(classN)]
  • x, y: centre of the box (relative to the cell)
  • w, h: width and height (relative to image)
  • confidence: probability × overlap with ground truth
  • P(classᵢ): probability it belongs to each class

Non-Maximum Suppression (NMS)

Many cells near an object all predict a box. We end up with dozens of overlapping boxes.

NMS cleans this up:

  1. Keep the box with the highest confidence
  2. Remove any boxes that overlap it too much (IoU > threshold)
  3. Repeat for the next highest
Result
One clean, final bounding box per detected object.
KBS
4.5
Section 4 · YOLO

4.5 YOLO Versions — A Brief History

VersionYearKey ImprovementSpeed / Accuracy
YOLOv12016Original one-pass detection conceptFast, struggled with small/clustered objects
YOLOv32018Multi-scale detection (detects small objects better)Good balance — popular baseline
YOLOv52020PyTorch-native, easy to use and fine-tuneVery popular in industry
YOLOv82023Unified framework: detection, segmentation, poseState-of-the-art accuracy + speed
YOLOv112024Further efficiency gains, edge-device optimisedFastest yet; production deployments
Code — running YOLOv8 in 3 lines
from ultralytics import YOLO
model = YOLO("yolov8n.pt")       # load a pre-trained model
results = model("street.jpg")      # detect objects in one image
results[0].show()                    # display with bounding boxes
KBS
4.Q
Knowledge Check

Knowledge Check — Section 4

Q5: What does "You Only Look Once" mean in the context of YOLO?
YOLO's key innovation is running the CNN once on the full image (divided into a grid), predicting bounding boxes and class labels for all cells simultaneously — unlike older methods that ran the CNN separately on thousands of candidate regions.
Q6: What problem does Non-Maximum Suppression (NMS) solve?
Since multiple neighbouring cells predict boxes for the same object, NMS keeps the highest-confidence box and suppresses all others that overlap it by more than a set threshold (IoU). The result is one clean bounding box per object.
KBS
§5
Section 5

Real-World Applications

Where are CNNs and YOLO actually used in business today?

KBS
5.1
Section 5 · Applications

5.1 CNNs & YOLO in Business

CNNs — Classification & Analysis

  • Healthcare: Detecting tumours in X-rays and MRI scans (often matching specialist accuracy)
  • Retail: Visual product search — "find similar shoes"
  • Finance: Cheque and document verification; fraud detection via signature analysis
  • Manufacturing: Defect detection on production lines (faster than human inspection)
  • Agriculture: Identifying crop disease from drone imagery

YOLO — Real-Time Detection

  • Autonomous vehicles: Detecting cars, pedestrians, cyclists, traffic signs in real time
  • Retail analytics: Customer counting and shelf stock monitoring from CCTV
  • Security: Intruder detection and perimeter monitoring
  • Sport analytics: Tracking player positions and ball movement
  • Logistics: Reading package labels and sorting items on conveyor belts
KBS
5.2
Section 5 · Applications

5.2 Limitations and Practical Considerations

Technical limitations

  • Data hungry: CNNs typically need thousands (or millions) of labelled images to train from scratch
  • Computationally expensive: Training requires GPUs and significant time/cost
  • Sensitive to distribution shift: A model trained on daytime images may fail on night images
  • YOLO small object detection: Struggles to detect very small or tightly-clustered objects in early versions

Ethical and business considerations

  • Bias: Models trained on unrepresentative data embed that bias (e.g., facial recognition systems performing poorly on darker skin tones)
  • Privacy: Real-time surveillance raises serious privacy and consent issues
  • Interpretability: CNNs are largely "black boxes" — hard to explain why a decision was made
  • Edge cases: High confidence ≠ correct — models can fail unexpectedly on unusual inputs
Key Principle
Always ask: What happens when the model is wrong? Who is affected?
KBS
5.3
Section 5 · Applications

5.3 CNN vs YOLO — When to Use Which?

Standard CNNYOLO
TaskImage classificationObject detection
OutputOne label per imageLabels + bounding boxes for all objects
SpeedFastReal-time capable
Use whenYou need to know what is in an imageYou need to know what and where
ExampleIs this a defective product? (yes/no)Mark every defect on the product image
Training dataLabels onlyLabels + bounding box coordinates
Business Scenario
A retailer wants to check if shelves are stocked.
CNN: "Is this shelf full or empty?" (classification)
YOLO: "Which specific products are missing and where?" (detection)
KBS
5.4
Week 5 · Summary

Week 5 Summary — Key Takeaways

CNNs

  • Images are grids of numbers (pixels)
  • Filters detect local patterns; shared weights make CNNs efficient
  • Pipeline: Conv → ReLU → Pool → Flatten → Dense → Output
  • Early layers = edges; deep layers = complex objects
  • Transfer learning lets you reuse pre-trained models

YOLO

  • Object detection = what + where
  • Divides image into a grid; each cell predicts boxes simultaneously
  • Single forward pass → real-time speed
  • NMS removes duplicate overlapping predictions
  • YOLOv8 / v11: state-of-the-art, 3-line Python API
The Big Picture
CNNs gave computers the ability to understand images. YOLO made that understanding fast enough for the real world. Together, they underpin much of modern AI in business — from medical imaging to autonomous vehicles to retail analytics.
KBS
5.5
Week 5 · Looking Ahead

Looking Ahead

Next Week — Week 6

We move from vision to language: Generative AI and Large Language Models.

  • How do models like GPT and Gemini generate text?
  • What is a Transformer? (attention mechanism)
  • Pre-training vs fine-tuning vs prompting
  • Business applications: summarisation, Q&A, code generation

Before Next Session

Recommended Practice
  • Open the Week 5 Colab notebook and run the CNN demo
  • Try changing the number of filters and observe the effect on accuracy
  • Run the YOLO detection example on your own image
Reflection question
Think of a business problem in your industry where knowing where something is (not just what it is) would be valuable. Could YOLO address it?
KBS
Week 5 · Data5000

Questions?

CNN Architecture & YOLO

Key terms: pixel, filter, convolution, feature map, ReLU, max pooling,
flatten, dense layer, transfer learning, bounding box, grid, NMS

KBS