Data5000 · Week 5

Convolutional Neural Networks & YOLO

Teaching computers to see: from pixels to objects

AI Programming in Business Analytics
Kaplan Business School

KBS

0.1

Overview

What You Will Learn Today

By the end of this session you will be able to:

Explain how a computer sees an image as numbers
Describe what a convolution does and why it is useful
Trace the layers of a CNN: Conv → ReLU → Pool → Flatten → Dense
Explain how CNNs learn features automatically
Describe the object detection problem and how YOLO solves it in one pass
Identify real business applications for CNNs and YOLO

Big Idea

You do not need to memorise equations. Focus on the intuition — understanding why each layer exists is more important than the maths.

KBS

0.2

Overview

Today's Agenda

#	Topic	Approx. Time
1	Images as numbers — how computers see	15 min
2	Convolution — the core operation	25 min
3	Building a full CNN — layer by layer	30 min
4	YOLO — detecting objects in real time	25 min
5	Real-world applications & wrap-up	15 min

KBS

§1

Section 1

How Computers See Images

Images are just grids of numbers — and that changes everything about how we process them.

KBS

1.1

Section 1 · How Computers See

1.1 An Image is a Grid of Numbers

Every image on your screen is made up of tiny squares called pixels. Each pixel stores a brightness value.

A greyscale image: one number per pixel (0 = black, 255 = white)
A colour image: three numbers per pixel — Red, Green, Blue (RGB)

Example

A 28 × 28 greyscale image (like a handwritten digit) = 784 numbers.
A 224 × 224 colour photo = 224 × 224 × 3 = 150,528 numbers.

The computer never actually "sees" a cat. It just sees thousands of numbers that represent brightness values across the image.

5 × 5 greyscale pixel grid — each cell is a number 0–255

KBS

1.2

Section 1 · How Computers See

1.2 Why Can't We Just Use a Normal Neural Network?

A regular (fully connected) neural network connects every input to every neuron.

Problem

A 224 × 224 colour photo has 150,528 pixels.
If the first hidden layer has 1,000 neurons, that's 150 million weights to learn — just for one layer!

Too many parameters → the model is huge and slow to train
No spatial awareness → the network doesn't know that nearby pixels are related
Not translation-invariant → a cat in the top-left corner looks completely different from a cat in the bottom-right

The Solution

CNNs solve all three problems by:

Sharing weights (one filter scans the whole image)
Using local connections (only look at nearby pixels)
Detecting the same pattern anywhere in the image

Flat network vs CNN — fewer parameters, smarter design

KBS

§2

Section 2

The Convolution Operation

A small filter slides over the image, detecting one type of pattern at a time.

KBS

2.1

Section 2 · Convolution

2.1 What is Convolution? The Spotlight Analogy

Think of convolution as shining a small spotlight (called a filter or kernel) across an image, one step at a time.

The filter is a small grid — typically 3 × 3 or 5 × 5 pixels
At each position, it multiplies its values with the image pixels underneath and adds them up
The result is a single number that summarises "how strongly this pattern appeared here"
The filter then slides across to the next position and repeats

Definition

A filter (kernel) is a small matrix of learnable numbers. During training, the CNN automatically learns the best filter values to detect useful features.

The red box slides across the image; each position produces one output number

KBS

2.2

Section 2 · Convolution

2.2 Worked Example — One Convolution Step

Suppose the filter sits over the top-left 3×3 patch of the image:

Image patch (top-left 3×3):

1	2	0
0	3	1
2	1	0

Filter:

1	0	−1
2	0	−2
1	0	−1

Calculation:

Multiply each pair, then sum:

Step by Step

(1×1) + (2×0) + (0×−1)
+ (0×2) + (3×0) + (1×−2)
+ (2×1) + (1×0) + (0×−1)

= 1 + 0 + 0 + 0 + 0 − 2 + 2 + 0 + 0
= 1

Output

The number 1 is placed in the top-left cell of the feature map. The filter then slides right and we repeat for every position.

This particular filter (called a Sobel filter) detects vertical edges. High output = strong vertical edge at that location.

KBS

2.3

Section 2 · Convolution

2.3 What Filters Actually Detect

Different filter values detect different visual patterns. A CNN learns the best filters automatically during training.

Figure 2.3: Three manually-designed filters — in practice, CNNs learn hundreds of such filters

KBS

2.4

Section 2 · Convolution

2.4 Feature Maps — Seeing the Image Differently

A single filter produces one feature map. In practice, a convolutional layer uses many filters at once.

Figure 2.4: One convolutional block — 32 filters each produce a feature map

Intuition

Early filters detect edges and colours. Deeper layers combine these to detect shapes, then textures, then objects (like ears, eyes, wheels).

KBS

2.5

Section 2 · Convolution

2.5 ReLU — Keeping Only the Positives

After each convolution, we apply an activation function. The most common one is ReLU (Rectified Linear Unit).

ReLU Formula

\(\text{ReLU}(x) = \max(0,\, x)\)

If the value is positive → keep it.
If the value is negative → set it to 0.

Why? Negative convolution outputs mean "this pattern was NOT found here." We simply ignore them. This also introduces non-linearity, letting the network learn complex patterns.

Example

Input: [−3, 0, 2, −1, 5, −0.5]
After ReLU: [0, 0, 2, 0, 5, 0]

Figure 2.5: ReLU keeps positives, zeros out negatives

KBS

2.6

Section 2 · Convolution

2.6 Pooling — Downsizing Without Losing Information

After ReLU, a pooling layer shrinks each feature map to make the network faster and more robust.

Max Pooling is the most common type. It slides a 2×2 window over the feature map and keeps only the largest value in each window.

Why Max?

We want to know whether a feature was detected somewhere in that region — not exactly where. Max pooling answers "was this feature present?" with the strongest signal.

Effect

A 4×4 feature map becomes a 2×2 map after 2×2 max pooling — 4× fewer numbers to process, while preserving the most important detections.

Max pooling picks the largest value from each 2×2 region

KBS

2.Q

Knowledge Check

Knowledge Check — Sections 1 & 2

Q1: A colour image (RGB) that is 100 × 100 pixels is represented as how many numbers?

A colour image has 3 channels (R, G, B). Each pixel requires 3 values, so a 100×100 colour image = 100 × 100 × 3 = 30,000 numbers.

Q2: What does a convolutional filter do?

A filter (kernel) slides across the image performing multiply-and-sum at each position, producing a feature map that shows where that pattern appears. Removing negatives is ReLU; resizing is pooling.

KBS

§3

Section 3

The Full CNN Architecture

Stacking layers: Conv → ReLU → Pool → Flatten → Dense → Output

KBS

3.1

Section 3 · CNN Architecture

3.1 The CNN Pipeline — Layer by Layer

Figure 3.1: The standard CNN pipeline from image to prediction

Layer	What it does	Output format
Conv	Detects local patterns using filters	3D: width × height × channels
ReLU	Removes negatives (non-linearity)	3D: same shape
Pool	Shrinks spatially, keeps strongest signal	3D: smaller width × height
Flatten	Converts 3D volume to a 1D list of numbers	1D vector
Dense	Standard neural network layer	1D vector (smaller)
Output	Final classification probabilities	1D: one value per class

KBS

3.2

Section 3 · CNN Architecture

3.2 Flatten and Dense — The Decision Stage

Flatten

After several Conv+Pool blocks, we have a 3D volume (width × height × depth). We need to convert it into a simple list to feed into a standard neural network.

Example

A 7 × 7 × 128 feature map has
7 × 7 × 128 = 6,272 numbers.
Flatten simply lays them end-to-end → a vector of length 6,272.

No learning happens here — it's just a reshape.

Dense (Fully Connected) Layers

This is the "decision-making" part of the CNN. The flat vector is passed through one or more standard neural network layers.

Each neuron connects to every value in the vector
The network combines all the detected features to make a final prediction

Final Output (Softmax)

The last dense layer outputs one probability per class. E.g., for cat/dog/bird:
[Cat: 0.85, Dog: 0.12, Bird: 0.03]
→ Predicted class: Cat

KBS

3.3

Section 3 · CNN Architecture

3.3 How Does a CNN Actually Learn?

A CNN learns its filter values and dense layer weights through a process called backpropagation, guided by a loss function.

Forward pass — image goes through all layers, producing a prediction (e.g., "Cat: 62%")
Calculate loss — compare prediction to true label. If it said "Cat: 62%" but it was actually a Dog, the loss is high.
Backward pass — the error is traced back through every layer, and each weight is adjusted slightly to reduce the loss
Repeat for thousands of images → filters become meaningful detectors

The training loop — repeat until loss is minimised

KBS

3.4

Section 3 · CNN Architecture

3.4 Famous CNN Architectures — Shoulders of Giants

You don't need to build a CNN from scratch. Many pre-trained models are freely available.

Architecture	Year	Layers	Notable For
LeNet-5	1998	7	First successful CNN — handwritten digit recognition
AlexNet	2012	8	Launched the deep learning revolution; won ImageNet by a wide margin
VGG-16	2014	16	Simple and uniform design; widely used as a baseline
ResNet-50	2015	50	Skip connections — allowed very deep networks without vanishing gradients
EfficientNet	2019	varies	Best accuracy-to-compute ratio; widely used in production

Transfer Learning

These models are pre-trained on millions of images. In practice, we reuse their learned filters and only retrain the final output layer for our specific task — saving weeks of computation.

KBS

3.Q

Knowledge Check

Knowledge Check — Section 3

Q3: What is the purpose of the Flatten layer in a CNN?

The Flatten layer performs a reshape operation — no weights, no learning. It converts a 3D volume (e.g., 7×7×128) into a 1D vector (6,272) that a fully-connected Dense layer can accept as input.

Q4: What is Transfer Learning?

Transfer learning reuses the lower layers (which detect general edges, shapes, textures) from a large pre-trained model. Only the final classification layers need fine-tuning for the new task — making it practical even with small datasets.

KBS

§4

Section 4

YOLO — You Only Look Once

Detecting and locating multiple objects in a single forward pass through the network.

KBS

4.1

Section 4 · YOLO

4.1 Classification vs Object Detection

Image Classification

Question asked

"What is in this image?"
→ Answer: "Cat" (with 92% confidence)

One label for the whole image
No information about where the object is
Example: "Is this invoice or a receipt?"

Object Detection

Question asked

"What is in this image and where?"
→ Answer: "Cat at [120, 80, 200, 160]"

Returns a bounding box around every detected object
Can detect multiple objects at different locations
Example: "Find all vehicles and pedestrians in this CCTV frame"

Figure 4.1: Classification gives one label; detection gives labels AND locations for every object

KBS

4.2

Section 4 · YOLO

4.2 The Challenge — Objects Can Be Anywhere

Object detection is harder than classification because:

Objects can appear anywhere in the image
There can be multiple objects of different sizes
We need to output both a class label AND a bounding box for each one

Traditional Approach (Slow)

Older methods like R-CNN would:

Propose ~2,000 candidate regions in the image
Run a CNN on each region separately
This was accurate but very slow — up to 47 seconds per image!

YOLO's Insight

Instead of looking at thousands of regions one at a time, YOLO asks: "What if we run the CNN just once on the whole image?"

The result: real-time detection at 30–100+ frames per second.

Speed comparison

R-CNN: ~47 seconds/image
Fast R-CNN: ~2 seconds/image
YOLO: <0.03 seconds/image (real time!)

KBS

4.3

Section 4 · YOLO

4.3 How YOLO Works — The Grid

YOLO divides the image into a grid (e.g., 7 × 7 = 49 cells). Each cell is responsible for detecting objects whose centre falls within it.

For each cell, the network predicts:

Bounding box coordinates: x, y (centre), width, height
Confidence score: how sure is it that an object is here?
Class probabilities: is it a car, person, dog, etc.?

Bounding Box

A rectangle defined by four numbers: the centre coordinates (x, y) and the box dimensions (width, height). Expressed as fractions of the image size (0 to 1).

All 49 cells make their predictions simultaneously in a single forward pass — that's why it's "You Only Look Once."

Highlighted cells are "responsible" for each detected object

KBS

4.4

Section 4 · YOLO

4.4 YOLO Output — Predictions and Filtering

What the network outputs

For each grid cell, YOLO outputs a vector:

Per Cell Output

[x, y, w, h, confidence, P(class₁), P(class₂), …, P(classN)]

x, y: centre of the box (relative to the cell)
w, h: width and height (relative to image)
confidence: probability × overlap with ground truth
P(classᵢ): probability it belongs to each class

Non-Maximum Suppression (NMS)

Many cells near an object all predict a box. We end up with dozens of overlapping boxes.

NMS cleans this up:

Keep the box with the highest confidence
Remove any boxes that overlap it too much (IoU > threshold)
Repeat for the next highest

Result

One clean, final bounding box per detected object.

KBS

4.5

Section 4 · YOLO

4.5 YOLO Versions — A Brief History

Version	Year	Key Improvement	Speed / Accuracy
YOLOv1	2016	Original one-pass detection concept	Fast, struggled with small/clustered objects
YOLOv3	2018	Multi-scale detection (detects small objects better)	Good balance — popular baseline
YOLOv5	2020	PyTorch-native, easy to use and fine-tune	Very popular in industry
YOLOv8	2023	Unified framework: detection, segmentation, pose	State-of-the-art accuracy + speed
YOLOv11	2024	Further efficiency gains, edge-device optimised	Fastest yet; production deployments

Code — running YOLOv8 in 3 lines

from ultralytics import YOLO
model = YOLO("yolov8n.pt")       # load a pre-trained model
results = model("street.jpg")      # detect objects in one image
results[0].show()                    # display with bounding boxes

KBS

4.Q

Knowledge Check

Knowledge Check — Section 4

Q5: What does "You Only Look Once" mean in the context of YOLO?

YOLO's key innovation is running the CNN once on the full image (divided into a grid), predicting bounding boxes and class labels for all cells simultaneously — unlike older methods that ran the CNN separately on thousands of candidate regions.

Q6: What problem does Non-Maximum Suppression (NMS) solve?

Since multiple neighbouring cells predict boxes for the same object, NMS keeps the highest-confidence box and suppresses all others that overlap it by more than a set threshold (IoU). The result is one clean bounding box per object.

KBS

§5

Section 5

Real-World Applications

Where are CNNs and YOLO actually used in business today?

KBS

5.1

Section 5 · Applications

5.1 CNNs & YOLO in Business

CNNs — Classification & Analysis

Healthcare: Detecting tumours in X-rays and MRI scans (often matching specialist accuracy)
Retail: Visual product search — "find similar shoes"
Finance: Cheque and document verification; fraud detection via signature analysis
Manufacturing: Defect detection on production lines (faster than human inspection)
Agriculture: Identifying crop disease from drone imagery

YOLO — Real-Time Detection

Autonomous vehicles: Detecting cars, pedestrians, cyclists, traffic signs in real time
Retail analytics: Customer counting and shelf stock monitoring from CCTV
Security: Intruder detection and perimeter monitoring
Sport analytics: Tracking player positions and ball movement
Logistics: Reading package labels and sorting items on conveyor belts

KBS

5.2

Section 5 · Applications

5.2 Limitations and Practical Considerations

Technical limitations

Data hungry: CNNs typically need thousands (or millions) of labelled images to train from scratch
Computationally expensive: Training requires GPUs and significant time/cost
Sensitive to distribution shift: A model trained on daytime images may fail on night images
YOLO small object detection: Struggles to detect very small or tightly-clustered objects in early versions

Ethical and business considerations

Bias: Models trained on unrepresentative data embed that bias (e.g., facial recognition systems performing poorly on darker skin tones)
Privacy: Real-time surveillance raises serious privacy and consent issues
Interpretability: CNNs are largely "black boxes" — hard to explain why a decision was made
Edge cases: High confidence ≠ correct — models can fail unexpectedly on unusual inputs

Key Principle

Always ask: What happens when the model is wrong? Who is affected?

KBS

5.3

Section 5 · Applications

5.3 CNN vs YOLO — When to Use Which?

	Standard CNN	YOLO
Task	Image classification	Object detection
Output	One label per image	Labels + bounding boxes for all objects
Speed	Fast	Real-time capable
Use when	You need to know what is in an image	You need to know what and where
Example	Is this a defective product? (yes/no)	Mark every defect on the product image
Training data	Labels only	Labels + bounding box coordinates

Business Scenario

A retailer wants to check if shelves are stocked.
→ CNN: "Is this shelf full or empty?" (classification)
→ YOLO: "Which specific products are missing and where?" (detection)

KBS

5.4

Week 5 · Summary

Week 5 Summary — Key Takeaways

CNNs

Images are grids of numbers (pixels)
Filters detect local patterns; shared weights make CNNs efficient
Pipeline: Conv → ReLU → Pool → Flatten → Dense → Output
Early layers = edges; deep layers = complex objects
Transfer learning lets you reuse pre-trained models

YOLO

Object detection = what + where
Divides image into a grid; each cell predicts boxes simultaneously
Single forward pass → real-time speed
NMS removes duplicate overlapping predictions
YOLOv8 / v11: state-of-the-art, 3-line Python API

The Big Picture

CNNs gave computers the ability to understand images. YOLO made that understanding fast enough for the real world. Together, they underpin much of modern AI in business — from medical imaging to autonomous vehicles to retail analytics.

KBS

5.5

Week 5 · Looking Ahead

Looking Ahead

Next Week — Week 6

We move from vision to language: Generative AI and Large Language Models.

How do models like GPT and Gemini generate text?
What is a Transformer? (attention mechanism)
Pre-training vs fine-tuning vs prompting
Business applications: summarisation, Q&A, code generation

Before Next Session

Recommended Practice

Open the Week 5 Colab notebook and run the CNN demo
Try changing the number of filters and observe the effect on accuracy
Run the YOLO detection example on your own image

Reflection question

Think of a business problem in your industry where knowing where something is (not just what it is) would be valuable. Could YOLO address it?

KBS

Week 5 · Data5000

Questions?

CNN Architecture & YOLO

Key terms: pixel, filter, convolution, feature map, ReLU, max pooling,
flatten, dense layer, transfer learning, bounding box, grid, NMS

KBS

Table of Contents

Convolutional Neural Networks & YOLO

What You Will Learn Today

Today's Agenda

How Computers See Images

1.1 An Image is a Grid of Numbers

1.2 Why Can't We Just Use a Normal Neural Network?

The Convolution Operation

2.1 What is Convolution? The Spotlight Analogy

2.2 Worked Example — One Convolution Step

2.3 What Filters Actually Detect

2.4 Feature Maps — Seeing the Image Differently

2.5 ReLU — Keeping Only the Positives

2.6 Pooling — Downsizing Without Losing Information

Knowledge Check — Sections 1 & 2

The Full CNN Architecture

3.1 The CNN Pipeline — Layer by Layer

3.2 Flatten and Dense — The Decision Stage

Flatten

Dense (Fully Connected) Layers

3.3 How Does a CNN Actually Learn?

3.4 Famous CNN Architectures — Shoulders of Giants

Knowledge Check — Section 3

YOLO — You Only Look Once

4.1 Classification vs Object Detection

Image Classification

Object Detection

4.2 The Challenge — Objects Can Be Anywhere

Traditional Approach (Slow)

4.3 How YOLO Works — The Grid

4.4 YOLO Output — Predictions and Filtering

What the network outputs

Non-Maximum Suppression (NMS)

4.5 YOLO Versions — A Brief History

Knowledge Check — Section 4

Real-World Applications

5.1 CNNs & YOLO in Business

CNNs — Classification & Analysis

YOLO — Real-Time Detection

5.2 Limitations and Practical Considerations

Technical limitations

Ethical and business considerations

5.3 CNN vs YOLO — When to Use Which?

Week 5 Summary — Key Takeaways

CNNs

YOLO

Looking Ahead

Next Week — Week 6

Before Next Session

Questions?