ML One
Lecture 08
Introduction to supervised learning
+
How does AI learn? - Intuitions on gradient descent
Welcome ๐Ÿ‘ฉโ€๐ŸŽค๐Ÿง‘โ€๐ŸŽค๐Ÿ‘จโ€๐ŸŽค
By the end of this lecture, we'll have learnt about:
The theoretical:
- what is supervised learning: the framework of training
- how to train a neural network on a labeled dataset
- aka how to find the right numbers to put in weight matrices and bias vector
- Intuitions on gradient descent
The practical:
- MLP implemented in python
First of all, don't forget to confirm your attendence on Seats App!
as usual, an AI-related fun project to wake us up
Recap
Last lecture we saw how dots (adding/multiplying matrices, functions) are connected!!!
๐Ÿงซ Biological neurons: receive charges, accumulate charges, fire off charges
๐Ÿงซ Biological neurons
-- A neuron is connected with some other neurons.
-- A neuron is charged by other connected neurons.
-- There are usually different levels of charges emitted from different neurons.
๐Ÿงซ Biological neurons (continued)
-- The received charges within one neurons are accumulated.
-- We can refer to the level of accumulated charges in one neuron as its activation value.
-- Once a neuron is sufficiently charged, it fires off a charge to the next neurons.
๐Ÿงซ Biological neurons (continued)
-- This "charging, thresholding and firing" neural process is the backbone of taking sensory input, processing and conducting motory output.
๐Ÿค– Artificial neurons
- Think of a neuron as a number container that holds a single number, activation. ๐Ÿชซ๐Ÿ”‹
- Neurons are grouped in layers. ๐Ÿ˜๏ธ
- Neurons in different layers are connected hierarchically (usually from left to righ) ๐Ÿ‘‰
- There are different weights assigned to each link between two connected neurons. ๐Ÿ”—
๐Ÿ˜๏ธ Layers in MLP
- Put each layer's activations into a vector: a layer activation vector ๐Ÿ“›
๐Ÿ˜๏ธ Layers in MLP
- Input layer: each neuron loads one value from the input (e.g. one pixel's greyscale value in the MNIST dataset)๐Ÿ‘๏ธ
๐Ÿ˜๏ธ Layers in MLP
- Output layer: for image classification task, one neuron represents the probability this neural net assigns to one class๐Ÿ—ฃ๏ธ
๐Ÿ˜๏ธ Layers in MLP
- Hidden layer: any layer between input and output layers and the size (number of neuron) is user-specified ๐Ÿดโ€โ˜ ๏ธ
๐Ÿ˜๏ธ Layers in MLP
- Put each layer's activations into a vector: a layer activation vector ๐Ÿ“›
- Input layer: each neuron loads one value from the input (e.g. one pixel's greyscale value in the MNIST dataset)๐Ÿ‘๏ธ
- Output layer: for image classification task, one neuron represents the probability this neural net assigns to one class๐Ÿ—ฃ๏ธ
- Hidden layer: any layer between input and output layers and the size (number of neuron) is user-specified ๐Ÿดโ€โ˜ ๏ธ
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
- The computation runs from left to right (a forward pass)๐Ÿ‘‰
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
- The computation runs from left to right (a forward pass)๐Ÿ‘‰
- The computation calculates each layer activation vector one by one from left to right.๐Ÿ‘‰
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
- The computation runs from left to right (a forward pass)๐Ÿ‘‰
- The computation calculates each layer activation vector one by one from left to right.๐Ÿ‘‰
- Overall, the computation takes the input layer activation vector as input and returns the output layer activation vector that ideally represents the correct answer.๐Ÿ‘‰
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐Ÿ”ฌ:
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐Ÿ”ฌ:
--- Multiply previous layer's activation vector with a weights matrix: "charging" โšก๏ธ
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐Ÿ”ฌ:
--- Multiply previous layer's activation vector with a weights matrix: "charging" โšก๏ธ
--- Add with a bias vector and feed into an activation function (e.g. relu): "thresholding" ๐Ÿชœ
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐Ÿ”ฌ:
--- Multiply previous layer's activation vector with a weights matrix: "charging" โšก๏ธ
--- Add with a bias vector and feed into an activation function (e.g. relu): "thresholding" ๐Ÿชœ
--- The result would be this layer's activation vector. โœŒ๏ธ
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐Ÿ”ฌ:
--- Multiply previous layer's activation vector with a weights matrix: "charging" โšก๏ธ
--- Add with a bias vector and feed into an activation function (e.g. relu): "thresholding" ๐Ÿชœ
--- The result would be this layer's activation vector. โœŒ๏ธ
--- We then feed this layer's activation vector as the input to next layer until it reaches the last layer (output layer): "firing" โ›“๏ธ
๐Ÿ”— one layer function:
V_output =
ReLU(WeightsMat ยท V_input + Bias)
โ›“๏ธ the big function for a MLP with one hidden layer (chaining each layer function together):
V_output =
ReLU(WeightsMat_o ยท ReLU(WeightsMat_h ยท V_input + Bias_h) + Bias_o)
V_input: input layer's activation vector
V_output: output layer's activation vector
WeightsMat_h, Bias_h: weights matrix and bias vector from the hidden layer
WeightsMat_o, Bias_o: weights matrix and bias vector from the output layer
Recall the amazing perceptual adaptation effect: repeated exposure makes things easier and familier! ๐Ÿฅธ
Wait but what numbers shall we put in the weights matrices and bias vectors? ๐Ÿฅน
end of recap ๐Ÿ‘‹
Let's forget about neural network for now and zoom out a little bit
- Supervised (machine) learning (SL) is a subcategory of machine learning.
- Supervised learning involves training a model on a labeled dataset, where the algorithm learns the mapping from input data to corresponding output labels.
๐Ÿ’กkeywords: model, labeled dataset
๐Ÿ–‡๏ธConnected to what we have talked about:
- The handwritten digit recognition task on the MNIST dataset we introduced last week is a supervised learning task.
- Neural networks (e.g. the MLP we introduced last week) are models (big functions with weight matrices and bias vectors).
๐Ÿ–‡๏ธConnected to what we have talked about:
- The handwritten digit recognition task on the MNIST dataset we introduced last week is supervised learning.
- Neural networks (e.g. the MLP we introduced last week) are models.
- The missing pieces: what does it mean by training a model? how to train a model using labeled dataset?
To be demystified today!
Let's consider a much easier labeled dataset and a much simpler model.
hi I'm a happy caveman,
I eat potatoes ๐Ÿฅ”,
and I collect apples ๐ŸŽ
can I have a model that predict the number of ๐ŸŽ I can collect given the number of ๐Ÿฅ” I eat today?
This is my labeled dataset at hand:
I ate 0.5๐Ÿฅ” and collected 2๐ŸŽ on Monday
I ate 2 ๐Ÿฅ” and collected 5๐ŸŽ on Tuesday
I ate 1 ๐Ÿฅ” and collected 3๐ŸŽ on Wedneady
the input: number of ๐Ÿฅ” I ate
the labelled output: number of ๐ŸŽ i collected
the task: to learn a model from the dataset, which is a function that can predict the numebr of ๐ŸŽ I collect (the output)
given the number of ๐Ÿฅ” I eat (the input)?
Now I'm puzzled... ๐Ÿคจ
Which function shall I choose?
Numbers are too headache to look at so I want to plot these numbers out and look at the relation between ๐Ÿฅ” and ๐ŸŽ visually!
an easy task: if today I ate 1๐Ÿฅ”, make an educated guess on how many ๐ŸŽ I can collect today?
a harder task: if today I ate 1.5๐Ÿฅ”, make an educated guess on how many ๐ŸŽ I can collect today?
Look at the dataset, unfortunatelly we don't have a data point that corresponds to 1.5๐Ÿฅ”
What if... there is a continuous line going through all points so that I can lookup the point corresponding to 1.5๐Ÿฅ” from this line?
Back to the question: ๐Ÿคจ
- Which function shall I choose?
- By looking at the dataset, a linear function (whose graph representation is a straight line) looks like a good fit.
Back to the question: ๐Ÿคจ
Which function shall I choose?
- By looking at the dataset, a linear function (whose graph representation is a straight line) looks like a good fit.
- a linear function: f(x) = wx + b where w is the weight and b is the bias
- But there are infinite many straight lines I can draw on the canvas! Which one is THE best?
But there are infinite many straight lines I can draw on the canvas! Which one is THE best model?
- BTW, the infinite many straight lines can be parametrised by weight and bias according to this expression: f(x) = wx + b where w is the weight and b is the bias
- Being parametrised means that we can tweak the value of the weight and bias to create all the infinite many straight lines where each line is determined by a unique combination of weight and bias.
Which one is THE best straight line?
- The line that "weaves" through all the labelled data points! It means that this line perfectly captures relation between # of ๐Ÿฅ” to # of ๐ŸŽ from my existing experience (the labeled dataset)
- And we can hand calculate THE weight and bias for this best-fit line easily.
SOLVED!!! ๐Ÿค˜ Happy caveman life!
- Note that we have made two choices for this SL task: we chose to use a straight line or a linear function for weaving the labeled dataset
- and we chose the best-fit weight and bias for the linear function.
Caution: this is a overly simplified example where it seems trivial to find a model that has the perfect fit shape (a straight line) and perfect parameters for weaving through all data points.
an example of real life data on the whiteboard...
In real life, data is a lot more messy and do not fall into a straight line ๐Ÿ™‚
For real life data,
to find a good-fit model/function,
we can't just look at data and draw a nice weaving curve,
we can only start with initially guessed function and tweak its parameters till it is in a good weaving position.
That's quite a lot, congrats! ๐ŸŽ‰
The supervised learning training process (a summary of what we just talked about)
1. have an initially guessed model (often imperfect)
->
2. feed the input data into the (imperfect) model
->
3. get the (imperfect) model output
->
4. measure how wrong this output is compared to the correct answer (labeled output)
->
5. use the measurement to update the model
->
back to step 2 and repeat
Let me demonstrate the training process for the naive caveman ๐ŸŽregression task on the whiteboard.
๐Ÿค˜ that is what "learning" in SL is about
what about "supervised" in SL then?
easy, it means learning/training using data that have "correct answers", aka labeled output
wait, can learning be done without "correct answers"?
introducing unsupervised learning (the one that uses unlabelled data, stay tuned!) or self-supervised learning(cool kids use this term)
think about how human and animal baby learn, it is amazing...
Connecting what we have talked about last week:
1. have an initially guessed model (often imperfect)
->
2. feed the input data into the (imperfect) model
->
3. get the (imperfect) model output
->
4. measure how wrong this output is compared to the correct answer (labeled output)
->
5. use the measurement to update the model
->
back to step 2 and repeat
๐ŸคฉThe MLP we talked about last week corresponds to the step 1, 2 and 3
(recall the forward pass with randomly guessed numbers in the weight matrices and bias vector?).
๐ŸคฉThe MLP we talked about last week corresponds to the step 1, 2 and 3 (recall the forward pass with randomly guessed numbers in the weight matrices and bias vector?).
๐Ÿ˜ŽIt also corresponds to our first decision on choosing linear functions to model in the caveman example.
๐ŸคฉThe MLP we talked about last week corresponds to the step 1, 2 and 3 (recall the forward pass with randomly guessed numbers in the weight matrices and bias vector?).
๐Ÿ˜ŽIt also corresponds to our first decision on choosing linear functions to model in the caveman example.
๐ŸฅธIt is another function after all, just a more complicated/expressive one.
That's quite a lot, congrats! ๐ŸŽ‰
Now let's zoom into step 4 and 5 ๐Ÿ˜Ž:
- how to tweak the numbers in weight matrices and bias vectors according to the labeled data
so that the model is in a good weaving state.
Let's forget about neural network and supervised learning for now.
on whiteboard:
let's start with a game! ๐Ÿ•น๏ธ
GAME SETTINGS ๐ŸŽฐ
--1. environment: littel 2D creature living on a curve terrain ๐Ÿ—ป
--2. objective: find the valley ๐Ÿ”ป
--3. player control: moving along the X axis, left or right (direction)? how far (stepsize)?๐Ÿ•น๏ธ
--4. world mist: mostly unknown but some very local information
all we know is that we can feel the slope under our feet ๐ŸŒซ๏ธ
on the whiteboard:
start here,
question 1: shall i go left or right? ๐Ÿ˜ˆ
on the whiteboard:
answer 1: go with the downward slope direction ๐Ÿ˜‰
on the whiteboard:
question 2: what happens if we feel the slope is flat under our feet? ๐Ÿง
on the whiteboard:
answer 2: jackpot! ๐Ÿฅฐ
no slope means that we have reached the valley!!!
a gentle reminder: avoiding being omniscient in this game, we know it is the valley not because we can see it being the lowest point from outside the game world
on the whiteboard:
question 3: back to the start, now we know how to decide the direction but how about our step size?
the dangerous situation with big step size ๐Ÿฅพ
on the whiteboard:
answer 3: a good strategy is that we should decrease our step size when the slope gets flatter
on whiteboard:
question 4: game level up! new terrain unlocked... ๐Ÿ”
What are the flat-slope points in the new terrain? how can we know if we are at THE valley (if flat slope is all we are looking for) ??? ๐Ÿฅฒ
on the whiteboard:
answer 4 part one:
these are global minima, local minima, local maxima and saddle point.
on the whiteboard:
answer 4 part two: NO WE CAN'T ๐Ÿฅน
We can easily get trapped at local minima
on the whiteboard:
BONUS ๐Ÿ’ฐ question 1: start here and is there any chance we end up at the local maxima ?
hint: run simulations in your ๐Ÿง , follow the
"feel the slope -> decide the direction
-> pick a step size -> jump to the point
-> repeat"
process
on the whiteboard:
BONUS ๐Ÿ’ฐ answer 1: barely possible
don't worry too much about the local maxima
on the whiteboard:
BONUS ๐Ÿ’ฐ question 2: start here and is there any chance we end up at the saddle point?
hint: run simulations in your ๐Ÿง , follow the
"feel the slope -> decide the direction
-> pick a step size -> jump to the point
-> repeat"
process
on the whiteboard:
BONUS ๐Ÿ’ฐ answer 2: likely!!!
we could get trapped at the saddle point ๐Ÿชค
what can we do?
larger step size helps us get carried over
on the whiteboard:
but what can we do?
larger step size helps us get carried over
MISSION ACCOMPLISHED โค๏ธโ€๐Ÿ”ฅ
wait how about training a neural network (tweaking its parameters)?
that's exactly how we train a neural network
Recall the SL training process:
1. have an initially guessed model (often imperfect)
->
2. feed the input data into the (imperfect) model
->
3. get the (imperfect) model output
->
4. measure how wrong this output is compared to the correct answer (labeled output)
->
5. use the measurement to update the model
->
back to step 2 and repeat
SAME SETTINGS ๐ŸŽฐ
--1. curve terrain ๐Ÿ—ป: a loss function measuring distance between predicted output and labeled output
and we are moving in the space of model parameters
SAME SETTINGS ๐ŸŽฐ
--2. objective ๐Ÿ”ป: find the parameters that give the lowest loss, which corresponds to the valley on the loss function terrain
SAME SETTINGS ๐ŸŽฐ
--3. player control with step size๐Ÿ•น๏ธ: adjust numbers in weight matrices and bias vectrs by deciding how much to increase/decrease each parameter
SAME SETTINGS ๐ŸŽฐ
--4. world mist ๐ŸŒซ๏ธ: we are agnostic of what parameters give the perfect solution
but we can compute the gradient (the slope) given current parameter values (under our feet)
SAME SETTINGS ๐ŸŽฐ
--1. curve terrain ๐Ÿ—ป: a loss function measuring distance between prediction and groudtruth
--2. objective ๐Ÿ”ป: navigate in the space of parameters and find the lowest loss point
--3. player control ๐Ÿ•น๏ธ: adjust numbers in weight matrices and bias vectors by deciding how much to increase/decrease each parameter
--4. world mist ๐ŸŒซ๏ธ: we are agnostic of what parameter values would give the perfect solution
but we can compute the gradient (the slope) given current parameter values (under our feet)
SAME TECHNIQUES ๐Ÿœ
--1. use the slope (gradient) direction to infer what directions to adjust for each parameter
gradient: derivative, aka the slope
direction really just refer to the plain binary choice of "increase or decrease / + or - "
SAME TECHNIQUES ๐Ÿœ --2. also use the slope (gradient) value as an indicator of how close we are to a potential valley
SAME TECHNIQUES ๐Ÿœ
--3. also use the slope (gradient) value to determine the step size of adjustment
step size: learning rate
SAME TECHNIQUES ๐Ÿœ
--1. use the slope (gradient) direction to infer what directions to adjust for each parameter
direction really just refer to the plain binary choice of "increase or decrease / + or - "
--2. also use the slope (gradient) value as indicator of how close we are to a potential valley
--3. also use the slope (gradient) value to determine the step size of adjustment
step size: learning rate
SAME FINDINGS IMPLIED ๐Ÿœ
-- mostly local minima
-- impossible global minima
-- no need to worry about local maxima
-- extra caution for saddle point (use larger step size)
RECAP 1๏ธโƒฃ
numberify and rephrase the neural network parameter tweaking process:
1. the goal is to minimize the loss (cost) function by adjusting parameter numbers
RECAP 2๏ธโƒฃ
2. after numberifying the training process, we can then apply some math trick called gradient descent
-- find the steepest decreasing direction
-- one gradient for one parameter
RECAP 3๏ธโƒฃ
3. we multiply the (minus) gradients with some learning rate to decide the parameter adjustment values
DONE๐ŸŽ‰
let's watch some of this video together
to verify our intuitions
and to connect them with the practical process
maybe this as well
well done everyone ๐ŸŽ‰
we have gone through MSc-level content
another two jargons unlocked:
backpropagation: a gradients calculation scheme
optimizer: conventionally in python DL libraries, all these backprop/GD stuff are handled by an object called "optimizer"
That's quite a lot, congrats! ๐ŸŽ‰
Next, we are going to:
- take a look at how training a MLP on fashion MNIST dataset is implemented in python with help from NumPy and TensorFlow(a very popular deep learning library in Python)!
Alert: you are going to see quite advanced python and neural network programming stuff, we are not expected to understand them all at the moment.
Let's take a look at how some ideas we talk about today are reflected in the code,
especially how we choose a model and how we fit the model to the dataset by setting up the loss function and optimizer.
- It is just a one-liner really...
everything is prepared here
Let's take a look at the notebook!

- 1. Make sure you have saved a copy to your GDrive or opened in playground. ๐ŸŽ‰
- 2. Most parts are out of the range of the content we have covered so far.
- 3. We only need to take a look at the several lines in the "Build the Model" and "Feed the Model" sections.
Today we have looked at:
- The supervised learning process that select a parametrised model and train the model to fit a labeled dataset
- The training or parameter tweaking process is done through backpropogation by a technique called gradient descent
- Intuitions on the gradien descent: feel the slope and find the valley
We'll bring swift and fun applications back next week, see you next Thursday same time and same place!