ML One
Lecture 08
Introduction to supervised learning
+
How does AI learn? - Intuitions on gradient descent
Welcome ๐ฉโ๐ค๐งโ๐ค๐จโ๐ค
By the end of this lecture, we'll have learnt about:
The theoretical:
- what is supervised learning: the framework of training
- how to train a neural network on a labeled dataset
- aka how to find the right numbers to put in weight matrices and bias vector
- Intuitions on gradient descent
The practical:
- MLP implemented in python
First of all, don't forget to confirm your attendence on
Seats App!
Recap
Last lecture we saw how dots (adding/multiplying matrices, functions) are connected!!!
๐งซ Biological neurons: receive charges, accumulate charges, fire off charges
๐งซ Biological neurons
-- A neuron is connected with some other neurons.
-- A neuron is charged by other connected neurons.
-- There are usually different levels of charges emitted from different neurons.
๐งซ Biological neurons (continued)
-- The received charges within one neurons are accumulated.
-- We can refer to the level of accumulated charges in one neuron as its activation value.
-- Once a neuron is sufficiently charged, it fires off a charge to the next neurons.
๐งซ Biological neurons (continued)
-- This "charging, thresholding and firing" neural process is the backbone of taking sensory input, processing and conducting motory output.
๐ค Artificial neurons
- Think of a neuron as a number container that holds a single number, activation. ๐ชซ๐
- Neurons are grouped in layers. ๐๏ธ
- Neurons in different layers are connected hierarchically (usually from left to righ) ๐
- There are different weights assigned to each link between two connected neurons. ๐
๐๏ธ Layers in MLP
- Put each layer's activations into a vector: a layer activation vector ๐
๐๏ธ Layers in MLP
- Input layer: each neuron loads one value from the input (e.g. one pixel's greyscale value in the MNIST dataset)๐๏ธ
๐๏ธ Layers in MLP
- Output layer: for image classification task, one neuron represents the probability this neural net assigns to one class๐ฃ๏ธ
๐๏ธ Layers in MLP
- Hidden layer: any layer between input and output layers and the size (number of neuron) is user-specified ๐ดโโ ๏ธ
๐๏ธ Layers in MLP
- Put each layer's activations into a vector: a layer activation vector ๐
- Input layer: each neuron loads one value from the input (e.g. one pixel's greyscale value in the MNIST dataset)๐๏ธ
- Output layer: for image classification task, one neuron represents the probability this neural net assigns to one class๐ฃ๏ธ
- Hidden layer: any layer between input and output layers and the size (number of neuron) is user-specified ๐ดโโ ๏ธ
๐ค๐งฎ The computational simulation of the "charging, thresholding and firing" neural process
- The computation runs from left to right (a forward pass)๐
๐ค๐งฎ The computational simulation of the "charging, thresholding and firing" neural process
- The computation runs from left to right (a forward pass)๐
- The computation calculates each layer activation vector one by one from left to right.๐
๐ค๐งฎ The computational simulation of the "charging, thresholding and firing" neural process
- The computation runs from left to right (a forward pass)๐
- The computation calculates each layer activation vector one by one from left to right.๐
- Overall, the computation takes the input layer activation vector as input and returns the output layer activation vector that ideally represents the correct answer.๐
๐ค๐งฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐ฌ:
๐ค๐งฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐ฌ:
--- Multiply previous layer's activation vector with a weights matrix: "charging" โก๏ธ
๐ค๐งฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐ฌ:
--- Multiply previous layer's activation vector with a weights matrix: "charging" โก๏ธ
--- Add with a bias vector and feed into an activation function (e.g. relu): "thresholding" ๐ช
๐ค๐งฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐ฌ:
--- Multiply previous layer's activation vector with a weights matrix: "charging" โก๏ธ
--- Add with a bias vector and feed into an activation function (e.g. relu): "thresholding" ๐ช
--- The result would be this layer's activation vector. โ๏ธ
๐ค๐งฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐ฌ:
--- Multiply previous layer's activation vector with a weights matrix: "charging" โก๏ธ
--- Add with a bias vector and feed into an activation function (e.g. relu): "thresholding" ๐ช
--- The result would be this layer's activation vector. โ๏ธ
--- We then feed this layer's activation vector as the input to next layer until it reaches the last layer (output layer): "firing" โ๏ธ
๐ one layer function:
V_output =
ReLU(WeightsMat ยท V_input + Bias)
โ๏ธ the big function for a MLP with one hidden layer (chaining each layer function together):
V_output =
ReLU(WeightsMat_o ยท ReLU(WeightsMat_h ยท V_input + Bias_h) + Bias_o)
V_input: input layer's activation vector
V_output: output layer's activation vector
WeightsMat_h, Bias_h: weights matrix and bias vector from the hidden layer
WeightsMat_o, Bias_o: weights matrix and bias vector from the output layer
Recall the amazing perceptual adaptation effect: repeated exposure makes things easier and familier! ๐ฅธ
Wait but what numbers shall we put in the weights matrices and bias vectors? ๐ฅน
end of recap ๐
Let's forget about neural network for now and zoom out a little bit
- Supervised (machine) learning (SL) is a subcategory of machine learning.
- Supervised learning involves training a model on a labeled dataset, where the algorithm learns the mapping from input data to corresponding output labels.
๐กkeywords: model, labeled dataset
๐๏ธConnected to what we have talked about:
- The handwritten digit recognition task on the MNIST dataset we introduced last week is a supervised learning task.
- Neural networks (e.g. the MLP we introduced last week) are models (big functions with weight matrices and bias vectors).
๐๏ธConnected to what we have talked about:
- The handwritten digit recognition task on the MNIST dataset we introduced last week is supervised learning.
- Neural networks (e.g. the MLP we introduced last week) are models.
- The missing pieces: what does it mean by training a model? how to train a model using labeled dataset?
To be demystified today!
Let's consider a much easier labeled dataset and a much simpler model.
hi I'm a happy caveman,
I eat potatoes ๐ฅ,
and I collect apples ๐
can I have a model that predict the number of ๐ I can collect given the number of ๐ฅ I eat today?
This is my labeled dataset at hand:
I ate 0.5๐ฅ and collected 2๐ on Monday
I ate 2 ๐ฅ and collected 5๐ on Tuesday
I ate 1 ๐ฅ and collected 3๐ on Wedneady
the input: number of ๐ฅ I ate
the labelled output: number of ๐ i collected
the task: to learn a model from the dataset, which is a function that can predict the numebr of ๐ I collect (the output)
given the number of ๐ฅ I eat (the input)?
Now I'm puzzled... ๐คจ
Which function shall I choose?
Numbers are too headache to look at so I want to plot these numbers out and look at the relation between ๐ฅ and ๐ visually!
an easy task: if today I ate 1๐ฅ, make an educated guess on how many ๐ I can collect today?
a harder task: if today I ate 1.5๐ฅ, make an educated guess on how many ๐ I can collect today?
Look at the dataset, unfortunatelly we don't have a data point that corresponds to 1.5๐ฅ
What if... there is a continuous line going through all points so that I can lookup the point corresponding to 1.5๐ฅ from this line?
Back to the question: ๐คจ
- Which function shall I choose?
- By looking at the dataset, a linear function (whose graph representation is a straight line) looks like a good fit.
Back to the question: ๐คจ
Which function shall I choose?
- By looking at the dataset, a linear function (whose graph representation is a straight line) looks like a good fit.
- a linear function: f(x) = wx + b where w is the weight and b is the bias
- But there are infinite many straight lines I can draw on the canvas! Which one is THE best?
But there are infinite many straight lines I can draw on the canvas! Which one is THE best model?
- BTW, the infinite many straight lines can be parametrised by weight and bias according to this expression: f(x) = wx + b where w is the weight and b is the bias
- Being parametrised means that we can tweak the value of the weight and bias to create all the infinite many straight lines where each line is determined by a unique combination of weight and bias.
Which one is THE best straight line?
- The line that "weaves" through all the labelled data points! It means that this line perfectly captures relation between # of ๐ฅ to # of ๐ from my existing experience (the labeled dataset)
- And we can hand calculate THE weight and bias for this best-fit line easily.
SOLVED!!! ๐ค Happy caveman life!
- Note that we have made two choices for this SL task: we chose to use a straight line or a linear function for weaving the labeled dataset
- and we chose the best-fit weight and bias for the linear function.
Caution: this is a overly simplified example where it seems trivial to find a model that has the perfect fit shape (a straight line) and perfect parameters for weaving through all data points.
an example of real life data on the whiteboard...
In real life, data is a lot more messy and do not fall into a straight line ๐
For real life data,
to find a good-fit model/function,
we can't just look at data and draw a nice weaving curve,
we can only start with initially guessed function and tweak its parameters till it is in a good weaving position.
That's quite a lot, congrats! ๐
The supervised learning training process (a summary of what we just talked about)
1. have an initially guessed model (often imperfect)
->
2. feed the input data into the (imperfect) model
->
3. get the (imperfect) model output
->
4. measure how wrong this output is compared to the correct answer (labeled output)
->
5. use the measurement to update the model
->
back to step 2 and repeat
Let me demonstrate the training process for the naive caveman ๐regression task on the whiteboard.
๐ค that is what "learning" in SL is about
what about "supervised" in SL then?
easy, it means learning/training using data that have "correct answers", aka labeled output
wait, can learning be done without "correct answers"?
introducing unsupervised learning (the one that uses unlabelled data, stay tuned!) or self-supervised learning(cool kids use this term)
think about how human and animal baby learn, it is amazing...
Connecting what we have talked about last week:
1. have an initially guessed model (often imperfect)
->
2. feed the input data into the (imperfect) model
->
3. get the (imperfect) model output
->
4. measure how wrong this output is compared to the correct answer (labeled output)
->
5. use the measurement to update the model
->
back to step 2 and repeat
๐คฉThe MLP we talked about last week corresponds to the step 1, 2 and 3
(recall the forward pass with randomly guessed numbers in the weight matrices and bias vector?).
๐คฉThe MLP we talked about last week corresponds to the step 1, 2 and 3 (recall the forward pass with randomly guessed numbers in the weight matrices and bias vector?).
๐It also corresponds to our first decision on choosing linear functions to model in the caveman example.
๐คฉThe MLP we talked about last week corresponds to the step 1, 2 and 3 (recall the forward pass with randomly guessed numbers in the weight matrices and bias vector?).
๐It also corresponds to our first decision on choosing linear functions to model in the caveman example.
๐ฅธIt is another function after all, just a more complicated/expressive one.
That's quite a lot, congrats! ๐
Now let's zoom into step 4 and 5 ๐:
- how to tweak the numbers in weight matrices and bias vectors according to the labeled data
so that the model is in a good weaving state.
Let's forget about neural network and supervised learning for now.
on whiteboard:
let's start with a game! ๐น๏ธ
GAME SETTINGS ๐ฐ
--1. environment: littel 2D creature living on a curve terrain ๐ป
--2. objective: find the valley ๐ป
--3. player control: moving along the X axis, left or right (direction)? how far (stepsize)?๐น๏ธ
--4. world mist: mostly unknown but some very local information
all we know is that we can feel the slope under our feet ๐ซ๏ธ
on the whiteboard:
start here,
question 1: shall i go left or right? ๐
on the whiteboard:
answer 1: go with the downward slope direction ๐
on the whiteboard:
question 2: what happens if we feel the slope is flat under our feet? ๐ง
on the whiteboard:
answer 2: jackpot! ๐ฅฐ
no slope means that we have reached the valley!!!
a gentle reminder: avoiding being omniscient in this game, we know it is the valley not because we can see it being the lowest point from outside the game world
on the whiteboard:
question 3: back to the start, now we know how to decide the direction but how about our step size?
the dangerous situation with big step size ๐ฅพ
on the whiteboard:
answer 3: a good strategy is that we should decrease our step size when the slope gets flatter
on whiteboard:
question 4: game level up! new terrain unlocked... ๐
What are the flat-slope points in the new terrain? how can we know if we are at THE valley (if flat slope is all we are looking for) ??? ๐ฅฒ
on the whiteboard:
answer 4 part one:
these are global minima, local minima, local maxima and saddle point.
on the whiteboard:
answer 4 part two: NO WE CAN'T ๐ฅน
We can easily get trapped at local minima
on the whiteboard:
BONUS ๐ฐ question 1: start here and is there any chance we end up at the local maxima ?
hint: run simulations in your ๐ง , follow the
"feel the slope -> decide the direction
-> pick a step size -> jump to the point
-> repeat"
process
on the whiteboard:
BONUS ๐ฐ answer 1: barely possible
don't worry too much about the local maxima
on the whiteboard:
BONUS ๐ฐ question 2: start here and is there any chance we end up at the saddle point?
hint: run simulations in your ๐ง , follow the
"feel the slope -> decide the direction
-> pick a step size -> jump to the point
-> repeat"
process
on the whiteboard:
BONUS ๐ฐ answer 2: likely!!!
we could get trapped at the saddle point ๐ชค
what can we do?
larger step size helps us get carried over
on the whiteboard:
but what can we do?
larger step size helps us get carried over
MISSION ACCOMPLISHED โค๏ธโ๐ฅ
wait how about training a neural network (tweaking its parameters)?
that's exactly how we train a neural network
Recall the SL training process:
1. have an initially guessed model (often imperfect)
->
2. feed the input data into the (imperfect) model
->
3. get the (imperfect) model output
->
4. measure how wrong this output is compared to the correct answer (labeled output)
->
5. use the measurement to update the model
->
back to step 2 and repeat
SAME SETTINGS ๐ฐ
--1. curve terrain ๐ป: a loss function measuring distance between predicted output and labeled output
and we are moving in the space of model parameters
SAME SETTINGS ๐ฐ
--2. objective ๐ป: find the parameters that give the lowest loss, which corresponds to the valley on the loss function terrain
SAME SETTINGS ๐ฐ
--3. player control with step size๐น๏ธ: adjust numbers in weight matrices and bias vectrs by deciding how much to increase/decrease each parameter
SAME SETTINGS ๐ฐ
--4. world mist ๐ซ๏ธ: we are agnostic of what parameters give the perfect solution
but we can compute the gradient (the slope) given current parameter values (under our feet)
SAME SETTINGS ๐ฐ
--1. curve terrain ๐ป: a loss function measuring distance between prediction and groudtruth
--2. objective ๐ป: navigate in the space of parameters and find the lowest loss point
--3. player control ๐น๏ธ: adjust numbers in weight matrices and bias vectors by deciding how much to increase/decrease each parameter
--4. world mist ๐ซ๏ธ: we are agnostic of what parameter values would give the perfect solution
but we can compute the gradient (the slope) given current parameter values (under our feet)
SAME TECHNIQUES ๐
--1. use the slope (gradient) direction to infer what directions to adjust for each parameter
gradient: derivative, aka the slope
direction really just refer to the plain binary choice of "increase or decrease / + or - "
SAME TECHNIQUES ๐
--2. also use the slope (gradient) value as an indicator of how close we are to a potential valley
SAME TECHNIQUES ๐
--3. also use the slope (gradient) value to determine the step size of adjustment
step size: learning rate
SAME TECHNIQUES ๐
--1. use the slope (gradient) direction to infer what directions to adjust for each parameter
direction really just refer to the plain binary choice of "increase or decrease / + or - "
--2. also use the slope (gradient) value as indicator of how close we are to a potential valley
--3. also use the slope (gradient) value to determine the step size of adjustment
step size: learning rate
SAME FINDINGS IMPLIED ๐
-- mostly local minima
-- impossible global minima
-- no need to worry about local maxima
-- extra caution for saddle point (use larger step size)
RECAP 1๏ธโฃ
numberify and rephrase the neural network parameter tweaking process:
1. the goal is to minimize the loss (cost) function by adjusting parameter numbers
RECAP 2๏ธโฃ
2. after numberifying the training process, we can then apply some math trick called gradient descent
-- find the steepest decreasing direction
-- one gradient for one parameter
RECAP 3๏ธโฃ
3. we multiply the (minus) gradients with some learning rate to decide the parameter adjustment values
DONE๐
let's watch some of
this video together
to verify our intuitions
and to connect them with the practical process
well done everyone ๐
we have gone through MSc-level content
another two jargons unlocked:
backpropagation: a gradients calculation scheme
optimizer: conventionally in python DL libraries, all these backprop/GD stuff are handled by an object called "optimizer"
That's quite a lot, congrats! ๐
Next, we are going to:
- take a look at how training a MLP on fashion MNIST dataset is implemented in python with help from NumPy and TensorFlow(a very popular deep learning library in Python)!
Alert: you are going to see quite advanced python and neural network programming stuff, we are not expected to understand them all at the moment.
Let's take a look at how some ideas we talk about today are reflected in the code,
especially how we choose a model and how we fit the model to the dataset by setting up the loss function and optimizer.
- It is just a one-liner really...
everything is prepared
here
Let's take a look at the notebook!
- 1. Make sure you have saved a copy to your GDrive or opened in playground. ๐
- 2. Most parts are out of the range of the content we have covered so far.
- 3. We only need to take a look at the several lines in the "Build the Model" and "Feed the Model" sections.
Today we have looked at:
- The supervised learning process that select a parametrised model and train the model to fit a labeled dataset
- The training or parameter tweaking process is done through backpropogation by a technique called gradient descent
- Intuitions on the gradien descent: feel the slope and find the valley
We'll bring swift and fun applications back next week, see you next Thursday same time and same place!