Neural Network with One Parameter – We say activation functions learn decision boundaries, but what do we mean by learning? Learning in the context of neural networks involves adjusting the parameters of the network to classify data into correct categories and make accurate predictions. In this post, you will learn how we can train a neural network to learn those parameters.
Let’s get started.
Overview
- How do we train a neural network to learn just one parameter such as weights?
- Steps for training a neural network
- Neural Network example
1 – How to Train a Neural Network to Learn One Parameter?
Activation functions enable learning by introducing non-linearities into the network computations, enabling it to capture complex relationships and patterns in the data. Learning the activation functions involves iteratively updating the network’s parameters during training. Let’s first understand what those parameters are.
For the sake of simplicity, let’s take only one parameter of the neural network i.e., weight, and see how the model learns the weights.
Here’s an example of a simple neural network with three inputs and one neuron with a non-linear activation function followed by an output layer.
So, how do we train the model to learn these weights?
Training a neural network is an optimization problem!
2 – What is an Optimization Problem?
Training a neural network is an optimization problem, where the objective is to minimize the predefined loss function.
Loss Function – Tells us how well our model is doing on training data. During training, the network is presented with training examples, and its predictions are compared to the actual target values. The difference between the predicted and actual values is used to compute the loss.
If you want to learn more about different types of loss functions for regression and classification problems, click on the post below:
Loss Functions for Regression and Classification in Deep Learning
3 – Steps for Training a Neural Network
Here’s how we train a model that has a single weight to learn:
i. Defining a Loss Function:
We define a loss function which tells us how well the model is doing. For this post, we are working with Mean Squared Error (MSE) loss function.
MSE = (y_i - \hat{y}_i)^2 = (xw - y)^2
where n represents number of samples, y_i refers to true values for i^th sample, \hat(y_i) represents the predicted value for the i^th sample.
ii. Calculate the derivative of error w.r.t. weights:
\frac{\partial E}{\partial w}
where E represents the error function and w denotes the weight.
iii. Using the derivative to define an update rule
w = w - \alpha \frac{\partial E}{\partial w}
where w is the current weight, \alpha is the learning rate that specifies the magnitude of update at every iteration, \frac{\partial E}{\partial w} is the derivative of the error function E with respect to the weight w.
The update rule will incrementally update the weights towards the opposite sign.
Let’s now train a simple neural network and see how we can optimize the weight (parameter) to minimize the loss function.
4 – Training a Neural Network with One Parameter- Example
Take an example of data points plotted on (x,y) coordinates. We have to learn a function that could generate similar data points. Here’s how it looks like:
Here’s how we would calculate the value of y assuming value of weight is randomly initialized to w=0.5:
y = xw
putting the values of datapoints (x,y) = (2,4) in equation for y:
y = (2)(0.5) = 1 for x=2
But the actual (target) output for x=2 is 4, as you can see in the plot above. So, how to fix this? How to tell the model to update weights in the right direction?
By using the loss function
Let’s consider MSE loss function in this example:
MSE = (y_i - \hat{y}_i)^2 = (xw - y)^2
MSE = ((2)(0.5) - (4))^2 = (2 * 0.5 - 4)^2 = (-3)^2 = 9
Let’s now calculate the derivate of error with respect to weight:
\frac{\partial E}{\partial w} = 2x(xw - y)
\frac{\partial E}{\partial w} = 2 * 2 (2 * 0.5 - 4)
Now, calculate the update rule for weight w, assuming \alpha = 0.1 :
w \leftarrow w - \alpha \frac{\partial E}{\partial w}
w \leftarrow 0.5 - (0.1 * 2) * 2 (2 * 0.5 - 4) = 1.7
Iterate Over Multiple Data Samples:
This is just one data sample. Here’s how you do this for multiple data samples:
Data sample | Weights | Update Rule |
(2,4) | w = 0.5 | w \leftarrow 0.5 - (0.1 * 2) * 2 (2 * 0.5 - 4) = 1.7 |
(1,2) | w = 1.7 | w \leftarrow 0.5 - (0.1 * 2) * 1 (1 * 1.7 - 2) = 1.76 |
(3,6) | w = 1.76 | w \leftarrow 0.5 - (0.1 * 2) * 3 (3 * 1.76 - 4) = 2.192 |
(4,8) | w =2.192 | w \leftarrow 0.5 - (0.1 * 2) * 4 (4 * 2.192- 8) = ? |
You repeat this process for different data points until the loss converges to an acceptable point (global minimum). This optimization algorithm is called stochastic gradient descent.
Training a model this way This helps in updating the parameters in the direction that reduces the loss function, stochastic gradient descent optimization in this example enables the neural network to learn from training data and improve its performance over time.
The derivative tells us the slope of the loss function at a given point. It helps in determining whether the value of weight needs to be decreased or increased to reach the minimum cost.
Here’s a figure that summarizes each step:
5 – Ways to Iterate Over Data Samples:
Here are some terminologies that you need to be familiar with:
- Online Learning: When we iterate over the samples one by one, which is referred to as ‘online learning‘. Training models this way can often lead to noisy weight updates and may slow down the convergence. The optimization algorithm that we use with this is stochastic gradient descent.
- Batch Learning: Use all data points at once, and average the loss over all data points at every iteration. The optimization algorithm that we use with this is gradient descent. However, in many applications, this approach is not feasible since the dataset is too big to fit into memory. Even if the entire dataset fits the memory, it might not be useful to use the entire data at every step.
- Mini-batch: Pick a mini-batch that consists of a number of samples and average the loss over samples in the mini-batch at every iteration. The number of samples in a mini-batch is called ‘batch size’.
Now we know that learning using a single parameter i.e. weight in this example is pretty straightforward, but what do we do when we have many parameters to learn?
In the next tutorial, you will learn how is training deep neural networks different from training shallow ones and intuitively see how we can train these models:
How to Train a Neural Network with Multiple Parameters
Summary
In this post, you learned:
- The basic steps of training a neural network include defining a loss function to calculate the error, taking derivate w.r.t. error, and using the derivative to define an update rule to update weights to minimize loss.
- For simplicity, we omit bias and non-linear activation.
- Derivative tells us the slope of the loss function at a given point.
Related Articles
- How to Choose the Best Activation Functions for Hidden Layers and Output Layers in Deep Learning
- Understanding Optimization Algorithms In Deep Learning
- Understanding Linear and Non-linear Activation Functions in Deep Learning
Related Videos
- Deep Learning crash course – Leo Isikdogan