Our human brain is the most complex organ in the entire human body. It contains roughly 86 billion neurons, each of which is linked to the others through 'synaptic connections' in a total of 100-500 trillion connections. We as humans don't even fully understand how our own brain works, has consciousness and is able to store memories.
The idea of modern neural networks was an attempt to recreate the intelligence of our brains using a simplified mathematical model. This enables them to learn patterns very quickly and generate quite accurate results, unlike previous techniques like SVMs, and they have since been used in large-scale models like GPT or Stable Diffusion.
Neural networks have been around since the 1940s, but they only started to gain serious popularity in the early 2010s because of the availability of more powerful hardware like GPUs.
Every large model you see online is built with neurons arranged in different ways, and all models learn using the same underlying algorithm, which we are going to explore today: backpropagation.
We're going to build a neural network from the ground up with just NumPy. To really get the most out of this blog, I expect you to have a basic understanding of differential calculus, some Python fundamentals and some basic experience with NumPy, though I have revisited a few concepts in this blog.
I would like to be clear that this work is NOT a recreation of any online content that you would see, such as Andrej Karpathy's Zero to Hero series or Andrew Ng's Deep Learning Specialisation. Though I have taken inspiration from various sources, I have built my own understanding and this is my own independent work built upon my knowledge.
This is going to be a rather long and technical blog. To keep things simple, I'm going to skip over a lot of concepts you would usually learn in a deep learning course, such as Dropout, Early Stopping, Learning rate schedulers or Quantisation.
A Neuron is the fundamental unit of computation of Neural Networks. They're also called perceptrons or units. Each neuron takes a set of inputs, associates a 'weight' corresponding to those inputs as well as a 'bias ', and returns a dot product of the inputs and weights added with the bias. This dot product is further passed into an 'activation function ', which is the final neuron output.
Weights and biases are just numerical values which signify the strength of an input or a connection. Together, they're called the 'parameters' of the model.
Production frameworks such as PyTorch further break neurons down to n-dimensional tensors and provide a Python object to experiment, train and build larger models. These tensors are also pre-arranged in various shapes such as Convolutional layers, Embeddings or Linear layers.
For the scope of our blog, we're going to assume the smallest unit of a Neural network is a neuron, and all the inputs, weights and biases are stored as single values or 1-D scalars.
Let's take a neuron which takes 3 inputs, has 3 weights and a bias and returns 1 output. Here's a diagram of how this example neuron would look:
The above image can be represented mathematically as follows:
where:
In any notebook environment, we can create a Python 'Neuron' class using NumPy:
import numpy as np
class Neuron:
def __init__(self, inputs=[], act='relu'):
self.input_sources = inputs
self.inputs = np.array([n() if callable(n) else n for n in inputs])
self.weights = np.random.random(len(self.inputs))
self.bias = np.random.random()
self.act = act
def set_input(self,new_inputs):
# set new inputs to the neuron if we want to update.
self.input_sources = new_inputs
def return_dot(self):
# Compute the summation of products of inputs and weights whenever it's called as a function
dot_prod = np.dot(self.inputs, self.weights) + self.bias
return dot_prod
def __call__(self):
f_x = self.return_dot()
n_x = 0
if self.act == 'relu':
n_x = np.maximum(0,f_x)
elif self.act == 'tanh':
n_x = np.tanh(f_x)
elif self.act == 'linear':
pass
return n_x
This enables us to instantiate a Neuron class with an array of inputs which can either be floating point numbers or other neuron instances. We're storing two input arrays: input_sources which just store the inputs as they are passed and inputs array which are the numerical values computed from previous neuron inputs.
n = Neuron(inputs=[0,1,2])
n1 = Neuron(inputs=[n], act='relu') # n1 has the input n
In the input_sources array, we have information about which neurons the current neuron receives input from. The instance stores only the preceding neurons, not the succeeding ones, to efficiently run the backpropagation algorithm, which we will explore later.
Activation functions are some functions attached to the end of a neuron which determine whether the output should be 'activated' or not depending on the input from the dot product. By introducing non-linearity through activation functions, neural networks can learn and map complex patterns.
Without activation functions, neural networks are just a large-scale linear combination of inputs. This means the model would never be able to learn nuances in images, videos or natural language.
Some of the most common activation functions, their derivatives, and plots are presented below. The blue line is the original function and the orange line is the derivative of the function.
ReLU: Rectified Linear Unit
TanH: Hyperbolic Tangent
Sigmoid: Sigmoid function
Looking at the plot of derivatives, we can see how the function behaves as the input changes. The differential also tells us the 'slope' of a curve, which means how much the function output is changing with respect to the input.
In higher dimensions, when multiple variables are involved, this slope is called the gradient of the curve. A gradient simply tells you how much an output is changing with respect to one particular variable.
In deep learning, our concern is to find and understand how much a function changes with respect to the model weights. This is where the concept of gradient becomes really helpful.
Sometimes it is better to not use an activation function by using a linear activation function i.e. simply multiplying the output by 1 instead of passing it through a fancy function.
The drawback with the complex activation functions that we just saw is that they squash the incoming values to a fixed set of outputs. (Sigmoid and TanH ) ReLU even gets rid of negative numbers entirely by only returning 0 or positive numbers.
If your model is predicting a large complex number (house prices, weather forecast, stocks) or a negative value (temperature, directional velocity), you are better off using linear activations.
Another case could be that your model is predicting a set of outputs for classifying objects. This would mean that our model has multiple output neurons. You're almost always better off letting the loss function work with the raw outputs. We will learn about this more in detail.
Before we get started working with the neuron object and initialising a small neural network, we have to understand what the chain rule is and why exactly it is used.
The chain rule is a fundamental concept in calculus and it powers the backpropagation algorithm. Without it, there is almost no way we could change the weights of the model and determine how much that change in weight changes the final output.
If we have two functions and where is a composite function, we can find out how much changes if we just change :
Using chain rule, we can determine how much the output changes if we slightly change . Since is indirectly related to , we can simply multiply all the way inside until we get a function that is directly related to and find out the how much that function ( in this case) changes w.r.t. .
The chain rule still holds true if the function is related to more than 1 variables.
Suppose we have a function and and are further two functions where and , we can find out how much the output changes w.r.t. original inputs and .
Here, the symbols and both represent the same thing: differential of z with respect to or . Conventially, you use the delta symbol if a function has multiple inputs and we're finding out the derivative with respect to one of those inputs. Otherwise we simply use .
If and had both sharing the same input , then the equation would now be:
This time, we are adding instead of leaving them separate to accumulate and consider how much the same variable changes the output since is going through both and .
With the above equations, we can now find out how much the output changes if the inputs or change.
This is the most important concept to understand in order to build and understand neural networks. Without having an intuition of what the chain rule is telling us, it would be very difficult to understand how exactly these models 'learn' the dataset.
In our case when I was building and experimenting with the Neuron object, majority of the bugs and miscalculations arose from implementing the backward pass and it quickly got difficult to fix these issues when I started to work with larger networks where more parameters and variables were involved.
Let's start working with the Neuron that we defined by instantiating a few neurons, giving our model an input, predicting an output and finding out the 'loss' of that output. Our neural network is going to take a number and predict the square of that number.
The image below represents the model architecture that we're going to start with:
Let's start by setting a datapoint and and pass it to the model:
x1 = 3
y1 = 9
np.random.seed(0) # for reproducibility since the weights are randomly generated
n1 = Neuron([x1])
n2 = Neuron([n1])
n3 = Neuron([n2], act='linear')
pred = n3() # the prediction
If you were to call the neuron n3(), you would expect the output to be 9, but instead it comes out as a random value. The value 3 went through all the neurons, and the final neuron (or layer, as we will see later) returned a value.
Whenever you pass an input to the model, that input goes through all the model layers and neurons and all the complex processing before we finally get an output. This is called the forward pass.
The reason that we're getting a random value instead of the expected output of 9 is that we haven't trained/optimised the model yet.
Every model is supposed to have a loss function. A loss function tells the model how far its predicted outputs are from the real outputs. Remember: we are building a model which predicts a numerical output, and there has to be some tool to tell the model if it's learning the outputs correctly or not.
For our use case, we are going to use Mean Squared Error (MSE) as our loss function. . This is also called the L2 loss.
def mean_square_loss(pred, real):
# returning the mean square error
upstream = 2 * (pred - real)
out = (pred - real) ** 2
return out, upstream
Notice that we're also returning an upstream variable. The upstream here means : the change in loss with respect to the output.
For this particular neural network, since it has one single neuron as the output. But they can be different as we will explore in another example.
One of the main objectives in model training is to minimise the loss by updating model parameters. And this would mean we want the out variable to approach 0 by backpropagating through the model to update the parameters (weights and biases).
Notice we didn't choose finding out the absolute difference as the loss (also called the L1 loss): . This is because the derivative of this function is a piece wise linear function, and that does not help the model learn the complex nature of the dataset.
This derivative is only telling us whether we are headed in the right direction or not, but it doesn't tells us how much exactly we are off from the real predictions.
Finding out the loss for our model, we define the loss function and pass it as an arguement with the real datapoint.
loss, upstream = mean_square_loss(pred, y1)
We're going to use the upstream variable to start the backward pass which runs the backpropagation algorithm.
This is the most important part of model training, and this is where the backpropagation algorithm is implemented. Our model takes the loss function and gets the upstream from the loss function to update the model parameters, and this is where the concept of the chain rule comes in.
Initially, we obtained the gradient, which is what I'm referring to upstream here. What we want to do is to find out the gradient of w.r.t. weights as well as biases . Since all three neurons are connected in a chain, we can use the chain rule to find out those gradients.
We can add a new gradients property array as well as a .backward() method in our Neuron class. Here we are storing the gradient of the neuron output w.r.t. weights as well as the bias.
class Neuron:
def __init__(self, inputs=[], act='relu') -> None:
... our previous variables here
self.gradients = np.array([])
...
def backward(self, upstream):
# this is where the backward pass is going to happen
Starting with we find out the gradient of loss w.r.t. the weight and bias using the chain rule.
We already have computed as the upstream variable so all we need to compute is the differential of the output .
Since is a composite function, we use the chain rule again to find out the gradient of w.r.t. and :
Notice that we also have to differentiate the activation function. When I was doing the math on paper as well as writing out the code, I almost always used to forget that we have to differentiate the activation function first before the dot product.
Finding out the differential of ReLU function (I've provided the formula before), we are able to differentiate w.r.t. as follows:
Yeah that's correct. It's not neat but it's correct :)
So combining that we get the gradients of :
In our .backward method, we write the code for above mathematical operations:
def backward(self, upstream):
n = self.__call__() # the n(x) from the __call__ that we defined
dot = self.return_dot() # the f(x) i.e. the dot product
g = 1 # the value of differential of activation function
if self.act == 'relu':
g = 1 if dot > 0 else 0
elif self.act == 'tanh':
g = 1 - np.tanh(dot)**2
self.gradients = self.inputs.copy() # df/dw = inputs n
# we're also storing the bias in the same array. a design choice
self.gradients.resize(self.inputs.size + 1)
self.gradients[-1] = 1 # df/db = 1
self.gradients = upstream * g * self.gradients
# upstream dL/dn * dn/df * [df/dw, df/db]
Now we have computed and stored the neuron's gradients in self.gradients in the form of : .
Since we only have 1 input, we have one weight. But in larger neural networks, self.gradients is going to be stored as:
where:
We now finally have the local gradients for , and using this we can easily compute the loss gradient with respect to the parameters of this neuron.
We pass it through the neuron to get our first new and updated upstream:
dl_dn2 = n3.backward(upstream)
Let's see why do we need to update the upstream.
Now let's find out the gradients of . Since it's preceding , we have a slightly longer expression of chain rule.
If you have noticed, on the right hand side we can already compute the last term of this expression in the exact same way as we did for . We also get the upstream gradient as an arguement to the backward method. But we haven't computed the term
Can you think of how can we derive the second term exactly? Think about differentiating the activation function.
We know that is an input for the neuron , so we have the expression . Let's differentiate w.r.t. to the input .
We have derived the second term mathematically, and if you have noticed this looks like the derivative of the activation function which we have stored in our code as g. We already know g is for any neuron with respect dot product i.e. the differential of the activation function.
We can use the same g variable, and instead of multiplying it by inputs, we multiply it by the weights and the previous upstream to return the new upstream.
We have to return this new upstream for the next neuron's backward as an arguement. So we add two more lines at the end of backward method.
def backward(self, upstream):
... the gradient calculation
new_upstream = upstream * g * self.weights
return new_upstream
This makes our backward method complete. (Or does it?). We have fully implemented the backward pass by writing the chain rule and writing out the backpropagation algorithm in Python.
We pass it through the neuron once again to get our updated upstream:
dl_dn1 = n2.backward(dl_n2)
Since most of the work has been done above, I suggest you try to find out the gradients of yourself as an exercise. Do the calculations on paper as well as in a notebook, and then see if it's working with the code that we have written so far.
In Python, we can simply write this out as:
dl_dx1 = n1.backward(dl_dn2)
Though our codebase is not going to require the last upstream dl_dx1, in different types of neural networks, such as adversarial networks or stacked neural networks, we often have to nudge the input itself from the loss.
In our case, the input x1 is a floating-point number, but it can also be an output of an entirely different neural network! And that's why I'm mentioning it here.
So far, we have found out the gradients of a neuron, i.e. how much the outputs are changing w.r.t. weights and biases by implementing the backpropagation algorithm.
This is, in fact, the heaviest and most demanding stage of model training. The backward pass requires a huge amount of memory from the forward pass to store all the gradients and activations to efficiently perform the chain rule. And this is why training a model requires significantly more compute than model inferencing.
But now that we have the gradients ready, what do we do with them?
We use the calculated gradients to update the weights. To update the weights, we simply subtract all the old weights from the gradients multiplied by a factor . This factor is called the learning rate of the model.
This learning rate is used alongside the gradient to update the weight as follows:
If you remember in 2D curves, the slope of the graph at a particular point was telling us where the curve was increasing. Our gradients work in the same way: they point to the opposite direction of the minima of the loss function. And this is the reason why we are subtracting from the old weight instead of adding to get the new weight.
The gradients tell inform us about the direction of the weight update, and the learning rate tell us by how much factor we should update our weights.
In our Neuron class, we add another method that updates all the model weights:
def step(self, lr):
self.weights = self.weights - lr * self.gradients[:-1]
self.bias = self.bias - lr * self.gradients[-1]
Now in our before main training run and even before we initialise our model, we can introduce a new variable lr which is the learning rate. We will use this to update our model parameters.
lr = 1e-4
... model initialisation and backpropagation code
# update the parameters for each neurons
n1.step(lr)
n2.step(lr)
n3.step(lr)
We want to minimise the loss function, which is a complex function of all the weights and biases. Having a large learning rate would mean the model would very easily overshoot the minima and this can result in catastrophic results.
What we discussed so far is also called the gradient descent algorithm. We are using the gradients to descend to the minimum value of the loss function.
In our blog, we have learned the simplest and easiest optimiser that updates the weights, calledSGD: Stochastic Gradient Descent. In production and larger models, you would use more advanced optimisers such as Adam or RMSProp to train models more efficiently. But the core idea remains the same: calculate the gradients, and use them along with a learning rate to update the model parameters.
For our use case, since the model is extremely small, we can get away with larger learning rates from 0.01 to even 10.
So far, our training pass looks something like this. I haven't talked about zeroing out the gradients just yet:
pred = n3()
# calculate the model loss
loss, upstream = mean_square_loss(pred, y1)
# print(loss)
# perform backpropagation
dl_dn2 = n3.backward(upstream)
dl_dn1 = n2.backward(dl_dn2)
dl_dx1 = n1.backward(dl_dn1)
# update the model weights
n1.step(lr)
n2.step(lr)
n3.step(lr)
We fed an input to our model 3 and we expected the output to be 9. But the actual output that we saw came out as a random value. And since the model prediction and the real value y1 are different, we get a loss value (e.g. 39.403).
We have used the upstream variable to run the backward pass and update the model weights. If in a separate cell we compute the model loss once again after running the pass, we get a different loss, which is lower than the original loss (e.g. 13.823).
We can't run the training cell again and again, so we wrap all of this inside a loop and plot :
n_iters = 1000
lr = 0.01
losses = []
for i in range(n_iters):
clear_output(wait=True)
pred = n3()
loss, upstream = mean_square_loss(pred, y1)
losses.append(loss)
... the entire code
n3.step(lr)
plt.plot(losses)
plt.title(f"Epoch: {i+1}, Loss: {loss:.4f}")
plt.xlabel("Epochs")
plt.ylabel("Loss")
print(loss)
If we print out the loss at the end of the training loop, we see that our loss is very close to 0. If you have run the cell a couple of times, you can even decrease the learning rate again manually and get an even lower loss.
And now, if we try to predict the output by calling the last neuron n3(), we can get an output which is very close (or almost the same as) what we wanted = 9.
Perfect! Our model has finally learnt that the square of 3 is 9. If you have followed along so far, Congratulations π.
Does this mean that our model can predict the square of any number? Let's see!
Since our neuron doesn't accept any new inputs, we can add another method to set the neuron's inputs.
def set_input(self, new_inputs):
self.input_sources = new_inputs
And in a new cell, we can set the input to our model via n1 and get the output by calling n3().
We get the wrong predictions, and it seems like our model didn't learn how to square numbers. It only learned that if the input is 3, the output has to be 9.
Our model understood one data point but has absolutely no idea about the other. One could also say that the model is overfitting on our dataset. How do we mitigate this?
To have our model actually learn the pattern that it needs to return the square of the input, we feed it a large number of data points called a dataset. The larger the dataset you have, the better the model turns out to be.
It's a common practice to split the large dataset into 3 different parts: training, validation and testing. But for this blog, we're just going to work on one large dataset and test it ourselves if the training has been successful at all or not.
k = 50
x_train = [i for i in range(1, k+1)]
y_train = [i*i for i in range(1, k+1)]
The new training loop now becomes:
n = 100 # number of iterations
lr = 0.01 # learning rate
losses = []
for i in range(n):
clear_output(wait=True)
for x, y in zip(x_train, y_train):
n1.set_input([x])
pred = n3()
# calculate the model loss
loss, upstream = mean_square_loss(pred, y)
# the remaining optimisation
# plot the loss
plt.plot(losses)
plt.title(f"Epoch: {i+1}, Loss: {losses[-1]:.4f}")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.show()
print(losses[-1])
After running the cell and testing the model, this is what we get:
The training looks very unstable. This is because the inputs and the outputs are extremely large with respect to the weights and gradients.
We normalise the dataset to lie within so that the model training becomes more stable.
k = 50
x_train = [i/k for i in range(1, k+1)]
y_train = [i*i/(k*k) for i in range(1, k+1)]
Now, if we re-run the training loop, plot the loss and test the output, we get a relatively more stable graph and a better (still incorrect) prediction.
Our dataset is sufficiently large for the model to learn the pattern of taking the input, squaring it and returning it as a prediction. But we are still getting the wrong results, and even with a significant increase in the number of iterations, we barely get close to the real outputs.
Clearly, our simple neural network is not working. And the reason for this is that the plot of is parabolic. But we're only using 3 neurons, and the first two are using ReLU activations and the last one is using a linear activation function.
Just by looking at the plot, we can understand it's hard for such a simple and small network whose outputs are linear to approximate a smooth curve.
This means that we now have to update the model architecture by increasing the width of the network. What I mean by updating the model architecture is that we are now choosing to add (or remove) neurons and rearrange them in other ways to get a better output.
In large models, we can do this in two ways:
For our example to predict the square of a number, we can increase the width of the first layer by adding more neurons to the first layer, and all the neurons feed into one output neuron.
I have set the number of neurons to 12, and the final neuron, which is the output, is n0:
# Layer 1
layer1 = [Neuron([0]) for i in range(12)]
# layer 2: the output layer
n0 = Neuron(layer1, act='linear')
Since our current Neuron class is designed in such a way that it has to take an input, I've set the input to 0. But this shouldn't matter since we're already updating the inputs in the training loop. The rest of the code remains the same.
for i in range(n_iters):
for x, y in zip(x_train, y_train):
for l in layer1:
l.set_input([x])
pred = n0()
# calculate the model loss
loss, upstream = mean_square_loss(pred, y)
# perform backpropagation
dl_dn0 = n0.backward(upstream) # perform for the output neuron
# then for the first layer
for k, neuron in enumerate(layer1):
neuron.backward(dl_dn0[k])
...
Since the final neuron has the same inputs, each of those neurons gets its corresponding upstream variable (dl_dn0 in this case). I haven't plotted the loss function for now. Just as before, you run the training loop in a cell, print the loss, go back to the previous cell to update the learning rate and run the loop again.
Eventually, you'll get a very low loss in the order of . Here's the code to check whether our models are predicting correctly or not:
# pick a random input and see how the model performs
idx = np.random.randint(0, k)
x = x_train[idx]
print("Input: ", x * 50)
for l in layer1:
l.set_input([x])
output = n0()
print(f"The model's predicted output: {output * 2500 }")
print(f"Actual output: {y_train[idx] * 2500 }")
Here is a table of some of the outputs that I'm getting:
| Input | Predicted | Real |
|---|---|---|
| 47 | 2225.901 | 2209.000 |
| 36 | 1311.276 | 1296.000 |
| 39 | 1526.413 | 1521.000 |
| 12 | 189.529 | 144.000 |
| 15 | 297.657 | 225.00 |
For most of the inputs, the predictions aren't very far off from the real values. By having an even wider model, you can get even more accurate predictions.
For inputs < 10, you will see that the predictions are terribly wrong. You would expect the model to return the square of 2 as 4, but instead we get something like -170.8!
This is because of how we are processing our training data. Small inputs like 2 are converted as (2/50=0.04) and it's output (4/2500=0.0016). These inputs are way too small for activation functions and so they don't make any difference when calculating the gradients.
Furthermore, the loss is dominated by big-target points (x=40, target 1600). Being off by 100 there barely matters relative to the huge numbers; being off by 100 at x=7 (target 49) is catastrophic but contributes almost nothing to the total loss. This means the gradient descent totally ignores these small values.
Our choice of loss function prioritises large numbers over smaller ones, and this results in our model being unable to predict the correct square for smaller numbers.
In this blog, I have defined what a Neuron is, what's the maths behind it, how we can define one in Python as a Neuron class and how we can use multiple instances of a neuron to create a neural network.
What we specifically learnt is a Feed Forward Neural Network (FNN). This type of neural network is the easiest to learn and get started with, and it helps to build an intuitive understanding of what different types of neural networks are like, such as Convolutional NN or Recurrent NN.
We defined and trained a simple neural network that predicts the square of a number by understanding all the stages of model training:
The GitHub repository for this entire writeup is here: Link
There is another step of model training right before the backward pass that I've deliberately skipped - the zeroing out of the gradients. But it becomes necessary when our model grows bigger, and more gradients start to accumulate from multiple paths. More on that in Part 2.
If you've made it this far, I have a challenge for you. I want you to take this Neuron class and use it to train a simple MLP model on the Iris dataset.
You can use scikit-learn to get the dataset, but you must not use any deep learning framework such as PyTorch or MLX, and only build the model using the Neuron class that we have defined here!
My solution for this challenge is already on my GitHub, so you can cross-check your work there!
As you'll see, our model is going to be much larger and more complex to handle than before. So the next part of the blog is going to be built upon the solution for the Iris dataset!
I hope this blog gave you a gentle introduction to Neural Networks and a better idea of how model training on a large scale actually happens.
Thanks for the read, and keep building. π