Building Neural Network from Scratch in Python

Understanding Deep Learning from Scratch

Akil Ahmed
ITNEXT

--

This is the second post of the Understanding Deep Learning from Scratch Series, here is our first post where we have explained how to create one single neuron(perceptron) in detail, we will be referring to that post multiple times over here so please, keep it open in another tab. you can find the code for this article on this GitHub Link inside building_dnn.ipynb file and you can find the numpy dataset in this link

In this post, we will create a Shallow Neural Network in Python from scratch. Then we train a classification model to predict whether the image is of a dog or a cat. Please refer to my previous post for understanding the dataset setup. We will be skipping a few things which have already been explained over there like sigmoid function, forward propagation, backward propagation, cost function, and gradient descent. We will talk about all these things in this post as well but we will not get into much detail.

Let’s Look at the architecture of our model

image-1

Unlike our Perceptron here the data(X) is not connected to our output neuron, it is connected to a middle layer(A) which is called Hidden Layers. Each unit of X is connected to every unit of A. Here we will create a network with 1 input,1 output, and 1 hidden layer. We can increase the number of hidden layers if we want to. The A is calculated like this,

equation - 1
equation - 2
image-2

Like last time, we compute the Z vector with the equation-1 where superscript l denotes the hidden layer number. W, X, and b are the weights, sample data and bias respectively. We will be using the Relu activation function in the hidden layers and the sigmoid function for the output layer.

All the 5 units of the hidden layer are then connected with our output unit which uses the sigmoid function as before to classify into the right class.

As before, we will first initialize our weights and biases, then it will go through the hidden layer, and then in the end the output layer. Let’s see and understand the code in detail.

  • At first in the initialize_parameters_deep function, we are initializing parameters but unlike before we are initializing W to be random numbers. Why? well because, if we start the values of W from zero for every unit(Neuron) of the first layer then during the forward and backward propagation the values of W will stay the same for every neuron of every layer, the W of every unit will be updated in the same way cause their starting point was same So generally, it will not learn anything new at all, and it will be like using a single neuron with more computation which does not make any sense at all, Right?
  • If we use random numbers to initialize W, then all the values will be different at initialization and each neuron will learn a new thing which is what we want. We will see how soon.
  • We are creating a parameter dictionary where we store the values of W, b(initialized as 0) for different layers and return them.
  • Then we are creating our Sigmoid and ReLu functions to be used in the output layer and hidden layers respectively.

Our Forward Propagation step is very similar to before only caveat being previously we were doing it for only one neuron but now we will be doing it for every neuron of every layer. So, if the model has 2 layers and each layer has 5 units then in total we will have 11 neurons(2 x 5 + 1 output neuron).

Let’s understand the code in detail,

  • In the linear_forward function, we are implementing equation-1 doing the dot product of weights(W) of the current layer with the result from the previous layer(A_prev) or sample data(X)(For the first layer only).
  • Then in the linear_activation_forward function, we are passing A_prev(results from the previous layers), W, b, and activation as an argument, here activation will be a string(sigmoid or relu) to decide which activation function to use in the current layer and we will call the linear_forward function to get Z, then we will calculate the A for the current layer with our activation function according to equation 1 and 2.
  • Then in the L_model_forward function, we are initializing A as X(dataset) for the first layer, creating an empty array named caches which will keep a track of our parameters(W and b), A_prev and Z, then storing the number of layers in variable L. I am gonna let you guys think about how we are getting the layer number from L = len(parameters) // 2 here.
  • In the for loop, all the units of hidden layers(A) are being calculated with Relu activation and in the next layer, we are updating A_prev to be A(Line 28). So for the first layer A is X but for all the other layers the A becomes output from the previous layer(A_prev), understood?
  • After the loop, we calculate the output for the final layer(AL) with the sigmoid activation function.
  • L_model_forward function will return AL and caches that are been appended throughout this forward propagation step.

You might ask, why we are using ReLu in the hidden layers and not sigmoid? Well, since ReLu is calculated with the max(0, Z) formula it is computationally more efficient than sigmoid or tanh cause it does not perform any exponential operation. Also, it solves the vanishing gradient problem.

So, we are done with our Forward Propagation Step.

Now it is time for the cost_funtion(J) of each layer. In the previous article, we had a detailed explanation of cost-function so here we will just see how our formula changes for multiple units and hidden layers. okay?

equation-3

the only change we can see over here is the [L] superscript which denotes the Lth layer of the network. let’s see the code,

  • The code is pretty straightforward, we are taking the final result of Forward Propagation step(AL) and computing the cost with the actual value(Y) in the compute_cost function.

now, we have to calculate Backwards Propagation which is used to calculate the gradient of the loss function with respect to the parameters. During this forward and backwards propagation, we go through something called a computational graph. Please refer to this article to understand computational graphs in detail. In this graph imagine whatever we have done in Forward Propagation, we have to go in the reverse direction and calculate the gradient of the loss function with respect to the parameters.

So at first, we will find the gradient loss function w.r.t of our activation function(dZ) which we calculated at last for forward propagation so in backprop we will do it first and so on and so forth cool? So the dZ is the product of the derivative of the loss function w.r.t dA and the derivative of the loss function w.r.t Z equation-4.

equation-4

We know the activation function in the first layer was sigmoid and in every other layer it was ReLu. Henceforth, When activation function is Sigmoid, dZ = dA*A*(1-A) and when it is ReLu, dZ = dA(z > 0) and dZ = 0 (Z ≤ 0). Here is the code,

I will not go into the derivation of these equations below, If you want me to then please comment. here l is the current layer and l-1 is the previous layer. Now, the derivative of weights(dW), biases(db) and previous layer’s output (dA_prev) w.r.t loss function(J) will be,

equation-5
equation-6
equation-7

Let’s see the code,

  • As we can see the linear_backward function takes dZ, W, and b as arguments and will return dW, db and dA_prev according to the equations 5, 6, and 7.
  • in Forward Propagation we were taking the values W, b of Lth layer and A of (L-1)th layer to calculate the value of A of Lth layer right? so, in Backwards Propagation We will take dA of the Lth layer to calculate dW, db of Lth and dA of (L-1)th layer, the process is happening in the reverse order. Hopefully, you could build intuition around it.
  • In the linear_activation_backward function first, we are asking which activation function the backprop will happen on.
  • Backprop of activation returns dZ, then linear_backward computes and returns dA_prev, dW and db.

In the initialize_parameters_deep function of our Forward Propagation step, we have stored all our weights(W) and biases(b) in the parameters dictionary likewise, we will store all the dW, db and dA_prev of every layer in a dictionary called grads, where the keys of the dictionary is dW + str(l)(stringified later number) and the value is the vector dW of that specific layer. I took dW as an example, db and dA are stored the same way as well.

From forward propagation, we got W, b, and AL. From Backward Propagation, we got dW, db, dA_prev. So now, we have to update our W and b of every layer in order to fit our data. As in our previous article, we are updating the value of W by subtracting dW and db multiplied by the learning rate(alpha) from W and b respectively. Refer to the previous article’s gradient descent part to understand it better.

Let’s look at the code for gradient storing and parameter updates,

  • In the L_model_backward function at first, we are initializing grads as an empty dictionary which will store dW, db, and dA_prev, then L is the number of layers present, m is the shape of the probability vector, the output of the forward propagation(AL), Y is the label vector containing 0 or 1 values.
  • dAL is the gradient of the loss function for the output layer, which will be used in the linear_activation_backward function to calculate the grads for the last layer(line-10).
  • As you remember in our last layer of forward prop(output layer) we have used the Sigmoid function, We are doing back prop on sigmoid activation as well then storing it in the grads dictionary.
  • Then, we are looping from the second last layer to the first layer and are calculating and storing the grads for all the other layers as well. In these layers, we have used ReLu activation so we will use the relu_backwards function that we have created previously. At last, we are returning the grads.
  • In update_parameters function we are passing params(weights and biases), grads(gradients of loss funtion w.r.t params dw, db), and learning rate.
  • Then, Looping over the first layer to the last layer and updating the weights and biases according to equations 9 and 10. Same thing we have done in the gradient_decent function of our Perceptron model in the previous article. But here we did it for multiple layers and multiple neurons(Perceptrons)

Let’s assemble all the functions till now to create the L_layer_model function,

  • We are creating the costs empty array, which will store the cost function for each iteration.
  • Initializing our parameters(weight and biases) with layer dimensions which will be an array like [3072, 5, 1], here the first one is the input layer which has 3072 units(flattened images), it has 1 hidden layer having 5 units and at last 1 output layer which has 1 unit or neuron that will give us the probability of our classification. We can increase the number of hidden layers just by changing the layer dimension array, if we change it to [3072, 5, 5, 1] then it will have two hidden layers so on and so forth.
  • inside the loop, we are iterating over the number of iterations we have provided.
  • At first, we are forward propagating which gives us the result of the output layer(AL) and cache which stores parameters(weights and biases) and output of every layer(A).
  • Then, we are computing the cost of AL and the actual labels(Y).
  • After that, we compute our gradients(dW, db and dA_prev).
  • Next, we are updating our parameters and printing the cost function.
  • in Line 13–14, I am reducing the learning rate after certain iterations so that it does not overshoot.
  • At last, we are returning our updated parameters(W and b) and costs array.

now, let’s train and test our model,

  • At first, we are calling the L_layer_model function which will train the model by taking the training dataset, layer dimensions are [3072, 5, 5, 1], the learning rate is 0.2, and the number of iterations is 10000 as arguments and is returning the updated parameters and costs.
  • Then in the predict function, we are transforming the probability values to 0 and 1. If the result of forward propagation or L_model_forward by taking the updated parameters is greater than 0.5 then it’s 1 otherwise 0.
  • It will print the accuracy of the dataset by comparing it with the actual value y.
Cost after iteration 0: 0.6965011895457693
Cost after iteration 1000: 0.6368951308414399
Cost after iteration 2000: 0.6219993389332111
Cost after iteration 3000: 0.6054223197992217
Cost after iteration 4000: 0.6010027596861128
Cost after iteration 5000: 0.5801166825629971
Cost after iteration 6000: 0.578504474487532
Cost after iteration 7000: 0.5665113489929966
Cost after iteration 8000: 0.5620262967779565
Cost after iteration 9000: 0.5597852629281578
Cost after iteration 9999: 0.5575603590876632`
test Accuracy: 0.6150000000000001
train Accuracy: 0.7849999999999998

We can clearly see that our model with multiple neurons and layers improved just 3% from our previous model, which is not good at all. Although we have done much more work than before. So what is going on?

The model is getting overfitted on training data, to avoid that we have to use something called Regularization Methods and a few more hyperparameters to optimize our algorithm, which you can read about in the Third article of this series. Thanks for reading, if you have any questions them please shoot in the comment section or on whichever social media you want to.

Credit: https://www.coursera.org/specializations/deep-learning

--

--