Neural networks learn through training.
In training a neural network is given an input and it shows an output. It learns from the mistakes it makes between its outputs and a desired output.
Here is waht happens initially during the Feedforward stage of learning:
The input Feedforwards through each layer which carrying its value through a set of
Weights and Activations unique to each layer.
During this process each neuron calculates its value based on the summation of
Weights * Inputs
This summation is then fed through the units Activation function usually
Act = Sigmoid(Weights * Inputs)
Sigmoid = 1 / (1 + Exponent(-Sum)) - This produces a smooth step which switches the neuron on or off by providing a value between 0 and 1.
This process is called feedforward and is the first step in learning. Remember the Sigmoid as this is important for Training.
Then the Error made by the neural network is calculated :-
Error = Target (Desired Output) - Ouput of the net
This error is then used to create a measure of how the Network performed called
Loss
An example of how we can do this is the Loss function here it is
Loss = Sum of squared(Error) / Number of Outputs
This makes the Network Error easier to use in training. Squaring makes it positive only.
Summing and dividing by the number of Outputs is like taking the average.
This method is called Mean Square Error or MSE.
Now that we have the Loss calculated we need to feedback the Loss from the output Layer to all the hidden layers of the Neural Network.
This is called Backpropogation.
In mathematics there is a method called the chain rule:
If a change in z depends on y and
y changes depend on x then it follows that changes in z must depend on x too.
dz/dx = dz/dy * dy/dx
Using the chain rule we can Feedback the Loss from the Output layer to all other layers.
if we think of the change in Loss with respect to the weights that connect each layer:
l = loss at lower layer
ul = loss at upper layer
a = actvity of lower layer neuron
w = weight between upper and lower layer
dl/dw = dl/dul * dul/dw
upper loss depends on wgt
lower loss depends on upper loss
then it follows that...
lower loss depends on wgt
this breaks down to
y = sum(ul * w)
Here we are differentiating the Sigmoid activation function for each neuron:
dl/dw = y*(y-1) * a (Differentiate Sigmoid using dul/dw)
The chain rule allows us to connect a change in loss at one layer with that of another ie the output loss with the input loss.
So now we have used the chain rule to feedback or backprop error from an upper layer to a lower layer. In this way we have a value for loss at each layer for each neuron.
Hurrah.
Now that we have loss for each layer and each neuron we can then change the weights that connect between each layer in such a way to reduce the loss and thereby train the neural network to provide an output that better resembles the desired one - the target.
We will change the Weights at each layer using differentiation. We will change them according to the change in loss at each layer giving us:
dw/dl = loss * a * lrt
lrt is our learning rate - how fast we want it to learn.
This is called Gradient Descent because we are following the gradient of the loss. As the loss gradient lessons so do the updates to each weight. We follow the path of least resistance until we reach our goal which is the least amount of loss.
lrt is our learning rate - how fast we want it to learn.
Too fast and the network will overfit the data. Overfitting is when the neural network fails to generalise to new unseen data. We can compensate for overfitting by using noise in our neural network. This helps the Gradient Descent to avoid falling into suboptimal solutions that overfit the data. The noise works by causing the Descent to move a little randomly and to jump up out of these suboptima and explore other solutions that are better.
Another way to look at it is that Backprop is the reverse process of Feedforward.
In Backprop we are using derivatives of the neurons activation function sigmoid to produce error from the summation of upper error and weight. Reverse.
We Feedforward activations using the same activation function on the summation of weights and lower activations in the forward direction. Forward.
And thats how the neural network learns.