This repository provides a detailed mathematical derivation of deep learning components. The goal is to break down the complex mathematics underlying neural networks for better understanding and application.
-
1.1. Objective Function and Model Formulation
1.2. Gradients of Weights and Biases
1.3. Backpropagation
A fully connected neural network, which acts as a parametric estimator, can be mathematically expressed as:
Where:
-
$\hat{Y}$ represents the predicted output of the network. -
$W_i$ represents the weight matrices associated with each layer$i$ . -
$b_i$ enotes the bias vectors corresponding to each layer$i$ . -
$\sigma$ denotes the activation function, which introduces non-linearity into the model. -
$X$ represents the input data.
To measure the model’s performance, we use the Mean Squared Error (MSE) loss function, defined as:
The training objective is to minimize the loss function with respect to the weight matrices and bias vectors,
To optimize the neural network, we need to compute the gradients of the loss function with respect to each parameter. The gradient of the loss function
We can derive the gradients for the parameters in the output layer,
The gradients with respect to the second hidden layer's parameters,
Finally, the gradients with respect to the first layer's parameters,
These calculations can be implemented in MATLAB code as follows,
% Compute gradients manually
% Gradients for W3 and b3
dW3 = 2/batch_size * (error) * ReLU(W2 * ReLU(W1 * x + b1) + b2)';
db3 = 2/batch_size * sum((error), 2);
% Gradients for W2 and b2
dW2 = 2/batch_size * (W3' * (error)) .* ReLU_deriv(W2 * ReLU(W1 * x + b1) + b2) * ReLU(W1 * x + b1)';
db2 = 2/batch_size * sum((W3' * (error)) .* ReLU_deriv(W2 * ReLU(W1 * x + b1) + b2), 2);
% Gradients for W1 and b1
dW1 = 2/batch_size * (W2' * (W3' * (error) .* ReLU_deriv(W2 * ReLU(W1 * x + b1) + b2))) .* ReLU_deriv(W1 * x + b1) * x';
db1 = 2/batch_size * sum((W2' * (W3' * (error) .* ReLU_deriv(W2 * ReLU(W1 * x + b1) + b2))) .* ReLU_deriv(W1 * x + b1), 2);
Backpropagation is computationally efficient because it leverages previously computed gradients and systematically applies the chain rule, avoiding the need to calculate each parameter's gradient individually from scratch.
Consider the forward pass through each layer, expressed as follows:
In the backpropagation process, we start by computing the gradients of the loss function.
First, we backpropagate the gradient of the loss with respect to
Given that
Second, we backpropagate the gradient of the loss with respect to
Given that
Finally, we backpropagate the gradient of the loss with respect to
Given that
These expressions can be implemented in MATLAB code as follows,
% Forward pass
x = X(i:batch_end, :)';
y_true = y(i:batch_end)';
z1 = W1 * x + b1;
a1 = ReLU(z1);
z2 = W2 * a1 + b2;
a2 = ReLU(z2);
z3 = W3 * a2 + b3;
y_pred = z3;
% Compute loss
error = y_pred - y_true;
% Compute for backpropagation
delta3 = 2/batch_size * (error);
% Gradients for W3 and b3
dW3 = delta3 * a2';
db3 = sum(delta3, 2);
% Backpropagate to second hidden layer
delta2 = (W3' * delta3) .* ReLU_deriv(z2);
% Gradients for W2 and b2
dW2 = delta2 * a1';
db2 = sum(delta2, 2);
% Backpropagate to first hidden layer
delta1 = (W2' * delta2) .* ReLU_deriv(z1);
% Gradients for W1 and b1
dW1 = delta1 * x';
db1 = sum(delta1, 2);