In this article I want to make a brief introduction to normal equation method. The article is following the Coursera structure, summarizing the following aspects:
- some theoretical points
- two practical examples
- general case
- advantages and disadvantages between gradient descent and normal equation
I think, the best way to understand how normal equation works is to take a planetary example. Normal equation solves for θ analytically. So let’s consider a very simplified cost function J(θ) as a quadratic function of a real number θ:
Intuition: If 1D (θ ∈ ℝ):
J(θ) = aθ2 + bθ + c
Next, we have to minimize this quadratic function, so we have to set derivatives equal to zero and solve for θ:
Fortunately, most of the time, the things are a bit more challenging, and our θ is not just a scalar. 😀
For example, θ could be this n + 1 dimensional vector and the cost function is a function of this vector:
Intuition: θ ∈ ℝn+1
Now, we have to solve for: θ0, θ1,…, θn. This would give us the values of θ that minimize the cost function J.
For a better understanding of this method, I’ll illustrate the example provided by Coursera in their course about ML.
Therefore, let’s assume that we have m = 4 training examples:
In order to implement the normal equation, we have introduced an extra column that correspond to our extra feature x0 – the constant term. Now, we can construct the matrix X that will hold our independent features and also an Y vector for the dependent variable (the values we are trying to predict). Thus, X is going to be an m × (n+1) dimensional matrix and Y is going to be an m – dimensional vector, where m is the number of training examples and n is the number of features we have.
The value of θ that minimize the cost function for normal equation is given by the following expression:
where XT is the transpose of matrix X and (XTX)-1 is the inverse for matrix (XTX).
In the general case, let’s say we have m training examples (x(1), y(1)), …, (x(m), y(m)) and n features. Each of the training example x(i) will look like an n+1 dimensional vector:
Now, in order to construct the global matrix X (also called the design matrix) we have to take the transpose of the first training example (that is an n+1 dimensional vector) and put it as the first row of the global matrix. We have to do so for each feature in our training set (see below):
In this way, we obtained an m × (n+1) dimensional matrix.
Example. For a better understanding let’s assume we have only one feature other than x0 (which is always equal to 1). Our feature vectors – x(i) and design matrix – X (m × 2 dimensional matrix) will look like:
The vector y is obtained by taking all the observations for the variable we are trying to predict and stacking them up into an m – dimensional vector. Now we can just compute θ as in expression 1.1.
Broadly, this article could be a good start for understanding multiple normal equation. It is good to specify that if you are using this method, feature scaling is not necessary.
Finally, here are some advantages and disadvantages regarding gradient descent and normal equation: