normal equation ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ΅ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅
Normal Equation
Given a matrix equation, the normal equation is one which minimizes the sum of the square differences between the left and right sides
Basics of Machine Learning Series
Introduction
Gradient descent is an algorithm which is used to reach an optimal solution iteratively using the gradient of the loss function or the cost function. In contrast, normal equation is a method that helps solve for the parameters analytically i.e. instead of reaching the solution iteratively, solution for the parameter \(\theta\) is reached at directly by solving the normal equation.
Intuition
Consider a one-dimensional equation for the cost function given by,
According to calculus, one can find the minimum of this function by calculating the derivative and solving the equation by setting derivative equal to zero, i.e.
Similarly, extending (1) to multi-dimensional setup, the cost function is given by,
And similar to (2), the minimum of (3) can be found by taking partial derivatives w.r.t. individual \(\theta_i \forall i \in (0, 1, 2, \cdots, n) \) and solving the equations by setting them to zero, i.e.
Through derivation one can find that \(\theta\) is given by,
Feature scaling is not necessary for the normal equation method. Reason being, the feature scaling was implemented to prevent any skewness in the contour plot of the cost function which affects the gradient descent but the analytical solution using normal equation does not suffer from the same drawback.
Comparison between Gradient Descent and Normal Equation
Given m training examples, and n features
| Gradient Descent | Normal Equation |
|---|---|
| Proper choice of \(\alpha\) is important | \(\alpha\) is not needed |
| Iterative Method | Direct Solution |
| Works well with large n. Complexity of algorithm is O(\(kn^2\)) | Slow for large n. Need to compute \((X^TX)^<-1>\). Generally the cost for computing the inverse is O(\(n^3\)) |
Generally if the number of features is less than 10000, one can use normal equation to get the solution beyond which the order of growth of the algorithm will make the computation very slow.
Non-invertibility
Matrices that do not have an inverse are called singular or degenerate.
Reasons for non-invertibility:
Calculating psuedo-inverse instead of inverse can also solve the issue of non-invertibility.
Implementation
Derivation of Normal Equation
Given the hypothesis,
Let X be the design matrix wherein each row corresponds to the features in \(i^
Since \(X\theta\) and \(y\) both are vectors, \((X\theta)^Ty = y^T(X\theta)\). So (7) can be further simplified as,
Normal Equation in Linear Regression
Author(s): Saniya Parveez
Machine Learning
Gradient descent is a very popular and first-order iterative optimization algorithm for finding a local minimum over a differential function. Similarly, the Normal Equation is another way of doing minimization. It does minimization without restoring to an iterative algorithm. Normal Equation method minimizes J by explicitly taking its derivatives concerning theta j and setting them to zero.
Below is a data-set to predict house price:
Gradient Descent Vs Normal Equation
Gradient Descent
Normal Equation
Linear Regression with Normal Equation
Load the Portland data
Visualize The Area against the Price:
Visualize the Number of Rooms against the Price of the House:
Here, the relationship between the Number of Rooms, and the Price of the House, appears to be Linear.
Define Feature Matrix, and Outcome/Target Vector:
Visualize Cost Function:

Split Data
Normal Equation
Prediction using Normal Equation theta value
Prediction using Linear Regression
Here, the predictions from the Normal Equation and Linear Equation are the same.
Normal Equation Non-Invertibility
A squared matrix that does not have an inverse a matrix is singular if and only if it is determined is zero.
The inverse of Matrix:

Problem due to Non-Invertibility:
How to solve if there are too many features?
Conclusion
Gradient Descent gives one way to minimizing J. Normal Equation is another way of doing minimization. It does minimization without restoring to an iterative algorithm. But, Normal Equation is very slow if the data-set size is very large
Normal Equation in Linear Regression was originally published in Towards AI β Multidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.
ML | Normal Equation in Linear Regression
Normal Equation is an analytical approach to Linear Regression with a Least Square Cost Function. We can directly find out the value of ΞΈ without using Gradient Descent. Following this approach is an effective and time-saving option when are working with a dataset with small features.
Normal Equation is a follows :
Attention reader! Donβt stop learning now. Get hold of all the important Machine Learning Concepts with the Machine Learning Foundation Course at a student-friendly price and become industry ready.
In the above equation,
ΞΈ: hypothesis parameters that define it the best.
X: Input feature value of each instance.
Y: Output value of each instance.
Maths Behind the equation β
Given the hypothesis function
where,
n: the no. of features in the data set.
x0: 1 (for vector multiplication)
Notice that this is a dot product between ΞΈ and x values. So for the convenience to solve we can write it as :
The motive in Linear Regression is to minimize the cost function :
where,
x i : the input value of i ih training example.
m: no. of training instances
n: no. of data-set features
y i : the expected result of i th instance
Let us representing the cost function in a vector form.
we have ignored 1/2m here as it will not make any difference in the working. It was used for mathematical convenience while calculation gradient descent. But it is no more needed here.
x i j: value of j ih feature in i ih training example.
This can further be reduced to
But each residual value is squared. We cannot simply square the above expression. As the square of a vector/matrix is not equal to the square of each of its values. So to get the squared value, multiply the vector/matrix with its transpose. So, the final equation derived is
Therefore, the cost function is
So, now getting the value of ΞΈ using derivative
So, this is the finally derived Normal Equation with ΞΈ giving the minimum cost value.
Π ΡΡΡΠΊΠΈΠ΅ ΠΠ»ΠΎΠ³ΠΈ
[ΠΠ°ΡΠΈΠ½Π½ΠΎΠ΅ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅ ΠΡΠΈΠΌΠ΅ΡΠ°Π½ΠΈΡ 1.1] Π Π΅ΡΠ΅Π½ΠΈΠ΅ Π½ΠΎΡΠΌΠ°Π»ΡΠ½ΡΡ ΡΡΠ°Π²Π½Π΅Π½ΠΈΠΉ Π»ΠΈΠ½Π΅ΠΉΠ½ΠΎΠΉ ΡΠ΅Π³ΡΠ΅ΡΡΠΈΠΈ
ΠΠ±Π·ΠΎΡ Π»ΠΈΠ½Π΅ΠΉΠ½ΠΎΠΉ ΡΠ΅Π³ΡΠ΅ΡΡΠΈΠΈ
ΠΠ°Π²Π°ΠΉΡΠ΅ ΡΠ½Π°ΡΠ°Π»Π° ΡΠ°ΡΡΠΌΠΎΡΡΠΈΠΌ ΠΏΡΠΎΡΡΠ΅ΠΉΡΠΈΠΉ ΡΠ»ΡΡΠ°ΠΉ, Ρ. Π. ΠΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π²Ρ ΠΎΠ΄Π½ΡΡ Π°ΡΡΠΈΠ±ΡΡΠΎΠ² ΡΠΎΠ»ΡΠΊΠΎ ΠΎΠ΄Π½ΠΎ, ΠΈ Π»ΠΈΠ½Π΅ΠΉΠ½Π°Ρ ΡΠ΅Π³ΡΠ΅ΡΡΠΈΡ ΠΏΡΡΠ°Π΅ΡΡΡ Π½Π°ΡΡΠΈΡΡΡΡ [1].
Π’Π΅ΠΏΠ΅ΡΡ Π·Π°ΠΏΡΠΎΡ E ( w β ) » role=»presentation» style=»position: relative;»> E ( w β ) ΠΠΈΠ½ΠΈΠΌΠ°Π»ΡΠ½ΠΎΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ E ( w β ) » role=»presentation» style=»position: relative;»> E ( w β ) ΠΠ΅ΡΠ½ΡΠΉ w β » role=»presentation» style=»position: relative;»> w β ΠΏΡΠΎΠΈΠ·Π²ΠΎΠ΄Π½ΡΠΉ
ΠΡΠΈΠΌΠ΅Ρ ΠΊΠΎΠ΄Π°
ΠΠ°ΠΊ ΡΡΠ΄ΠΈΡΡ ΠΎ ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅ ΠΌΠΎΠ΄Π΅Π»ΠΈ
ΠΠΎΡΡΠΈ Π»ΡΠ±ΠΎΠΉ Π½Π°Π±ΠΎΡ Π΄Π°Π½Π½ΡΡ
ΠΌΠΎΠΆΠ΅Ρ Π±ΡΡΡ ΡΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ Ρ ΠΏΠΎΠΌΠΎΡΡΡ Π²ΡΡΠ΅ΡΠΊΠ°Π·Π°Π½Π½ΠΎΠ³ΠΎ ΠΌΠ΅ΡΠΎΠ΄Π°, ΡΠ°ΠΊ ΠΊΠ°ΠΊ ΠΎΡΠ΅Π½ΠΈΡΡ ΠΊΠ°ΡΠ΅ΡΡΠ²ΠΎ ΡΡΠΈΡ
ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ? [2] Π‘ΡΠ°Π²Π½ΠΈΡΠ΅ Π΄Π²Π° ΠΏΠΎΠ΄Π³ΡΠ°ΡΠ° Π½Π° ΡΠΈΡΡΠ½ΠΊΠ΅ Π½ΠΈΠΆΠ΅.ΠΡΠ»ΠΈ Π²Ρ Π²ΡΠΏΠΎΠ»Π½ΠΈΡΠ΅ Π»ΠΈΠ½Π΅ΠΉΠ½ΡΡ ΡΠ΅Π³ΡΠ΅ΡΡΠΈΡ Π΄Π»Ρ Π΄Π²ΡΡ
Π½Π°Π±ΠΎΡΠΎΠ² Π΄Π°Π½Π½ΡΡ
, Π²Ρ ΠΏΠΎΠ»ΡΡΠΈΡΠ΅ ΡΠΎΡΠ½ΠΎ ΡΠ°ΠΊΡΡ ββΠΆΠ΅ ΠΌΠΎΠ΄Π΅Π»Ρ (ΠΏΠΎΠ΄Π³ΠΎΠ½ΠΊΠ° ΠΏΠΎ ΠΏΡΡΠΌΠΎΠΉ Π»ΠΈΠ½ΠΈΠΈ). ΠΡΠ΅Π²ΠΈΠ΄Π½ΠΎ, ΡΡΠΎ ΡΡΠΈ Π΄Π°Π½Π½ΡΠ΅ ΡΠ°Π·Π½ΡΠ΅, ΡΠ°ΠΊ Π½Π°ΡΠΊΠΎΠ»ΡΠΊΠΎ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½Ρ ΠΌΠΎΠ΄Π΅Π»ΠΈ Π½Π° ΡΡΠΈΡ
Π΄Π²ΡΡ
? ΠΠ°ΠΊ ΠΌΡ Π΄ΠΎΠ»ΠΆΠ½Ρ ΡΡΠ°Π²Π½ΠΈΠ²Π°ΡΡ ΡΡΠΈ ΡΡΡΠ΅ΠΊΡΡ? Π‘ΡΡΠ΅ΡΡΠ²ΡΠ΅Ρ ΡΠΏΠΎΡΠΎΠ± Π²ΡΡΠΈΡΠ»ΠΈΡΡ ΡΡΠ΅ΠΏΠ΅Π½Ρ ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΠΈΡ ΠΌΠ΅ΠΆΠ΄Ρ ΠΏΡΠ΅Π΄ΡΠΊΠ°Π·Π°Π½Π½ΡΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ΠΌ ΠΏΠΎΡΠ»Π΅Π΄ΠΎΠ²Π°ΡΠ΅Π»ΡΠ½ΠΎΡΡΠΈ yHat ΠΈ ΠΈΡΡΠΈΠ½Π½ΡΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ΠΌ ΠΏΠΎΡΠ»Π΅Π΄ΠΎΠ²Π°ΡΠ΅Π»ΡΠ½ΠΎΡΡΠΈ y, ΡΠΎ Π΅ΡΡΡ Π²ΡΡΠΈΡΠ»ΠΈΡΡ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½Ρ ΠΊΠΎΡΡΠ΅Π»ΡΡΠΈΠΈ Π΄Π²ΡΡ
ΠΏΠΎΡΠ»Π΅Π΄ΠΎΠ²Π°ΡΠ΅Π»ΡΠ½ΠΎΡΡΠ΅ΠΉ.
Π Π΅ΡΠΈΡΠ΅ ΠΌΠ°ΡΡΠΈΡΡ Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ Ρ ΠΏΠΎΠΌΠΎΡΡΡ Π½ΠΎΡΠΌΠ°Π»ΡΠ½ΡΡ ΡΡΠ°Π²Π½Π΅Π½ΠΈΠΉ X T X » role=»presentation» style=»position: relative;»> X T X ΠΠ΅ΠΎΠ±ΡΠ°ΡΠΈΠΌΠΎΠ΅ ΡΠ΅ΡΠ΅Π½ΠΈΠ΅
Π§ΡΠΎ ΠΊΠ°ΡΠ°Π΅ΡΡΡ Π½Π΅ΠΎΠ±ΡΠ°ΡΠΈΠΌΠΎΠΉ ΠΌΠ°ΡΡΠΈΡΡ, ΠΌΡ ΡΠ°ΠΊΠΆΠ΅ Π½Π°Π·ΡΠ²Π°Π΅ΠΌ Π΅Π΅ ΠΎΡΠΎΠ±ΠΎΠΉ ΠΈΠ»ΠΈ Π²ΡΡΠΎΠΆΠ΄Π΅Π½Π½ΠΎΠΉ ΠΌΠ°ΡΡΠΈΡΠ΅ΠΉ. ΠΠ΅ΠΎΠ±ΡΠ°ΡΠΈΠΌΠ°Ρ ΠΌΠ°ΡΡΠΈΡΠ° ΠΎΠ±ΡΡΠ½ΠΎ ΠΈΠΌΠ΅Π΅Ρ ΡΠ»Π΅Π΄ΡΡΡΠ΅Π΅ [3-4.7]:
ΠΡΠΎΠΌΠ΅ ΡΠΎΠ³ΠΎ, ΠΌΠ΅ΡΠΎΠ΄ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠ³ΠΎ ΡΠΏΡΡΠΊΠ° ΡΠ°ΠΊΠΆΠ΅ ΠΌΠΎΠΆΠ΅Ρ Π±ΡΡΡ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ Π΄Π»Ρ ΡΠ΅ΡΠ΅Π½ΠΈΡ ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½ΠΎΠ³ΠΎ ΡΠ΅ΡΠ΅Π½ΠΈΡ, ΠΊΠΎΠ³Π΄Π° ΠΌΠ°ΡΡΠΈΡΠ° Π½Π΅ΠΎΠ±ΡΠ°ΡΠΈΠΌΠ° (me: ΡΠ΅Π°Π»ΡΠ½ΠΎΠ΅ ΡΠ΅ΡΠ΅Π½ΠΈΠ΅, ΠΏΠΎΠ»ΡΡΠ΅Π½Π½ΠΎΠ΅ Ρ ΠΏΠΎΠΌΠΎΡΡΡ Π½ΠΎΡΠΌΠ°Π»ΡΠ½ΠΎΠ³ΠΎ ΡΡΠ°Π²Π½Π΅Π½ΠΈΡ, ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½ΠΎΠ΅ ΡΠ΅ΡΠ΅Π½ΠΈΠ΅, ΠΏΠΎΠ»ΡΡΠ΅Π½Π½ΠΎΠ΅ Ρ ΠΏΠΎΠΌΠΎΡΡΡ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠ³ΠΎ ΡΠΏΡΡΠΊΠ°). Π‘ΡΠ°Π²Π½Π΅Π½ΠΈΠ΅ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠ³ΠΎ ΡΠΏΡΡΠΊΠ° ΠΈ Π½ΠΎΡΠΌΠ°Π»ΡΠ½ΠΎΠ³ΠΎ ΡΡΠ°Π²Π½Π΅Π½ΠΈΡ ΠΏΠΎΠΊΠ°Π·Π°Π½ΠΎ Π² ΡΠ»Π΅Π΄ΡΡΡΠ΅ΠΉ ΡΠ°Π±Π»ΠΈΡΠ΅: [3-4.6]
Normal Equation in Python: The Closed-Form Solution for Linear Regression
Machine Learning from scratch: Part 3
Mar 23 Β· 5 min read
In this article, we will implement the Normal Equation which is the closed-form solution for the Linear Regression algorithm where we can find the optimal value of theta in just one step without using the Gradient Descent algorithm.
We will first recap with Gradient Descent Algorithm, then talk about calculating theta using a formula called Normal Equation and finally, see the Normal Equation in Action and plot predictions for our randomly generated data.
Machine Learning from scratch series β
Linear Regression from scratch in Python
Machine Learning from Scratch: Part 1
Locally Weighted Linear Regression in Python
Machine Learning from Scratch: Part 2
Gradient Descent Recap
Gradient Descent Algorithmβ
First, we initialize the parameter theta randomly or with all zeros. Then,
Normal Equation
Gradien t Descent is an iterative algorithm meaning that you need to take multiple steps to get to the Global optimum (to find the optimal parameters) but it turns out that for the special case of Linear Regression, there is a way to solve for the optimal values of the parameter theta to just jump in one step to the Global optimum without needing to use an iterative algorithm and this algorithm is called the Normal Equation. It works only for Linear Regression and not any other algorithm.
Normal Equation is the Closed-form solution for the Linear Regression algorithm which means that we can obtain the optimal parameters by just using a formula that includes a few matrix multiplications and inversions.
This is the Normal Equation β
If you know about the matrix derivatives along with a few properties of matrices, you should be able to derive the Normal Equation for yourself.
You might think what if X is a non-invertible matrix, which usually happens if you have redundant features i.e your features are linearly dependent, probably because you have the same features repeated twice. One thing you can do is go and find out which features are repeated and fix them or you can use the np.pinv function in NumPy which will also give you the right answer.
The Algorithm
Check the shapes of X and y so that the equation matches up.
Normal Equation in Action
Letβs take the following randomly generated data as a motivating example to understand the Normal Equation.
Here, n =1 which means the matrix X has only 1 column and m =500 means X has 500 rows. X is a (500×1) matrix and y is a vector of length 500.
Find Theta Function
Letβs write the code to calculate theta using the Normal Equation.
- Π°ΠΊΡΠΏΡΠ΅ΡΡΡΡΠ° ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅ Π² ΠΌΠΎΡΠΊΠ²Π΅
- ΡΡΠΎ Π·Π½Π°ΡΠΈΡ ΠΆΠΈΠ²ΠΎΠ½Π°ΡΠ°Π»ΡΠ½Π°Ρ ΡΡΠΎΠΈΡΠ°























