# Understand Linear Regression

Linear Regression is often the very first lesson when someone starts their study on Machine Learning. It is a very fundamental tool in supervised learning, with so many applications, notably in predictive analysis. In this post, I will present to you the idea of Linear Regression and its problem formulation.

First of all, we start with a real world example. Imagine that you’re asked to do a market research on apartment rent in Paris, and one of your missions is to analyze the effects of apartments’ area on their rental cost. That means you have to find a mathematical relationship between those two variables.

Then, I’d like to show you some real data that I picked randomly from a website in France. Let’s look at the data table below. In Machine Learning , such kind of data collection is called data set (or training set).

You’ll see that the table has two columns: the first one shows apartment’s area, measured in square meters, and the second column shows the corresponding rental costs, in euros. Next, we draw a scatter plot for the data in the training set, as shown by the figure on the right.

In the scatter plot, each data point represents for a row in the data table. The horizontal axis is area in square meters, and the vertical axis is rental cost in euros. Normally, to do regression analysis, we often need a lot more data than that, but for the purpose of illustration later, I’d like to show you only a portion of the data set. However you should keep in mind that there is a lot more data than this.

Let x denote the input area in square meters, and y denotes the rental cost in euros. Then, we have to find a function f(x) such that: given an input area x, f(x) will be the predicted value of rental cost y:

In Linear Regression, f(x) must be a linear function, thus we have: $f(x) = \phi_1 x + \phi_0$ where $\phi_0, \phi_1 \in R$ are parameters

And we say that f(x) models the linear relationship between the input variable x and the output variable y. Furthermore, the graph of this function will be a straight line, as you see in the scatter plot below.

Definitely, different values of parameters give different straight lines, which leads to a question: What is the straight line that best fits the data points? Or another saying: How can we find the values of $\phi_0, \phi_1$ that give the best fitting line?

Suppose that we have found a linear function f(x) that predicts y, then for any data point $(x_i, y_i)$ in the training set where $i = 1, 2, ..., M$ is the index of data points, and M is the total number of data points in the training set. Then, let: $\hat y_i = f(x_i) = \phi_1 x_i + \phi_0$

That means, $\hat y_i$  is calculated from the linear function f given $x = x_i$. In this context, $x_i$, as you know, is our input variable, and $y_i$, is the actual output corresponding to $x_i$ because it is the true output observed from the training set. But, $\hat y_i$ is called the predicted output: it’s indeed the value of the function f given the input $x_i$.

Then, the difference between the actual output and the predicted output is called Error: $error = \hat y_i - y_i$

Obviously, if f(x) is the best estimator of y, this error must be close as much as possible to zero. In other words, the best fitting line, or the best values of $\phi_0, \phi_1$ will minimize the difference between the actual outputs and the predicted outputs.

To give you more intuition about this, let’s look at the scatter plot below: We take a data point $(x_i, y_i)$ (the green point). In this example, $x_i$ equals 40 (m2), and $(y_i)$ equals 1600 (euros). Now look at the red point, it is the predicted point of $(x_i, y_i)$ given the solution $f(x) = \phi_1 x + \phi_0$. Please note that the coordinate of the red point is $(x_i, \hat y_i)$. Then, the difference between $\hat y_i$ and $y_i$ is the prediction error (the length of the two-headed pink arrow), and it is the value that needs to be close to zero as much as possible.

In the same way, we can calculate the prediction error for all the data points in the training set. The problem now becomes finding the linear function f(x) such that those errors are minimized. How can we do that?

Let’s define a cost function: $C(\phi_0, \phi_1) = \frac{1}{M} \sum_{i=1}^{M} (\hat y_i - y_i)^2$

The above function is also called Mean Squared Error (MSE) which indeed computes the average of squared prediction errors through the training set. It is obvious that the smaller the cost value is, the better the fitting line is.

Note that the cost function has two variables ( $\phi_0, \phi_1$), and we have to find the values of $\phi_0, \phi_1$ so that C is minimized. Once we find those optimal values, the linear function $f(x) = \phi_{1_{best}} x + \phi_{0_{best}}$ will best model the linear relationship between x and y.

So, the idea of Linear Regression is not so complicated, it’s all about basic Linear Algebra problems! The next step in this lesson is to explore how the cost function will be minimized to find the best parameters $\phi_0, \phi_1$. Additionally, in this example we use only one variable x (apartment’s area), that’s why this is called « Simple Linear Regression« , but in really, we often have multiple variables, which makes the problem called « Multiple Linear Regression« . How do we manage to approach those?

If you really want to explore the full mystery of Linear Regression, I strongly recommend you to check this AI & Machine Learning course from scratch. Not only Linear Regression, in this very comprehensive course, you will also learn about the most fundamentals in Machine Learning such as Logistic Regression, Artificial Neural Network, Deep Learning, Clustering … STEP-by-STEP with practical exercises on real-world problems. It also covers Fuzzy Logic and Evolutionary Computation. In other words, the course provides you a solid background in Artificial Intelligence which is a must if you seriously want to start your career in AI & ML, the leading technologies today.

Thank you for reading this post and please share it if you find it interesting.