In the ML Pipeline, Regression is the next step after Classification. We are going to specifically discuss about a subset of Regression called Linear regression. Linear regression is an approach to create a model to understand the relationship between input and output numerical variables. It helps one to understand how the typical value of the dependent variable changes when any one of the independent variables is varied. It is used when we have an independent variable ( Features that are inputted to the model, example: number of rooms, location of house, year the house was built etc.) and another variable that depends on the initial variable ( example – Price of the house ). As the values of the independent variable( features ) change the values of the dependent variable changes. Using regression we can find the value of the dependent variable according to the change in value of the independent variable. The way regression works & its error finding will be discussed in the conclusion. Basically in a nutshell, Regression allows you to find Y given X.
Hey guys! This is Manas from csopensource.com – “Your one stop destination for everything computer science”. I have been busy for a few days and so could not upload content. Today, I managed to find some free time so decided to make an article for you guys to read! So here we go..
Understanding using an Example
Back to the topic, let us take the example of Predicting the cost of a new house we cannot use classification here as it simply does not make sense. We are predicting the cost of the house & not the category the house belongs to. Firstly we need to input the features, i.e all the factors which affect the final price of the house. Then we need to apply Linear Regression to find the relationship between all the factors related to the house and the final cost of the house.
Using regression we can predict the cost of the house given the features related to it. This is not classification as in classification we are predicting as to which category any given object belongs to. Here we are finding the cost of the house which depends on the features.
Solve a Regression Problem
I can go on for hours speaking about the way regression works but it just makes more sense for me to show you how to solve a given problem using regression. This makes it simple for the end user and for me as i can make sure that I’m speaking less jargon and more useful things. As I mentioned before we will be working on a problem that is the Cost of the house problem. In this case, we give certain parameters(features) to the model and it gives us the predicted mean cost of the house as per the given features. This can be used as an estimation to find the relative cost of the house.
Firstly, we need a dataset of house pricing, an easily available dataset is one that we can load from the python module sklearn. We need to use this python code to load the dataset:
from sklearn.datasets import load_boston
These are all the features that we have to specify to get the predicted cost of the house. i have written this out just to make it easy for you guys to understand what to specify as input to our model.
before we write the code, open terminal or cmd and enter the following,
- pip install numpy
- pip install pandas
- pip install sklearn
Notice that, we have a variable called ‘test_array’. This array contains all the sample values that I have provided to serve as a testing medium. This piece of Code is relatively small and easy to understand. You may have also noticed that in the last line I am multiplying the prediction by 1000. This is because I want the output to be in the proper cash denominations i.e in the denominations of 1000$ US. I am converting the prediction into a float and applying a ‘math.ceil’ operation to round off the prediction to make it more readable. Now let us see the output or prediction that the model makes based on the features that we give.
Prediction or Output
With this we can conclude that the cost of a house with all the features present in our ‘test_array’ variable the house will have a price of $21000. So we can further experiment with the model by changing the values in our ‘test_array’ variable.
Conclusion & More information
So to sum it all up, Regression is a technique that has been borrowed from Statistics. It involves finding the relationship between the independent variables and the dependent variables. The way it does this is by using some simple high school geometry.
It creates a line using the formula y = mx + b where Y is the prediction, m is the slope or gradient of the line, X is the x-axis of the line and b is the y-intercept of the line.
This line is known as the line of best fit. The distance between the points and the line is known as the error and the error is calculated using the formula of Mean Squared Error. Formula: (predicted value – actual value)².
So here, we are getting the difference between the predicted value and the actual value. i.e The Distance between them. After this, we are squaring it. Thus we have the Mean Squared Error. Regression has many types such as polynomial regression etc. but we have gone ahead with Linear Regression as it is an easier and highly used concept. Using regression we can find out the values of the dependent variable by manipulating the independent variables. Regression is also used in cases of Weather prediction, Stock Price Prediction etc. This algorithm comes under Supervised Classification and is very easy to learn. The line tries to fit the data in such a way that any further predictions will be done according to the line. This algorithm is useful for data which has a linear relationships with each other.
I hope you learnt something from this post! please comment if you have any queries and stay up to date with csopensource.com for some more exciting content. In the next post we will be discussing about a Support Vector Machine.