Simple and Multiple Linear Regression Maths, Calculating Intercept, coefficients and Implementation Using Sklearn

Nitin
Analytics Vidhya
Published in
4 min readApr 8, 2020

--

Linear Regression is one of the oldest methods in the field of probability/statistics. It works by fitting the best fit line between dependent and independent variables. Let’s get familiar with some common terms.

Best Fit Line

Intercept(b0): Intercept is where the best fit line intersects the y-axis on the plane.

Slope(b1): Slope is the measure of how y value changes with the corresponding unit change in the x-axis(unit=1 value shift)

Let’s use a sample dataset so we can understand the maths behind Simple Linear Regression.

x=[1,2,4,3,5] #independent variable
y=[1,3,3,2,5] #dependent variable

Y=b0+b1*(x), where:

b0= interept
b1= coefficint of the independent variable
x= independent variable
Y=target variable

b0=mean(y)-b1*mean(x)

consider xi-mean(x) as a,yi-mean(y) as b 
So in the numerator we are going to have a*b for all n values
And in the denominator we will have a*a for all n values
So according to the formula for b1, the value of b1=8/10 is 0.8
b0=mean(y)-b1*(mean(x))
b0=2.8-0.8*(3)
b0=0.4

We now have the equation for Linear Regression for our X and Y values.

Y=0.4+0.8(X)

Let’s Substitute some values of X to check our prediction values.

##when x=1
Y=04+0.8(1)-->1.2
##when x=2
Y=0.4+0.8(2)-->2
##when x=4
Y=0.4+0.8(4)-->3.6
##when x=3
Y=0.4+0.8(3)-->2.8
##when x=5
Y=0.4+0.8(5)-->4.4
0.692(RMSE metric)

I have created a Python program to calculate the intercept, slope and prediction for a Simple Linear Regression.

def slope(x1,y1):
a11=[]
b11=[]
c11=[]
d11=[]
a1=[]
b1=[]
c1=[]
d1=[]
mean_x=sum(x1)/len(x1)
mean_y=sum(y1)/len(y1)
for i in x1:
a1=i-mean_x
a11.append(round(a1,2))

for j in y1:
b1=j-mean_y
b11.append(round(b1,2))

for i,j in zip(a11,b11):
c1=i*j
c11.append(round(c1,2))

for k in a11:
d1=k*k
d11.append(round(d1,2))

sflope_l=sum(c11)/sum(d11)
return sflope_l
def intercept(x2,y2):
mean_x1=sum(x2)/len(x2)
mean_y1=sum(y2)/len(y2)

intercept=mean_y1-b1*mean_x1
return round(intercept,2)
def prediction(b0,b1,x):
pred=b0+b1*x
return round(pred,2)
prediction(intercept(x,y),slope(x,y),4)
#returns 3.6

Now that we know-how Simple Linear Regression works, it will be very easy to implement Multiple Linear Regression. There are some minor changes to the formula and we are good to go. Suppose we have two independent features Age and Experience and one dependent feature Salary.

Y=b0+b1*x1+b2*x2
where:
b1=Age coefficient
b2=Experience coefficient
#use the same b1 formula(given above) to calculate the coefficients of Age and Experience

Since the calculations for Multiple Linear Regression can be complex and takes a very long time. For calculations of intercept and coefficients, I am going to use sklearn LinearRegression model.

#implementing multiple Linear Regression Using Sklearn
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression as LR
Age=[21,25,28,30,35]
Experience=[1,4,7,9,14]
Salary=[2000,4000,8000,10000,20000]
data=pd.DataFrame({
'Ages':Age,
'Experiences':Experience,
'Salary':Salary
})
X=data.iloc[:,data.columns!='Salary']
Y=data.Salary
model=LR()
model.fit(X,Y)
model.intercept_
70962.26415094335
model.coef_
-3528.30188679,5132.0754717
def predictions_multiple(x,y):
pred=70962.26415094335+(-3528.30188679*x)+(5132.0754717*y)
return round(pred,0)
predictions_multiple(21,1)
#2000.0
predictions_multiple(25,4)
#3283.0
predictions_multiple(28,7)
#8094.0
predictions_multiple(30,9)
#11302.0
predictions_multiple(35,14)
#19321.0

There you have it guys we now have a good idea about all the maths that goes behind Linear Regression. It is a pretty effective model which is easy to implement and understand. To make the predictions more accurate there are a few pre-processing techniques to perform to reduce errors and help the model learn the underlying mapping function from the input to the output. Here are some tips to prepare data for Linear Regression:

  • Linear Assumption: The model benefits from a linear relationship between the dependent & independent feature. We can transform the data to make it linear using the log transform.
  • Remove Collinearity: Linear regression will overfit the model if multi-collinearity is present. Consider removing the most independent correlated features.
  • Remove Noise: Remove outliers from the data, clean the data properly. Removing outliers from the output variable will be helpful.
  • Rescale Inputs: Model will often make more reliable predictions if we standardize or normalize the data.
  • Gaussian Distribution: Model benefits from Gaussian distribution, try using log or box-cox transformation.

If you find this article helpful or you think there is room for improvement, please do tell me in the comment section below. If you are new to the Data Science field visit my Github, I have some great tutorials and concepts to get you started.

Github:https://github.com/nitin689

Some Problems to practice Linear Regression: https://mathbitsnotebook.com/Algebra1/StatisticsReg/ST2LinRegPractice.html

--

--