Networking & Security Technologist
In this example we will consider sales based on 'TV' marketing budget.
In this notebook, we'll build a linear regression model to predict 'Sales' using 'TV' as the predictor variable.
Let's start with the following steps:
import pandas as pd
advertising = pd.read_csv("tvmarketing.csv")
Now, let's check the structure of the advertising dataset.
# Display the first 5 rows
advertising.head()
# Display the last 5 rows
advertising.tail()
# Let's check the columns
advertising.info()
# Check the shape of the DataFrame (rows, columns)
advertising.shape
# Let's look at some statistical information about the dataframe.
advertising.describe()
# Conventional way to import seaborn
import seaborn as sns
# To visualise in the notebook
%matplotlib inline
# Visualise the relationship between the features and the response using scatterplots
sns.pairplot(advertising, x_vars=['TV'], y_vars='Sales',height=7, aspect=0.7, kind='scatter')
Equation of linear regression
$y = c + m_1x_1 + m_2x_2 + ... + m_nx_n$
In our case:
$y = c + m_1 \times TV$
The $m$ values are called the model coefficients or model parameters.
sklearn
¶Before you read further, it is good to understand the generic structure of modeling using the scikit-learn library. Broadly, the steps to build any model can be divided as follows:
# Putting feature variable to X
X = advertising['TV']
# Print the first 5 rows
X.head()
# Putting response variable to y
y = advertising['Sales']
# Print the first 5 rows
y.head()
#random_state is the seed used by the random number generator, it can be any integer.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7 , random_state=100)
print(type(X_train))
print(type(X_test))
print(type(y_train))
print(type(y_test))
train_test_split #Press Tab to auto-fill the code
#Press Tab+Shift to read the documentation
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
#It is a general convention in scikit-learn that observations are rows, while features are columns.
#This is needed only when you are using a single feature; in this case, 'TV'.
import numpy as np
X_train = X_train[:, np.newaxis]
X_test = X_test[:, np.newaxis]
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
# import LinearRegression from sklearn
from sklearn.linear_model import LinearRegression
# Representing LinearRegression as lr(Creating LinearRegression Object)
lr = LinearRegression()
# Fit the model using lr.fit()
lr.fit(X_train, y_train)
# Print the intercept and coefficients
print(lr.intercept_)
print(lr.coef_)
$y = 6.989 + 0.0464 \times TV $
Now, let's use this equation to predict our sales.
# Making predictions on the testing set
y_pred = lr.predict(X_test)
type(y_pred)
# Actual vs Predicted
import matplotlib.pyplot as plt
c = [i for i in range(1,61,1)] # generating index
fig = plt.figure()
plt.plot(c,y_test, color="blue", linewidth=2.5, linestyle="-")
plt.plot(c,y_pred, color="red", linewidth=2.5, linestyle="-")
fig.suptitle('Actual and Predicted', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Sales', fontsize=16) # Y-label
# Error terms
c = [i for i in range(1,61,1)]
fig = plt.figure()
plt.plot(c,y_test-y_pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('ytest-ypred', fontsize=16) # Y-label
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)
print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)
import matplotlib.pyplot as plt
plt.scatter(y_test,y_pred)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')