Objective of the Assignment:
There are several reasons for wanting to consider the effects of multiple variables on an outcome of interest.
- To understand the multiple factors that can influence the outcome.
- To recognize and adjust for confounding. If other factors that influence the outcome are unevenly distributed between the groups, these other factors can distort the apparent association between the outcome and the primary exposure of interest; this is what is meant by confounding. When there is confounding, multivariable methods can be used to estimate the association between an exposure and an outcome after adjusting for, or taking into account, the impact of one or more confounding factors (other risk factors). In essence, multiple variable analysis allows us to assess the independent effect of each of the exposures.
# Importing packages required for this tutorial import pandas as pd import numpy as np import statsmodels.api as sm from matplotlib import pyplot as plt
# In the following example we will use the advertising dataset which consists of the sales of products and their advertising # budget in three different media TV, radio, newspaper.
# Importing the data set for the assignment purpose df_adv = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
# List of variables in the dataset list(df_adv.columns.values)
['TV', 'Radio', 'Newspaper', 'Sales']
# Viewing first few observations in the dataset df_adv.head()
# 1. With the advertising spent, the sales will increase # 2. There are no confounding variable in the dataset
Regression Model for single variable
# In line with the hypothesis 1, let us build a simple OLS model where response variable is Sales and predictor variable is TV TV = df_adv['TV'] Y = df_adv['Sales'] SalesTVModel = sm.OLS(Y, TV).fit()
Summarizing the Model
# Model Summary SalesTVModel.summary()
|Date:||Sat, 12 Dec 2015||Prob (F-statistic):||3.52e-100|
|coef||std err||t||P>|t|||[95.0% Conf. Int.]|
# The model summary reveals that there exists a positive significant (p-value < 0.05) direct relationship between Sales and # advertising spends on TV # Reg. Model is # Sales (Y) = 0.0832 * TV_Spends (TV) # Mathematically: Y = 0.0832 * TV ## Every dollar spent on TV advertising, it will contribute to increased Sales by 8.3%, according the model obtained
Constructing the Multiple Regression
# Now we will build a reg. model using multiple (in this case TV and Radio) variables. # In line with Hypothesis 1, we would like to learn if advertisement spends on TV and Radion will increase the sales # Also we want to learn if there are any confounding variable among the two.
# Creating a subset X = df_adv[['TV', 'Radio']] y = df_adv['Sales']
# The multiple regression model describes the response as a weighted sum of the predictors: print 'Sales = β0 + β1 × TV + β2 × Radio'
Sales = β0 + β1 × TV + β2 × Radio
## fit a OLS model with intercept on TV and Radio # Intercept on TV and Radio X = sm.add_constant(X) # Fitting OLS model MultMod = sm.OLS(y, X).fit()
# Model Summary MultMod.summary()
|Date:||Sat, 12 Dec 2015||Prob (F-statistic):||4.83e-98|
|coef||std err||t||P>|t|||[95.0% Conf. Int.]|
# Clearly we can see thatR-Squared value is approximately 90% which means maximum variability in the data is well # explained by the model.
# Also, p-values of coeeficients of variables TV, Radio and Intercept are close to or almost zero, and all of the coef are # positive which means there exists a sginificant positive relations among them.
# Therefore, final regression model will be: print 'Sales = 2.9211 + 0.0458 * TV + 0.1880 * Radio'
Sales = 2.9211 + 0.0458 * TV + 0.1880 * Radio
# Use of model: # Suppose TV = 250 Radio = 40 # What would be Sales ? Sales = 2.9211 + 0.0458 * TV + 0.1880 * Radio print "Sales:", Sales, "Dollars"
Sales: 21.8911 Dollars
Confounding is a distortion (inaccuracy) in the estimated measure of association that occurs when the primary exposure of interest is mixed up with some other factor that is associated with the outcome.
There are three conditions that must be present for confounding to occur:
The confounding factor must be associated with both the variable(s) of interests and the outcome.
The confounding factor must be distributed unequally among the groups being compared.
A confounder cannot be an intermediary step in the causal pathway from the exposure of interest to the outcome of interest.
As a rule of thumb, if the regression coefficient from the simple linear regression model changes by more than 10%, then X2c(or the second variable that just entered in OLS Model) is said to be a confounder.
To check that, we will do the following:
From Single variable model, we have :
Y = 0.0832 * TV
And from Multiple variable model, we have :
Sales = 2.9211 + 0.0458 TV + 0.1880 Radio
Now we will use the same values for TV and then see the effects on sales by adding adv. spends on Radio
TV = 100 Radio = 1
print "Sales (Dollars), when Price of TV is 100 :", Y
Sales (Dollars), when Price of TV is 100 : 1 19.14432 2 3.70240 3 1.43104 4 12.60480 5 15.04256 6 0.72384 7 4.78400 8 10.00064 9 0.71552 10 16.62336 11 5.49952 12 17.86304 13 1.98016 14 8.11200 15 16.98112 16 16.25728 17 5.64096 18 23.41248 19 5.75744 20 12.25536 21 18.17088 22 19.75168 23 1.09824 24 18.99456 25 5.18336 26 21.87328 27 11.88928 28 19.97632 29 20.70016 30 5.87392 ... 171 4.16000 172 13.68640 173 1.63072 174 14.01088 175 18.50368 176 23.03808 177 20.66688 178 14.16064 179 23.02144 180 13.77792 181 13.02912 182 18.17920 183 4.67584 184 23.92832 185 21.11616 186 17.05600 187 11.60640 188 15.89952 189 23.79520 190 1.55584 191 3.28640 192 6.28160 193 1.43104 194 13.87776 195 12.45504 196 3.17824 197 7.83744 198 14.72640 199 23.59552 200 19.31072 Name: TV, dtype: float64
print "Sales (Dollars), when Price of TV is 100 and Radio is 1:", Sales
Sales (Dollars), when Price of TV is 100 and Radio is 1: 21.8911
Therefore, we can conclude that for every unit of Dollar spent on advertising on Radio, the increase in sales is 5.3% approx.
So we can conclude that there are no confounding vaiable among TV and Radio advertising spends.
Q-Q Plot of residuals
res = MultMod.resid # residuals
from scipy import stats %matplotlib inline
fig = sm.qqplot(res, stats.t, fit=True, line='r') plt.show()
The Plot reveals that the residuals are mostly uniformaly distributed and hence we can assume that model represents the best fit possible
stdres = pd.DataFrame(MultMod.resid_pearson)
fig = plt.plot(stdres, 'o', ls='none') l = plt.axhline(y=0, color='r') plt.ylabel('Standardized Residual') plt.xlabel('Observation') print fig
[<matplotlib.lines.Line2D object at 0x000000002102A240>]
In the above plot we can see that most of the residuals follow 1 standard deviation of the mean, but almost all of the residuals are within 2 standard deviation of the mean (i.e. within 95% Confidence interval, the residuals to follow within 2 standard deviation of the mean)
# Cooks distance influence = MultMod.get_influence() #c is the distance and p is p-value (c, p) = influence.cooks_distance plt.stem(np.arange(len(c)), c, markerfmt=",")
<Container object of 3 artists>
from statsmodels.graphics.regressionplots import * plot_leverage_resid2(MultMod)
Most observatons are close to the zero leverage values meaning that although there are outling observations, but in general these do not have undue influence on to the estimation of parameters of regression model.