Objective of the Assignment:

There are several reasons for wanting to consider the effects of multiple variables on an outcome of interest.

  1. To understand the multiple factors that can influence the outcome.
  2. To recognize and adjust for confounding. If other factors that influence the outcome are unevenly distributed between the groups, these other factors can distort the apparent association between the outcome and the primary exposure of interest; this is what is meant by confounding. When there is confounding, multivariable methods can be used to estimate the association between an exposure and an outcome after adjusting for, or taking into account, the impact of one or more confounding factors (other risk factors). In essence, multiple variable analysis allows us to assess the independent effect of each of the exposures.
In [14]:
# Importing packages required for this tutorial
import pandas as pd
import numpy as np
import statsmodels.api as sm
from matplotlib import pyplot as plt
In [15]:
# In the following example we will use the advertising dataset which consists of the sales of products and their advertising 
# budget in three different media TV, radio, newspaper.

Importing Data

In [16]:
# Importing the data set for the assignment purpose
df_adv = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
In [17]:
# List of variables in the dataset
list(df_adv.columns.values)
Out[17]:
['TV', 'Radio', 'Newspaper', 'Sales']

Data View

In [18]:
# Viewing first few observations in the dataset
df_adv.head()
Out[18]:
TV Radio Newspaper Sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9

Hypothesis

In [19]:
# 1. With the advertising spent, the sales will increase

# 2. There are no confounding variable in the dataset

Regression Model for single variable

In [20]:
# In line with the hypothesis 1, let us build a simple OLS model where response variable is Sales and predictor variable is TV

TV = df_adv['TV']
Y = df_adv['Sales']

SalesTVModel = sm.OLS(Y, TV).fit()

Summarizing the Model

In [21]:
# Model Summary

SalesTVModel.summary()
Out[21]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 1733.
Date: Sat, 12 Dec 2015 Prob (F-statistic): 3.52e-100
Time: 02:45:09 Log-Likelihood: -597.51
No. Observations: 200 AIC: 1197.
Df Residuals: 199 BIC: 1200.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
TV 0.0832 0.002 41.633 0.000 0.079 0.087
Omnibus: 20.228 Durbin-Watson: 1.707
Prob(Omnibus): 0.000 Jarque-Bera (JB): 23.930
Skew: -0.846 Prob(JB): 6.36e-06
Kurtosis: 3.086 Cond. No. 1.00

Discussions

In [22]:
# The model summary reveals that there exists a positive significant (p-value < 0.05) direct relationship between Sales and 
# advertising spends on TV

# Reg. Model is
# Sales (Y) = 0.0832 * TV_Spends (TV)
# Mathematically:
Y = 0.0832 * TV

## Every dollar spent on TV advertising, it will contribute to increased Sales by 8.3%, according the model obtained 

Constructing the Multiple Regression

In [23]:
# Now we will build a reg. model using multiple (in this case TV and Radio) variables.

# In line with Hypothesis 1, we would like to learn if advertisement spends on TV and Radion will increase the sales

# Also we want to learn if there are any confounding variable among the two.
In [24]:
# Creating a subset

X = df_adv[['TV', 'Radio']]

y = df_adv['Sales']
In [25]:
# The multiple regression model describes the response as a weighted sum of the predictors:
print 'Sales = β0 + β1 × TV + β2 × Radio'
Sales = β0 + β1 × TV + β2 × Radio
In [26]:
## fit a OLS model with intercept on TV and Radio

# Intercept on TV and Radio
X = sm.add_constant(X)

# Fitting OLS model
MultMod = sm.OLS(y, X).fit()
In [27]:
# Model Summary
MultMod.summary()
Out[27]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 859.6
Date: Sat, 12 Dec 2015 Prob (F-statistic): 4.83e-98
Time: 02:45:15 Log-Likelihood: -386.20
No. Observations: 200 AIC: 778.4
Df Residuals: 197 BIC: 788.3
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
const 2.9211 0.294 9.919 0.000 2.340 3.502
TV 0.0458 0.001 32.909 0.000 0.043 0.048
Radio 0.1880 0.008 23.382 0.000 0.172 0.204
Omnibus: 60.022 Durbin-Watson: 2.081
Prob(Omnibus): 0.000 Jarque-Bera (JB): 148.679
Skew: -1.323 Prob(JB): 5.19e-33
Kurtosis: 6.292 Cond. No. 425.
In [28]:
# Clearly we can see thatR-Squared value is approximately 90% which means maximum variability in the data is well 
# explained by the model.
In [29]:
# Also, p-values of coeeficients of variables TV, Radio and Intercept are close to or almost zero, and all of the coef are 
# positive which means there exists a sginificant positive relations among them.
In [30]:
# Therefore, final regression model will be:

print 'Sales = 2.9211 + 0.0458 * TV + 0.1880 * Radio'
Sales = 2.9211 + 0.0458 * TV + 0.1880 * Radio
In [31]:
# Use of model:

# Suppose
TV = 250
Radio = 40
# What would be Sales ?

Sales = 2.9211 + 0.0458 * TV + 0.1880 * Radio

print "Sales:", Sales, "Dollars"
Sales: 21.8911 Dollars

Confounding

Confounding is a distortion (inaccuracy) in the estimated measure of association that occurs when the primary exposure of interest is mixed up with some other factor that is associated with the outcome.

There are three conditions that must be present for confounding to occur:

    The confounding factor must be associated with both the variable(s) of interests and the outcome.

The confounding factor must be distributed unequally among the groups being compared.
A confounder cannot be an intermediary step in the causal pathway from the exposure of interest to the outcome of interest.

As a rule of thumb, if the regression coefficient from the simple linear regression model changes by more than 10%, then X2c(or the second variable that just entered in OLS Model) is said to be a confounder.

To check that, we will do the following:

From Single variable model, we have :

Y = 0.0832 * TV

And from Multiple variable model, we have :

Sales = 2.9211 + 0.0458 TV + 0.1880 Radio

Now we will use the same values for TV and then see the effects on sales by adding adv. spends on Radio

In [32]:
TV = 100

Radio = 1
In [33]:
print "Sales (Dollars), when Price of TV is 100 :", Y
Sales (Dollars), when Price of TV is 100 : 1      19.14432
2       3.70240
3       1.43104
4      12.60480
5      15.04256
6       0.72384
7       4.78400
8      10.00064
9       0.71552
10     16.62336
11      5.49952
12     17.86304
13      1.98016
14      8.11200
15     16.98112
16     16.25728
17      5.64096
18     23.41248
19      5.75744
20     12.25536
21     18.17088
22     19.75168
23      1.09824
24     18.99456
25      5.18336
26     21.87328
27     11.88928
28     19.97632
29     20.70016
30      5.87392
         ...   
171     4.16000
172    13.68640
173     1.63072
174    14.01088
175    18.50368
176    23.03808
177    20.66688
178    14.16064
179    23.02144
180    13.77792
181    13.02912
182    18.17920
183     4.67584
184    23.92832
185    21.11616
186    17.05600
187    11.60640
188    15.89952
189    23.79520
190     1.55584
191     3.28640
192     6.28160
193     1.43104
194    13.87776
195    12.45504
196     3.17824
197     7.83744
198    14.72640
199    23.59552
200    19.31072
Name: TV, dtype: float64
In [34]:
print "Sales (Dollars), when Price of TV is 100 and Radio is 1:", Sales
Sales (Dollars), when Price of TV is 100 and Radio is 1: 21.8911

Therefore, we can conclude that for every unit of Dollar spent on advertising on Radio, the increase in sales is 5.3% approx.

So we can conclude that there are no confounding vaiable among TV and Radio advertising spends.

Q-Q Plot of residuals

In [35]:
res = MultMod.resid # residuals
In [40]:
from scipy import stats
%matplotlib inline  
In [48]:
fig = sm.qqplot(res, stats.t, fit=True, line='r')
plt.show()
qq

The Plot reveals that the residuals are mostly uniformaly distributed and hence we can assume that model represents the best fit possible

Standardized Residuals Plot

In [49]:
stdres = pd.DataFrame(MultMod.resid_pearson)
In [50]:
fig = plt.plot(stdres, 'o', ls='none')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation')
print fig
[<matplotlib.lines.Line2D object at 0x000000002102A240>]
img1

In the above plot we can see that most of the residuals follow 1 standard deviation of the mean, but almost all of the residuals are within 2 standard deviation of the mean (i.e. within 95% Confidence interval, the residuals to follow within 2 standard deviation of the mean)

In [43]:
# Cooks distance 
influence = MultMod.get_influence()

#c is the distance and p is p-value
(c, p) = influence.cooks_distance

plt.stem(np.arange(len(c)), c, markerfmt=",")
Out[43]:
<Container object of 3 artists>
img2
In [45]:
from statsmodels.graphics.regressionplots import *
plot_leverage_resid2(MultMod)
Out[45]:
img3

Most observatons are close to the zero leverage values meaning that although there are outling observations, but in general these do not have undue influence on to the estimation of parameters of regression model.

In [46]:
influence_plot(MultMod)
Out[46]:
img6
In [ ]:

 

Advertisements