## Objective of the Assignment:

There are several reasons for wanting to consider the effects of multiple variables on an outcome of interest.

- To understand the multiple factors that can influence the outcome.
- To recognize and adjust for confounding. If other factors that influence the outcome are unevenly distributed between the groups, these other factors can distort the apparent association between the outcome and the primary exposure of interest; this is what is meant by confounding. When there is confounding, multivariable methods can be used to estimate the association between an exposure and an outcome after adjusting for, or taking into account, the impact of one or more confounding factors (other risk factors). In essence, multiple variable analysis allows us to assess the independent effect of each of the exposures.

```
# Importing packages required for this tutorial
import pandas as pd
import numpy as np
import statsmodels.api as sm
from matplotlib import pyplot as plt
```

```
# In the following example we will use the advertising dataset which consists of the sales of products and their advertising
# budget in three different media TV, radio, newspaper.
```

## Importing Data

```
# Importing the data set for the assignment purpose
df_adv = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
```

```
# List of variables in the dataset
list(df_adv.columns.values)
```

## Data View

```
# Viewing first few observations in the dataset
df_adv.head()
```

## Hypothesis

```
# 1. With the advertising spent, the sales will increase
# 2. There are no confounding variable in the dataset
```

## Regression Model for single variable

```
# In line with the hypothesis 1, let us build a simple OLS model where response variable is Sales and predictor variable is TV
TV = df_adv['TV']
Y = df_adv['Sales']
SalesTVModel = sm.OLS(Y, TV).fit()
```

## Summarizing the Model

```
# Model Summary
SalesTVModel.summary()
```

## Discussions

```
# The model summary reveals that there exists a positive significant (p-value < 0.05) direct relationship between Sales and
# advertising spends on TV
# Reg. Model is
# Sales (Y) = 0.0832 * TV_Spends (TV)
# Mathematically:
Y = 0.0832 * TV
## Every dollar spent on TV advertising, it will contribute to increased Sales by 8.3%, according the model obtained
```

## Constructing the Multiple Regression

```
# Now we will build a reg. model using multiple (in this case TV and Radio) variables.
# In line with Hypothesis 1, we would like to learn if advertisement spends on TV and Radion will increase the sales
# Also we want to learn if there are any confounding variable among the two.
```

```
# Creating a subset
X = df_adv[['TV', 'Radio']]
y = df_adv['Sales']
```

```
# The multiple regression model describes the response as a weighted sum of the predictors:
print 'Sales = β0 + β1 × TV + β2 × Radio'
```

```
## fit a OLS model with intercept on TV and Radio
# Intercept on TV and Radio
X = sm.add_constant(X)
# Fitting OLS model
MultMod = sm.OLS(y, X).fit()
```

```
# Model Summary
MultMod.summary()
```

```
# Clearly we can see thatR-Squared value is approximately 90% which means maximum variability in the data is well
# explained by the model.
```

```
# Also, p-values of coeeficients of variables TV, Radio and Intercept are close to or almost zero, and all of the coef are
# positive which means there exists a sginificant positive relations among them.
```

```
# Therefore, final regression model will be:
print 'Sales = 2.9211 + 0.0458 * TV + 0.1880 * Radio'
```

```
# Use of model:
# Suppose
TV = 250
Radio = 40
# What would be Sales ?
Sales = 2.9211 + 0.0458 * TV + 0.1880 * Radio
print "Sales:", Sales, "Dollars"
```

## Confounding

Confounding is a distortion (inaccuracy) in the estimated measure of association that occurs when the primary exposure of interest is mixed up with some other factor that is associated with the outcome.

There are three conditions that must be present for confounding to occur:

```
The confounding factor must be associated with both the variable(s) of interests and the outcome.
```

The confounding factor must be distributed unequally among the groups being compared.

A confounder cannot be an intermediary step in the causal pathway from the exposure of interest to the outcome of interest.

As a rule of thumb, if the regression coefficient from the simple linear regression model changes by more than 10%, then X2c(or the second variable that just entered in OLS Model) is said to be a confounder.

To check that, we will do the following:

### From Single variable model, we have :

Y = 0.0832 * TV

### And from Multiple variable model, we have :

Sales = 2.9211 + 0.0458 * TV + 0.1880 * Radio

Now we will use the same values for TV and then see the effects on sales by adding adv. spends on Radio

```
TV = 100
Radio = 1
```

```
print "Sales (Dollars), when Price of TV is 100 :", Y
```

```
print "Sales (Dollars), when Price of TV is 100 and Radio is 1:", Sales
```

Therefore, we can conclude that for every unit of Dollar spent on advertising on Radio, the increase in sales is 5.3% approx.

### So we can conclude that there are no confounding vaiable among TV and Radio advertising spends.

### Q-Q Plot of residuals

```
res = MultMod.resid # residuals
```

```
from scipy import stats
%matplotlib inline
```

```
fig = sm.qqplot(res, stats.t, fit=True, line='r')
plt.show()
```

The Plot reveals that the residuals are mostly uniformaly distributed and hence we can assume that model represents the best fit possible

### Standardized Residuals Plot¶

```
stdres = pd.DataFrame(MultMod.resid_pearson)
```

```
fig = plt.plot(stdres, 'o', ls='none')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation')
print fig
```

#### In the above plot we can see that most of the residuals follow 1 standard deviation of the mean, but almost all of the residuals are within 2 standard deviation of the mean (i.e. within 95% Confidence interval, the residuals to follow within 2 standard deviation of the mean)

```
# Cooks distance
influence = MultMod.get_influence()
#c is the distance and p is p-value
(c, p) = influence.cooks_distance
plt.stem(np.arange(len(c)), c, markerfmt=",")
```

```
from statsmodels.graphics.regressionplots import *
plot_leverage_resid2(MultMod)
```

### Most observatons are close to the zero leverage values meaning that although there are outling observations, but in general these do not have undue influence on to the estimation of parameters of regression model.

```
influence_plot(MultMod)
```