Assignment: Test a Logistic Regression Model

1 Introduction

    Logistic regression, also called a logit model, is used to model dichotomous outcome variables. 
    
    In the logit model the log odds of the outcome is modeled as a linear combination of the predictor 
    variables.

1.1 Problem Statement

    I am interested to learn how variables, such as GRE (Graduate Record Exam scores), 
    GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate 
    school. 
    
    The response variable, admit/don't admit, is a binary variable.
    

1.2 Description of the data

    For my assignment purpose, I am going to expand on above problem statement about getting admitted 
    into graduate school. 
    
    I Will use a hypothetical data, which can be downloaded from:
    
    http://www.ats.ucla.edu/stat/data/binary.csv
    
    
    1. This dataset has a binary response (outcome, dependent) variable called 'admit' (1: admitted, 0: Not). 
    
    2. There are three predictor variables: 'gre', 'gpa' and 'rank'. I will treat the variables 'gre' 
       and 'gpa' as continuous. 
    
    3. The variable 'rank' takes on the values 1 through 4. Institutions with a rank of 1 have the highest 
       prestige, while those with a rank of 4 have the lowest.
    
    4. I will also show the basic descriptives for the entire data set in summary section. 
    

1.2.1 Viewing a few observations in the dataset

##    admit gre  gpa rank
## 1      0 380 3.61    3
## 2      1 660 3.67    3
## 3      1 800 4.00    1
## 4      1 640 3.19    4
## 5      0 520 2.93    4
## 6      1 760 3.00    2
## 7      1 560 2.98    1
## 8      0 400 3.08    2
## 9      1 540 3.39    3
## 10     0 700 3.92    2

1.2.2 Summary of Data

##      admit             gre             gpa             rank      
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
##  Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000

1.2.3 Standard Deviation of Data

##       admit         gre         gpa        rank 
##   0.4660867 115.5165364   0.3805668   0.9444602

1.2.4 Two-way contingency table of categorical outcome and predictors

##      rank
## admit  1  2  3  4
##     0 28 97 93 55
##     1 33 54 28 12

1.3 Building Logistic Regression Model

    First, we convert 'rank' to a factor to indicate that 'rank' should be treated as a categorical variable.
    
    Below is the conversion result.
##  Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ...
    We not bulid the logistic model.
    
    Here 'admit' is the outcome variable and 'gre','gpa', and 'rank' are predictors.
    
    Summary of the built model is shown below:
## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4
    In the above results (model summary), we can easily identify that:
    
    'gre','gpa', and 'rank' order 2, 3 and 4 are statistically significant at 95% CI.
    
    Below are the obtained confidence intervals for the coefficient estimates in the above logit model.
## Waiting for profiling to be done...
##                     2.5 %       97.5 %
## (Intercept) -6.2716202334 -1.792547080
## gre          0.0001375921  0.004435874
## gpa          0.1602959439  1.464142727
## rank2       -1.3008888002 -0.056745722
## rank3       -2.0276713127 -0.670372346
## rank4       -2.4000265384 -0.753542605
    Below are the obtained confidence intervals using standard errors in the above logit model.
##                     2.5 %       97.5 %
## (Intercept) -6.2242418514 -1.755716295
## gre          0.0001202298  0.004408622
## gpa          0.1536836760  1.454391423
## rank2       -1.2957512650 -0.055134591
## rank3       -2.0169920597 -0.663415773
## rank4       -2.3703986294 -0.732528724

1.4 Odd Ratio

## (Intercept)         gre         gpa       rank2       rank3       rank4 
##   0.0185001   1.0022670   2.2345448   0.5089310   0.2617923   0.2119375
    Odds Ratios and 95% CI :
## Waiting for profiling to be done...
##                    OR       2.5 %    97.5 %
## (Intercept) 0.0185001 0.001889165 0.1665354
## gre         1.0022670 1.000137602 1.0044457
## gpa         2.2345448 1.173858216 4.3238349
## rank2       0.5089310 0.272289674 0.9448343
## rank3       0.2617923 0.131641717 0.5115181
## rank4       0.2119375 0.090715546 0.4706961

1.4.1 Odds Ratio interpretation

    From above results, we can say that for every unit increase in 'gpa', the odds of being admitted to 
    graduate school (versus not being admitted) increase by a factor of 2.23 

1.5 Confounding for the Association

    To varify whether or not there was evidence of confounding for the association between the primary 
    explanatory and the response variable, we will first remove one variable from the earlier logistic 
    regression model and then verify the difference.
    
    Lest remove, 'gpa' and rebuild the model.
## 
## Call:
## glm(formula = admit ~ gre + rank, family = "binomial", data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5199  -0.8715  -0.6588   1.1775   2.1113  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.802365   0.672982  -2.678 0.007402 ** 
## gre          0.003224   0.001019   3.163 0.001562 ** 
## rank2       -0.721737   0.313033  -2.306 0.021132 *  
## rank3       -1.291305   0.340775  -3.789 0.000151 ***
## rank4       -1.602054   0.414932  -3.861 0.000113 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 464.53  on 395  degrees of freedom
## AIC: 474.53
## 
## Number of Fisher Scoring iterations: 4
    We can see from the summary that 'gre' and 'rank' (2 through 4) are still statistically 
    significant at 95%CI.
    
    We will now also calculate Odds Ratio and 95% CI for the new model.
    
## (Intercept)         gre       rank2       rank3       rank4 
##   0.1649085   1.0032291   0.4859076   0.2749117   0.2014823
## Waiting for profiling to be done...
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1649085 0.04314803 0.6074323
## gre         1.0032291 1.00125509 1.0052723
## rank2       0.4859076 0.26162841 0.8955744
## rank3       0.2749117 0.13958428 0.5327929
## rank4       0.2014823 0.08667895 0.4446645

1.5.1 Confounding effect result interpretation

    Clearly we can observe that in earlier results, for every unit increase in 'gre', the odds of 
    being admitted to graduate school (versus not being admitted) increase by a factor of 1.0022670, 
    which remains approximately the same (1.0032291) in the second case as well.
    
    This confirms that there are no confounding effects for the association among the predictor variables.

1.6 Logistic Regression Results

    After adjusting for potential confounding factors (gpa, rank), the odds of being admitted to 
    graduate school (versus not being admitted) increase by a factor more than two times higher for 
    candidates with higher 'gpa' than for candidates with less scores. 
    (OR=2.23, 95% CI = 1.174-4.32, p<0.05) 
    

 

 

Advertisements