Tags

, , , , , , , , , ,

Attrition Analysis Using R

# For any firm in the world, attrition (churning) of its customers could be disastrous in the long term. Firms keep struggling in maintaining its customer base. 

# Analysts in customer relationship management department (in most firms) are taking advantages of modern tools available for continuously performing data mining and statistical analysis on the data available (in their databases). One of such data mining projects includes "Attrition analysis" (also known as "Churn Analysis") which is about developing a model to find the relations between customers' attrition and the variables that are causing it. 

# In fact in real world there could be hundreds (or may be thousands) of such variables which affects the attrition of customers. For example, price, service and product quality, advertising, competitors' promotions, distance of household, family structure, salary, disposable income, job security, taste change etc to name a few. Furthermore, there could be only quantitative variables or only qualitative variables or mix of both. In most cases, a mix of variables is seen to affect the decision of customer whether to stay with the existing brand or to leave (churn).

# In short, the goal of attrition analysis is to provide the mangers (of marketing or CRM department), the ability to understand which are those important variables that cause attrition and what is the likelihood of a customer to churn.

# Although attrition analysis models have specific requirements (e.g. finding the mix of important variables etc), in this tutorial I am using a binomial logistic as a predictive statistical model to analyze the variables and their contribution to the outcome (churn).

# I have a sample dataset which contains the outcome as binary event where value 1 stands for the event that customer churned (left) while 0 means the event that customer still is associated with that brand (or firm).

# The other variables that I am considering for this tutorials purposes are Customer's Gender, Age, Income, Family (household) size, Education (in number of years e.g. for under-graduate it is 16 years, of course it may be different in different countries), Calls (i.e. how many times, till date, customer has called to the service center or customer care department), Visits (i.e. how many times customer has visited to the local service center till date)

# You can download the sample dataset in CSV format from this link:
# Download Sample Dataset HERE
# or from my box.net link: https://app.box.com/s/fb75bd1yecuvv2jlk4qx

# loading data into memory
attrdata 


# Fitting Generalized Linear Model to the data
fitlogit 
# Summerizing results
summary(fitlogit)
## 
## Call:
## glm(formula = Churn ~ Gender + Age + Income + FamilySize + Education + 
##     Calls + Visits, family = "binomial", data = attrdata)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.467  -0.811   0.230   0.745   2.185  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -7.8365     1.6205   -4.84  1.3e-06 ***
## Gender        1.3330     0.4069    3.28  0.00105 ** 
## Age          -0.0255     0.0127   -2.00  0.04513 *  
## Income        1.4998     0.4226    3.55  0.00039 ***
## FamilySize    0.7869     0.2332    3.37  0.00074 ***
## Education     0.2227     0.0829    2.69  0.00722 ** 
## Calls         0.0339     0.0162    2.08  0.03711 *  
## Visits        0.4112     0.1396    2.95  0.00322 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 221.18  on 159  degrees of freedom
## Residual deviance: 156.61  on 152  degrees of freedom
## AIC: 172.6
## 
## Number of Fisher Scoring iterations: 5
# Clearly we can observe in the summary that all the variables are significant at least at 95% Confidence.

# Analysis of variances
round(x = anova(fitlogit), digits = 4)
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Churn
## 
## Terms added sequentially (first to last)
## 
## 
##            Df Deviance Resid. Df Resid. Dev
## NULL                         159        221
## Gender      1    14.03       158        207
## Age         1     3.04       157        204
## Income      1     7.39       156        197
## FamilySize  1    13.57       155        183
## Education   1    13.85       154        169
## Calls       1     3.27       153        166
## Visits      1     9.42       152        157
library(aod)
# Variance - Covariance table
round(x = vcov(fitlogit), digits = 4)
##             (Intercept)  Gender     Age  Income FamilySize Education
## (Intercept)      2.6259 -0.1550 -0.0031 -0.2834    -0.1863   -0.0946
## Gender          -0.1550  0.1656 -0.0004  0.0279     0.0011    0.0025
## Age             -0.0031 -0.0004  0.0002 -0.0003    -0.0004    0.0000
## Income          -0.2834  0.0279 -0.0003  0.1786     0.0260    0.0053
## FamilySize      -0.1863  0.0011 -0.0004  0.0260     0.0544   -0.0012
## Education       -0.0946  0.0025  0.0000  0.0053    -0.0012    0.0069
## Calls           -0.0062  0.0003  0.0000  0.0004     0.0009   -0.0003
## Visits          -0.0703  0.0067 -0.0003  0.0113     0.0079    0.0002
##               Calls  Visits
## (Intercept) -0.0062 -0.0703
## Gender       0.0003  0.0067
## Age          0.0000 -0.0003
## Income       0.0004  0.0113
## FamilySize   0.0009  0.0079
## Education   -0.0003  0.0002
## Calls        0.0003  0.0003
## Visits       0.0003  0.0195
# Coefficient of variables in fitted model
round(x = coef(fitlogit), digits = 4)
## (Intercept)      Gender         Age      Income  FamilySize   Education 
##     -7.8365      1.3330     -0.0255      1.4998      0.7869      0.2227 
##       Calls      Visits 
##      0.0339      0.4112
# Confidence Intervals using profiled log-likelihood in the test
round(x = confint(fitlogit), digits = 4)
## Waiting for profiling to be done...
##                2.5 %  97.5 %
## (Intercept) -11.2326 -4.8424
## Gender        0.5531  2.1568
## Age          -0.0514 -0.0012
## Income        0.6981  2.3632
## FamilySize    0.3515  1.2718
## Education     0.0643  0.3914
## Calls         0.0028  0.0669
## Visits        0.1457  0.6965
# Confidence Intervals using standard errors in the test
round(x = confint.default(fitlogit), digits = 4)
##                2.5 %  97.5 %
## (Intercept) -11.0126 -4.6605
## Gender        0.5355  2.1305
## Age          -0.0504 -0.0006
## Income        0.6715  2.3280
## FamilySize    0.3298  1.2440
## Education     0.0602  0.3852
## Calls         0.0020  0.0657
## Visits        0.1377  0.6848
# Calculating odds ratios for the variables
round(x = exp(coef(fitlogit)), digits = 4)
## (Intercept)      Gender         Age      Income  FamilySize   Education 
##      0.0004      3.7924      0.9748      4.4806      2.1965      1.2495 
##       Calls      Visits 
##      1.0344      1.5086
# We can say that:
# For one member increase in family size, the odds of being churned out (attrition or leaving the brand) increases by a factor 2.2 approx.
# Similarly, 1 visit at the service center, if led to dissatisfaction, would increase attrition by a factor 1.5 approx.

## Calculating odds ratios with 95% confidence interval
round(x = exp(cbind(OR = coef(fitlogit), confint(fitlogit))), digits = 4)
## Waiting for profiling to be done...
##                 OR  2.5 %  97.5 %
## (Intercept) 0.0004 0.0000  0.0079
## Gender      3.7924 1.7386  8.6435
## Age         0.9748 0.9499  0.9988
## Income      4.4806 2.0099 10.6249
## FamilySize  2.1965 1.4212  3.5674
## Education   1.2495 1.0665  1.4791
## Calls       1.0344 1.0028  1.0692
## Visits      1.5086 1.1569  2.0067
# Storing predicted probabilities for GLM fits in an additional column "prob" in our dataframe
prob 
# Plotting probabilities of churning a customer in the dataset
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
g  
## 
## Call:
## roc.formula(formula = Churn ~ prob, data = attrdata)
## 
## Data: prob in 75 controls (Churn 0) < 85 cases (Churn 1).
## Area under the curve: 0.842 (which represents that the model is very good, if not excellent!)

glm00

# Plotting Area under Curve (AUC)
library(Deducer)
## Loading required package: ggplot2
## Loading required package: JGR
## Loading required package: rJava
## Loading required package: JavaGD
## Loading required package: iplots
## 
## Please type JGR() to launch console. Platform specific launchers (.exe and .app) can also be obtained at http://www.rforge.net/JGR/files/.
## 
## 
## Loading required package: car
## Loading required package: MASS
## 
## 
## Note Non-JGR console detected:
##  Deducer is best used from within JGR (http://jgr.markushelbig.org/).
##  To Bring up GUI dialogs, type deducer().
## 
## 
## Attaching package: 'Deducer'
## 
## The following object is masked from 'package:stats':
## 
##     summary.lm
modelfit 

glm0

# Clearly the graph of ROC curve and the Area Under Curve (AUC) value confirm the "very good predictive model".

# For reference, the following table represents some standards being followed by most researchers and analysts:

# AUC value    Model
#   0.5        No distinguish ability shown by the prediction model develoved and required further improvements
#   0.5-0.7    Although can be accepted but overall it is not a very good model
#   0.7-0.9    very good prediction model (most models fall within this range)
#   0.9-1.0    Excellent Prediction Model (but are rare)
#   Using summary of Logistic Model and confirming the validity of model through various statistical tests, the following equation for prediction of churning is formed:

Probability of Churn = 1 / (1 + exp(-(-7.8365 - 0.0255 * Age + 0.0339 * Calls + 0.2227 * Education + 0.7869 * FamilySize + 1.333 * Gender + 1.4998 * Income + 0.4112 * Visits)))
This was one basic example of attrition analysis. In fact the real life examples could be very complex in nature.

I hope this would help learners to develop advanced skills in developing attrition analysis models for the firm and, hence, can help further in management decision making.
Happy learning....

Manoj Kumar

Advertisements