**Tags**

Area Under Curve, AUC, Churn, Free, Generalized Linear Models, GLM, Logit, R, Regression, ROC Curve, Tutorial

**Attrition Analysis Using R**

```
# For any firm in the world, attrition (churning) of its customers could be disastrous in the long term. Firms keep struggling in maintaining its customer base.
# Analysts in customer relationship management department (in most firms) are taking advantages of modern tools available for continuously performing data mining and statistical analysis on the data available (in their databases). One of such data mining projects includes "Attrition analysis" (also known as "Churn Analysis") which is about developing a model to find the relations between customers' attrition and the variables that are causing it.
# In fact in real world there could be hundreds (or may be thousands) of such variables which affects the attrition of customers. For example, price, service and product quality, advertising, competitors' promotions, distance of household, family structure, salary, disposable income, job security, taste change etc to name a few. Furthermore, there could be only quantitative variables or only qualitative variables or mix of both. In most cases, a mix of variables is seen to affect the decision of customer whether to stay with the existing brand or to leave (churn).
# In short, the goal of attrition analysis is to provide the mangers (of marketing or CRM department), the ability to understand which are those important variables that cause attrition and what is the likelihood of a customer to churn.
# Although attrition analysis models have specific requirements (e.g. finding the mix of important variables etc), in this tutorial I am using a binomial logistic as a predictive statistical model to analyze the variables and their contribution to the outcome (churn).
# I have a sample dataset which contains the outcome as binary event where value 1 stands for the event that customer churned (left) while 0 means the event that customer still is associated with that brand (or firm).
# The other variables that I am considering for this tutorials purposes are Customer's Gender, Age, Income, Family (household) size, Education (in number of years e.g. for under-graduate it is 16 years, of course it may be different in different countries), Calls (i.e. how many times, till date, customer has called to the service center or customer care department), Visits (i.e. how many times customer has visited to the local service center till date)
# You can download the sample dataset in CSV format from this link:
# Download Sample Dataset
```**HERE
**# or from my box.net link: https://app.box.com/s/fb75bd1yecuvv2jlk4qx
# loading data into memory
**attrdata
# Fitting Generalized Linear Model to the data
****fitlogit
# Summerizing results
****summary(fitlogit)**

```
##
## Call:
## glm(formula = Churn ~ Gender + Age + Income + FamilySize + Education +
## Calls + Visits, family = "binomial", data = attrdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.467 -0.811 0.230 0.745 2.185
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.8365 1.6205 -4.84 1.3e-06 ***
## Gender 1.3330 0.4069 3.28 0.00105 **
## Age -0.0255 0.0127 -2.00 0.04513 *
## Income 1.4998 0.4226 3.55 0.00039 ***
## FamilySize 0.7869 0.2332 3.37 0.00074 ***
## Education 0.2227 0.0829 2.69 0.00722 **
## Calls 0.0339 0.0162 2.08 0.03711 *
## Visits 0.4112 0.1396 2.95 0.00322 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 221.18 on 159 degrees of freedom
## Residual deviance: 156.61 on 152 degrees of freedom
## AIC: 172.6
##
## Number of Fisher Scoring iterations: 5
```

```
# Clearly we can observe in the summary that all the variables are significant at least at 95% Confidence.
# Analysis of variances
```**round(x = anova(fitlogit), digits = 4)**

```
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Churn
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 159 221
## Gender 1 14.03 158 207
## Age 1 3.04 157 204
## Income 1 7.39 156 197
## FamilySize 1 13.57 155 183
## Education 1 13.85 154 169
## Calls 1 3.27 153 166
## Visits 1 9.42 152 157
```

**library(aod)**
# Variance - Covariance table
**round(x = vcov(fitlogit), digits = 4)**

```
## (Intercept) Gender Age Income FamilySize Education
## (Intercept) 2.6259 -0.1550 -0.0031 -0.2834 -0.1863 -0.0946
## Gender -0.1550 0.1656 -0.0004 0.0279 0.0011 0.0025
## Age -0.0031 -0.0004 0.0002 -0.0003 -0.0004 0.0000
## Income -0.2834 0.0279 -0.0003 0.1786 0.0260 0.0053
## FamilySize -0.1863 0.0011 -0.0004 0.0260 0.0544 -0.0012
## Education -0.0946 0.0025 0.0000 0.0053 -0.0012 0.0069
## Calls -0.0062 0.0003 0.0000 0.0004 0.0009 -0.0003
## Visits -0.0703 0.0067 -0.0003 0.0113 0.0079 0.0002
## Calls Visits
## (Intercept) -0.0062 -0.0703
## Gender 0.0003 0.0067
## Age 0.0000 -0.0003
## Income 0.0004 0.0113
## FamilySize 0.0009 0.0079
## Education -0.0003 0.0002
## Calls 0.0003 0.0003
## Visits 0.0003 0.0195
```

```
# Coefficient of variables in fitted model
```**round(x = coef(fitlogit), digits = 4)**

```
## (Intercept) Gender Age Income FamilySize Education
## -7.8365 1.3330 -0.0255 1.4998 0.7869 0.2227
## Calls Visits
## 0.0339 0.4112
```

```
# Confidence Intervals using profiled log-likelihood in the test
```**round(x = confint(fitlogit), digits = 4)**

`## Waiting for profiling to be done...`

```
## 2.5 % 97.5 %
## (Intercept) -11.2326 -4.8424
## Gender 0.5531 2.1568
## Age -0.0514 -0.0012
## Income 0.6981 2.3632
## FamilySize 0.3515 1.2718
## Education 0.0643 0.3914
## Calls 0.0028 0.0669
## Visits 0.1457 0.6965
```

```
# Confidence Intervals using standard errors in the test
```**round(x = confint.default(fitlogit), digits = 4)**

```
## 2.5 % 97.5 %
## (Intercept) -11.0126 -4.6605
## Gender 0.5355 2.1305
## Age -0.0504 -0.0006
## Income 0.6715 2.3280
## FamilySize 0.3298 1.2440
## Education 0.0602 0.3852
## Calls 0.0020 0.0657
## Visits 0.1377 0.6848
```

```
# Calculating odds ratios for the variables
```**round(x = exp(coef(fitlogit)), digits = 4)**

```
## (Intercept) Gender Age Income FamilySize Education
## 0.0004 3.7924 0.9748 4.4806 2.1965 1.2495
## Calls Visits
## 1.0344 1.5086
```

```
# We can say that:
# For one member increase in family size, the odds of being churned out (attrition or leaving the brand) increases by a factor 2.2 approx.
# Similarly, 1 visit at the service center, if led to dissatisfaction, would increase attrition by a factor 1.5 approx.
## Calculating odds ratios with 95% confidence interval
```**round(x = exp(cbind(OR = coef(fitlogit), confint(fitlogit))), digits = 4)**

`## Waiting for profiling to be done...`

```
## OR 2.5 % 97.5 %
## (Intercept) 0.0004 0.0000 0.0079
## Gender 3.7924 1.7386 8.6435
## Age 0.9748 0.9499 0.9988
## Income 4.4806 2.0099 10.6249
## FamilySize 2.1965 1.4212 3.5674
## Education 1.2495 1.0665 1.4791
## Calls 1.0344 1.0028 1.0692
## Visits 1.5086 1.1569 2.0067
```

```
# Storing predicted probabilities for GLM fits in an additional column "prob" in our dataframe
```**prob
# Plotting probabilities of churning a customer in the dataset
****library(pROC)**

```
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
```

**g **

```
##
## Call:
## roc.formula(formula = Churn ~ prob, data = attrdata)
##
## Data: prob in 75 controls (Churn 0) < 85 cases (Churn 1).
##
```**Area under the curve: 0.842 (which represents that the model is very good, if not excellent!)
**

```
# Plotting Area under Curve (AUC)
```**library(Deducer)
**## Loading required package: ggplot2

## Loading required package: JGR
## Loading required package: rJava
## Loading required package: JavaGD
## Loading required package: iplots

```
##
## Please type JGR() to launch console. Platform specific launchers (.exe and .app) can also be obtained at http://www.rforge.net/JGR/files/.
##
##
## Loading required package: car
## Loading required package: MASS
##
##
## Note Non-JGR console detected:
## Deducer is best used from within JGR (http://jgr.markushelbig.org/).
## To Bring up GUI dialogs, type deducer().
##
##
## Attaching package: 'Deducer'
##
## The following object is masked from 'package:stats':
##
## summary.lm
```

`modelfit`

```
# Clearly the graph of ROC curve and the Area Under Curve (AUC) value confirm the "very good predictive model".
```**# For reference, the following table represents some standards being followed by most researchers and analysts:
# AUC value Model
# 0.5 No distinguish ability shown by the prediction model develoved and required further improvements
# 0.5-0.7 Although can be accepted but overall it is not a very good model
# 0.7-0.9 very good prediction model (most models fall within this range)
# 0.9-1.0 Excellent Prediction Model (but are rare)**

**# Using summary of Logistic Model and confirming the validity of model through various statistical tests, the following equation for prediction of churning is formed:
Probability of Churn = 1 / (1 + exp(-(-7.8365 - 0.0255 * Age + 0.0339 * Calls + 0.2227 * Education + 0.7869 * FamilySize + 1.333 * Gender + 1.4998 * Income + 0.4112 * Visits)))**

This was one basic example of attrition analysis. In fact the real life examples could be very complex in nature. I hope this would help learners to develop advanced skills in developing attrition analysis models for the firm and, hence, can help further in management decision making.

Happy learning....Manoj Kumar

MANOJ KUMAR

said:In addition we can set different hypothesis to test the data and build other models, for example:

Ho: Young males are tolerant of poor customer service than older females

Ho: Younger males have higher churn rates

Ho:

It would allow, for example, the organisation to focus its recruitment on low-churn customers and/or give more assistance to the higher risk groups.

However, knowing why they are at risk of churn could lead to more fundamental improvements.

Unfortunately, this almost certainly requires some form of attitudinal analysis which is unlikely to be possible using static customer data alone.

(Thanks to Norman Jessup at Analytic Advantage for his valuable suggestions and inputs)

LikeLike

Kalyan

said:Good explanation. Thanks for sharing. Please tell me after built a model what will be next steps , how can we score with live data is this customer will be churned or not.

LikeLike

Navaneeth

said:Hi,

Excellent explanation.

Just a small clarification. In the data, the ratio between churn and non churn is approximately 50% and hence the model’s output of 84% is excellent. However, what if this ratio is skewed towards one type? For eg: If the ration between churn and non churn is 80/20 then would a model which predicts upto 90% considered good?

LikeLike

MANOJ KUMAR

said:Are you talking about 90% Area under the curve? if so, yes, the model prediction is excellent..

LikeLike

Szabolcs Máj

said:If you are working on heavily skewed data (10-90, closing to anomaly detection) you should switch to Precision/Sensitivity instead of Specificity/Sensitivity. The goal should be to always to maximize the AUC regardless which you are using

LikeLike

MANOJ KUMAR

said:Thanks so much Dear Szabolcs for the useful addition to the blog.

Best regards,

LikeLike

Hank

said:Hi Manoj, great work! One question concerning the equation for predicting the “churning rate”:

Shouldn’t the probability function for p(Y=1)= exp / (1+exp) with Y=1 for churning and p(Y=0)=1-p(Y=1)=1/(1+exp) with Y=0 for nor churning?

LikeLike

Hank

said:Ah I missed the “-” there…

LikeLike

MANOJ KUMAR

said:🙂

LikeLike

shan

said:Hi..

Nice explanation.

Searching fo good resources on boosting trees ensemble learning.

Pls help.

Thanks inanticipation..

LikeLike

MANOJ KUMAR

said:Sure Shan.Will be happy to see some posts here on my blogs, if posible.

LikeLike

Amit

said:excellent explanation , very useful.

LikeLike

Hank

said:If you want to extend your analysis, you should check out this set of data:

http://www.sigkdd.org/kdd-cup-2009-customer-relationship-prediction

LikeLike

MANOJ KUMAR

said:Thanks very much Hank…

LikeLiked by 1 person

Jacquin Axel

said:Could you please repost data ?

Thanks

LikeLike

MANOJ KUMAR

said:Sure….

Here is the new data source…

http://idatasciencer.com/files/attrition.csv

LikeLike

Niall Wynne

said:Hi. Thanks for this. I’ve been looking for some sort of churn analysis explanation for weeks. I’m new to r and i’m not sure of the full range of online help. I have a question about the above. How do i do this section ‘# Storing predicted probabilities for GLM fits in an additional column “prob” in our dataframe

prob’. I’m not sure exactly what it means. Any help would be greatly appreciated.

LikeLike

MANOJ KUMAR

said:GLM Fit produced predicted probabilities. we store these probabilities in another column (name of column is “prob”).

LikeLike

Shilpa S

said:Was trying to access the data set but couldn’t find it on idatasciencer.com. Kindly post the link.

LikeLike

DataScientist

said:Actually I discontinued that domain. Thanks for reminding, I will upload it elsewhere and post the link asap.

LikeLike

Shilpa S

said:Thank you!

LikeLike

DataScientist

said:You can still download from the box.net link:

https://app.box.com/s/fb75bd1yecuvv2jlk4qx

LikeLike

inanckilic

said:# loading data into memory

attrdata

# Fitting Generalized Linear Model to the data

fitlogit

# Summerizing results

summary(fitlogit)

I did not get it sir,

LikeLike