, , , ,

Credit Score in R

Due to the threat that the customer might default on payments, the lending firms, such as banks and credit card companies, use “Credit Scores” to evaluate the potential risk posed by lending money to customers and to mitigate losses to occur to due to bad debts (defaulters).

These (credit) scores are used to determine whether an applicant qualifies for a loan (or eligible to receive credit card) or not. If qualifies then at what interest rate (or with what credit limit).

But lenders are not always at risk due to defaults or bad debts, they may suffer the risk of prepayments as well. Prepayment means the loss of revenues to the bank. Therefore, lenders also use these (credit) scores to determine which customers are likely to bring in the most revenue.

This credit score can also be termed as credit worthiness of an individual applicant.

Although there are no specific ways that the lenders calculate the credit worthiness of its probable and existing customers, data mining techniques could be quite useful in analyzing the credit worthiness. The only care is to be taken is in selection of the variables (independent and essentially most appropriate) that are to be included in the model building.

Most lending firms choose the following (of course, not an exhausted list) independent variables:

1. Borrower’s Age
2. Borrower’s Gender
3. Borrower’s Education Qualification
4. Borrower’s Job Type (e.g. Private, Govt, Professional such as Doctor, Lawyer etc)
5. Number of Years in Current Job
6. Borrower’s Total Experience
7. Borrower’s Current Income
8. Borrower’s Spouse and Family Details (e.g. Age, Qualifications, Whether Working, Income, Number of Children and their Age etc)
9. Borrower’s Previous Credit History (such as Current Obligations, EMIs, Payment Default Cases etc)
10. Borrower’s Health Conditions and Insurances
11. Borrower’s Total Worth (e.g. Savings and Assets)

Credit worthiness is a linear function of these (like above in the list) variables and hence can be mathematically written as:


The above equation is then converted into either a linear regression equation (such as to find out the final score of an individual customer), where, the output variable is “Credit Score” and independent variables are input; or a logistic regressions (binary output) to find out whether a customer to be given loan (YES) or not  (NO).

A third and widely used method in recent times is Machine Learning, which uses the concept of neural networks.

In this tutorial we will use well know data set “German Credit” (Asuncion et al, 2007) (source: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/statlog/german).

The data can be downloaded from the following link:


This data set has 300 bad loans and 700 good loans. We will model the data to derive on to the decision whether grant a loan or not.

Attribute description for german credit data set

Attribute 1: (qualitative) Status of existing checking account
A11 : … < 0 DM
A12 : 0 <= … < 200 DM
A13 : [… >= 200 DM / salary assignments for at least 1 year]
A14 : no checking account

Attribute 2: (numerical) Duration in months

Attribute 3: (qualitative) Credit history
A30 : no credits taken / all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account / other credits existing (not at this bank)

Attribute 4: (qualitative) Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation – does not exist?)
A48 : retraining
A49 : business
A410 : others

Attribute 5: (numerical) Credit amount

Attibute 6: (qualitative) Savings account/bonds
A61 : … < 100 DM
A62 : 100 <= … < 500 DM
A63 : 500 <= … < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account

Attribute 7: (qualitative) Present employment since
A71 : unemployed
A72 : … < 1 year
A73 : 1 <= … < 4 years
A74 : 4 <= … < 7 years
A75 : .. >= 7 years

Attribute 8: (numerical) Installment rate in percentage of disposable income

Attribute 9: (qualitative) Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single

Attribute 10: (qualitative) Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor

Attribute 11: (numerical) Present residence since (no. of years)

Attribute 12: (qualitative) Property
A121 : real estate
A122 : if not A121 : building society savings agreement / life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property

Attribute 13: (numerical) Age in years

Attribute 14: (qualitative) Other installment plans
A141 : bank
A142 : stores
A143 : none

Attribute 15: (qualitative) Housing
A151 : rent
A152 : own
A153 : for free / employer provided

Attribute 16: (numerical) Number of existing credits at this bank

Attribute 17: (qualitative) Job
A171 : unemployed/ unskilled – non-resident
A172 : unskilled – resident
A173 : skilled employee / official
A174 : management/ self-employed/ highly qualified employee/ officer

Attribute 18: (numerical) Number of people being liable to provide maintenance for

Attribute 19: (qualitative) Telephone
A191 : none
A192 : yes, registered under the customers name

Attribute 20: (qualitative) foreign worker
A201 : yes
A202 : no

Cost Matrix

This dataset requires use of a cost matrix (see below)
1     2
1     0    1
2     5    0

(1 = Good, 2 = Bad)

The rows represent the actual classification and the columns the predicted classification. It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).