Project motivation

Credit scores inform us about a person’s reliability on repaying debt. For the individual, checking one’s credit score on a regular basis can help with making better financial decisions. For banks, credit scores allows for it to determine the riskiness of a borrower. But what factors actually play in determining one’s credit score? Is it possible for banks to detect “bad” debt and reduce loss based on these factors? We will explore these questions in this project.


Dataset Description

The German credit scoring data is a dataset provided by Prof. Hogmann in the file german.data.

The dataset contains 20 variables and 1000 observations, on the basis of whether these people have been classified as risky or not.

Data Structure1

Data contains:

  • Attribute 1: chk_acct (qualitative)

    • Status of existing checking account

    • A11 : … < 0 DM

    • A12 : 0 <= … < 200 DM

    • A13 : … >= 200 DM / salary assignments for at least 1 year

    • A14 : no checking account

  • Attribute 2: duration (numerical)

    • Duration in month
  • Attribute 3: credit_his (qualitative)

    • Credit history

    • A30 : no credits taken/all credits paid back duly

    • A31 : all credits at this bank paid back duly

    • A32 : existing credits paid back duly till now

    • A33 : delay in paying off in the past

    • A34 : critical account/other credits existing (not at this bank)

  • Attribute 4: purpose (qualitative)

    • Purpose

    • A40 : car (new)

    • A41 : car (used)

    • A42 : furniture/equipment

    • A43 : radio/television

    • A44 : domestic appliances

    • A45 : repairs

    • A46 : education

    • A47 : (vacation - does not exist?)

    • A48 : retraining

    • A49 : business

    • A410 : others

  • Attribute 5: amount (numerical)

    • Credit amount
  • Attibute 6: saving_acct (qualitative)

    • Savings account/bonds

    • A61 : … < 100 DM

    • A62 : 100 <= … < 500 DM

    • A63 : 500 <= … < 1000 DM

    • A64 : .. >= 1000 DM

    *A65 : unknown/ no savings account

  • Attribute 7: present_emp (qualitative)

    • Present employment since

    • A71 : unemployed

    • A72 : … < 1 year

    • A73 : 1 <= … < 4 years

    • A74 : 4 <= … < 7 years

    • A75 : .. >= 7 years

  • Attribute 8: installment_rate (numerical)

    • Installment rate in percentage of disposable income
  • Attribute 9: sex (qualitative)

    • Personal status and sex

    • A91 : male : divorced/separated

    • A92 : female : divorced/separated/married

    • A93 : male : single

    • A94 : male : married/widowed

    • A95 : female : single

  • Attribute 10: other_debtor (qualitative)

    • Other debtors / guarantors

    • A101 : none

    • A102 : co-applicant

    • A103 : guarantor

  • Attribute 11: present_resid (numerical)

    • Present residence since
  • Attribute 12: property (qualitative)

    • Property

    • A121 : real estate

    • A122 : if not A121 : building society savings agreement/life insurance

    • A123 : if not A121/A122 : car or other, not in attribute 6

    • A124 : unknown / no property

  • Attribute 13: age (numerical)

    • Age in years
  • Attribute 14: other_install (qualitative)

    • Other installment plans

    • A141 : bank

    • A142 : stores

    • A143 : none

  • Attribute 15: housing (qualitative)

    • Housing

    • A151 : rent

    • A152 : own

    • A153 : for free

  • Attribute 16: n_credits (numerical)

    • Number of existing credits at this bank
  • Attribute 17: job (qualitative)

    • Job

    • A171 : unemployed/ unskilled - non-resident

    • A172 : unskilled - resident

    • A173 : skilled employee / official

    • A174 : management/ self-employed / highly qualified employee/ officer

  • Attribute 18: n_people (numerical)

    • Number of people being liable to provide maintenance for
  • Attribute 19: telephone (qualitative)

    • Telephone

    • A191 : none

    • A192 : yes, registered under the customers name

  • Attribute 20: foreign (qualitative)

    • foreign worker

    • A201 : yes

    • A202 : no

  • Response: risk (qualitative)

    • 0 : good record

    • 1 : bad record

Data Loading

Data retrieved from this link.

# download data
data = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")

# data preprocessing
colnames(data) = c("chk_acct", "duration", "credit_his", "purpose","amount",
                   "saving_acct", "present_emp","installment_rate", "sex",
                   "other_debtor", "present_resid", "property", "age",
                   "other_install","housing", "n_credits", "job", "n_people",
                   "telephone", "foreign", "risk")

#str(data)
data$risk = as.factor(data$risk - 1)

Dataset Summary

summary(data)
##  chk_acct     duration    credit_his    purpose        amount      saving_acct
##  A11:274   Min.   : 4.0   A30: 40    A43    :280   Min.   :  250   A61:603    
##  A12:269   1st Qu.:12.0   A31: 49    A40    :234   1st Qu.: 1366   A62:103    
##  A13: 63   Median :18.0   A32:530    A42    :181   Median : 2320   A63: 63    
##  A14:394   Mean   :20.9   A33: 88    A41    :103   Mean   : 3271   A64: 48    
##            3rd Qu.:24.0   A34:293    A49    : 97   3rd Qu.: 3972   A65:183    
##            Max.   :72.0              A46    : 50   Max.   :18424              
##                                      (Other): 55                              
##  present_emp installment_rate  sex      other_debtor present_resid   property  
##  A71: 62     Min.   :1.000    A91: 50   A101:907     Min.   :1.000   A121:282  
##  A72:172     1st Qu.:2.000    A92:310   A102: 41     1st Qu.:2.000   A122:232  
##  A73:339     Median :3.000    A93:548   A103: 52     Median :3.000   A123:332  
##  A74:174     Mean   :2.973    A94: 92                Mean   :2.845   A124:154  
##  A75:253     3rd Qu.:4.000                           3rd Qu.:4.000             
##              Max.   :4.000                           Max.   :4.000             
##                                                                                
##       age        other_install housing      n_credits       job     
##  Min.   :19.00   A141:139      A151:179   Min.   :1.000   A171: 22  
##  1st Qu.:27.00   A142: 47      A152:713   1st Qu.:1.000   A172:200  
##  Median :33.00   A143:814      A153:108   Median :1.000   A173:630  
##  Mean   :35.55                            Mean   :1.407   A174:148  
##  3rd Qu.:42.00                            3rd Qu.:2.000             
##  Max.   :75.00                            Max.   :4.000             
##                                                                     
##     n_people     telephone  foreign    risk   
##  Min.   :1.000   A191:596   A201:963   0:700  
##  1st Qu.:1.000   A192:404   A202: 37   1:300  
##  Median :1.000                                
##  Mean   :1.155                                
##  3rd Qu.:1.000                                
##  Max.   :2.000                                
## 

Data Visualization

The first step would be to understand the underlying data structure. The variables listed below differ between risk and non-risk groups.

EDA - checking account

Individuals who have empty checking accounts are the riskiest group (group A11), while individuals who allocate money to their checkings account (group A12 and A13) are less risky. From the plot below, we can conclude that the variable “chk_acct” is a big deal.

EDA - Duration

The riskier group tends to have a high value at duration compared with less risky people. The difference between the two groups can tell us that “duration” will make a big contribution to the model fitting part.

## Using risk as id variables

EDA - Credit History

The credit_his variable has a great deal of difference between good and bad records.

People who do not have credits taken or all their credits were paid back duly(A30) are more risker, as well as those people who are counted as “as credits at this bank paid back duly”(A31).

EDA - Purpose

The risk difference among different purposes varies a lot. Using money to buy radio/television(A43) is the main purpose and it has a lower risk, while using the money for education(A46) may become risky.

EDA - Saving Account

People who have saving account but do not have money in it(group A61) are risker than others, while people who put more than 1000 DM into their saving account(group A64) are less likely to have bad credits.

EDA - Sex

The credit risk differences between different sex status groups can be ignored. There is no strong evidence that one kind of sex group is risker than another one though divorced or separated males could have sightly higher credit risk.

EDA - Age

The age difference between risk and non-risk people can be ignored, though credit-risk people tend to be younger than non-risk people.

## Using risk as id variables


Model Fitting

After getting familiarized with the dataset, the following steps whould be taken.

Data preprocessing

The training data and the testing data are set randomly, with 212 observations in the training data and 91 observations in the testing data.

set.seed(123)
train = data[sample(nrow(data), size = round(0.7*nrow(data))), ]
test = data[-sample(nrow(data), size = round(0.7*nrow(data))), ]

Logistic Regression

First build the logistic regression model.

  • Findings:

    • We can conclude that credit history, duration, whether this person has a checking account and a saving account, the installment rate and the purpose are the most important factors informing a person’s credit risk.

    • Good credit history and having checking/saving account leads to better credits.

    • The duration has a negative impact on credits. If one person is unable to repay a loan for an extended period, the possibility of defaulting will rise.

    • Installment rate also has a positive relationship with the risk. Higher installment rate coincides with riskier individuals.

    • From the test prediction table, 41 risky people have been predicted as non-risky people while 22 non-risky people have been categorized to the risky group. With the purpose of reducing the bad debt loss we’re more concerned about the misclassification for the risky group. Misclassification will induce significant losses to banks.

    • The test accuracy in this example is 0.79 and the train accuracy is 0.801

Results and code are show below.

logit_model <- glm(risk ~ ., family = binomial, train)
summary(logit_model)$coefficients
##                        Estimate  Std. Error     z value     Pr(>|z|)
## (Intercept)       -1.1313739220 1.370505235 -0.82551594 4.090787e-01
## chk_acctA12       -0.2300316391 0.270689061 -0.84980028 3.954361e-01
## chk_acctA13       -0.7905931015 0.457856632 -1.72672633 8.421682e-02
## chk_acctA14       -1.6066696764 0.288732270 -5.56456568 2.628061e-08
## duration           0.0331562356 0.011129655  2.97908913 2.891067e-03
## credit_hisA31      0.3886107895 0.704817003  0.55136410 5.813841e-01
## credit_hisA32     -0.6437996628 0.571162420 -1.12717441 2.596687e-01
## credit_hisA33     -0.6974179943 0.629916567 -1.10715931 2.682251e-01
## credit_hisA34     -1.4985808031 0.574797372 -2.60714623 9.130036e-03
## purposeA41        -1.8709435982 0.458936755 -4.07669156 4.568102e-05
## purposeA410       -1.4477249834 0.990378793 -1.46178916 1.437990e-01
## purposeA42        -0.8020469082 0.328382527 -2.44241652 1.458930e-02
## purposeA43        -0.8683073660 0.310794631 -2.79383001 5.208785e-03
## purposeA44        -0.1960169526 0.967232477 -0.20265754 8.394027e-01
## purposeA45         0.3884673663 0.703061011  0.55253721 5.805804e-01
## purposeA46        -0.5849829967 0.513110375 -1.14007244 2.542561e-01
## purposeA48        -1.4515074995 1.357061985 -1.06959558 2.848014e-01
## purposeA49        -0.9211803035 0.423890758 -2.17315496 2.976866e-02
## amount             0.0001460432 0.000053184  2.74599889 6.032697e-03
## saving_acctA62    -0.4909556606 0.366239717 -1.34053091 1.800728e-01
## saving_acctA63    -0.9304392855 0.527763253 -1.76298611 7.790283e-02
## saving_acctA64    -1.7510920404 0.670342556 -2.61223463 8.995250e-03
## saving_acctA65    -1.2035607335 0.338948729 -3.55086369 3.839692e-04
## present_empA72     0.6837668787 0.540265477  1.26561275 2.056518e-01
## present_empA73     0.3936069049 0.522510470  0.75329956 4.512699e-01
## present_empA74    -0.4145354963 0.553699299 -0.74866538 4.540589e-01
## present_empA75    -0.1297280190 0.523454991 -0.24783032 8.042657e-01
## installment_rate   0.3943849329 0.109040387  3.61687027 2.981867e-04
## sexA92             0.0522930399 0.474402609  0.11022924 9.122276e-01
## sexA93            -0.6667295197 0.467245316 -1.42693676 1.535981e-01
## sexA94            -0.0179217466 0.552976628 -0.03240959 9.741454e-01
## other_debtorA102   0.5692649114 0.542228896  1.04986089 2.937821e-01
## other_debtorA103  -0.9705508423 0.514396683 -1.88677508 5.919058e-02
## present_resid      0.0244825354 0.107337266  0.22808980 8.195764e-01
## propertyA122       0.1890957315 0.319299425  0.59222071 5.537028e-01
## propertyA123       0.4269841684 0.286877712  1.48838390 1.366497e-01
## propertyA124       1.1096637808 0.648381501  1.71143652 8.700056e-02
## age                0.0018871447 0.010963330  0.17213244 8.633334e-01
## other_installA142 -0.0394384509 0.534419938 -0.07379674 9.411721e-01
## other_installA143 -0.2443294572 0.300172814 -0.81396264 4.156664e-01
## housingA152       -0.5696057202 0.294743344 -1.93254821 5.329188e-02
## housingA153       -1.0612312760 0.691793185 -1.53402968 1.250224e-01
## n_credits          0.2888989801 0.254380665  1.13569552 2.560840e-01
## jobA172           -0.3075138905 0.807656577 -0.38074833 7.033900e-01
## jobA173           -0.0291265746 0.778331071 -0.03742183 9.701487e-01
## jobA174           -0.2059283987 0.780777613 -0.26374783 7.919742e-01
## n_people           0.4692257063 0.309568019  1.51574348 1.295842e-01
## telephoneA192     -0.4318596955 0.251922462 -1.71425641 8.648164e-02
## foreignA202       -1.1362124212 0.759652238 -1.49570075 1.347317e-01
## ......... Prediction over the testing set.........
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  300 
## 
##  
##              | test_pred_lr 
##    test$risk |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       182 |        22 |       204 | 
##              |     0.892 |     0.108 |     0.680 | 
##              |     0.816 |     0.286 |           | 
##              |     0.607 |     0.073 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        41 |        55 |        96 | 
##              |     0.427 |     0.573 |     0.320 | 
##              |     0.184 |     0.714 |           | 
##              |     0.137 |     0.183 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       223 |        77 |       300 | 
##              |     0.743 |     0.257 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
## ......... Prediction over the training set .........
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  700 
## 
##  
##              | train_pred_lr 
##   train$risk |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       448 |        48 |       496 | 
##              |     0.903 |     0.097 |     0.709 | 
##              |     0.831 |     0.298 |           | 
##              |     0.640 |     0.069 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        91 |       113 |       204 | 
##              |     0.446 |     0.554 |     0.291 | 
##              |     0.169 |     0.702 |           | 
##              |     0.130 |     0.161 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       539 |       161 |       700 | 
##              |     0.770 |     0.230 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

SVM

SVM typically produces better results than logistic regression.

  • Did:

    • I built three SVM models: the rbfdot model, the vanilladot model, and the svm_train model.
  • Expected:

    • SVM models may work better than the logistic model. These models may become more accurate and have less risky misclassified cases.
  • Found:

    • The three SVM models tend to misclassify higher risk individuals. The svm_train model had the least number of misclassified higher risk individuals at 44. The vanilladot kernel model ang the rbfdot kernel model misclassified 45 and 57, respectively.

    • SVM does not provide better results than logistics regression which may lead to increase losses for the bank.

SVM - rbfdot

library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:modeltools':
## 
##     prior
## The following object is masked from 'package:purrr':
## 
##     cross
## The following object is masked from 'package:ggplot2':
## 
##     alpha
svm_model <- ksvm(risk ~ ., data = train, kernel = "rbfdot")
## ......... Prediction over the testing set .........
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  300 
## 
##  
##              | test_pred_svm 
##    test$risk |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       196 |         8 |       204 | 
##              |     0.961 |     0.039 |     0.680 | 
##              |     0.775 |     0.170 |           | 
##              |     0.653 |     0.027 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        57 |        39 |        96 | 
##              |     0.594 |     0.406 |     0.320 | 
##              |     0.225 |     0.830 |           | 
##              |     0.190 |     0.130 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       253 |        47 |       300 | 
##              |     0.843 |     0.157 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
## ......... Prediction over the training set .........
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  700 
## 
##  
##              | train_pred_svm 
##   train$risk |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       483 |        13 |       496 | 
##              |     0.974 |     0.026 |     0.709 | 
##              |     0.813 |     0.123 |           | 
##              |     0.690 |     0.019 |           | 
## -------------|-----------|-----------|-----------|
##            1 |       111 |        93 |       204 | 
##              |     0.544 |     0.456 |     0.291 | 
##              |     0.187 |     0.877 |           | 
##              |     0.159 |     0.133 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       594 |       106 |       700 | 
##              |     0.849 |     0.151 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

The test accuracy in this example is 0.783 and the train accuracy is 0.796

SVM - vanilladot

svm_model_van <- ksvm(risk ~ ., data = train, kernel = "vanilladot")
## ......... Prediction over the testing set .........
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  300 
## 
##  
##              | test_pred_svm_van 
##    test$risk |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       183 |        21 |       204 | 
##              |     0.897 |     0.103 |     0.680 | 
##              |     0.803 |     0.292 |           | 
##              |     0.610 |     0.070 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        45 |        51 |        96 | 
##              |     0.469 |     0.531 |     0.320 | 
##              |     0.197 |     0.708 |           | 
##              |     0.150 |     0.170 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       228 |        72 |       300 | 
##              |     0.760 |     0.240 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
## ......... Prediction over the training set .........
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  700 
## 
##  
##              | train_pred_svm_van 
##   train$risk |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       450 |        46 |       496 | 
##              |     0.907 |     0.093 |     0.709 | 
##              |     0.823 |     0.301 |           | 
##              |     0.643 |     0.066 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        97 |       107 |       204 | 
##              |     0.475 |     0.525 |     0.291 | 
##              |     0.177 |     0.699 |           | 
##              |     0.139 |     0.153 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       547 |       153 |       700 | 
##              |     0.781 |     0.219 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

The test accuracy in this example is 0.78 and the train accuracy is 0.796

SVM - Train

svm_model_train <- caret::train(risk ~ ., data = train, 
                                  method = "svmRadialSigma", 
                                  metric = "Kappa")
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## ......... Prediction over the testing set .........
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  300 
## 
##  
##              | test_pred_svm_train 
##    test$risk |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       194 |        10 |       204 | 
##              |     0.951 |     0.049 |     0.680 | 
##              |     0.815 |     0.161 |           | 
##              |     0.647 |     0.033 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        44 |        52 |        96 | 
##              |     0.458 |     0.542 |     0.320 | 
##              |     0.185 |     0.839 |           | 
##              |     0.147 |     0.173 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       238 |        62 |       300 | 
##              |     0.793 |     0.207 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
## ......... Prediction over the training set .........
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  700 
## 
##  
##              | train_pred_svm_train 
##   train$risk |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       490 |         6 |       496 | 
##              |     0.988 |     0.012 |     0.709 | 
##              |     0.860 |     0.046 |           | 
##              |     0.700 |     0.009 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        80 |       124 |       204 | 
##              |     0.392 |     0.608 |     0.291 | 
##              |     0.140 |     0.954 |           | 
##              |     0.114 |     0.177 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       570 |       130 |       700 | 
##              |     0.814 |     0.186 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

The test accuracy in this example is 0.82 and the train accuracy is 0.879

Tree

  • Did:

    • I built four tree models: the gini model, the boosting tree model, the ada tree model, and the random forest model.
  • Expected:

    • Tree models may work better than the previous models. These models may become more accurate and have less risky misclassified cases.
  • Found:

    • Tree models are suitable for this project. The prediction results are better than previous models.

    • In the random forest model, there are only 18 risky people who have been misclassified and the overall prediction accuracy is 0.91; while the prediction accuracy for the boosting tree model is 0.92, which is the highest among all the models. Because the accuracy is high and the dataset is relatively small, even a single layer of the neural network could overfit the data. So there is no need to build a neural network and we can go with the boosting tree model or the random forest model.

    • From the decision tree model, we can find that the most important variable is whether this person has a checking account. If this person has checking account, we consider him as a non-risk person. Other crucial variables are: duration and purpose.

    • From the random forest model, the top 3 important variables are: checking account, duration and credit history, which is similar to the result of logistic regression.

You can explore more results from the code.

Simple Gini Tree

# TREE
set.seed(123)
tree_gini = rpart(risk ~ ., data = train,
                  parms = list(split = "gini"), method = "class", cp=0.03)
fancyRpartPlot(tree_gini)

From the decision tree above, we can find that the most important variable is whether this person has a checking account. In this simple tree model, if this person has a checking account, we consider him as a non-risk person.

The model is simple and the test error is higher than other models, but what we can find from this tree model is checking account is important to determine the risk status of a person. Other crucial variables are duration and purpose.

The tree model above can give us an initial understanding of classification. The boosting model will be utilized later for more accurate results.

Boosting Tree

# boosted C5.0 decision tree
set.seed(123)

## build model
boost_tree <- C5.0(train[-21], train$risk,trials = 10)
## ......... Prediction over the testing set .........
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  300 
## 
##  
##              | test_pred_tree 
##    test$risk |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       200 |         4 |       204 | 
##              |     0.980 |     0.020 |     0.680 | 
##              |     0.913 |     0.049 |           | 
##              |     0.667 |     0.013 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        19 |        77 |        96 | 
##              |     0.198 |     0.802 |     0.320 | 
##              |     0.087 |     0.951 |           | 
##              |     0.063 |     0.257 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       219 |        81 |       300 | 
##              |     0.730 |     0.270 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
## ......... Prediction over the training set .........
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  700 
## 
##  
##              | train_pred_tree 
##   train$risk |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       496 |         0 |       496 | 
##              |     1.000 |     0.000 |     0.709 | 
##              |     0.978 |     0.000 |           | 
##              |     0.709 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        11 |       193 |       204 | 
##              |     0.054 |     0.946 |     0.291 | 
##              |     0.022 |     1.000 |           | 
##              |     0.016 |     0.276 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       507 |       193 |       700 | 
##              |     0.724 |     0.276 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

The test accuracy in this example is 0.923 and the train accuracy is 0.984.

Ada tree

tree_ada <- ada(risk~., data = train)
## ......... Prediction over the testing set .........
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  300 
## 
##  
##              | test_pred_ada 
##    test$risk |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       197 |         7 |       204 | 
##              |     0.966 |     0.034 |     0.680 | 
##              |     0.868 |     0.096 |           | 
##              |     0.657 |     0.023 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        30 |        66 |        96 | 
##              |     0.312 |     0.688 |     0.320 | 
##              |     0.132 |     0.904 |           | 
##              |     0.100 |     0.220 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       227 |        73 |       300 | 
##              |     0.757 |     0.243 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
## ......... Prediction over the training set .........
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  700 
## 
##  
##              | train_pred_ada 
##   train$risk |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       490 |         6 |       496 | 
##              |     0.988 |     0.012 |     0.709 | 
##              |     0.897 |     0.039 |           | 
##              |     0.700 |     0.009 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        56 |       148 |       204 | 
##              |     0.275 |     0.725 |     0.291 | 
##              |     0.103 |     0.961 |           | 
##              |     0.080 |     0.211 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       546 |       154 |       700 | 
##              |     0.780 |     0.220 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

The test accuracy in this example is 0.883 and the train accuracy is 0.917

Random Forest

set.seed(123)
rf <- randomForest(risk ~ ., data = train)
rf$importance
##                  MeanDecreaseGini
## chk_acct               29.6205021
## duration               28.2332828
## credit_his             19.1507272
## purpose                26.7650263
## amount                 34.3047750
## saving_acct            15.0074013
## present_emp            18.1846648
## installment_rate       11.0119763
## sex                    11.5023921
## other_debtor            5.5159841
## present_resid          11.3279983
## property               13.8667032
## age                    25.1164445
## other_install           6.5396332
## housing                 7.0007572
## n_credits               5.3752129
## job                     9.0736041
## n_people                3.5164749
## telephone               4.9136172
## foreign                 0.8767939

From the random forest model, the top 3 important variables in this dataset are: checking account, duration and credit history, which is similar to the result of logistic regression.

## ......... Prediction over the testing set .........
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  300 
## 
##  
##              | test_pred_rf 
##    test$risk |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       197 |         7 |       204 | 
##              |     0.966 |     0.034 |     0.680 | 
##              |     0.916 |     0.082 |           | 
##              |     0.657 |     0.023 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        18 |        78 |        96 | 
##              |     0.188 |     0.812 |     0.320 | 
##              |     0.084 |     0.918 |           | 
##              |     0.060 |     0.260 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       215 |        85 |       300 | 
##              |     0.717 |     0.283 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
## ......... Prediction over the training set .........
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  700 
## 
##  
##              | train_pred_rf 
##   train$risk |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       496 |         0 |       496 | 
##              |     1.000 |     0.000 |     0.709 | 
##              |     1.000 |     0.000 |           | 
##              |     0.709 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
##            1 |         0 |       204 |       204 | 
##              |     0.000 |     1.000 |     0.291 | 
##              |     0.000 |     1.000 |           | 
##              |     0.000 |     0.291 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       496 |       204 |       700 | 
##              |     0.709 |     0.291 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

The test accuracy in this example is 0.917 and the train accuracy is 1. The training accuracy is 1 mainly because the dataset is not too big and the random forest model overfits the data.


Stacked Model2

  • Espected:

    • A better model than the above models. Since the stacked model can usually give us a more accurate model.
  • Found:

    • From the Cumulative Distribution Function plot, we can find that the random forest model behaves way better than other methods.

    • The random forest model works well and the accuracy for the stacked model is 0.917.

# Convert all of the feature data to factors....
ConvertToYesNo <- function(myprediction) {
  result <- myprediction %>% as.factor()
  levels(result) <- c("no", "yes")
  result
}
# Convert each set of predictions to factors
test_pred_lr %<>% ConvertToYesNo()
test_pred_svm %<>% ConvertToYesNo()
test_pred_rf %<>% ConvertToYesNo()
test_pred_tree %<>% ConvertToYesNo()
test_pred_ada %<>% ConvertToYesNo()
test_pred_svm_train %<>% ConvertToYesNo()
test_pred_svm_van %<>% ConvertToYesNo()

stacked_data = data.frame(test_pred_lr,  test_pred_svm, test_pred_svm_van, test_pred_svm_train, test_pred_tree, test_pred_ada) %>% as.tbl()


agreement_matrix <- lapply(stacked_data, function(x) {
  as.integer(x) == as.integer(test$risk)
}) %>% as.data.frame() %>% as.tbl 

# For each row, count the number that are right, and then discard the correct
# ones so we can find the distribution of wrong ones
rowSums(agreement_matrix) %>% {.[. != 10]} %>% hist(main = "When predictors disagree, how many get the solution correct?", xlab = "Number of models correct in disagreement (10 agreements left out)")

rowSums(agreement_matrix) %>% {.[. != 10]} %>% ecdf %>% plot(main = "Empirical Cumulative Distribution Function of Model Aggreement", xlab = "Number of Models that agree"); abline(h = 0.5)

Since the stacked model only selects the random forest model if I include it in, I will drop the random forest model and see how the other models behaves.

# First we need to create a feature matrix so that all of the names will match our prediction!
model_combined_results <- data.frame(train_pred_lr, train_pred_svm, train_pred_svm_van, train_pred_svm_train, train_pred_tree, train_pred_ada) %>% as.tbl()
# Swap out the names
names(model_combined_results) <- names(stacked_data)
# Create the model -> Stacking with a Ctree to find out if it helps any!
stacked_model <- ctree(train$risk ~ . + 1, data = model_combined_results, controls = ctree_control(mincriterion = .7)) %T>% plot

The stacked model uses the simple tree model as the base model and adjusts the granularity of the prediction based on svm_train model. The svm_train model also provides limited contributions to the predictions.

If we include all the models in the stacked model, it only uses the random forest model as its final stacked model. So we can conclude that the random forest model works well and the accuracy for the stacked model is 0.917.

To wrap up

We can compare the behavior of 7 different models. Out of all the models, random forest is the most accurate. The training error of the random forest is 0. There is a high possibility that overfitting occurred.

In regards to the test data, the tree model behaves best and even better than the random forest model.

We not worry about overfitting since we are more concerned with the predictions. Thus, the tree model gives us the best prediction result.

Testing and Training Evaluation for 7 Methods
Logistic Regression SVM-rbfdot SVM-vanilladot SVM-Train Tree Ada Tree Random Forest
Training Accuracy 0.801 0.796 0.796 0.879 0.984 0.917 1.000
Testing Accuracy 0.790 0.783 0.780 0.820 0.923 0.883 0.917

  • Conclusion:

    • Based on different models, the checking account, saving account, duration, credit history, and purpose are the most important variables to determining an individual’s credit.

      • Owners of empty checking/saving accounts have higher possibilities of receving bad credits. People tend to have better credits if they have more money in their accounts.

      • Longer the loan duration is, riskier the person.

      • Good credit history is an important feature.

      • Individuals who take out loans to purchase cars can be less risky than those who take out loans for education.

    • The adv_tree model is the best model so far as its test accuracy can reach 0.92. In addition, the random forest model also provides high test accuracy of 0.91.

    • Logistic regression and the simple SVM model give us the test accuracy about 0.8. In contrast, advanced methods like svm_train and tree models give us more accurate results.


  1. the data description is from the UCI wbsite.

  2. The stacked model code is copied from the modelSelection.html from one of our lectures.