Credit scores inform us about a person’s reliability on repaying debt. For the individual, checking one’s credit score on a regular basis can help with making better financial decisions. For banks, credit scores allows for it to determine the riskiness of a borrower. But what factors actually play in determining one’s credit score? Is it possible for banks to detect “bad” debt and reduce loss based on these factors? We will explore these questions in this project.
The German credit scoring data is a dataset provided by Prof. Hogmann in the file german.data.
The dataset contains 20 variables and 1000 observations, on the basis of whether these people have been classified as risky or not.
Data contains:
Attribute 1: chk_acct (qualitative)
Status of existing checking account
A11 : … < 0 DM
A12 : 0 <= … < 200 DM
A13 : … >= 200 DM / salary assignments for at least 1 year
A14 : no checking account
Attribute 2: duration (numerical)
Attribute 3: credit_his (qualitative)
Credit history
A30 : no credits taken/all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/other credits existing (not at this bank)
Attribute 4: purpose (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others
Attribute 5: amount (numerical)
Attibute 6: saving_acct (qualitative)
Savings account/bonds
A61 : … < 100 DM
A62 : 100 <= … < 500 DM
A63 : 500 <= … < 1000 DM
A64 : .. >= 1000 DM
*A65 : unknown/ no savings account
Attribute 7: present_emp (qualitative)
Present employment since
A71 : unemployed
A72 : … < 1 year
A73 : 1 <= … < 4 years
A74 : 4 <= … < 7 years
A75 : .. >= 7 years
Attribute 8: installment_rate (numerical)
Attribute 9: sex (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
Attribute 10: other_debtor (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor
Attribute 11: present_resid (numerical)
Attribute 12: property (qualitative)
Property
A121 : real estate
A122 : if not A121 : building society savings agreement/life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property
Attribute 13: age (numerical)
Attribute 14: other_install (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none
Attribute 15: housing (qualitative)
Housing
A151 : rent
A152 : own
A153 : for free
Attribute 16: n_credits (numerical)
Attribute 17: job (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed / highly qualified employee/ officer
Attribute 18: n_people (numerical)
Attribute 19: telephone (qualitative)
Telephone
A191 : none
A192 : yes, registered under the customers name
Attribute 20: foreign (qualitative)
foreign worker
A201 : yes
A202 : no
Response: risk (qualitative)
0 : good record
1 : bad record
Data retrieved from this link.
# download data
data = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")
# data preprocessing
colnames(data) = c("chk_acct", "duration", "credit_his", "purpose","amount",
"saving_acct", "present_emp","installment_rate", "sex",
"other_debtor", "present_resid", "property", "age",
"other_install","housing", "n_credits", "job", "n_people",
"telephone", "foreign", "risk")
#str(data)
data$risk = as.factor(data$risk - 1)
summary(data)
## chk_acct duration credit_his purpose amount saving_acct
## A11:274 Min. : 4.0 A30: 40 A43 :280 Min. : 250 A61:603
## A12:269 1st Qu.:12.0 A31: 49 A40 :234 1st Qu.: 1366 A62:103
## A13: 63 Median :18.0 A32:530 A42 :181 Median : 2320 A63: 63
## A14:394 Mean :20.9 A33: 88 A41 :103 Mean : 3271 A64: 48
## 3rd Qu.:24.0 A34:293 A49 : 97 3rd Qu.: 3972 A65:183
## Max. :72.0 A46 : 50 Max. :18424
## (Other): 55
## present_emp installment_rate sex other_debtor present_resid property
## A71: 62 Min. :1.000 A91: 50 A101:907 Min. :1.000 A121:282
## A72:172 1st Qu.:2.000 A92:310 A102: 41 1st Qu.:2.000 A122:232
## A73:339 Median :3.000 A93:548 A103: 52 Median :3.000 A123:332
## A74:174 Mean :2.973 A94: 92 Mean :2.845 A124:154
## A75:253 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :4.000 Max. :4.000
##
## age other_install housing n_credits job
## Min. :19.00 A141:139 A151:179 Min. :1.000 A171: 22
## 1st Qu.:27.00 A142: 47 A152:713 1st Qu.:1.000 A172:200
## Median :33.00 A143:814 A153:108 Median :1.000 A173:630
## Mean :35.55 Mean :1.407 A174:148
## 3rd Qu.:42.00 3rd Qu.:2.000
## Max. :75.00 Max. :4.000
##
## n_people telephone foreign risk
## Min. :1.000 A191:596 A201:963 0:700
## 1st Qu.:1.000 A192:404 A202: 37 1:300
## Median :1.000
## Mean :1.155
## 3rd Qu.:1.000
## Max. :2.000
##
The first step would be to understand the underlying data structure. The variables listed below differ between risk and non-risk groups.
Individuals who have empty checking accounts are the riskiest group (group A11), while individuals who allocate money to their checkings account (group A12 and A13) are less risky. From the plot below, we can conclude that the variable “chk_acct” is a big deal.
The riskier group tends to have a high value at duration compared with less risky people. The difference between the two groups can tell us that “duration” will make a big contribution to the model fitting part.
## Using risk as id variables
The credit_his variable has a great deal of difference between good and bad records.
People who do not have credits taken or all their credits were paid back duly(A30) are more risker, as well as those people who are counted as “as credits at this bank paid back duly”(A31).
The risk difference among different purposes varies a lot. Using money to buy radio/television(A43) is the main purpose and it has a lower risk, while using the money for education(A46) may become risky.
People who have saving account but do not have money in it(group A61) are risker than others, while people who put more than 1000 DM into their saving account(group A64) are less likely to have bad credits.
The credit risk differences between different sex status groups can be ignored. There is no strong evidence that one kind of sex group is risker than another one though divorced or separated males could have sightly higher credit risk.
The age difference between risk and non-risk people can be ignored, though credit-risk people tend to be younger than non-risk people.
## Using risk as id variables
After getting familiarized with the dataset, the following steps whould be taken.
The training data and the testing data are set randomly, with 212 observations in the training data and 91 observations in the testing data.
set.seed(123)
train = data[sample(nrow(data), size = round(0.7*nrow(data))), ]
test = data[-sample(nrow(data), size = round(0.7*nrow(data))), ]
First build the logistic regression model.
Findings:
We can conclude that credit history, duration, whether this person has a checking account and a saving account, the installment rate and the purpose are the most important factors informing a person’s credit risk.
Good credit history and having checking/saving account leads to better credits.
The duration has a negative impact on credits. If one person is unable to repay a loan for an extended period, the possibility of defaulting will rise.
Installment rate also has a positive relationship with the risk. Higher installment rate coincides with riskier individuals.
From the test prediction table, 41 risky people have been predicted as non-risky people while 22 non-risky people have been categorized to the risky group. With the purpose of reducing the bad debt loss we’re more concerned about the misclassification for the risky group. Misclassification will induce significant losses to banks.
The test accuracy in this example is 0.79 and the train accuracy is 0.801
Results and code are show below.
logit_model <- glm(risk ~ ., family = binomial, train)
summary(logit_model)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1313739220 1.370505235 -0.82551594 4.090787e-01
## chk_acctA12 -0.2300316391 0.270689061 -0.84980028 3.954361e-01
## chk_acctA13 -0.7905931015 0.457856632 -1.72672633 8.421682e-02
## chk_acctA14 -1.6066696764 0.288732270 -5.56456568 2.628061e-08
## duration 0.0331562356 0.011129655 2.97908913 2.891067e-03
## credit_hisA31 0.3886107895 0.704817003 0.55136410 5.813841e-01
## credit_hisA32 -0.6437996628 0.571162420 -1.12717441 2.596687e-01
## credit_hisA33 -0.6974179943 0.629916567 -1.10715931 2.682251e-01
## credit_hisA34 -1.4985808031 0.574797372 -2.60714623 9.130036e-03
## purposeA41 -1.8709435982 0.458936755 -4.07669156 4.568102e-05
## purposeA410 -1.4477249834 0.990378793 -1.46178916 1.437990e-01
## purposeA42 -0.8020469082 0.328382527 -2.44241652 1.458930e-02
## purposeA43 -0.8683073660 0.310794631 -2.79383001 5.208785e-03
## purposeA44 -0.1960169526 0.967232477 -0.20265754 8.394027e-01
## purposeA45 0.3884673663 0.703061011 0.55253721 5.805804e-01
## purposeA46 -0.5849829967 0.513110375 -1.14007244 2.542561e-01
## purposeA48 -1.4515074995 1.357061985 -1.06959558 2.848014e-01
## purposeA49 -0.9211803035 0.423890758 -2.17315496 2.976866e-02
## amount 0.0001460432 0.000053184 2.74599889 6.032697e-03
## saving_acctA62 -0.4909556606 0.366239717 -1.34053091 1.800728e-01
## saving_acctA63 -0.9304392855 0.527763253 -1.76298611 7.790283e-02
## saving_acctA64 -1.7510920404 0.670342556 -2.61223463 8.995250e-03
## saving_acctA65 -1.2035607335 0.338948729 -3.55086369 3.839692e-04
## present_empA72 0.6837668787 0.540265477 1.26561275 2.056518e-01
## present_empA73 0.3936069049 0.522510470 0.75329956 4.512699e-01
## present_empA74 -0.4145354963 0.553699299 -0.74866538 4.540589e-01
## present_empA75 -0.1297280190 0.523454991 -0.24783032 8.042657e-01
## installment_rate 0.3943849329 0.109040387 3.61687027 2.981867e-04
## sexA92 0.0522930399 0.474402609 0.11022924 9.122276e-01
## sexA93 -0.6667295197 0.467245316 -1.42693676 1.535981e-01
## sexA94 -0.0179217466 0.552976628 -0.03240959 9.741454e-01
## other_debtorA102 0.5692649114 0.542228896 1.04986089 2.937821e-01
## other_debtorA103 -0.9705508423 0.514396683 -1.88677508 5.919058e-02
## present_resid 0.0244825354 0.107337266 0.22808980 8.195764e-01
## propertyA122 0.1890957315 0.319299425 0.59222071 5.537028e-01
## propertyA123 0.4269841684 0.286877712 1.48838390 1.366497e-01
## propertyA124 1.1096637808 0.648381501 1.71143652 8.700056e-02
## age 0.0018871447 0.010963330 0.17213244 8.633334e-01
## other_installA142 -0.0394384509 0.534419938 -0.07379674 9.411721e-01
## other_installA143 -0.2443294572 0.300172814 -0.81396264 4.156664e-01
## housingA152 -0.5696057202 0.294743344 -1.93254821 5.329188e-02
## housingA153 -1.0612312760 0.691793185 -1.53402968 1.250224e-01
## n_credits 0.2888989801 0.254380665 1.13569552 2.560840e-01
## jobA172 -0.3075138905 0.807656577 -0.38074833 7.033900e-01
## jobA173 -0.0291265746 0.778331071 -0.03742183 9.701487e-01
## jobA174 -0.2059283987 0.780777613 -0.26374783 7.919742e-01
## n_people 0.4692257063 0.309568019 1.51574348 1.295842e-01
## telephoneA192 -0.4318596955 0.251922462 -1.71425641 8.648164e-02
## foreignA202 -1.1362124212 0.759652238 -1.49570075 1.347317e-01
## ......... Prediction over the testing set.........
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 300
##
##
## | test_pred_lr
## test$risk | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 182 | 22 | 204 |
## | 0.892 | 0.108 | 0.680 |
## | 0.816 | 0.286 | |
## | 0.607 | 0.073 | |
## -------------|-----------|-----------|-----------|
## 1 | 41 | 55 | 96 |
## | 0.427 | 0.573 | 0.320 |
## | 0.184 | 0.714 | |
## | 0.137 | 0.183 | |
## -------------|-----------|-----------|-----------|
## Column Total | 223 | 77 | 300 |
## | 0.743 | 0.257 | |
## -------------|-----------|-----------|-----------|
##
##
## ......... Prediction over the training set .........
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 700
##
##
## | train_pred_lr
## train$risk | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 448 | 48 | 496 |
## | 0.903 | 0.097 | 0.709 |
## | 0.831 | 0.298 | |
## | 0.640 | 0.069 | |
## -------------|-----------|-----------|-----------|
## 1 | 91 | 113 | 204 |
## | 0.446 | 0.554 | 0.291 |
## | 0.169 | 0.702 | |
## | 0.130 | 0.161 | |
## -------------|-----------|-----------|-----------|
## Column Total | 539 | 161 | 700 |
## | 0.770 | 0.230 | |
## -------------|-----------|-----------|-----------|
##
##
SVM typically produces better results than logistic regression.
Did:
Expected:
Found:
The three SVM models tend to misclassify higher risk individuals. The svm_train model had the least number of misclassified higher risk individuals at 44. The vanilladot kernel model ang the rbfdot kernel model misclassified 45 and 57, respectively.
SVM does not provide better results than logistics regression which may lead to increase losses for the bank.
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:modeltools':
##
## prior
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
svm_model <- ksvm(risk ~ ., data = train, kernel = "rbfdot")
## ......... Prediction over the testing set .........
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 300
##
##
## | test_pred_svm
## test$risk | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 196 | 8 | 204 |
## | 0.961 | 0.039 | 0.680 |
## | 0.775 | 0.170 | |
## | 0.653 | 0.027 | |
## -------------|-----------|-----------|-----------|
## 1 | 57 | 39 | 96 |
## | 0.594 | 0.406 | 0.320 |
## | 0.225 | 0.830 | |
## | 0.190 | 0.130 | |
## -------------|-----------|-----------|-----------|
## Column Total | 253 | 47 | 300 |
## | 0.843 | 0.157 | |
## -------------|-----------|-----------|-----------|
##
##
## ......... Prediction over the training set .........
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 700
##
##
## | train_pred_svm
## train$risk | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 483 | 13 | 496 |
## | 0.974 | 0.026 | 0.709 |
## | 0.813 | 0.123 | |
## | 0.690 | 0.019 | |
## -------------|-----------|-----------|-----------|
## 1 | 111 | 93 | 204 |
## | 0.544 | 0.456 | 0.291 |
## | 0.187 | 0.877 | |
## | 0.159 | 0.133 | |
## -------------|-----------|-----------|-----------|
## Column Total | 594 | 106 | 700 |
## | 0.849 | 0.151 | |
## -------------|-----------|-----------|-----------|
##
##
The test accuracy in this example is 0.783 and the train accuracy is 0.796
svm_model_van <- ksvm(risk ~ ., data = train, kernel = "vanilladot")
## ......... Prediction over the testing set .........
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 300
##
##
## | test_pred_svm_van
## test$risk | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 183 | 21 | 204 |
## | 0.897 | 0.103 | 0.680 |
## | 0.803 | 0.292 | |
## | 0.610 | 0.070 | |
## -------------|-----------|-----------|-----------|
## 1 | 45 | 51 | 96 |
## | 0.469 | 0.531 | 0.320 |
## | 0.197 | 0.708 | |
## | 0.150 | 0.170 | |
## -------------|-----------|-----------|-----------|
## Column Total | 228 | 72 | 300 |
## | 0.760 | 0.240 | |
## -------------|-----------|-----------|-----------|
##
##
## ......... Prediction over the training set .........
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 700
##
##
## | train_pred_svm_van
## train$risk | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 450 | 46 | 496 |
## | 0.907 | 0.093 | 0.709 |
## | 0.823 | 0.301 | |
## | 0.643 | 0.066 | |
## -------------|-----------|-----------|-----------|
## 1 | 97 | 107 | 204 |
## | 0.475 | 0.525 | 0.291 |
## | 0.177 | 0.699 | |
## | 0.139 | 0.153 | |
## -------------|-----------|-----------|-----------|
## Column Total | 547 | 153 | 700 |
## | 0.781 | 0.219 | |
## -------------|-----------|-----------|-----------|
##
##
The test accuracy in this example is 0.78 and the train accuracy is 0.796
svm_model_train <- caret::train(risk ~ ., data = train,
method = "svmRadialSigma",
metric = "Kappa")
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## ......... Prediction over the testing set .........
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 300
##
##
## | test_pred_svm_train
## test$risk | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 194 | 10 | 204 |
## | 0.951 | 0.049 | 0.680 |
## | 0.815 | 0.161 | |
## | 0.647 | 0.033 | |
## -------------|-----------|-----------|-----------|
## 1 | 44 | 52 | 96 |
## | 0.458 | 0.542 | 0.320 |
## | 0.185 | 0.839 | |
## | 0.147 | 0.173 | |
## -------------|-----------|-----------|-----------|
## Column Total | 238 | 62 | 300 |
## | 0.793 | 0.207 | |
## -------------|-----------|-----------|-----------|
##
##
## ......... Prediction over the training set .........
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 700
##
##
## | train_pred_svm_train
## train$risk | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 490 | 6 | 496 |
## | 0.988 | 0.012 | 0.709 |
## | 0.860 | 0.046 | |
## | 0.700 | 0.009 | |
## -------------|-----------|-----------|-----------|
## 1 | 80 | 124 | 204 |
## | 0.392 | 0.608 | 0.291 |
## | 0.140 | 0.954 | |
## | 0.114 | 0.177 | |
## -------------|-----------|-----------|-----------|
## Column Total | 570 | 130 | 700 |
## | 0.814 | 0.186 | |
## -------------|-----------|-----------|-----------|
##
##
The test accuracy in this example is 0.82 and the train accuracy is 0.879
Did:
Expected:
Found:
Tree models are suitable for this project. The prediction results are better than previous models.
In the random forest model, there are only 18 risky people who have been misclassified and the overall prediction accuracy is 0.91; while the prediction accuracy for the boosting tree model is 0.92, which is the highest among all the models. Because the accuracy is high and the dataset is relatively small, even a single layer of the neural network could overfit the data. So there is no need to build a neural network and we can go with the boosting tree model or the random forest model.
From the decision tree model, we can find that the most important variable is whether this person has a checking account. If this person has checking account, we consider him as a non-risk person. Other crucial variables are: duration and purpose.
From the random forest model, the top 3 important variables are: checking account, duration and credit history, which is similar to the result of logistic regression.
You can explore more results from the code.
# TREE
set.seed(123)
tree_gini = rpart(risk ~ ., data = train,
parms = list(split = "gini"), method = "class", cp=0.03)
fancyRpartPlot(tree_gini)
From the decision tree above, we can find that the most important variable is whether this person has a checking account. In this simple tree model, if this person has a checking account, we consider him as a non-risk person.
The model is simple and the test error is higher than other models, but what we can find from this tree model is checking account is important to determine the risk status of a person. Other crucial variables are duration and purpose.
The tree model above can give us an initial understanding of classification. The boosting model will be utilized later for more accurate results.
# boosted C5.0 decision tree
set.seed(123)
## build model
boost_tree <- C5.0(train[-21], train$risk,trials = 10)
## ......... Prediction over the testing set .........
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 300
##
##
## | test_pred_tree
## test$risk | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 200 | 4 | 204 |
## | 0.980 | 0.020 | 0.680 |
## | 0.913 | 0.049 | |
## | 0.667 | 0.013 | |
## -------------|-----------|-----------|-----------|
## 1 | 19 | 77 | 96 |
## | 0.198 | 0.802 | 0.320 |
## | 0.087 | 0.951 | |
## | 0.063 | 0.257 | |
## -------------|-----------|-----------|-----------|
## Column Total | 219 | 81 | 300 |
## | 0.730 | 0.270 | |
## -------------|-----------|-----------|-----------|
##
##
## ......... Prediction over the training set .........
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 700
##
##
## | train_pred_tree
## train$risk | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 496 | 0 | 496 |
## | 1.000 | 0.000 | 0.709 |
## | 0.978 | 0.000 | |
## | 0.709 | 0.000 | |
## -------------|-----------|-----------|-----------|
## 1 | 11 | 193 | 204 |
## | 0.054 | 0.946 | 0.291 |
## | 0.022 | 1.000 | |
## | 0.016 | 0.276 | |
## -------------|-----------|-----------|-----------|
## Column Total | 507 | 193 | 700 |
## | 0.724 | 0.276 | |
## -------------|-----------|-----------|-----------|
##
##
The test accuracy in this example is 0.923 and the train accuracy is 0.984.
tree_ada <- ada(risk~., data = train)
## ......... Prediction over the testing set .........
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 300
##
##
## | test_pred_ada
## test$risk | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 197 | 7 | 204 |
## | 0.966 | 0.034 | 0.680 |
## | 0.868 | 0.096 | |
## | 0.657 | 0.023 | |
## -------------|-----------|-----------|-----------|
## 1 | 30 | 66 | 96 |
## | 0.312 | 0.688 | 0.320 |
## | 0.132 | 0.904 | |
## | 0.100 | 0.220 | |
## -------------|-----------|-----------|-----------|
## Column Total | 227 | 73 | 300 |
## | 0.757 | 0.243 | |
## -------------|-----------|-----------|-----------|
##
##
## ......... Prediction over the training set .........
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 700
##
##
## | train_pred_ada
## train$risk | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 490 | 6 | 496 |
## | 0.988 | 0.012 | 0.709 |
## | 0.897 | 0.039 | |
## | 0.700 | 0.009 | |
## -------------|-----------|-----------|-----------|
## 1 | 56 | 148 | 204 |
## | 0.275 | 0.725 | 0.291 |
## | 0.103 | 0.961 | |
## | 0.080 | 0.211 | |
## -------------|-----------|-----------|-----------|
## Column Total | 546 | 154 | 700 |
## | 0.780 | 0.220 | |
## -------------|-----------|-----------|-----------|
##
##
The test accuracy in this example is 0.883 and the train accuracy is 0.917
set.seed(123)
rf <- randomForest(risk ~ ., data = train)
rf$importance
## MeanDecreaseGini
## chk_acct 29.6205021
## duration 28.2332828
## credit_his 19.1507272
## purpose 26.7650263
## amount 34.3047750
## saving_acct 15.0074013
## present_emp 18.1846648
## installment_rate 11.0119763
## sex 11.5023921
## other_debtor 5.5159841
## present_resid 11.3279983
## property 13.8667032
## age 25.1164445
## other_install 6.5396332
## housing 7.0007572
## n_credits 5.3752129
## job 9.0736041
## n_people 3.5164749
## telephone 4.9136172
## foreign 0.8767939
From the random forest model, the top 3 important variables in this dataset are: checking account, duration and credit history, which is similar to the result of logistic regression.
## ......... Prediction over the testing set .........
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 300
##
##
## | test_pred_rf
## test$risk | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 197 | 7 | 204 |
## | 0.966 | 0.034 | 0.680 |
## | 0.916 | 0.082 | |
## | 0.657 | 0.023 | |
## -------------|-----------|-----------|-----------|
## 1 | 18 | 78 | 96 |
## | 0.188 | 0.812 | 0.320 |
## | 0.084 | 0.918 | |
## | 0.060 | 0.260 | |
## -------------|-----------|-----------|-----------|
## Column Total | 215 | 85 | 300 |
## | 0.717 | 0.283 | |
## -------------|-----------|-----------|-----------|
##
##
## ......... Prediction over the training set .........
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 700
##
##
## | train_pred_rf
## train$risk | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 496 | 0 | 496 |
## | 1.000 | 0.000 | 0.709 |
## | 1.000 | 0.000 | |
## | 0.709 | 0.000 | |
## -------------|-----------|-----------|-----------|
## 1 | 0 | 204 | 204 |
## | 0.000 | 1.000 | 0.291 |
## | 0.000 | 1.000 | |
## | 0.000 | 0.291 | |
## -------------|-----------|-----------|-----------|
## Column Total | 496 | 204 | 700 |
## | 0.709 | 0.291 | |
## -------------|-----------|-----------|-----------|
##
##
The test accuracy in this example is 0.917 and the train accuracy is 1. The training accuracy is 1 mainly because the dataset is not too big and the random forest model overfits the data.
Espected:
Found:
From the Cumulative Distribution Function plot, we can find that the random forest model behaves way better than other methods.
The random forest model works well and the accuracy for the stacked model is 0.917.
# Convert all of the feature data to factors....
ConvertToYesNo <- function(myprediction) {
result <- myprediction %>% as.factor()
levels(result) <- c("no", "yes")
result
}
# Convert each set of predictions to factors
test_pred_lr %<>% ConvertToYesNo()
test_pred_svm %<>% ConvertToYesNo()
test_pred_rf %<>% ConvertToYesNo()
test_pred_tree %<>% ConvertToYesNo()
test_pred_ada %<>% ConvertToYesNo()
test_pred_svm_train %<>% ConvertToYesNo()
test_pred_svm_van %<>% ConvertToYesNo()
stacked_data = data.frame(test_pred_lr, test_pred_svm, test_pred_svm_van, test_pred_svm_train, test_pred_tree, test_pred_ada) %>% as.tbl()
agreement_matrix <- lapply(stacked_data, function(x) {
as.integer(x) == as.integer(test$risk)
}) %>% as.data.frame() %>% as.tbl
# For each row, count the number that are right, and then discard the correct
# ones so we can find the distribution of wrong ones
rowSums(agreement_matrix) %>% {.[. != 10]} %>% hist(main = "When predictors disagree, how many get the solution correct?", xlab = "Number of models correct in disagreement (10 agreements left out)")
rowSums(agreement_matrix) %>% {.[. != 10]} %>% ecdf %>% plot(main = "Empirical Cumulative Distribution Function of Model Aggreement", xlab = "Number of Models that agree"); abline(h = 0.5)
Since the stacked model only selects the random forest model if I include it in, I will drop the random forest model and see how the other models behaves.
# First we need to create a feature matrix so that all of the names will match our prediction!
model_combined_results <- data.frame(train_pred_lr, train_pred_svm, train_pred_svm_van, train_pred_svm_train, train_pred_tree, train_pred_ada) %>% as.tbl()
# Swap out the names
names(model_combined_results) <- names(stacked_data)
# Create the model -> Stacking with a Ctree to find out if it helps any!
stacked_model <- ctree(train$risk ~ . + 1, data = model_combined_results, controls = ctree_control(mincriterion = .7)) %T>% plot
The stacked model uses the simple tree model as the base model and adjusts the granularity of the prediction based on svm_train model. The svm_train model also provides limited contributions to the predictions.
If we include all the models in the stacked model, it only uses the random forest model as its final stacked model. So we can conclude that the random forest model works well and the accuracy for the stacked model is 0.917.
We can compare the behavior of 7 different models. Out of all the models, random forest is the most accurate. The training error of the random forest is 0. There is a high possibility that overfitting occurred.
In regards to the test data, the tree model behaves best and even better than the random forest model.
We not worry about overfitting since we are more concerned with the predictions. Thus, the tree model gives us the best prediction result.
Logistic Regression | SVM-rbfdot | SVM-vanilladot | SVM-Train | Tree | Ada Tree | Random Forest | |
---|---|---|---|---|---|---|---|
Training Accuracy | 0.801 | 0.796 | 0.796 | 0.879 | 0.984 | 0.917 | 1.000 |
Testing Accuracy | 0.790 | 0.783 | 0.780 | 0.820 | 0.923 | 0.883 | 0.917 |
Conclusion:
Based on different models, the checking account, saving account, duration, credit history, and purpose are the most important variables to determining an individual’s credit.
Owners of empty checking/saving accounts have higher possibilities of receving bad credits. People tend to have better credits if they have more money in their accounts.
Longer the loan duration is, riskier the person.
Good credit history is an important feature.
Individuals who take out loans to purchase cars can be less risky than those who take out loans for education.
The adv_tree model is the best model so far as its test accuracy can reach 0.92. In addition, the random forest model also provides high test accuracy of 0.91.
Logistic regression and the simple SVM model give us the test accuracy about 0.8. In contrast, advanced methods like svm_train and tree models give us more accurate results.
the data description is from the UCI wbsite.↩
The stacked model code is copied from the modelSelection.html from one of our lectures.↩