The explanatory variables in the logistic regression are both the type of loan and the borrowing amount.
Figure 1: The summary output of the logistic regression based on the type of loan and the borrowing amount.
The logistic equation shows statistical significance at the 0.01 level when the variables amount, and when the type of loan is used for a used car and a radio/television (Figure 1). Thus, the regression equation comes out to be:
default = -0.9321 + 0.0001330(amount) – 1.56(Purpose is for used car) – 0.6499(purpose is for radio/television)
Figure 2: The comparative output of the logistic regression prediction versus actual results.
When comparing the predictions to the actual values (Figure 2), the mean and minimum scores between both of them are similar. However, all other values are not. When the prediction values are rounded to the nearest whole number the actual prediction rate is 73%.
K-means classification, on the 3 continuous variables: duration, amount, and installment.
In K-means classification the data is clustered by the mean Euclidean distance between their differences (Ahlemeyer-Stubbe & Coleman, 2014). In this exercise, there are two clusters. Thus, the cluster size is 825 no defaults, 175 defaults, where the within-cluster sum of squares for between/total is 69.78%. The matrix of cluster centers is shown below (Figure 3).
Figure 3: K means center values, per variable
Cross-validation with k = 5 for the nearest neighbor.
K-nearest neighbor (K =5) is when a data point is clustered into a group, by having 5 of the nearest neighbors vote on that data point, and it is particularly useful if the data is a binary or categorical (Berson, Smith, & Thearling, 1999). In this exercise, the percentage of correct classifications from the trained and predicted classification is 69%. However, logistic regression in this scenario was able to produce a much higher prediction rate of 73%, this for this exercise and this data set, logistic regression was quite useful in predicting the default rate than the k-nearest neighbor algorithm at k=5.
Code
#
## The German credit data contains attributes and outcomes on 1,000 loan applications.
## Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data
## Metadata file: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc
#
## Reading the data from source and displaying the top five entries.
credits=read.csv(“https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data”, header = F, sep = ” “)
head(credits)
#
##
### ———————————————————————————————————-
## The two outcomes are success (defaulting on the loan) and failure (not defaulting).
## The explanatory variables in the logistic regression are both the type of loan and the borrowing amount.
### ———————————————————————————————————-
##
#
## Defining and re-leveling the variables (Taddy, n.d.)
default = credits$V21 – 1 # set default true when = 2
amount = credits$V5
purpose = factor(credits$V4, levels = c(“A40″,”A41″,”A42″,”A43″,”A44″,”A45″,”A46″,”A48″,”A49″,”A410”))
levels(purpose) = c(“newcar”, “usedcar”, “furniture/equip”, “radio/TV”, “apps”, “repairs”, “edu”, “retraining”, “biz”, “other”)
## Create a new matrix called “cred” with the 8 defined variables (Taddy, n.d.)
credits$default = default
credits$amount = amount
credits$purpose = purpose
cred = credits[,c(“default”,”amount”,”purpose”)]
head(cred[,])
summary(cred[,])
## Create a design matrix, such that factor variables are turned into indicator variables
Xcred = model.matrix(default~., data=cred)[,-1]
Xcred[1:5,]
## Creating training and prediction datasets: Select 900 rows for esitmation and 100 for testing
set.seed(1)
train = sample(1:1000,900)
## Defining which x and y values in the design matrix will be for training and for testing
xtrain = Xcred[train,]
xtest = Xcred[-train,]
ytrain = cred$default[train]
ytest = cred$default[-train]
## logistic regresion
datas=data.frame(default=ytrain,xtrain)
creditglm=glm(default~., family=binomial, data=datas)
summary(creditglm)
percentOfCorrect=100*(sum(ytest==round(testingdata$defaultPrediction))/100)
percentOfCorrect
## Predicting default from the test data (Alice, 2015; UCLA: Statistical Consulting Group., 2007)
testdata=data.frame(default=ytest,xtest)
testdata[1:5,]
testingdata=testdata[,2:11] #removing the variable default from the data matrix
testingdata$defaultPrediction = predict(creditglm, newdata=testdata, type = “response”)
results = data.frame(ytest,testingdata$defaultPrediction)
summary(results)
head(results,10)
#
##
### ———————————————————————————————————-
## K-means classification, on the 3 continuous variables: duration, amount, and installment.
### ———————————————————————————————————-
##
#
install.packages(“class”)
library(class)
## Defining and re-leveling the variables (Taddy, n.d.)
default = credits$V21 – 1 # set default true when = 2
duration = credits$V2
amount = credits$V5
installment = credits$V8
## Create a new matrix called “cred” with the 8 defined variables (Taddy, n.d.)
credits$default = default
credits$amount = amount
credits$installment = installment
credits$duration = duration
creds = credits[,c(“duration”,”amount”,”installment”,”default”)]
head(creds[,])
summary(creds[,])
## K means classification (R, n.b.a)
kmeansclass= cbind(creds$default,creds$duration,creds$amount,creds$installment)
kmeansresult= kmeans(kmeansclass,2)
kmeansresult$cluster
kmeansresult$size
kmeansresult$centers
kmeansresult$betweenss/kmeansresult$totss
#
##
### ———————————————————————————————————-
## Cross-validation with k = 5 for the nearest neighbor.
### ———————————————————————————————————-
##
#
## Create a design matrix, such that factor variables are turned into indicator variables
Xcreds = model.matrix(default~., data=creds)[,-1]
Xcreds[1:5,]
## Creating training and prediction datasets: Select 900 rows for esitmation and 100 for testing
set.seed(1)
train = sample(1:1000,900)
## Defining which x and y values in the design matrix will be for training and for testing
xtrain = Xcreds[train,]
xtest = Xcreds[-train,]
ytrain = creds$default[train]
ytest = creds$default[-train]
## K-nearest neighbor clustering (R, n.d.b.)
nearestFive=knn(train = xtrain[,2,drop=F],test=xtest[,2,drop=F],cl=ytrain,k=5)
knnresults=cbind(ytest+1,nearestFive) # The addition of 1 is done on ytest because when cbind is applied to nearestFive it adds 1 to each value.
percentOfCorrect=100*(sum(ytest==nearestFive)/100)
References
- Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online]. Retrieved from https://bookshelf.vitalsource.com/#/books/9781118981863/
- Alice, M. (2015). How to perform a logistic regression in R. Retrieved from http://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/
- Berson, A. Smith, S. & Thearling K. (1999). Building Data Mining Applications for CRM. McGraw-Hill. Retrieved from http://www.thearling.com/text/dmtechniques/dmtechniques.htm
- Taddy, M. (n.d.). credit.R: German Credit Data. Retrieved from http://faculty.chicagobooth.edu/matt.taddy/teaching/credit.R
- UCLA: Statistical Consulting Group. (2007). R data analysis examples: Logit Regression. Retrieved from http://www.ats.ucla.edu/stat/r/dae/logit.htm
- R (n.d.a.). K-Means Clustering. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html
- R (n.d.b.). k-Nearest Neighbour Classification. Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/class/html/knn.html