Adv Quant: Association Rules in R

Introduction

Online radio keeps track of everything you play. This information is used to make recommendations to you for additional music. This large dataset was mined with arules in R to recommend new music to this community of radio listeners which has ~300,000 records and ~15,000 users.

Results

 5ip1.PNG

Figure 1. The output of the apriori command, which filtered data for the rules under a support of 0.01, a confidence of 0.5, and max length of 3.

5ip2.PNG

Figure 2. The output of the apriori, searching for only a subset of rules: (a) all rules with lift is greater than 5, (b) all rules where the confidence is greater than 0.6, (c) all rules with support > 0.02 and confidence greater than 0.6, (d) all the rules where Rihanna appears on the right-hand side, and (e) the top ten rules with the largest lift.

 5ip3.PNG

Figure 3. The output of the apriori command, which filtered data for the rules as aforementioned under a support of 0.001, a confidence of 0.5, and max length of 2.

5ip4.1.PNG5ip4.2.PNG5ip4.3

Figure 4. The output of the apriori, searching for only a subset of rules: (a) all rules with lift is greater than 5, (b) all rules where the confidence is greater than 0.6, (c) all rules with support > 0.02 and confidence greater than 0.6, (d) all the rules where Rihanna appears on the right-hand side, and (e) the top ten rules with the largest lift.

Discussion

There are a total of 289,956 data points, with 15,001 unique users that are listening to 1,005 unique artists.  From this dataset, there is a total of 48 rules under a support of 0.01, a confidence of 0.5, and a max length of 3.  When inspecting the first five rules (Figure 1), the results show each rule, and its corresponding support, confidence and lift if it meets the restrictions placed above.   Also, there is a total of 93 rules under a support of 0.001, the confidence of 0.5, and max length of 2.  When inspecting the first five rules (Figure 2), the results show each rule, and its corresponding support, confidence and lift if it meets the restrictions placed above.

 Apriori counts the transactions within the “playtrans” matrix.  According to Hahsler et al. (n.d.), the most used constraints for apriori are known as support and confidence, where the lower the confidence or support values, the more rules the algorithm will generate.  This relationship is illustrated between the two rule sets, where with higher support values, there were fewer rules generated.  Essentially, support can be seen as the proportion (%) of transactions in the data set with that exact item, whereas confidence is the proportion (%) of transaction where the rule is correct (Hahsler et al., n.d.).  The effects between just varying the support values can be seen in the number of subset rules for each rule set (Figure 2 & 4).    When reducing the support levels, there was an increase in the number of rules with Rihanna on the right-hand side (Figure 2d & 4d), and this happened across inspecting all the subset rules, even though the support, confidence, and lift values are the same between the rule sets.

Finally, the greater the lift value, the stronger the association rule (Hahsler et al., n.d.).  When relaxing the constraints, higher lift values could be observed (Figure 1-4).  This happens due to showing more rules, as constraints are weakened, then lift values can increase. Analyzing the top 10 lift values between both rule sets (Figure 2e and 4e), the top value with stricter results doesn’t appear in the top 10 lift values for relaxed constraints.  However, with stricter constraints (Figure 2e), users that listen to “the pussycat dolls” have a higher chance of listening to “rihanna”, than any other artist.  Whereas with relaxed constraints (Figure 4e), users that listen to “madvillain” have a higher chance of listening to “mf doom”, than any other artist, and that is more likely than the “the pussycat doll”-“rihanna” rule.  Similar associations can be made from the data found in the figures (1-4).

 Code

setwd(“C:/Users/fj998d/Documents/R/dataSets”)

LastFM=read.csv(“lastfm.csv”, header = F, sep = “,”) ## (Celma, 2009)

#

##

###—————————————————————————————————————-

## Variables: UserID = V1; ArtistID = V2; ArtistName = V3; PlayCount = V4

###—————————————————————————————————————-

## Apriori info(Hahsler, Grun, Hornic, & Buchta, n.d.):

##   Constraints for apriori are known as support and confidence, the lower the confidence or supprot the more rules.

##     * Support is the proportion (%) of transactions in the data set with that exact item.

##     * Confidence is the proportion (%) of transaction where the rule is correct.

##   The greater the lift, the stronger the assocition rule, thus lift is a deviation measure of the total rule

##   support from the support expected under independence.

##   Other Contraints used

##     * Max length defines the maximum size of mined frequent item rules.

###—————————————————————————————————————-

##

#

head(LastFM)

length(LastFM$V1)

summary(levels(LastFM$V1))

summary(levels(LastFM$V2))

## a-rules package for asociation rules

install.packages(“arules”)

library(arules)

## Computational enviroment for mining association rules and frequent item sets

## we need to manpulate the data a bit before using arules, we split the data in the vector

## x into groups defined in vector f. (Hahsler, Grun, Hornic, & Buchta, n.d.)

playlists = split(x=LastFM[,”V2″],f=LastFM$V1) # Convert the data to a matrix so that each fan is a row for artists across the clmns (R, n.d.c.)

playlists = lapply(playlists,unique)           # Find unique attributes in playlist, and create a list of those in playlists (R, n.d.a.; R, n.d.b.)

playtrans = as(playlists,”transactions”)       # Converts data and produce rule sets

## Create association rules with a support of 0.01 and confidence of 0.5, with a max length of 3

## which will show the support that listening to one artist gives to other artists; in other words,

## providing lift to an associated artist.

musicrules = apriori(playtrans, parameter=list(support=0.01, confidence=0.5, maxlen=3)) # filter the data for rules

musicrules

inspect(musicrules[1:5])

## Choose any subset

inspect(subset(musicrules, subset=lift>5))                        # tell me all the rules with a lift > 5

inspect(subset(musicrules, subset=confidence>0.6))                # tell me all the rules with a confidence of 0.6 or greater

inspect(subset(musicrules, subset=support>0.02& confidence >0.6)) # tell me the rules within a particular CI

inspect(subset(musicrules, subset=rhs%in%”rihanna”))              # tell me all the rules with rihanna in the left hand side

inspect(head(musicrules, n=10, by=”lift”))                        # tell me the top 10 rules with the largest lift

## Create association rules with a support of 0.001 and confidence of 0.1, with a max length of 2

artrules = apriori(playtrans, parameter=list(support=0.001, confidence=0.5, maxlen=2)) # filter the data for rules

artrules

inspect(artrules[1:5])

 ## Choose any subset

inspect(subset(artrules, subset=lift>5))

inspect(subset(artrules, subset=confidence>0.6))

inspect(subset(artrules, subset=support>0.02& confidence >0.6))

inspect(subset(artrules, subset=rhs%in%”rihanna”))

inspect(head(artrules, n=10, by=”lift”))

## Write down all the rules into a CSV file for co

write(musicrules, file=”musicRulesFromApriori.csv”, sep = “,”, col.names = NA)

write(artrules, file=”artistRulesFromApriori.csv”, sep = “,”, col.names = NA)

 Reference

Adv Quant: Decision Trees in R

Classification, Regression, and Conditional Tree Growth Algorithms

The variables used for tree growth algorithms are the log of benign prostatic hyperplasia amount (lbph), log of prostate-specific antigen (lpsa), Gleason score (gleason), log of capsular penetration (lcp) and log of the cancer volume (lcavol) to understand and predict tumor spread (seminal vesicle invasion=svi).

Results

5db3f1.PNG

Figure 1: Visualization of cross-validation results, for the classification tree (left) and regression tree (right).

5db3f2

Figure 2: Classification tree (left), regression tree (center), and conditional tree (right).

5db3f3.PNG

Figure 3: Summarization of tree data: (a) classification tree, (b) regression tree, and (c) conditional tree.

Discussion

For the classification tree growth algorithm, the head node is the seminal vesicle invasion which helps show the tumor spread in this dataset, and the cross-validation results show that there is only one split in the tree, with an x-value relative value for the first split of 0.71429 (Figure 1 & Figure 3a), and an x-value standard deviation of 0.16957 (Figure 3a).  The variable that was used to split the tree was the log of capsular penetration (Figure 2), when the log of capsular penetration at <1.791.

Next, for the regression tree growth algorithm, there are three leaf nodes, because the algorithm split the data three times.  In this case, the relative error for the first split is 1.00931, and a standard deviation of 0.18969 and at the second split the relative error is 0.69007 and a standard deviation of 0.15773 (Figure 1 & Figure 3b).  The tree was split at first at the log of capsular penetration at <1.791, and with the log of prostate specific antigen value at <2.993 (Figure 2).  It is interesting that the first split occurred at the same value for these two different tree growth algorithm, but that the relative errors and standard deviations were different and that the regression tree created one more level.

Finally, the conditional tree growth algorithm produced a split at <1.749 of the log capsular penetration at the 0.001 significance level and <2.973 for the log of prostate specific antigen also at the 0.001 significance level (Figure 2 & Figure 3c).  The results are similar to the regression tree, with the same number of leaf nodes and values in which they are split against, but more information is gained from the conditional tree growth algorithm than the classification and regression tree growth algorithm.

Code

#

### ———————————————————————————————————-

## Use the prostate cancer dataset available in R, in which biopsy results are given for 97 men.

## Goal:  Predict tumor spread in this dataset of 97 men who had undergone a biopsy.

## The measures to be used for prediction are BPH=lbhp, PSA=lpsa, Gleason Score=gleason, CP=lcp,

## and size of prostate=lcavol.

### ———————————————————————————————————-

##

install.packages(“lasso2”)

library(lasso2)

data(“Prostate”)

install.packages(“rpart”)

library(rpart)

## Grow a classification tree

classification = rpart(svi~lbph+lpsa+gleason+lcp+lcavol, data=Prostate, method=”class”)

printcp(classification) # display the results

plotcp(classification)  # visualization cross-validation results

plot(classification, uniform = T, main=”Classification Tree for prostate cancer”) # plot tree

text(classification, use.n = T, all = T, cex=.8)                                  # create text on the tree

## Grow a regression tree

Regression = rpart(svi~lbph+lpsa+gleason+lcp+lcavol, data=Prostate, method=”anova”)

printcp(Regression) # display the results

plotcp(Regression)  # visualization cross-validation results

plot(Regression, uniform = T, main=”Regression Tree for prostate cancer”) # plot tree

text(Regression, use.n = T, all = T, cex=.8)                              # create text on the tree

install.packages(“party”)

library(party)

## Grow a conditional inference tree

conditional = ctree(svi~lbph+lpsa+gleason+lcp+lcavol, data=Prostate)

conditional # display the results

plot(conditional, main=”Conditional inference tree for prostate cancer”)

References

Adv Quant: Bayesian analysis in R

Introduction

Bayes’ theory is a conditional probability that takes into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015).  The formulation of Bayes’ theory is p(θ |y)= p(theta)*P(y| θ)/(∑(P(θ)*P(y| θ))), where p(θ) is the prior probabilities, and P(y| θ) are the likelihoods (Cowles, Kass, & O’Hagan, 2009).

The Delayed Airplanes Dataset consists of airplane flights from Washington D.C. into New York City.  The date range for this data is for the entire month of February 2016, and there are 702 cases to be studied.

Results

4ip1.PNG

Figure 1: Histogram showcasing the density of flight delays that are 15 minutes or longer.

4ip2.PNG

Figure 2: Shows summary data for the variables in this Bayesian Analysis before training and testing.

4ip3.PNG

Figure 3: Bayesian Prediction of the flight delay data from Washington, D.C. to New York City, NY.

4ip4

Figure 4: Bayesian prediction results versus the test data results, where false negatives are encircled in blue, while false positives are encircled in red.

Discussion

 The histogram (Figure 1) showcases that there are almost three times as many cases that flights depart on time from Washington, D.C. to New York City, NY.  Summation data proves this (Table 2).

The above summary (Table 2) states that 77.813% of the flights were not delayed equal to or more than 15 minutes, for the cases we do have data on. There is null data in the departure time, delayed 15 minutes or more, and weather delay variables.  To know the percentage of flights per day of the week, or carrier, destination, etc. the prior probabilities need to be calculated below.

About 77.2973% of the training model didn’t have a delay, but 22.7027% did have a delay of 15 or greater minutes (from tdelay variable).  These values are close to those above summation (Figure 2). Thus the training data could be trusted, even though a random sampling wasn’t taken.  The reason for not taking a random sampling is to be able to predict into the future, given 60% of the data is already collected.

Comparing both sets of histograms (Figure 1 and Figure 3), the distribution of the first histogram is binomial.  However, the posterior distribution, the secondary histogram, is similarly shaped as a positively skewed distribution.  This was an expected result described by Smith (2015), which is why the author states that the prior distribution has an effect on the posterior distribution.

The Bayesian prediction results tend to produce a bunch false negatives, compared to the real data sets, thus indicating more type II error than type I error.  When looking at the code below, the probability of finding a result that is 0.5 or larger is 15.302%.

Code

#

## Locate the data, filter out the data, and pull it into R from the computer (R, n.d.b.)

#

setwd(“C:/Users/XXX/Documents/R/dataSets”)

airplaneData=read.csv(“022016DC2NYC_1022370032_T_ONTIME.csv”, header = T, sep = “,”)

#

##

### ———————————————————————————————————-

##  Data Source: http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time

##        Dependent:   Departure Delay Indicator, 15 minutes or more (Dep_Del15)

##        Independent: Arrival airports of Newark-EWR, Kennedy-JFK, and LaGuardia-LGA (Origin)

##        Independent: Departure airports of Baltimore-BWI, Dulles-IAD, and Reagan-DCA (Dest)

##        Independent: Carriers (Carrier)

##        Independent: Hours of departure (Dep_Time)

##        Independent: Weather conditions (Weather_Delay)

##        Independent: Monday = 1, Tuesday = 2, …Sunday = 7 (Day_Of_Week)

### ———————————————————————————————————-

##  bayes theory => p(theta|y)= p(theta)*P(y|theta)/(SUM(P(theta)*P(y|theta))) (Cowles, Kass, & O’Hagan, 2009)

### ———————————————————————————————————-

##

#

## Create a data.frame

delay = data.frame(airplaneData)

## Factoring and labeling the variables (Taddy, n.d.)

delay$DEP_TIME = factor(floor(delay$DEP_TIME/100))

delay$DAY_OF_WEEK = factor(delay$DAY_OF_WEEK, labels = c(“M”, “T”, “W”, “R”, “F”, “S”, “U”))

delay$DEP_DEL15 = factor(delay$DEP_DEL15)

delay$WEATHER_DELAY= factor(ifelse(delay$WEATHER_DELAY>=1,1,0)) # (R, n.d.a.)

delay$CARRIER = factor(delay$CARRIER, levels = c(“AA”,”B6″,”DL”,”EV”,”UA”))

levels(delay$CARRIER) = c(“American”, “JetBlue”, “Delta”, “ExpressJet”, “UnitedAir”)

## Quick understanding the data

delayed15 = as.numeric(levels(delay$DEP_DEL15)[delay$DEP_DEL15])

hist(delayed15, freq=F, main = “Histogram of Delays of 15 mins or longer”, xlab = “time >= 15 mins (1) or time < 15 (0)”)

summary(delay)

### Create the training and testing data (60/40%)

ntotal=length(delay$DAY_OF_WEEK)    # Total number of datapoints assigned dynamically

ntrain = sample(1:ntotal,floor(ntotal*(0.6))) # Take values 1 – n*0.6

ntest = ntotal-floor(ntotal*(0.6))       # The number of test cases (40% of the data)

trainingData = cbind(delay$DAY_OF_WEEK[ntrain], delay$CARRIER[ntrain],delay$ORIGIN[ntrain],delay$DEST[ntrain],delay$DEP_TIME[ntrain],delay$WEATHER_DELAY[ntrain],delayed15[ntrain])

testingData  = cbind(delay$DAY_OF_WEEK[-ntrain], delay$CARRIER[-ntrain],delay$ORIGIN[-ntrain],delay$DEST[-ntrain],delay$DEP_TIME[-ntrain],delay$WEATHER_DELAY[-ntrain],delayed15[-ntrain])

## Partitioning the train data by half

trainFirst= trainingData[trainingData[,7]<0.5,]

trainSecond= trainingData[trainingData[,7]>0.5,]

### Prior probabilities = p(theta) (Cowles, Kass, & O’Hagan, 2009)

## Dependent variable: time delayed >= 15

tdelay=table(delayed15[ntrain])/sum(table(delayed15[ntrain]))

### Prior probabilities between the partitioned training data

## Independent variable: Day of the week (% flights occured in which day of the week)

tday1=table(trainFirst[,1])/sum(table(trainFirst[,1]))

tday2=table(trainSecond[,1])/sum(table(trainSecond[,1]))

## Independent variable: Carrier (% flights occured in which carrier)

tcarrier1=table(trainFirst[,2])/sum(table(trainFirst[,2]))

tcarrier2=table(trainSecond[,2])/sum(table(trainSecond[,2]))

## Independent variable: Origin (% flights occured in which originating airport)

tOrigin1=table(trainFirst[,3])/sum(table(trainFirst[,3]))

tOrigin2=table(trainSecond[,3])/sum(table(trainSecond[,3]))

## Independent variable: Destination (% flights occured in which destinateion airport)

tdest1=table(trainFirst[,4])/sum(table(trainFirst[,4]))

tdest2=table(trainSecond[,4])/sum(table(trainSecond[,4]))

## Independent variable: Department Time (% flights occured in which time of the day)

tTime1=table(trainFirst[,5])/sum(table(trainFirst[,5]))

tTime2=table(trainSecond[,5])/sum(table(trainSecond[,5]))

## Independent variable: Weather (% flights delayed because of adverse weather conditions)

twx1=table(trainFirst[,6])/sum(table(trainFirst[,6]))

twx2=table(trainSecond[,6])/sum(table(trainSecond[,6]))

### likelihoods = p(y|theta) (Cowles, Kass, & O’Hagan, 2009)

likelihood1=tday1[testingData[,1]]*tcarrier1[testingData[,2]]*tOrigin1[testingData[,3]]*tdest1[testingData[,4]]*tTime1[testingData[,5]]*twx1[testingData[,6]]

likelihood2=tday2[testingData[,1]]*tcarrier2[testingData[,2]]*tOrigin2[testingData[,3]]*tdest2[testingData[,4]]*tTime2[testingData[,5]]*twx2[testingData[,6]]

### Predictions using bayes theory = p(theta|y)= p(theta)*P(y|theta)/(SUM(P(theta)*P(y|theta))) (Cowles, Kass, & O’Hagan, 2009)

Bayes=(likelihood2*tdelay[2])/(likelihood2*tdelay[2]+likelihood1*tdelay[1])

hist(Bayes, freq=F, main=”Bayesian Analysis of flight delay data”)

plot(delayed15[-ntrain]~Bayes, main=”Bayes results versus actual results for flights delayed >= 15 mins”, xlab=”Bayes Analysis Prediction of which cases will be delayed”, ylab=”Actual results from test data showing delayed cases”)

## The probability of 0.5 or larger

densityMeasure = table(delayed15[-ntrain],floor(Bayes+0.5))

probabilityOfXlarger=(densityMeasure[1,2]+densityMeasure[2,1])/ntest

probabilityOfXlarger

References

Adv Quant: K-means classification in R

The explanatory variables in the logistic regression are both the type of loan and the borrowing amount.

4dbf1.PNG

Figure 1: The summary output of the logistic regression based on the type of loan and the borrowing amount.

The logistic equation shows statistical significance at the 0.01 level when the variables amount, and when the type of loan is used for a used car and a radio/television (Figure 1).  Thus, the regression equation comes out to be:

default = -0.9321 + 0.0001330(amount) – 1.56(Purpose is for used car) – 0.6499(purpose is for radio/television)

4dbf2.PNG

Figure 2: The comparative output of the logistic regression prediction versus actual results.

When comparing the predictions to the actual values (Figure 2), the mean and minimum scores between both of them are similar.  However, all other values are not. When the prediction values are rounded to the nearest whole number the actual prediction rate is 73%.

K-means classification, on the 3 continuous variables: duration, amount, and installment.

In K-means classification the data is clustered by the mean Euclidean distance between their differences (Ahlemeyer-Stubbe & Coleman, 2014).  In this exercise, there are two clusters. Thus, the cluster size is 825 no defaults, 175 defaults, where the within-cluster sum of squares for between/total is 69.78%.  The matrix of cluster centers is shown below (Figure 3).

4dbf3

Figure 3: K means center values, per variable

Cross-validation with k = 5 for the nearest neighbor.

K-nearest neighbor (K =5) is when a data point is clustered into a group, by having 5 of the nearest neighbors vote on that data point, and it is particularly useful if the data is a binary or categorical (Berson, Smith, & Thearling, 1999).  In this exercise, the percentage of correct classifications from the trained and predicted classification is 69%.  However, logistic regression in this scenario was able to produce a much higher prediction rate of 73%, this for this exercise and this data set, logistic regression was quite useful in predicting the default rate than the k-nearest neighbor algorithm at k=5.

Code

#

## The German credit data contains attributes and outcomes on 1,000 loan applications.

## Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data

## Metadata file: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc

#

## Reading the data from source and displaying the top five entries.

credits=read.csv(“https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data&#8221;, header = F, sep = ” “)

head(credits)

#

##

### ———————————————————————————————————-

## The two outcomes are success (defaulting on the loan) and failure (not defaulting).

## The explanatory variables in the logistic regression are both the type of loan and the borrowing amount.

### ———————————————————————————————————-

##

#

## Defining and re-leveling the variables (Taddy, n.d.)

default = credits$V21 – 1 # set default true when = 2

amount = credits$V5

purpose = factor(credits$V4, levels = c(“A40″,”A41″,”A42″,”A43″,”A44″,”A45″,”A46″,”A48″,”A49″,”A410”))

levels(purpose) = c(“newcar”, “usedcar”, “furniture/equip”, “radio/TV”, “apps”, “repairs”, “edu”, “retraining”, “biz”, “other”)

## Create a new matrix called “cred” with the 8 defined variables (Taddy, n.d.)

credits$default = default

credits$amount  = amount

credits$purpose = purpose

cred = credits[,c(“default”,”amount”,”purpose”)]

head(cred[,])

summary(cred[,])

## Create a design matrix, such that factor variables are turned into indicator variables

Xcred = model.matrix(default~., data=cred)[,-1]

Xcred[1:5,]

## Creating training and prediction datasets: Select 900 rows for esitmation and 100 for testing

set.seed(1)

train = sample(1:1000,900)

## Defining which x and y values in the design matrix will be for training and for testing

xtrain = Xcred[train,]

xtest = Xcred[-train,]

ytrain = cred$default[train]

ytest = cred$default[-train]

## logistic regresion

datas=data.frame(default=ytrain,xtrain)

creditglm=glm(default~., family=binomial, data=datas)

summary(creditglm)

percentOfCorrect=100*(sum(ytest==round(testingdata$defaultPrediction))/100)

percentOfCorrect

## Predicting default from the test data (Alice, 2015; UCLA: Statistical Consulting Group., 2007)

testdata=data.frame(default=ytest,xtest)

testdata[1:5,]

testingdata=testdata[,2:11] #removing the variable default from the data matrix

testingdata$defaultPrediction = predict(creditglm, newdata=testdata, type = “response”)

results = data.frame(ytest,testingdata$defaultPrediction)

summary(results)

head(results,10)

#

##

### ———————————————————————————————————-

##  K-means classification, on the 3 continuous variables: duration, amount, and installment.

### ———————————————————————————————————-

##

#

install.packages(“class”)

library(class)

## Defining and re-leveling the variables (Taddy, n.d.)

default = credits$V21 – 1 # set default true when = 2

duration = credits$V2

amount = credits$V5

installment = credits$V8

## Create a new matrix called “cred” with the 8 defined variables (Taddy, n.d.)

credits$default = default

credits$amount  = amount

credits$installment = installment

credits$duration = duration

creds = credits[,c(“duration”,”amount”,”installment”,”default”)]

head(creds[,])

summary(creds[,])

## K means classification (R, n.b.a)

kmeansclass= cbind(creds$default,creds$duration,creds$amount,creds$installment)

kmeansresult= kmeans(kmeansclass,2)

kmeansresult$cluster

kmeansresult$size

kmeansresult$centers

kmeansresult$betweenss/kmeansresult$totss

#

##

### ———————————————————————————————————-

##  Cross-validation with k = 5 for the nearest neighbor. 

### ———————————————————————————————————-

##

#

## Create a design matrix, such that factor variables are turned into indicator variables

Xcreds = model.matrix(default~., data=creds)[,-1]

Xcreds[1:5,]

## Creating training and prediction datasets: Select 900 rows for esitmation and 100 for testing

set.seed(1)

train = sample(1:1000,900)

## Defining which x and y values in the design matrix will be for training and for testing

xtrain = Xcreds[train,]

xtest = Xcreds[-train,]

ytrain = creds$default[train]

ytest = creds$default[-train]

## K-nearest neighbor clustering (R, n.d.b.)

nearestFive=knn(train = xtrain[,2,drop=F],test=xtest[,2,drop=F],cl=ytrain,k=5)

knnresults=cbind(ytest+1,nearestFive) # The addition of 1 is done on ytest because when cbind is applied to nearestFive it adds 1 to each value.

percentOfCorrect=100*(sum(ytest==nearestFive)/100)

References

Adv Quant: Bayesian Analysis

Uncertainty in making decisions

Generalizing something that is specific from a statistical standpoint, is the problem of induction, and that can cause uncertainty in making decisions (Spiegelhalter & Rice, 2009). Uncertainty in making a decision could also arise from not knowing how to incorporate new data with old assumptions (Hubbard, 2010).

According to Hubbard (2010) conventional statistics assumes:

(1)    The researcher has no prior information about the range of possible values (which is never true) or,

(2)    The researcher does have prior knowledge that the distribution of the population and it is never any of the messy ones (which is not true more often than not)

Thus, knowledge before data collection and the knowledge gained from data collection doesn’t tell the full story until they are combined, hence the need for Bayes’ analysis (Hubbard, 2010).  Bayes’ theory can be reduced to a conditional probability that aims to take into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).  Bayesian analysis avoids overconfidence and underconfidence from ignoring prior data or ignoring new data (Hubbard, 2010), through the implementation of the equation below:

 eq4                           (1)

Where P(hypothesis|data) is the posterior data, P(hypothesis) is the true probability of the hypothesis/distribution before the data is introduced, P(data) marginal probability, and P(data|hypothesis) is the likelihood that the hypothesis/distribution is still true after the data is introduced (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).  This forces the researcher to think about the likelihood that different and new observations could impact a current hypothesis (Hubbard, 2010). Equation (1) shows that evidence is usually a result of two conditional probabilities, where the strongest evidence comes from a low probability that the new data could have led to X (Yudkowsky, 2003).  From these two conditional probabilities, the resultant value is approximately the average from that of the prior assumptions and the new data gained (Hubbard, 2010; Smith, 2015).  Smith (2015) describe this approximation in the following simplified relationship (equation 2):

 eq5.PNG                                            (2)

Therefore, from equation (2) the type of prior assumptions influence the posterior resultant. Prior distributions come from Uniform, Constant, or Normal distribution that results in a Normal posterior distribution and a Beta or Binomial distribution results in a Beta posterior distribution (Smith, 2015).  To use Bayesian Analysis one must take into account the analysis’ assumptions.

Basic Assumptions of Bayesian Analysis

Though these three assumptions are great to have for Bayesian Analysis, it has been argued that they are quite unrealistic when real life data, particularly unstructured text-based data (Lewis, 1998; Turhan & Bener, 2009):

  • Each of the new data samples is independent of each other and identically distributed (Lewis, 1998; Nigam & Ghani, 2000; Turhan & Bener, 2009)
  • Each attribute has equation importance (Turhan & Bener, 2009)
  • The new data is compatible with the target posterior (Nigam & Ghani, 2000; Smith 2015).

Applications of Bayesian Analysis

There are typically three main situations where Bayesian Analysis is used (Spiegelhalter, & Rice, 2009):

  • Small data situations: The researcher has no choice but to include prior quantitative information, because of a lack of data, or lack of a distribution model.
  • Moderate size data situations: The researcher has multiple sources of data. They can create a hierarchical model on the assumption of similar prior distributions
  • Big data situations: where there are huge join probability models, with 1000s of data points or parameters, which can then be used to help make inferences of unknown aspects of the data

Pros and Cons

Applying Bayesian Analytics to data has its advantages and disadvantages.  Those Advantages and Disadvantages with Bayesian Analysis as identified by SAS (n.d.) are:

Advantages

+    Allows for a combination of prior information with data, for a strong decision-making

+    No reliance on asymptotic approximation, thus the inferences are conditional on the data

+    Provides easily interpretive results.

Disadvantages

– Posteriors are heavily influenced by their priors.

– This method doesn’t help the researcher to select the proper prior, given how much influence it has on the posterior.

– Computationally expensive with large data sets.

The key takeaway from this discussion is that the prior knowledge can heavily influence the posterior, which can easily be seen in equation (2).  That is because knowledge before data collection and the knowledge gained from data collection doesn’t tell the full story unless they are combined.

Reference

Adv Quant: Logistic Regression in R

Introduction

The German credit data contains attributes and outcomes on 1,000 loan applications. The data are available at this Web site, where datasets are provided for the machine learning community.

Results

IP3F1.PNG

Figure 1: Image shows the first six entries in the German credit data.

IP3F2.png

Figure 2: Matrix scatter plot, showing the 2×2 relationships between all the variables within the German credit data.

IP3F3.png

Figure 3: A summary of the credit data with the variables of interest.

IP3F3.png

Figure 4: Shows the entries in the designer matrix which will be used for logistical analysis.

IP3F4

Figure 5: Summarized logistic regression information based on the training data.

IP3F6.1.pngIP3F6.2.png

Figure 6: The coeficients’ confidence interval at the 95% level using log-likelihood vlaues, with values to the right including the standard errors values.

IP3F7.png

Figure 7: Wald Test statistic to test the significance level of the entire ranked variable.

IP3F8.png

Figure 8: The Odds Ratio for each independent variable along with the 95% confidence interval for those odds ratio.

IP3F9.png

Figure 9: Part of the summarized test data set for the logistics regression model.

IP3F10.png

Figure 10: The ROC curve, which illustrates the false positive rate versus the true positive rate of the prediction model.

Discussion

The results from Figure 1 means that the data needs to be formatted before any analysis could be conducted on the data.  Hence, the following lines of code were needed to redefine the variables in the German data set.   Given the data output (Figure 1), the matrix scatter plot (Figure 2) show that duration, amount, and age are continuous variables, while the other five variables are factor variables, which have categorized scatterplots.  Even though installment and default show box plot data in the summary (Figure 3), the data wasn’t factored like history, purpose, or rent, thus it won’t show a count.  From the count data (Figure 3), the ~30% of the purpose of loans are for cars, where as 28% is for TVs.  In this German credit data, about 82% of those asking for credit do not rent and about 53% of borrowers have an ok credit history with 29.3% having a horrible credit history.  The mean average default rate is 30%.

Variables (Figure 5) that have statistical significance at the 0.10 include duration, amount, installment, age, history (per category), rent, and some of the purposes categories.  Though it is preferred to see a large difference in the null deviance and residual deviance, it is still a difference.  The 95% confidence interval for all the logistic regression equation don’t show much spread from their central tendencies (Figure 6).  Thus, the logistic regression model is (from figure 5):

IP3F11.PNG

The odds ratio measures the constant strength of association between the independent and dependent variables (Huck, 2011; Smith, 2015).  This is similar to the correlation coefficient (r) and coefficient of determination (r2) values for linear regression.  According to UCLA: Statistical Consulting Group, (2007), if the P value is less than 0.05, then the overall effect of the ranked term is statistically significant (Figure 7), which in this case the three main terms are.  The odds ratio data (Figure 8) is promising, as values closer to one is desirable for this prediction model (Field, 2013). If the value of the odds ratio is greater than 1, it will show that as the independent variable value increases, so do the odds of the dependent variable (Y = n) occurs increases and vice versa (Fields, 2013).

Moving into the testing phase of the logistics regression model, the 100 value data set needs to be extracted, and the results on whether or not there will be a default or not on the loan are predicted. Comparing the training and the test data sets, the maximum values between the both are not the same for durations and amount of the loan.  All other variables and statistical distributions are similar to each other between the training and the test data.  Thus, the random sampling algorithm in R was effective.

The area underneath the ROC curve (Figure 10), is 0.6994048, which is closer to 0.50 than it is to one, thus this regression does better than pure chance, but it is far from perfect (Alice, 2015).

In conclusion, the regression formula has a 0.699 prediction accuracy, and the purpose, history, and rent ranked categorical variables were statistically significant as a whole.  Therefore, the logistic regression on these eight variables shows more promise in prediction accuracy than pure chance, on who will and will not default on their loan.

Code

#

## The German credit data contains attributes and outcomes on 1,000 loan applications.

##    •   You need to use random selection for 900 cases to train the program, and then the other 100 cases will be used for testing.

##    •   Use duration, amount, installment, and age in this analysis, along with loan history, purpose, and rent.

### ———————————————————————————————————-

## Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data

## Metadata file: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc

#

#

## Reading the data from source and displaying the top six entries.

#

credits=read.csv(“https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data&#8221;, header = F, sep = ” “)

head(credits)

#

## Defining the variables (Taddy, n.d.)

#

default = credits$V21 – 1 # set default true when = 2

duration = credits$V2

amount = credits$V5

installment = credits$V8

age = credits$V13

history = factor(credits$V3, levels = c(“A30”, “A31”, “A32”, “A33”, “A34”))

purpose = factor(credits$V4, levels = c(“A40″,”A41″,”A42″,”A43″,”A44″,”A45″,”A46″,”A48″,”A49″,”A410”))

rent = factor(credits$V15==”A151″) # renting status only

# rent = factor(credits$V15 , levels = c(“A151″,”A152″,”153”)) # full property status

#

## Re-leveling the variables (Taddy, n.d.)

#

levels(history) = c(“great”, “good”, “ok”, “poor”, “horrible”)

levels(purpose) = c(“newcar”, “usedcar”, “furniture/equip”, “radio/TV”, “apps”, “repairs”, “edu”, “retraining”, “biz”, “other”)

# levels(rent) = c(“rent”, “own”, “free”) # full property status

#

## Create a new matrix called “cred” with the 8 defined variables (Taddy, n.d.)

#

credits$default = default

credits$duration= duration

credits$amount  = amount

credits$installment = installment

credits$age     = age

credits$history = history

credits$purpose = purpose

credits$rent    = rent

cred = credits[,c(“default”,”duration”,”amount”,”installment”,”age”,”history”,”purpose”,”rent”)]

#

##  Plotting & reading to make sure the data was transfered correctly into this dataset and present summary stats (Taddy, n.d.)

#

plot(cred)

cred[1:3,]

summary(cred[,])

#

## Create a design matrix, such that factor variables are turned into indicator variables

#

Xcred = model.matrix(default~., data=cred)[,-1]

Xcred[1:3,]

#

## Creating training and prediction datasets: Select 900 rows for esitmation and 100 for testing

#

set.seed(1)

train = sample(1:1000,900)

## Defining which x and y values in the design matrix will be for training and for testing

xtrain = Xcred[train,]

xnew = Xcred[-train,]

ytrain = cred$default[train]

ynew = cred$default[-train]

#

## logistic regresion

#

datas=data.frame(default=ytrain,xtrain)

creditglm=glm(default~., family=binomial, data=datas)

summary(creditglm)

#

## Confidence Intervals (UCLA: Statistical Consulting Group, 2007)

#

confint(creditglm)

confint.default(creditglm)

#

## Overall effect of the rank using the wald.test function from the aod library (UCLA: Statistical Consulting Group, 2007)

#

install.packages(“aod”)

library(aod)

wald.test(b=coef(creditglm), Sigma = vcov(creditglm), Terms = 6:9) # for all ranked terms for history

wald.test(b=coef(creditglm), Sigma = vcov(creditglm), Terms = 10:18) # for all ranked terms for purpose

wald.test(b=coef(creditglm), Sigma = vcov(creditglm), Terms = 19) # for the ranked term for rent

#

## Odds Ratio for model analysis (UCLA: Statistical Consulting Group, 2007)

#

exp(coef(creditglm))

exp(cbind(OR=coef(creditglm), confint(creditglm))) # odds ration next to the 95% confidence interval for odds ratios

#

## Predicting default from the test data (Alice, 2015; UCLA: Statistical Consulting Group., 2007)

#

newdatas=data.frame(default=ynew,xnew)

newestdata=newdatas[,2:19] #removing the variable default from the data matrix

newestdata$defaultPrediction = predict(creditglm, newdata=newestdata, type = “response”)

summary(newdatas)

#

## Plotting the true positive rate against the false positive rate (ROC Curve) (Alice, 2015)

#

install.packages(“ROCR”)

library(ROCR)

pr  = prediction(newestdata$defaultPrediction, newdatas$default)

prf = performance(pr, measure=”tpr”, x.measure=”fpr”)

plot(prf)

## Area under the ROC curve (Alice, 2015)

auc= performance(pr, measure = “auc”)

auc= auc@y.values[[1]]

auc # The closer this value is to 1 the better, much better than to 0.5

 

 

References

Adv Quant: More on Logistic Regression

Logistic regression is another flavor of multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).  Zheng and Agresti (2000) defines predictive power as a measure that helps compare competing regressions via analyzing the importance of the independent variables.  For linear regression and multiple linear regression, the correlation coefficient and coefficient of determination are adequate for predictive power (Field, 2014; Zheng & Agresti, 2000). The more data that is collected could yield a stronger predictive power (Field, 2014).  Predictive power is used to sell the relationships between variables to management (Ahlemeyer-Stubbe, & Coleman, 2014).

For logistic regression, the predictive power of the independent variables can be evaluated by the concept of the odds ratio for each independent variable (Huck, 2011). Field (2013) and Schumacker (2014) explained that when the logistic regression is calculated, the categorical/binary variables are transformed into ln(odds ratio) and a regression is then performed on this newly scaled variable (scale factor seen in equation 1):

eq1.PNG                                                 (1)

Since the probability of one categorical variable varies between 0à0.999…, the odds ratio value can vary between 0 à 999.999… (Schumacker, 2014). If the value of the odds ratio is greater than 1, it will show that as the independent variable value increases, so do the odds of the dependent variable (Y = n) occurs increases and vice versa (Field, 2013). Thus, the odds ratio measures the constant strength of association between the independent and dependent variables (Huck, 2011; Smith, 2015).  Due to this ln(odds ratio) transformation, logistic regression should be used for binary outcomes.

Field (2013) and Schumacker (2014) further explained that given that this ln(odds ratio) transformation needs to be made on the variables; the way to predict categorical outcomes from the regression formula (2),

  eq2.PNG                                            (2)

is best to explained the probability of the categorical outcome value one is trying to calculate:

eq3                                                (3)

The probability equation (3) can be expressed in multiple ways, through typical algebraic manipulations.  Thus, the probability/likelihood of the dependent variable Y is defined between 0-100% and the odds ratio is used to discuss the strength of these relationships.

References

  • Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Gall, M. D., Gall, J. P., Borg, W. R. (2006). Educational Research: An Introduction, 8th Edition. [VitalSource Bookshelf Online].
  • Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
  • Huck, Schuyler W. (2011). Reading Statistics and Research, 6th Edition. [VitalSource Bookshelf Online].
  • Schumacker, Randall E. (2014). Learning Statistics Using R, 1st Edition. [VitalSource Bookshelf Online].
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html
  • Zheng, B. and Agresti, A. (2000) Summarizing the predictive power of a generalized linear model.  Retrieved from http://www.stat.ufl.edu/~aa/articles/zheng_agresti.pdf

Adv Quant: Logistic Vs Linear Regression

To generalize the results of the research the insights gained from a sample of data needs to use the correct mathematical procedures for using probabilities and information, statistical inference (Gall et al., 2006; Smith, 2015).  Gall et al. (2006), stated that statistical inference is what dictates the order of procedures, for instance, a hypothesis and a null hypothesis must be defined before a statistical significance level, which also has to be defined before calculating a z or t statistic value. Essentially, a statistical inference allows for quantitative researchers to make inferences about a population.  A population, where researchers must remember where that data was generated and collected from during quantitative research process.  The orders of procedures are important to apply statistical inferences to regressions, if not the prediction formula will not be generalizable.

Logistic regression is another flavor of multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).  Logistic regression is an alternative to linear regression, which assumes all variables are continuous (Ahlemeyer-Stubbe, & Coleman, 2014). Both the multi-variable linear regression and logistic regression formula are (Field, 2013; Schumacker, 2014):

Y = a + b11 + b2X2 + …                                                       (1)

The main difference between these two regressions is that the variables in the equation (1) represent different types of dependent (Y) and independent variables (Xi).  These different types of variables may have to undergo a transformation before the regression analysis begins (Field, 2013; Schumacker 2014).  Due to the difference in the types of variables between logistic and linear regression the assumptions on when to use either regression are also different (Table 1).

Table 1: Discusses and summarizes the types of assumptions and variables used in both logistic and regular regression, created from Ahlemeyer-Stubbe & Coleman (2014), Field (2013), Gall et al. (2006), Huck (2011) and Schumacker, (2014).

 

Assumptions of Logistic Regression Assumptions for Linear Regression
·         Multicollinearity should be minimized between the independent variables

·         There is no need for linearity between the dependent and independent variables

·         Normality only on the continuous independent variables

·         No need for homogeneity of variance within the categorical variables

·         Error terms a not normally distributed

·         Independent variables don’t have to be continuous

·         There are no missing data (no null values)

·         Variance that is not zero

·         Multicollinearity should be minimized between the multiple independent variables

·         Linearity exists between all variables

·         Additivity (for multi-variable linear regression)

·         Errors in the dependent variable and its predicted values are independent and uncorrelated

·         All variables are continuous

·         Normality on all variables

·         Normality on the error values

·         Homogeneity of variance

·         Homoscedasticity- variance between residuals are constant

·         Variance that is not zero

Variable Types of Logistic Regression Variable Types of Linear Regression
·         2 or more Independent variables

·         Independent variables: continuous, dichotomous, binary, or categorical

·         Dependent variable: dichotomous, binary

·         1 or more Independent variables

·         Independent variables: continuous

·         Dependent variables: continuous

References

  • Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Gall, M. D., Gall, J. P., Borg, W. R. (2006). Educational Research: An Introduction, 8th Edition. [VitalSource Bookshelf Online].
  • Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
  • Huck, Schuyler W. (2011). Reading Statistics and Research, 6th Edition. [VitalSource Bookshelf Online].
  • Schumacker, Randall E. (2014). Learning Statistics Using R, 1st Edition. [VitalSource Bookshelf Online].
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html

Adv Quant: Overfitting & Parsimony

Overfitting and Parsimony

Overfitting a regression model is stuffing it with so many variables that have little contributional weight to help predict the dependent variable (Field, 2013; Vandekerckhove, Matzke, & Wagenmakers, 2014).  Thus, to avoid the over-fitting problem, the use of parsimony is important in big data analytics.  Parsimony is describing a dependent variable with the fewest independent variables as possible (Field, 2013; Huck, 2013; Smith, 2015).  The best way to describe this is to use the “Keep It Simple Sweaty,” concept on the regression model.  The concept of parsimony could be attributed to Occam’s Razor, which states “plurality out never be posited without necessity” (Duignan, 2015).  Vandekerckhove et al. (2014) describe parsimony as a way of removing the noise from the signal to create better predictive regression models.

Overfitting in General Least Squares Model (GLM)

For multivariate regressions, a correlation matrix could be conducted on all the variables, to help with identifying parsimony, such that the software will try to maximize the correlation while minimizing the number of variables (Field, 2013).  Smith (2015) stated that the proportion of variation should remain high between the variables and that the correlation between the separate independent variables should be as low as possible. If the correlation coefficient between the independent variables is high (0.8 or higher), then there is a chance that there are extraneous variables (Smith, 2015).   Another technique to achieve parsimony is called the backward stepwise method, which is to run a regression model with all variables, and remove those variables that don’t contribute to the models significantly, or the model could start with one variable and add variables until it has maximized correlation and variance in a forward stepwise method (Field, 2013; Huck, 2015).

Unfortunately, there is still a problem of overfitting when conducting a backward stepwise method, forward stepwise method, or correlation matrix in multivariate linear models.  That is because, computers tend to remove, add or consider variables systematically and mathematically, not based on human knowledge (Field, 2013; Huck, 2015). Thus, it is still important to have a human to evaluate the computational output for logic, consistency, and reliability.  However, if the focus is to reduce overfitting, it should be noted that underfitting should also be avoided.  Underfitting a regression model happens when the model leaves out key independent variables that can help predict the dependent variable from the model (Field, 2013).

Hierarchical regression methods

When the researcher builds a multivariate regression model, they build it in stages, as they tend to add known independent variables first, and add newer independent variables to avoid overfitting in a technique called hierarchical regression (Austin, Goel & van Walraven, 2001; Field, 2013; Huck 2013).  The new and unknown independent variables could be entered in through a stepwise algorithm as abovementioned, or another step could be created where suspected new variables that may have a high contribution to the predictability of the dependent variables are added next (Field, 2013).  Hierarchical regression methods allow the researcher to analyze the differing hierarchical levels by examining not only the correlations between the levels but also the intercepts and slopes, helping drive valid statistical inferences (Austin et al., 2001).

Vandekerckhove et al. (2014) listed these three hierarchical methods for model selection; where each method is balancing between the goodness of fit and parsimony:

  • Akaike’s Information Criterion (AIC) considers how much-observed data influences the belief of one model over the other, but is unreliable with huge amounts of data
  • Bayesian Information Criterion (BIC) considers how much-observed data influences the belief of one model over the other and can handle huge amounts of data, but is known to underfit
  • Minimum Description Length (MDL) considers how much a model can compress the observed data, through identifying regularity within the data values

Vandekerckhove et al, (2014), also stated that the model with the lowest AIC and/or BIC score would be the best to choose.

In conclusion, under parsimony, if adding another variable does not improve the regression formula, then should not be added into the assessment to avoid overfitting (Field, 2013). General Least Squares Models have issues in overfitting because computers systematically and mathematically conduct their analysis and lack the human knowledge to keep removing unneeded variables from the equation.  Hierarchical regression methods can help minimize overfitting through indirect calculation of a parsimony value (Vandekerckhove et al., 2014).

References

  • Austin, P. C., Goel, V., & van Walraven, C. (2001). An introduction to multilevel regression models. Canadian Journal of Public Health92(2), 150.
  • Duignan, B. (2015). Occam’s razor. Encyclopaedia Britannica. Retrieved from https://www.britannica.com/topic/Occams-razor
  • Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
  • Huck, S. W. (2013) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. [VitalSource Bookshelf Online].
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html
  • Vandekerckhove, J., Matzke, D., & Wagenmakers, E. J. (2014). Model comparison and the principle of parsimony.

Adv Quant: Polynomial Regression in R

Introduction

For this local polynomial regression, the “oldfaithful.csv” will be used from the open-source data. The eruption times (in minutes) and the waiting time to the next eruption (in minutes) of 272 eruptions are provided for the Old Faithful geyser.

Results

IP2F1.PNG

Figure 1: Density Histograms for eruptions times and eruption waiting times.

IP2F2.PNG

Figure 2: Smoothed density histrogram from local polynomial regresion.

IP2F3.png

Figure 3: Intercomparisson of linear regression (blue), lowess regresion(red), and polynomial regression (green) on the eruption data.

IP2F4

Figure 4: Residual plots for both linear and polynomial regression.

Discussion

The histogram plots (Figure 1) illustrate that both variables, eruption times and eruptions waiting time are both bimodal distributions.  Thus, a linear regression (Figure 3), would not capture the relationship between these two variables.  A polynomial smoothed version of the bimodal curve (Figure 2) show that for low values of the geysers magnitude, there is a low wait time for the next occurrence and vice versa.  The smoothed density curve shows the estimate values of the geyser’s variable distribution better than the bar histogram

LOCFIT (locally fitted regression) and LOWESS (locally weighted scatterplot smoothing regression) are assessed alongside the typical LM (linear regression).  LOCFIT is based on LOWESS, which allows the end user to specify the smoothing parameter and neighborhood size, but LOCFIT affords the end user more control over other the smoothing parameters (Futschik & Crompton, 2004).  Both LOCFIT and LOWESS are methods for regression that uses the nearest-neighbor-based model (Field, 2013; Futschik & Crompton, 2004; Loader, 2013; Smith, 2015).  This analysis will look at all three.

The goal is to see if there is a relationship between the waiting time to the next eruption to the magnitude of the eruption (per eruption time).  Through the linear regression algorithm, the linear model is eruptions = 0.075628 (waiting) – 1.874016.  The Pearson’s correlation coefficient is 0.9008112. Thus 81.14% of the variation could be explained by a linear regression model.  The lowess regression appears not to capture the distribution of data at smaller eruption times, but it is better than the linear regression model since its correlation is 0.9809684, and can explain 0.9622990 of the variation between the variables.

Finally, to evaluate the effectiveness of the linear model and the polynomial model, residuals must be assessed (Figure 4). Both of the residual plots don’t show any discernable pattern. However, the residuals are closer to zero in the polynomial regression, suggesting that it does a better job at explaining the variance between the eruption magnitude and the next eruption wait time.  In conclusion, the best regression for this data set appears to be the polynomial regression.

Code

#

## Use R to analyze the faithful dataset.

## This is a version of the eruption data from the “Old Faithful” geyser in Yellowstone National Park, Wyoming.

##  •     X (primary key)

##  •     eruptions (eruption time [mins])

##  •     waiting (wait time for this eruptions [mins])

#

fateful = read.csv(file=”https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/faithful.csv&#8221;, header = TRUE, sep = “,”)

head(fateful)

# Produce density histograms of eruption times and of waiting times.

hist(fateful$eruptions, freq=F, xlab = “eruptions time [mins]”,  main = “Histogram of the eruptions time”)

hist(fateful$waiting, freq=F, xlab = “eruptions waiting time [mins]”,  main = “Histogram of the eruptions waiting time”)

# Produce a smoothed density histogram from local polynomial regression.

install.packages(“locfit”)

library(locfit)

plot(locfit(~lp(fateful$eruptions),data=fateful), xlab = “eruptions time [mins]”,  main = “Histogram of the eruptions time”)

plot(locfit(~lp(fateful$waiting),data=fateful), xlab = “eruptions waiting time [mins]”,  main = “Histogram of the eruptions waiting time”)

# Compare local polynomial regression to regular regression.

lowessRegression = lowess(fateful$waiting, faithful$eruptions, f=2/3)

polynomialRegression = locfit(fateful$eruptions~lp(fateful$waiting))

linearRegression = lm(fateful$eruptions~fateful$waiting)

# Graphing the data

plot(fateful$waiting, fateful$eruptions, main = “Eruption Times”, xlab=”eruption time [min]”, ylab = “Waiting time to next eruption [min]”)

lines(lowessRegression, col=”red”)

abline(linearRegression, col=”blue”)

lines(polynomialRegression, col=”green”)

# summary on the regressions

summary(linearRegression)

# correlations on the regressions

cor(fateful$eruptions,fateful$waiting)

cor(lowessRegression$x, lowessRegression$y)

# Plotting residuals

plot(residuals(linearRegression), main = “residuals for the linear regression”, ylab = “residuals”)

plot(residuals(polynomialRegression), main = “residuals for the polynomial regression”, ylab=”residuals”)

References