Adv Quant: Decision Trees in R

Classification, Regression, and Conditional Tree Growth Algorithms

The variables used for tree growth algorithms are the log of benign prostatic hyperplasia amount (lbph), log of prostate-specific antigen (lpsa), Gleason score (gleason), log of capsular penetration (lcp) and log of the cancer volume (lcavol) to understand and predict tumor spread (seminal vesicle invasion=svi).



Figure 1: Visualization of cross-validation results, for the classification tree (left) and regression tree (right).


Figure 2: Classification tree (left), regression tree (center), and conditional tree (right).


Figure 3: Summarization of tree data: (a) classification tree, (b) regression tree, and (c) conditional tree.


For the classification tree growth algorithm, the head node is the seminal vesicle invasion which helps show the tumor spread in this dataset, and the cross-validation results show that there is only one split in the tree, with an x-value relative value for the first split of 0.71429 (Figure 1 & Figure 3a), and an x-value standard deviation of 0.16957 (Figure 3a).  The variable that was used to split the tree was the log of capsular penetration (Figure 2), when the log of capsular penetration at <1.791.

Next, for the regression tree growth algorithm, there are three leaf nodes, because the algorithm split the data three times.  In this case, the relative error for the first split is 1.00931, and a standard deviation of 0.18969 and at the second split the relative error is 0.69007 and a standard deviation of 0.15773 (Figure 1 & Figure 3b).  The tree was split at first at the log of capsular penetration at <1.791, and with the log of prostate specific antigen value at <2.993 (Figure 2).  It is interesting that the first split occurred at the same value for these two different tree growth algorithm, but that the relative errors and standard deviations were different and that the regression tree created one more level.

Finally, the conditional tree growth algorithm produced a split at <1.749 of the log capsular penetration at the 0.001 significance level and <2.973 for the log of prostate specific antigen also at the 0.001 significance level (Figure 2 & Figure 3c).  The results are similar to the regression tree, with the same number of leaf nodes and values in which they are split against, but more information is gained from the conditional tree growth algorithm than the classification and regression tree growth algorithm.



### ———————————————————————————————————-

## Use the prostate cancer dataset available in R, in which biopsy results are given for 97 men.

## Goal:  Predict tumor spread in this dataset of 97 men who had undergone a biopsy.

## The measures to be used for prediction are BPH=lbhp, PSA=lpsa, Gleason Score=gleason, CP=lcp,

## and size of prostate=lcavol.

### ———————————————————————————————————-







## Grow a classification tree

classification = rpart(svi~lbph+lpsa+gleason+lcp+lcavol, data=Prostate, method=”class”)

printcp(classification) # display the results

plotcp(classification)  # visualization cross-validation results

plot(classification, uniform = T, main=”Classification Tree for prostate cancer”) # plot tree

text(classification, use.n = T, all = T, cex=.8)                                  # create text on the tree

## Grow a regression tree

Regression = rpart(svi~lbph+lpsa+gleason+lcp+lcavol, data=Prostate, method=”anova”)

printcp(Regression) # display the results

plotcp(Regression)  # visualization cross-validation results

plot(Regression, uniform = T, main=”Regression Tree for prostate cancer”) # plot tree

text(Regression, use.n = T, all = T, cex=.8)                              # create text on the tree



## Grow a conditional inference tree

conditional = ctree(svi~lbph+lpsa+gleason+lcp+lcavol, data=Prostate)

conditional # display the results

plot(conditional, main=”Conditional inference tree for prostate cancer”)


Adv Quant: Ensemble Classifiers and RandomForests

Ensembles classifiers can perform better than a single classifier since they are created as a combination of classifiers that have a weight attached to them to properly classify new data points (Bauer & Kohavi, 1999; Dietterich, 2000).  The ensemble classifier can include methods such as:

  • Logistic Regression: multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).
  • Nearest Neighbor Methods: K-nearest neighbor (i.e. K =5) is when a data point is clustered into a group, by having 5 of the nearest neighbors vote on that data point, and it is particularly useful if the data is a binary or categorical (Berson, Smith, & Thearling, 1999).
  • Classification Trees: aid in data abstraction and finding patterns in an intuitive way (Ahlemeyer-Stubbe & Coleman, 2014; Brookshear & Brylow, 2014; Conolly & Begg, 2014) and aid the decision-making process by mapping out all the paths, solutions, or options available for the decision maker to decide upon.
  • Bayesian Analysis: can be reduced to a conditional probability that aims to take into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).
  • Discriminate Analysis: how should data be best separated into several groups based on several independent variables that create the largest separation of the prediction (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013).

As mentioned above, the ensemble classifier can create weights for each classifier to help improve the accuracy of the total “ensemble classifier result,” through boosting and bagging procedures.  Boosting procedures help reduce both bias and variance of the different methods, and bagging procedures reduce just the variance of the different methods (Bauer & Kohavi, 1999; Liaw & Wiener, 2002).

  • Boosting: helps boost weak classifying algorithms done serially in systems, to force a reduction in the expected error (Bauer & Kohavi, 1999). The reason why this algorithm is done serially is that the classifier done previously had voted on the variables previously, and that vote is taken into account in this next classifier prediction (Liaw & Wiener, 2002)
  • Bagging (Bootstrap aggregating): assigns values to classifiers which are created from different uniform samples from the training data set with replacement, which is computed in parallel because they don’t depend on other classifiers’ votes to run the next classification prediction (Bauer & Kohavi, 1999; Liaw & Wiener, 2002). This is also known as an averaging method or a random forest (Ahlemeyer-Stubbe & Coleman, 2014).

Random Forest

According to Ahlemeyer-Stubbe and Coleman (2014), random forests are multiple decision trees conducted from selecting multiple random samples from the same data set (either through resampled or disjoint sampling), and the variables that appear more frequently in the forest adds more confidence that this variable has a real influence on the dependent variable.  Liaw and Wiener (2002) affirmed this by stating not only does a variable that frequently appears among many trees in the forest add more confidence in its influence, but also can help determine its proximity to the root node.  Random forests add a new level of randomness to bagging algorithms and is robust against over fitting which is a problem with some decision trees algorithms (Ahlemeyer-Stubbe & Coleman, 2014; Liaw & Wiener, 2002).

The use of random forests is most helpful when relationships between the variables are weak or if there is very little data available (Ahlemeyer-Stubbe and Coleman, 2014).  Also, it is worth considering that the numbers of trees needed to achieve great performance increases as the number of variables under consideration increases (Liaw & Wiener, 2002). To learn how to run random forests algorithms in the statistical programming language R, Liaw and Wiener (2002) shared some of their coding syntax as well as observations on how to effectively meet the objectives.


  • Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning,36(1-2), 105-139.
  • Berson, A. Smith, S. & Thearling K. (1999). Building Data Mining Applications for CRM. McGraw-Hill. Retrieved from
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, 12th Edition. [VitalSource Bookshelf Online].
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • Dietterich, T. G. (2000). Ensemble methods in machine learning. International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.
  • Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].

Adv Quant: Decision Trees

Decision Trees

Humans when facing a decision tend to seek out a path, solution, or option that appears closest to the goal (Brookshear & Brylow, 2014). Decision trees are helpful as they are predictive models (Ahlemeyer-Stubbe & Coleman, 2014).  Thus, decisions tree aid in data abstraction and finding patterns in an intuitive way (Ahlemeyer-Stubbe & Coleman, 2014; Brookshear & Brylow, 2014; Conolly & Begg, 2014) and aid the decision-making process by mapping out all the paths, solutions, or options available for the decision maker to decide upon.  Every decision is different and varies in complexity. Therefore there is no way to write a simple and well thought out decision tree (Sadalage & Fowler, 2012).

Ahlemeyer-Stubbe and Coleman (2014) stated that the decision trees are a great way to identify possible variables for inclusion in statistical models that are mutually exclusive and collectively exhaustive, even if the relationship between the target and inputs are weak. To help facilitate decision making, each node on a decision tree can have questions attached to it that needs to be asked with leaves associated with each node that represents the differing answers (McNurlin, Sprague, & Bui, 2008). The variable with the strongest influence becomes the top most branch of the decision tree (Ahlemeyer-Stubbe & Coleman, 2014). Chaudhuri, Lo, Loh, & Yang (1995) defines regression decision trees as those where the target question/variable is either continuous, real, or logistic yielding. Murthy (1998), confirms this definition for regression decision trees, while also defining that when to target question/variables needs to be split up into different, finite, and discrete classes is what defines classification decision trees.

Aiming to mirror the way human brain works, the classification decision trees can be created by using neural networks algorithms, which contains a connection of nodes that can have multiple inputs, outputs and processes in each node (Ahlemeyer-Stubbe & Coleman, 2014; Connolly & Begg, 2014). Neural network algorithms contrast the typical decision trees, which usually have one input, one output, and one process per node (similar to Figure 1). Once a root question has been identified, the decision tree algorithm keeps recursively iterating through the data, in an aim to answer the root question (Ahlemeyer-Stubbe & Coleman, 2014).

However, the larger the decision tree, the weaker the leaves get, because the model is tending to overfit the data. Thus thresholds could be applied to the decision tree modeling algorithm to prune back the unstable leaves (Ahlemeyer-Stubbe & Coleman, 2014).  Thus, when looking for a decision tree algorithm to parse through data, it is best to find one that has pruning capabilities.


Figure 1: A left-to-right decision tree on whether or not to take an umbrella, assuming the person is going to spend any amount of time outside during the day.

Advantages of a decision tree

According to Ahlemeyer-Stubbe & Coleman (2014) some of the advantages of using decision tress are:

+ Few assumptions are needed about the distribution of the data

+ Few assumptions are needed about the linearity

+ Decision trees are not sensitive to outliers

+ Decision trees are best for large data, because of their adaptability and minimal assumptions needed to begin parsing the data

+ For logistic and linear regression trees, parameter estimation and hypothesis testing are possible

+ For neural network (Classification) decision trees, predictive equations can be derived

According to Murthy (1998) the advantages of using classification decision trees are:

+ Pre-classified examples mitigate the needs for a subject matter expert knowledge

+ It is an exploratory method as opposes to inferential method

According to Chaudhuri et al. (1995) the advantages of using a regression decision tree are:

+ It can easily handle model complexity in an easily interpretable way

+ Covariates values are conveyed by the tree structure

+ Statistical properties can be derived and studied


  • Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, 12th Edition. [VitalSource Bookshelf Online].
  • Chaudhuri, P., Lo, W. D., Loh, W. Y., & Yang, C. C. (1995). Generalized regression trees. Statistica Sinica, 641-666. Retrieved from
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • McNurlin, B., Sprague, R., & Bui, T. (2008). Information Systems Management, 8th Edition. [VitalSource Bookshelf Online].
  • Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey. Data mining and knowledge discovery2(4), 345-389. Retrieved from
  • Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [VitalSource Bookshelf Online].

Adv Quant: Bayesian analysis in R


Bayes’ theory is a conditional probability that takes into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015).  The formulation of Bayes’ theory is p(θ |y)= p(theta)*P(y| θ)/(∑(P(θ)*P(y| θ))), where p(θ) is the prior probabilities, and P(y| θ) are the likelihoods (Cowles, Kass, & O’Hagan, 2009).

The Delayed Airplanes Dataset consists of airplane flights from Washington D.C. into New York City.  The date range for this data is for the entire month of February 2016, and there are 702 cases to be studied.



Figure 1: Histogram showcasing the density of flight delays that are 15 minutes or longer.


Figure 2: Shows summary data for the variables in this Bayesian Analysis before training and testing.


Figure 3: Bayesian Prediction of the flight delay data from Washington, D.C. to New York City, NY.


Figure 4: Bayesian prediction results versus the test data results, where false negatives are encircled in blue, while false positives are encircled in red.


 The histogram (Figure 1) showcases that there are almost three times as many cases that flights depart on time from Washington, D.C. to New York City, NY.  Summation data proves this (Table 2).

The above summary (Table 2) states that 77.813% of the flights were not delayed equal to or more than 15 minutes, for the cases we do have data on. There is null data in the departure time, delayed 15 minutes or more, and weather delay variables.  To know the percentage of flights per day of the week, or carrier, destination, etc. the prior probabilities need to be calculated below.

About 77.2973% of the training model didn’t have a delay, but 22.7027% did have a delay of 15 or greater minutes (from tdelay variable).  These values are close to those above summation (Figure 2). Thus the training data could be trusted, even though a random sampling wasn’t taken.  The reason for not taking a random sampling is to be able to predict into the future, given 60% of the data is already collected.

Comparing both sets of histograms (Figure 1 and Figure 3), the distribution of the first histogram is binomial.  However, the posterior distribution, the secondary histogram, is similarly shaped as a positively skewed distribution.  This was an expected result described by Smith (2015), which is why the author states that the prior distribution has an effect on the posterior distribution.

The Bayesian prediction results tend to produce a bunch false negatives, compared to the real data sets, thus indicating more type II error than type I error.  When looking at the code below, the probability of finding a result that is 0.5 or larger is 15.302%.



## Locate the data, filter out the data, and pull it into R from the computer (R, n.d.b.)



airplaneData=read.csv(“022016DC2NYC_1022370032_T_ONTIME.csv”, header = T, sep = “,”)



### ———————————————————————————————————-

##  Data Source:

##        Dependent:   Departure Delay Indicator, 15 minutes or more (Dep_Del15)

##        Independent: Arrival airports of Newark-EWR, Kennedy-JFK, and LaGuardia-LGA (Origin)

##        Independent: Departure airports of Baltimore-BWI, Dulles-IAD, and Reagan-DCA (Dest)

##        Independent: Carriers (Carrier)

##        Independent: Hours of departure (Dep_Time)

##        Independent: Weather conditions (Weather_Delay)

##        Independent: Monday = 1, Tuesday = 2, …Sunday = 7 (Day_Of_Week)

### ———————————————————————————————————-

##  bayes theory => p(theta|y)= p(theta)*P(y|theta)/(SUM(P(theta)*P(y|theta))) (Cowles, Kass, & O’Hagan, 2009)

### ———————————————————————————————————-



## Create a data.frame

delay = data.frame(airplaneData)

## Factoring and labeling the variables (Taddy, n.d.)

delay$DEP_TIME = factor(floor(delay$DEP_TIME/100))

delay$DAY_OF_WEEK = factor(delay$DAY_OF_WEEK, labels = c(“M”, “T”, “W”, “R”, “F”, “S”, “U”))

delay$DEP_DEL15 = factor(delay$DEP_DEL15)

delay$WEATHER_DELAY= factor(ifelse(delay$WEATHER_DELAY>=1,1,0)) # (R, n.d.a.)

delay$CARRIER = factor(delay$CARRIER, levels = c(“AA”,”B6″,”DL”,”EV”,”UA”))

levels(delay$CARRIER) = c(“American”, “JetBlue”, “Delta”, “ExpressJet”, “UnitedAir”)

## Quick understanding the data

delayed15 = as.numeric(levels(delay$DEP_DEL15)[delay$DEP_DEL15])

hist(delayed15, freq=F, main = “Histogram of Delays of 15 mins or longer”, xlab = “time >= 15 mins (1) or time < 15 (0)”)


### Create the training and testing data (60/40%)

ntotal=length(delay$DAY_OF_WEEK)    # Total number of datapoints assigned dynamically

ntrain = sample(1:ntotal,floor(ntotal*(0.6))) # Take values 1 – n*0.6

ntest = ntotal-floor(ntotal*(0.6))       # The number of test cases (40% of the data)

trainingData = cbind(delay$DAY_OF_WEEK[ntrain], delay$CARRIER[ntrain],delay$ORIGIN[ntrain],delay$DEST[ntrain],delay$DEP_TIME[ntrain],delay$WEATHER_DELAY[ntrain],delayed15[ntrain])

testingData  = cbind(delay$DAY_OF_WEEK[-ntrain], delay$CARRIER[-ntrain],delay$ORIGIN[-ntrain],delay$DEST[-ntrain],delay$DEP_TIME[-ntrain],delay$WEATHER_DELAY[-ntrain],delayed15[-ntrain])

## Partitioning the train data by half

trainFirst= trainingData[trainingData[,7]<0.5,]

trainSecond= trainingData[trainingData[,7]>0.5,]

### Prior probabilities = p(theta) (Cowles, Kass, & O’Hagan, 2009)

## Dependent variable: time delayed >= 15


### Prior probabilities between the partitioned training data

## Independent variable: Day of the week (% flights occured in which day of the week)



## Independent variable: Carrier (% flights occured in which carrier)



## Independent variable: Origin (% flights occured in which originating airport)



## Independent variable: Destination (% flights occured in which destinateion airport)



## Independent variable: Department Time (% flights occured in which time of the day)



## Independent variable: Weather (% flights delayed because of adverse weather conditions)



### likelihoods = p(y|theta) (Cowles, Kass, & O’Hagan, 2009)



### Predictions using bayes theory = p(theta|y)= p(theta)*P(y|theta)/(SUM(P(theta)*P(y|theta))) (Cowles, Kass, & O’Hagan, 2009)


hist(Bayes, freq=F, main=”Bayesian Analysis of flight delay data”)

plot(delayed15[-ntrain]~Bayes, main=”Bayes results versus actual results for flights delayed >= 15 mins”, xlab=”Bayes Analysis Prediction of which cases will be delayed”, ylab=”Actual results from test data showing delayed cases”)

## The probability of 0.5 or larger

densityMeasure = table(delayed15[-ntrain],floor(Bayes+0.5))