Classification, Regression, and Conditional Tree Growth Algorithms
The variables used for tree growth algorithms are the log of benign prostatic hyperplasia amount (lbph), log of prostate-specific antigen (lpsa), Gleason score (gleason), log of capsular penetration (lcp) and log of the cancer volume (lcavol) to understand and predict tumor spread (seminal vesicle invasion=svi).
Figure 1: Visualization of cross-validation results, for the classification tree (left) and regression tree (right).
Figure 2: Classification tree (left), regression tree (center), and conditional tree (right).
Figure 3: Summarization of tree data: (a) classification tree, (b) regression tree, and (c) conditional tree.
For the classification tree growth algorithm, the head node is the seminal vesicle invasion which helps show the tumor spread in this dataset, and the cross-validation results show that there is only one split in the tree, with an x-value relative value for the first split of 0.71429 (Figure 1 & Figure 3a), and an x-value standard deviation of 0.16957 (Figure 3a). The variable that was used to split the tree was the log of capsular penetration (Figure 2), when the log of capsular penetration at <1.791.
Next, for the regression tree growth algorithm, there are three leaf nodes, because the algorithm split the data three times. In this case, the relative error for the first split is 1.00931, and a standard deviation of 0.18969 and at the second split the relative error is 0.69007 and a standard deviation of 0.15773 (Figure 1 & Figure 3b). The tree was split at first at the log of capsular penetration at <1.791, and with the log of prostate specific antigen value at <2.993 (Figure 2). It is interesting that the first split occurred at the same value for these two different tree growth algorithm, but that the relative errors and standard deviations were different and that the regression tree created one more level.
Finally, the conditional tree growth algorithm produced a split at <1.749 of the log capsular penetration at the 0.001 significance level and <2.973 for the log of prostate specific antigen also at the 0.001 significance level (Figure 2 & Figure 3c). The results are similar to the regression tree, with the same number of leaf nodes and values in which they are split against, but more information is gained from the conditional tree growth algorithm than the classification and regression tree growth algorithm.
## Use the prostate cancer dataset available in R, in which biopsy results are given for 97 men.
## Goal: Predict tumor spread in this dataset of 97 men who had undergone a biopsy.
## The measures to be used for prediction are BPH=lbhp, PSA=lpsa, Gleason Score=gleason, CP=lcp,
## and size of prostate=lcavol.
## Grow a classification tree
classification = rpart(svi~lbph+lpsa+gleason+lcp+lcavol, data=Prostate, method=”class”)
printcp(classification) # display the results
plotcp(classification) # visualization cross-validation results
plot(classification, uniform = T, main=”Classification Tree for prostate cancer”) # plot tree
text(classification, use.n = T, all = T, cex=.8) # create text on the tree
## Grow a regression tree
Regression = rpart(svi~lbph+lpsa+gleason+lcp+lcavol, data=Prostate, method=”anova”)
printcp(Regression) # display the results
plotcp(Regression) # visualization cross-validation results
plot(Regression, uniform = T, main=”Regression Tree for prostate cancer”) # plot tree
text(Regression, use.n = T, all = T, cex=.8) # create text on the tree
## Grow a conditional inference tree
conditional = ctree(svi~lbph+lpsa+gleason+lcp+lcavol, data=Prostate)
conditional # display the results
plot(conditional, main=”Conditional inference tree for prostate cancer”)
- lasso2 (n.d.) Prostate Canceer Data. Retrieved from http://www.biostat.jhsph.edu/~ririzarr/Teaching/649/prostate.html
- Quick-R (n.d.) Tree-Based Models. Retrieved from http://www.statmethods.net/advstats/cart.html