Decision Trees
Humans when facing a decision tend to seek out a path, solution, or option that appears closest to the goal (Brookshear & Brylow, 2014). Decision trees are helpful as they are predictive models (Ahlemeyer-Stubbe & Coleman, 2014). Thus, decisions tree aid in data abstraction and finding patterns in an intuitive way (Ahlemeyer-Stubbe & Coleman, 2014; Brookshear & Brylow, 2014; Conolly & Begg, 2014) and aid the decision-making process by mapping out all the paths, solutions, or options available for the decision maker to decide upon. Every decision is different and varies in complexity. Therefore there is no way to write a simple and well thought out decision tree (Sadalage & Fowler, 2012).
Ahlemeyer-Stubbe and Coleman (2014) stated that the decision trees are a great way to identify possible variables for inclusion in statistical models that are mutually exclusive and collectively exhaustive, even if the relationship between the target and inputs are weak. To help facilitate decision making, each node on a decision tree can have questions attached to it that needs to be asked with leaves associated with each node that represents the differing answers (McNurlin, Sprague, & Bui, 2008). The variable with the strongest influence becomes the top most branch of the decision tree (Ahlemeyer-Stubbe & Coleman, 2014). Chaudhuri, Lo, Loh, & Yang (1995) defines regression decision trees as those where the target question/variable is either continuous, real, or logistic yielding. Murthy (1998), confirms this definition for regression decision trees, while also defining that when to target question/variables needs to be split up into different, finite, and discrete classes is what defines classification decision trees.
Aiming to mirror the way human brain works, the classification decision trees can be created by using neural networks algorithms, which contains a connection of nodes that can have multiple inputs, outputs and processes in each node (Ahlemeyer-Stubbe & Coleman, 2014; Connolly & Begg, 2014). Neural network algorithms contrast the typical decision trees, which usually have one input, one output, and one process per node (similar to Figure 1). Once a root question has been identified, the decision tree algorithm keeps recursively iterating through the data, in an aim to answer the root question (Ahlemeyer-Stubbe & Coleman, 2014).
However, the larger the decision tree, the weaker the leaves get, because the model is tending to overfit the data. Thus thresholds could be applied to the decision tree modeling algorithm to prune back the unstable leaves (Ahlemeyer-Stubbe & Coleman, 2014). Thus, when looking for a decision tree algorithm to parse through data, it is best to find one that has pruning capabilities.
Figure 1: A left-to-right decision tree on whether or not to take an umbrella, assuming the person is going to spend any amount of time outside during the day.
Advantages of a decision tree
According to Ahlemeyer-Stubbe & Coleman (2014) some of the advantages of using decision tress are:
+ Few assumptions are needed about the distribution of the data
+ Few assumptions are needed about the linearity
+ Decision trees are not sensitive to outliers
+ Decision trees are best for large data, because of their adaptability and minimal assumptions needed to begin parsing the data
+ For logistic and linear regression trees, parameter estimation and hypothesis testing are possible
+ For neural network (Classification) decision trees, predictive equations can be derived
According to Murthy (1998) the advantages of using classification decision trees are:
+ Pre-classified examples mitigate the needs for a subject matter expert knowledge
+ It is an exploratory method as opposes to inferential method
According to Chaudhuri et al. (1995) the advantages of using a regression decision tree are:
+ It can easily handle model complexity in an easily interpretable way
+ Covariates values are conveyed by the tree structure
+ Statistical properties can be derived and studied
References
- Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
- Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, 12th Edition. [VitalSource Bookshelf Online].
- Chaudhuri, P., Lo, W. D., Loh, W. Y., & Yang, C. C. (1995). Generalized regression trees. Statistica Sinica, 641-666. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.133.4786&rep=rep1&type=pdf
- Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
- McNurlin, B., Sprague, R., & Bui, T. (2008). Information Systems Management, 8th Edition. [VitalSource Bookshelf Online].
- Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey. Data mining and knowledge discovery, 2(4), 345-389. Retrieved from http://cmapspublic3.ihmc.us/rid=1MVPFT7ZQ-15Z1DTZ-14TG/Murthy%201998%20DMKD%20Automatic%20Construction%20of%20Decision%20Trees.pdf
- Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [VitalSource Bookshelf Online].