Compelling topics summary/definitions
- Supervised machine learning algorithms: is a model that needs training and testing data set. However it does need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014).
- Unsupervised machine learning algorithms: is a model that needs training and testing data set, but unlike supervised learning, it doesn’t need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014). Therefore, unsupervised learning tries to find the natural relationships in the input data (Ahlemeyer-Stubbe & Coleman, 2014).
- General Least Squares Model (GLM): is the line of best fit, for linear regressions modeling along with its corresponding correlations (Smith, 2015). There are five assumptions to a linear regression model: additivity, linearity, independent errors, homoscedasticity, and normally distributed errors.
- Overfitting: is stuffing a regression model with so many variables that have little contributional weight to help predict the dependent variable (Field, 2013; Vandekerckhove, Matzke, & Wagenmakers, 2014). Thus, to avoid the over-fitting problem, the use of parsimony is important in big data analytics.
- Parsimony: is describing a dependent variable with the fewest independent variables as possible (Field, 2013; Huck, 2013; Smith, 2015). The concept of parsimony could be attributed to Occam’s Razor, which states “plurality out never be posited without necessity” (Duignan, 2015). Vandekerckhove et al. (2014) describe parsimony as a way of removing the noise from the signal to create better predictive regression models.
- Hierarchical Regression: When the researcher builds a multivariate regression model, they build it in stages, as they tend to add known independent variables first, and add newer independent variables in order to avoid overfitting in a technique called hierarchical regression (Austin, Goel & van Walraven, 2001; Field, 2013; Huck 2013).
- Logistic Regression: multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).
- Nearest Neighbor Methods: K-nearest neighbor (i.e. K =5) is when a data point is clustered into a group, by having 5 of the nearest neighbors vote on that data point, and it is particularly useful if the data is a binary or categorical (Berson, Smith, & Thearling, 1999).
- Classification Trees: aid in data abstraction and finding patterns in an intuitive way (Ahlemeyer-Stubbe & Coleman, 2014; Brookshear & Brylow, 2014; Conolly & Begg, 2014) and aid the decision-making process by mapping out all the paths, solutions, or options available to the decision maker to decide upon.
- Bayesian Analysis: can be reduced to a conditional probability that aims to take into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).
- Discriminate Analysis: how should data be best separated into several groups based on several independent variables that create the largest separation of the prediction (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013).
- Ensemble Models: can perform better than a single classifier, since they are created as a combination of classifiers that have a weight attached to them to properly classify new data points (Bauer & Kohavi, 1999; Dietterich, 2000), through techniques like Bagging and Boosting. Boosting procedures help reduce both bias and variance of the different methods, and bagging procedures reduce just the variance of the different methods (Bauer & Kohavi, 1999; Liaw & Wiener, 2002).
References
- Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
- Austin, P. C., Goel, V., & van Walraven, C. (2001). An introduction to multilevel regression models. Canadian Journal of Public Health, 92(2), 150.
- Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning,36(1-2), 105-139.
- Berson, A. Smith, S. & Thearling K. (1999). Building Data Mining Applications for CRM. McGraw-Hill. Retrieved from http://www.thearling.com/text/dmtechniques/dmtechniques.htm
- Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, 12th Edition. [VitalSource Bookshelf Online].
- Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
- Dietterich, T. G. (2000). Ensemble methods in machine learning. International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.
- Duignan, B. (2015). Occam’s razor. Encyclopaedia Britannica. Retrieved from https://www.britannica.com/topic/Occams-razor
- Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
- Gall, M. D., Gall, J. P., Borg, W. R. (2006). Educational Research: An Introduction, 8th Edition. [VitalSource Bookshelf Online].
- Hubbard, D. W. (2010). How to measure anything: Finding the values of “intangibles” in business. (2nd e.d.) New Jersey, John Wiley & Sons, Inc.
- Huck, Schuyler W. (2011). Reading Statistics and Research, 6th Edition. [VitalSource Bookshelf Online].
- Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.
- Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html
- Spiegelhalter, D. & Rice, K. (2009) Bayesian statistics. Retrieved from http://www.scholarpedia.org/article/Bayesian_statistics
- Vandekerckhove, J., Matzke, D., & Wagenmakers, E. J. (2014). Model comparison and the principle of parsimony.
- Yudkowsky, E.S. (2003). An intuitive explanation of Bayesian reasoning. Retrieved from http://yudkowsky.net/rational/bayes