Adv Quant: Compelling Topics

Compelling topics summary/definitions

  • Supervised machine learning algorithms: is a model that needs training and testing data set. However it does need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014).
  • Unsupervised machine learning algorithms: is a model that needs training and testing data set, but unlike supervised learning, it doesn’t need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014). Therefore, unsupervised learning tries to find the natural relationships in the input data (Ahlemeyer-Stubbe & Coleman, 2014).
  • General Least Squares Model (GLM): is the line of best fit, for linear regressions modeling along with its corresponding correlations (Smith, 2015). There are five assumptions to a linear regression model: additivity, linearity, independent errors, homoscedasticity, and normally distributed errors.
  • Overfitting: is stuffing a regression model with so many variables that have little contributional weight to help predict the dependent variable (Field, 2013; Vandekerckhove, Matzke, & Wagenmakers, 2014). Thus, to avoid the over-fitting problem, the use of parsimony is important in big data analytics.
  • Parsimony: is describing a dependent variable with the fewest independent variables as possible (Field, 2013; Huck, 2013; Smith, 2015). The concept of parsimony could be attributed to Occam’s Razor, which states “plurality out never be posited without necessity” (Duignan, 2015).  Vandekerckhove et al. (2014) describe parsimony as a way of removing the noise from the signal to create better predictive regression models.
  • Hierarchical Regression: When the researcher builds a multivariate regression model, they build it in stages, as they tend to add known independent variables first, and add newer independent variables in order to avoid overfitting in a technique called hierarchical regression (Austin, Goel & van Walraven, 2001; Field, 2013; Huck 2013).
  • Logistic Regression: multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).
  • Nearest Neighbor Methods: K-nearest neighbor (i.e. K =5) is when a data point is clustered into a group, by having 5 of the nearest neighbors vote on that data point, and it is particularly useful if the data is a binary or categorical (Berson, Smith, & Thearling, 1999).
  • Classification Trees: aid in data abstraction and finding patterns in an intuitive way (Ahlemeyer-Stubbe & Coleman, 2014; Brookshear & Brylow, 2014; Conolly & Begg, 2014) and aid the decision-making process by mapping out all the paths, solutions, or options available to the decision maker to decide upon.
  • Bayesian Analysis: can be reduced to a conditional probability that aims to take into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).
  • Discriminate Analysis: how should data be best separated into several groups based on several independent variables that create the largest separation of the prediction (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013).
  • Ensemble Models: can perform better than a single classifier, since they are created as a combination of classifiers that have a weight attached to them to properly classify new data points (Bauer & Kohavi, 1999; Dietterich, 2000), through techniques like Bagging and Boosting. Boosting procedures help reduce both bias and variance of the different methods, and bagging procedures reduce just the variance of the different methods (Bauer & Kohavi, 1999; Liaw & Wiener, 2002).

 

References

  • Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Austin, P. C., Goel, V., & van Walraven, C. (2001). An introduction to multilevel regression models. Canadian Journal of Public Health92(2), 150.
  • Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning,36(1-2), 105-139.
  • Berson, A. Smith, S. & Thearling K. (1999). Building Data Mining Applications for CRM. McGraw-Hill. Retrieved from http://www.thearling.com/text/dmtechniques/dmtechniques.htm
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, 12th Edition. [VitalSource Bookshelf Online].
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • Dietterich, T. G. (2000). Ensemble methods in machine learning. International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.
  • Duignan, B. (2015). Occam’s razor. Encyclopaedia Britannica. Retrieved from https://www.britannica.com/topic/Occams-razor
  • Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
  • Gall, M. D., Gall, J. P., Borg, W. R. (2006). Educational Research: An Introduction, 8th Edition. [VitalSource Bookshelf Online].
  • Hubbard, D. W. (2010). How to measure anything: Finding the values of “intangibles” in business. (2nd e.d.) New Jersey, John Wiley & Sons, Inc.
  • Huck, Schuyler W. (2011). Reading Statistics and Research, 6th Edition. [VitalSource Bookshelf Online].
  • Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html
  • Spiegelhalter, D. & Rice, K. (2009) Bayesian statistics. Retrieved from http://www.scholarpedia.org/article/Bayesian_statistics
  • Vandekerckhove, J., Matzke, D., & Wagenmakers, E. J. (2014). Model comparison and the principle of parsimony.
  • Yudkowsky, E.S. (2003). An intuitive explanation of Bayesian reasoning. Retrieved from http://yudkowsky.net/rational/bayes

Adv Quant: Overfitting & Parsimony

Overfitting and Parsimony

Overfitting a regression model is stuffing it with so many variables that have little contributional weight to help predict the dependent variable (Field, 2013; Vandekerckhove, Matzke, & Wagenmakers, 2014).  Thus, to avoid the over-fitting problem, the use of parsimony is important in big data analytics.  Parsimony is describing a dependent variable with the fewest independent variables as possible (Field, 2013; Huck, 2013; Smith, 2015).  The best way to describe this is to use the “Keep It Simple Sweaty,” concept on the regression model.  The concept of parsimony could be attributed to Occam’s Razor, which states “plurality out never be posited without necessity” (Duignan, 2015).  Vandekerckhove et al. (2014) describe parsimony as a way of removing the noise from the signal to create better predictive regression models.

Overfitting in General Least Squares Model (GLM)

For multivariate regressions, a correlation matrix could be conducted on all the variables, to help with identifying parsimony, such that the software will try to maximize the correlation while minimizing the number of variables (Field, 2013).  Smith (2015) stated that the proportion of variation should remain high between the variables and that the correlation between the separate independent variables should be as low as possible. If the correlation coefficient between the independent variables is high (0.8 or higher), then there is a chance that there are extraneous variables (Smith, 2015).   Another technique to achieve parsimony is called the backward stepwise method, which is to run a regression model with all variables, and remove those variables that don’t contribute to the models significantly, or the model could start with one variable and add variables until it has maximized correlation and variance in a forward stepwise method (Field, 2013; Huck, 2015).

Unfortunately, there is still a problem of overfitting when conducting a backward stepwise method, forward stepwise method, or correlation matrix in multivariate linear models.  That is because, computers tend to remove, add or consider variables systematically and mathematically, not based on human knowledge (Field, 2013; Huck, 2015). Thus, it is still important to have a human to evaluate the computational output for logic, consistency, and reliability.  However, if the focus is to reduce overfitting, it should be noted that underfitting should also be avoided.  Underfitting a regression model happens when the model leaves out key independent variables that can help predict the dependent variable from the model (Field, 2013).

Hierarchical regression methods

When the researcher builds a multivariate regression model, they build it in stages, as they tend to add known independent variables first, and add newer independent variables to avoid overfitting in a technique called hierarchical regression (Austin, Goel & van Walraven, 2001; Field, 2013; Huck 2013).  The new and unknown independent variables could be entered in through a stepwise algorithm as abovementioned, or another step could be created where suspected new variables that may have a high contribution to the predictability of the dependent variables are added next (Field, 2013).  Hierarchical regression methods allow the researcher to analyze the differing hierarchical levels by examining not only the correlations between the levels but also the intercepts and slopes, helping drive valid statistical inferences (Austin et al., 2001).

Vandekerckhove et al. (2014) listed these three hierarchical methods for model selection; where each method is balancing between the goodness of fit and parsimony:

  • Akaike’s Information Criterion (AIC) considers how much-observed data influences the belief of one model over the other, but is unreliable with huge amounts of data
  • Bayesian Information Criterion (BIC) considers how much-observed data influences the belief of one model over the other and can handle huge amounts of data, but is known to underfit
  • Minimum Description Length (MDL) considers how much a model can compress the observed data, through identifying regularity within the data values

Vandekerckhove et al, (2014), also stated that the model with the lowest AIC and/or BIC score would be the best to choose.

In conclusion, under parsimony, if adding another variable does not improve the regression formula, then should not be added into the assessment to avoid overfitting (Field, 2013). General Least Squares Models have issues in overfitting because computers systematically and mathematically conduct their analysis and lack the human knowledge to keep removing unneeded variables from the equation.  Hierarchical regression methods can help minimize overfitting through indirect calculation of a parsimony value (Vandekerckhove et al., 2014).

References

  • Austin, P. C., Goel, V., & van Walraven, C. (2001). An introduction to multilevel regression models. Canadian Journal of Public Health92(2), 150.
  • Duignan, B. (2015). Occam’s razor. Encyclopaedia Britannica. Retrieved from https://www.britannica.com/topic/Occams-razor
  • Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
  • Huck, S. W. (2013) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. [VitalSource Bookshelf Online].
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html
  • Vandekerckhove, J., Matzke, D., & Wagenmakers, E. J. (2014). Model comparison and the principle of parsimony.