Adv Quant: Compelling Topics

Compelling topics summary/definitions

  • Supervised machine learning algorithms: is a model that needs training and testing data set. However it does need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014).
  • Unsupervised machine learning algorithms: is a model that needs training and testing data set, but unlike supervised learning, it doesn’t need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014). Therefore, unsupervised learning tries to find the natural relationships in the input data (Ahlemeyer-Stubbe & Coleman, 2014).
  • General Least Squares Model (GLM): is the line of best fit, for linear regressions modeling along with its corresponding correlations (Smith, 2015). There are five assumptions to a linear regression model: additivity, linearity, independent errors, homoscedasticity, and normally distributed errors.
  • Overfitting: is stuffing a regression model with so many variables that have little contributional weight to help predict the dependent variable (Field, 2013; Vandekerckhove, Matzke, & Wagenmakers, 2014). Thus, to avoid the over-fitting problem, the use of parsimony is important in big data analytics.
  • Parsimony: is describing a dependent variable with the fewest independent variables as possible (Field, 2013; Huck, 2013; Smith, 2015). The concept of parsimony could be attributed to Occam’s Razor, which states “plurality out never be posited without necessity” (Duignan, 2015).  Vandekerckhove et al. (2014) describe parsimony as a way of removing the noise from the signal to create better predictive regression models.
  • Hierarchical Regression: When the researcher builds a multivariate regression model, they build it in stages, as they tend to add known independent variables first, and add newer independent variables in order to avoid overfitting in a technique called hierarchical regression (Austin, Goel & van Walraven, 2001; Field, 2013; Huck 2013).
  • Logistic Regression: multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).
  • Nearest Neighbor Methods: K-nearest neighbor (i.e. K =5) is when a data point is clustered into a group, by having 5 of the nearest neighbors vote on that data point, and it is particularly useful if the data is a binary or categorical (Berson, Smith, & Thearling, 1999).
  • Classification Trees: aid in data abstraction and finding patterns in an intuitive way (Ahlemeyer-Stubbe & Coleman, 2014; Brookshear & Brylow, 2014; Conolly & Begg, 2014) and aid the decision-making process by mapping out all the paths, solutions, or options available to the decision maker to decide upon.
  • Bayesian Analysis: can be reduced to a conditional probability that aims to take into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).
  • Discriminate Analysis: how should data be best separated into several groups based on several independent variables that create the largest separation of the prediction (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013).
  • Ensemble Models: can perform better than a single classifier, since they are created as a combination of classifiers that have a weight attached to them to properly classify new data points (Bauer & Kohavi, 1999; Dietterich, 2000), through techniques like Bagging and Boosting. Boosting procedures help reduce both bias and variance of the different methods, and bagging procedures reduce just the variance of the different methods (Bauer & Kohavi, 1999; Liaw & Wiener, 2002).

 

References

  • Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Austin, P. C., Goel, V., & van Walraven, C. (2001). An introduction to multilevel regression models. Canadian Journal of Public Health92(2), 150.
  • Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning,36(1-2), 105-139.
  • Berson, A. Smith, S. & Thearling K. (1999). Building Data Mining Applications for CRM. McGraw-Hill. Retrieved from http://www.thearling.com/text/dmtechniques/dmtechniques.htm
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, 12th Edition. [VitalSource Bookshelf Online].
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • Dietterich, T. G. (2000). Ensemble methods in machine learning. International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.
  • Duignan, B. (2015). Occam’s razor. Encyclopaedia Britannica. Retrieved from https://www.britannica.com/topic/Occams-razor
  • Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
  • Gall, M. D., Gall, J. P., Borg, W. R. (2006). Educational Research: An Introduction, 8th Edition. [VitalSource Bookshelf Online].
  • Hubbard, D. W. (2010). How to measure anything: Finding the values of “intangibles” in business. (2nd e.d.) New Jersey, John Wiley & Sons, Inc.
  • Huck, Schuyler W. (2011). Reading Statistics and Research, 6th Edition. [VitalSource Bookshelf Online].
  • Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html
  • Spiegelhalter, D. & Rice, K. (2009) Bayesian statistics. Retrieved from http://www.scholarpedia.org/article/Bayesian_statistics
  • Vandekerckhove, J., Matzke, D., & Wagenmakers, E. J. (2014). Model comparison and the principle of parsimony.
  • Yudkowsky, E.S. (2003). An intuitive explanation of Bayesian reasoning. Retrieved from http://yudkowsky.net/rational/bayes

Adv Quant: Ensemble Classifiers and RandomForests

Ensembles classifiers can perform better than a single classifier since they are created as a combination of classifiers that have a weight attached to them to properly classify new data points (Bauer & Kohavi, 1999; Dietterich, 2000).  The ensemble classifier can include methods such as:

  • Logistic Regression: multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).
  • Nearest Neighbor Methods: K-nearest neighbor (i.e. K =5) is when a data point is clustered into a group, by having 5 of the nearest neighbors vote on that data point, and it is particularly useful if the data is a binary or categorical (Berson, Smith, & Thearling, 1999).
  • Classification Trees: aid in data abstraction and finding patterns in an intuitive way (Ahlemeyer-Stubbe & Coleman, 2014; Brookshear & Brylow, 2014; Conolly & Begg, 2014) and aid the decision-making process by mapping out all the paths, solutions, or options available for the decision maker to decide upon.
  • Bayesian Analysis: can be reduced to a conditional probability that aims to take into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).
  • Discriminate Analysis: how should data be best separated into several groups based on several independent variables that create the largest separation of the prediction (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013).

As mentioned above, the ensemble classifier can create weights for each classifier to help improve the accuracy of the total “ensemble classifier result,” through boosting and bagging procedures.  Boosting procedures help reduce both bias and variance of the different methods, and bagging procedures reduce just the variance of the different methods (Bauer & Kohavi, 1999; Liaw & Wiener, 2002).

  • Boosting: helps boost weak classifying algorithms done serially in systems, to force a reduction in the expected error (Bauer & Kohavi, 1999). The reason why this algorithm is done serially is that the classifier done previously had voted on the variables previously, and that vote is taken into account in this next classifier prediction (Liaw & Wiener, 2002)
  • Bagging (Bootstrap aggregating): assigns values to classifiers which are created from different uniform samples from the training data set with replacement, which is computed in parallel because they don’t depend on other classifiers’ votes to run the next classification prediction (Bauer & Kohavi, 1999; Liaw & Wiener, 2002). This is also known as an averaging method or a random forest (Ahlemeyer-Stubbe & Coleman, 2014).

Random Forest

According to Ahlemeyer-Stubbe and Coleman (2014), random forests are multiple decision trees conducted from selecting multiple random samples from the same data set (either through resampled or disjoint sampling), and the variables that appear more frequently in the forest adds more confidence that this variable has a real influence on the dependent variable.  Liaw and Wiener (2002) affirmed this by stating not only does a variable that frequently appears among many trees in the forest add more confidence in its influence, but also can help determine its proximity to the root node.  Random forests add a new level of randomness to bagging algorithms and is robust against over fitting which is a problem with some decision trees algorithms (Ahlemeyer-Stubbe & Coleman, 2014; Liaw & Wiener, 2002).

The use of random forests is most helpful when relationships between the variables are weak or if there is very little data available (Ahlemeyer-Stubbe and Coleman, 2014).  Also, it is worth considering that the numbers of trees needed to achieve great performance increases as the number of variables under consideration increases (Liaw & Wiener, 2002). To learn how to run random forests algorithms in the statistical programming language R, Liaw and Wiener (2002) shared some of their coding syntax as well as observations on how to effectively meet the objectives.

References:

  • Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning,36(1-2), 105-139.
  • Berson, A. Smith, S. & Thearling K. (1999). Building Data Mining Applications for CRM. McGraw-Hill. Retrieved from http://www.thearling.com/text/dmtechniques/dmtechniques.htm
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, 12th Edition. [VitalSource Bookshelf Online].
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • Dietterich, T. G. (2000). Ensemble methods in machine learning. International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.
  • Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].

Adv Quant: Bayesian analysis in R

Introduction

Bayes’ theory is a conditional probability that takes into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015).  The formulation of Bayes’ theory is p(θ |y)= p(theta)*P(y| θ)/(∑(P(θ)*P(y| θ))), where p(θ) is the prior probabilities, and P(y| θ) are the likelihoods (Cowles, Kass, & O’Hagan, 2009).

The Delayed Airplanes Dataset consists of airplane flights from Washington D.C. into New York City.  The date range for this data is for the entire month of February 2016, and there are 702 cases to be studied.

Results

4ip1.PNG

Figure 1: Histogram showcasing the density of flight delays that are 15 minutes or longer.

4ip2.PNG

Figure 2: Shows summary data for the variables in this Bayesian Analysis before training and testing.

4ip3.PNG

Figure 3: Bayesian Prediction of the flight delay data from Washington, D.C. to New York City, NY.

4ip4

Figure 4: Bayesian prediction results versus the test data results, where false negatives are encircled in blue, while false positives are encircled in red.

Discussion

 The histogram (Figure 1) showcases that there are almost three times as many cases that flights depart on time from Washington, D.C. to New York City, NY.  Summation data proves this (Table 2).

The above summary (Table 2) states that 77.813% of the flights were not delayed equal to or more than 15 minutes, for the cases we do have data on. There is null data in the departure time, delayed 15 minutes or more, and weather delay variables.  To know the percentage of flights per day of the week, or carrier, destination, etc. the prior probabilities need to be calculated below.

About 77.2973% of the training model didn’t have a delay, but 22.7027% did have a delay of 15 or greater minutes (from tdelay variable).  These values are close to those above summation (Figure 2). Thus the training data could be trusted, even though a random sampling wasn’t taken.  The reason for not taking a random sampling is to be able to predict into the future, given 60% of the data is already collected.

Comparing both sets of histograms (Figure 1 and Figure 3), the distribution of the first histogram is binomial.  However, the posterior distribution, the secondary histogram, is similarly shaped as a positively skewed distribution.  This was an expected result described by Smith (2015), which is why the author states that the prior distribution has an effect on the posterior distribution.

The Bayesian prediction results tend to produce a bunch false negatives, compared to the real data sets, thus indicating more type II error than type I error.  When looking at the code below, the probability of finding a result that is 0.5 or larger is 15.302%.

Code

#

## Locate the data, filter out the data, and pull it into R from the computer (R, n.d.b.)

#

setwd(“C:/Users/XXX/Documents/R/dataSets”)

airplaneData=read.csv(“022016DC2NYC_1022370032_T_ONTIME.csv”, header = T, sep = “,”)

#

##

### ———————————————————————————————————-

##  Data Source: http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time

##        Dependent:   Departure Delay Indicator, 15 minutes or more (Dep_Del15)

##        Independent: Arrival airports of Newark-EWR, Kennedy-JFK, and LaGuardia-LGA (Origin)

##        Independent: Departure airports of Baltimore-BWI, Dulles-IAD, and Reagan-DCA (Dest)

##        Independent: Carriers (Carrier)

##        Independent: Hours of departure (Dep_Time)

##        Independent: Weather conditions (Weather_Delay)

##        Independent: Monday = 1, Tuesday = 2, …Sunday = 7 (Day_Of_Week)

### ———————————————————————————————————-

##  bayes theory => p(theta|y)= p(theta)*P(y|theta)/(SUM(P(theta)*P(y|theta))) (Cowles, Kass, & O’Hagan, 2009)

### ———————————————————————————————————-

##

#

## Create a data.frame

delay = data.frame(airplaneData)

## Factoring and labeling the variables (Taddy, n.d.)

delay$DEP_TIME = factor(floor(delay$DEP_TIME/100))

delay$DAY_OF_WEEK = factor(delay$DAY_OF_WEEK, labels = c(“M”, “T”, “W”, “R”, “F”, “S”, “U”))

delay$DEP_DEL15 = factor(delay$DEP_DEL15)

delay$WEATHER_DELAY= factor(ifelse(delay$WEATHER_DELAY>=1,1,0)) # (R, n.d.a.)

delay$CARRIER = factor(delay$CARRIER, levels = c(“AA”,”B6″,”DL”,”EV”,”UA”))

levels(delay$CARRIER) = c(“American”, “JetBlue”, “Delta”, “ExpressJet”, “UnitedAir”)

## Quick understanding the data

delayed15 = as.numeric(levels(delay$DEP_DEL15)[delay$DEP_DEL15])

hist(delayed15, freq=F, main = “Histogram of Delays of 15 mins or longer”, xlab = “time >= 15 mins (1) or time < 15 (0)”)

summary(delay)

### Create the training and testing data (60/40%)

ntotal=length(delay$DAY_OF_WEEK)    # Total number of datapoints assigned dynamically

ntrain = sample(1:ntotal,floor(ntotal*(0.6))) # Take values 1 – n*0.6

ntest = ntotal-floor(ntotal*(0.6))       # The number of test cases (40% of the data)

trainingData = cbind(delay$DAY_OF_WEEK[ntrain], delay$CARRIER[ntrain],delay$ORIGIN[ntrain],delay$DEST[ntrain],delay$DEP_TIME[ntrain],delay$WEATHER_DELAY[ntrain],delayed15[ntrain])

testingData  = cbind(delay$DAY_OF_WEEK[-ntrain], delay$CARRIER[-ntrain],delay$ORIGIN[-ntrain],delay$DEST[-ntrain],delay$DEP_TIME[-ntrain],delay$WEATHER_DELAY[-ntrain],delayed15[-ntrain])

## Partitioning the train data by half

trainFirst= trainingData[trainingData[,7]<0.5,]

trainSecond= trainingData[trainingData[,7]>0.5,]

### Prior probabilities = p(theta) (Cowles, Kass, & O’Hagan, 2009)

## Dependent variable: time delayed >= 15

tdelay=table(delayed15[ntrain])/sum(table(delayed15[ntrain]))

### Prior probabilities between the partitioned training data

## Independent variable: Day of the week (% flights occured in which day of the week)

tday1=table(trainFirst[,1])/sum(table(trainFirst[,1]))

tday2=table(trainSecond[,1])/sum(table(trainSecond[,1]))

## Independent variable: Carrier (% flights occured in which carrier)

tcarrier1=table(trainFirst[,2])/sum(table(trainFirst[,2]))

tcarrier2=table(trainSecond[,2])/sum(table(trainSecond[,2]))

## Independent variable: Origin (% flights occured in which originating airport)

tOrigin1=table(trainFirst[,3])/sum(table(trainFirst[,3]))

tOrigin2=table(trainSecond[,3])/sum(table(trainSecond[,3]))

## Independent variable: Destination (% flights occured in which destinateion airport)

tdest1=table(trainFirst[,4])/sum(table(trainFirst[,4]))

tdest2=table(trainSecond[,4])/sum(table(trainSecond[,4]))

## Independent variable: Department Time (% flights occured in which time of the day)

tTime1=table(trainFirst[,5])/sum(table(trainFirst[,5]))

tTime2=table(trainSecond[,5])/sum(table(trainSecond[,5]))

## Independent variable: Weather (% flights delayed because of adverse weather conditions)

twx1=table(trainFirst[,6])/sum(table(trainFirst[,6]))

twx2=table(trainSecond[,6])/sum(table(trainSecond[,6]))

### likelihoods = p(y|theta) (Cowles, Kass, & O’Hagan, 2009)

likelihood1=tday1[testingData[,1]]*tcarrier1[testingData[,2]]*tOrigin1[testingData[,3]]*tdest1[testingData[,4]]*tTime1[testingData[,5]]*twx1[testingData[,6]]

likelihood2=tday2[testingData[,1]]*tcarrier2[testingData[,2]]*tOrigin2[testingData[,3]]*tdest2[testingData[,4]]*tTime2[testingData[,5]]*twx2[testingData[,6]]

### Predictions using bayes theory = p(theta|y)= p(theta)*P(y|theta)/(SUM(P(theta)*P(y|theta))) (Cowles, Kass, & O’Hagan, 2009)

Bayes=(likelihood2*tdelay[2])/(likelihood2*tdelay[2]+likelihood1*tdelay[1])

hist(Bayes, freq=F, main=”Bayesian Analysis of flight delay data”)

plot(delayed15[-ntrain]~Bayes, main=”Bayes results versus actual results for flights delayed >= 15 mins”, xlab=”Bayes Analysis Prediction of which cases will be delayed”, ylab=”Actual results from test data showing delayed cases”)

## The probability of 0.5 or larger

densityMeasure = table(delayed15[-ntrain],floor(Bayes+0.5))

probabilityOfXlarger=(densityMeasure[1,2]+densityMeasure[2,1])/ntest

probabilityOfXlarger

References

Adv Quant: Use of Bayesian Analysis in research

Just using knowledge before data collection and the knowledge gained from data collection doesn’t tell the full story until they are combined, hence establishing the need for Bayesian analysis (Hubbard, 2010).  Bayes’ theory is a conditional probability that takes into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015).  Bayesian analysis aids in avoiding overconfidence and underconfidence because it doesn’t ignore prior or new data (Hubbard, 2010).  There are many examples of how Bayesian analysis can be used in the context of social media data.  Below are just three ways of many,

  • With high precision, Bayesian Analysis was able to detect spam twitter accounts from legitimate users, based on their followers/following ration information and their most 100 recent tweets (McCord & Chuah, 2011). McCord and Chuah (2011) was able to use Bayesian analysis to achieve a 75% accuracy in detecting spam just by using user-based features, and ~90% accuracy in detecting spam when using both user and content based features.
  • Boulle (2014) used Bayesian Analysis off of 60,000 URLs in 100 websites. The goal was to predict the number of visits and messages on Twitter and Facebook after 48 hours, and Boulle (2014) was able to come close to the actual numbers through using Bayesian Analysis, showcasing the robustness of the approach.
  • Zaman, Fox, and Bradlow (2014), was able to use Bayesian analysis for predicting the popularity of tweets by measuring the final count of retweets a source tweet gets.

An in-depth exploration of Zaman, et al. (2014)

Goal:

The researchers aimed to predict how popular a tweet can become a Bayesian model to analyze the time path of retweets a tweet receives, and the eventual number of retweets of a tweet one week later.

  • They were analyzing 52 tweets varying among different topics like music, politics, etc.
    • They narrowed down the scope to analyzing tweets with a max of 1800 retweets per root tweets.

Defining the parameters:

  • Twitter = microblogging site
  • Tweets = microblogging content that is contained in up to 140 characters
  • Root tweets = original tweets
  • Root user = generator of the root tweet
  • End user = those who read the root tweet and retweeted it
  • Twitter followers = people who are following the content of a root user
  • Follower graph = resulting connections into a social graph from known twitter followers
  • Retweet = a twitter follower’s sharing of content from the user for their followers to read
  • Depth of 1 = how many end users retweeted a root tweet
  • Depth of 2 = how many end users retweeted a retweet of the root tweet

Exploration of the data:

From the 52 sampled root tweets, the researchers found that the tweets had anywhere between 21-1260 retweets associated with them and that the last retweet that could have occurred between a few hours to a few days from the root tweet’s generation.  The researchers calculated the median times from the last retweet, yielding scores that ranged from 4 minutes to 3 hours.  The difference between the median times was not statistically significant to reject a null hypothesis, which involved a difference in the median times.  This gave potentially more weight to the potential value of the Bayesian model over just descriptive/exploratory methods, as stated by the researchers.

The researchers explored the depth of the retweets and found that 11,882 were a depth of 1, whereas 314 were a depth of 2 or more in those 52 root tweets, which suggested that root tweets get more retweets than retweeted tweets.  It was suggested by the researchers that the depth seemed to have occurred because of a large number of followers from the retweeter’s side.

It was noted by the researchers that retweets per time path decays similarly to a log-normally distribution, which is what was used in the Bayesian analysis model.

Bayesian analysis results:

The researchers partitioned their results randomly into a training set with 26 observations, and a testing set of 26 observations, and varied the amount of retweets observations from 10%-100% of the last retweet.  Their main results are plotted in boxplots, where the whiskers cover 90% of the posterior solution (Figure 10).

IP3F12.png

The figure above is directly from Zaman, et al. (2014). The authors mentioned that as the observation fraction increased the absolute percent errors decreased.    For future work, the researchers suggested that their analysis could be parallelized to incorporate more data points, take into consideration the time of day the root tweet was posted, as well as understanding the content within the tweets and their retweet-ability because of it.

References

  • Boullé, M. (2014). Selective Naive Bayes Regressor with Variable Construction for Predictive Web Analytics.
  • Hubbard, D. W. (2010). How to measure anything: Finding the values of “intangibles” in business. (2nd e.d.) New Jersey, John Wiley & Sons, Inc.
  • Mccord, M., & Chuah, M. (2011). Spam detection on twitter using traditional classifiers. In International Conference on Autonomic and Trusted Computing (pp. 175-186). Springer Berlin Heidelberg.
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html
  • Zaman, T., Fox, E. B., & Bradlow, E. T. (2014). A Bayesian approach for predicting the popularity of tweets. The Annals of Applied Statistics8(3), 1583-1611.

Adv Quant: Bayesian Analysis

Uncertainty in making decisions

Generalizing something that is specific from a statistical standpoint, is the problem of induction, and that can cause uncertainty in making decisions (Spiegelhalter & Rice, 2009). Uncertainty in making a decision could also arise from not knowing how to incorporate new data with old assumptions (Hubbard, 2010).

According to Hubbard (2010) conventional statistics assumes:

(1)    The researcher has no prior information about the range of possible values (which is never true) or,

(2)    The researcher does have prior knowledge that the distribution of the population and it is never any of the messy ones (which is not true more often than not)

Thus, knowledge before data collection and the knowledge gained from data collection doesn’t tell the full story until they are combined, hence the need for Bayes’ analysis (Hubbard, 2010).  Bayes’ theory can be reduced to a conditional probability that aims to take into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).  Bayesian analysis avoids overconfidence and underconfidence from ignoring prior data or ignoring new data (Hubbard, 2010), through the implementation of the equation below:

 eq4                           (1)

Where P(hypothesis|data) is the posterior data, P(hypothesis) is the true probability of the hypothesis/distribution before the data is introduced, P(data) marginal probability, and P(data|hypothesis) is the likelihood that the hypothesis/distribution is still true after the data is introduced (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).  This forces the researcher to think about the likelihood that different and new observations could impact a current hypothesis (Hubbard, 2010). Equation (1) shows that evidence is usually a result of two conditional probabilities, where the strongest evidence comes from a low probability that the new data could have led to X (Yudkowsky, 2003).  From these two conditional probabilities, the resultant value is approximately the average from that of the prior assumptions and the new data gained (Hubbard, 2010; Smith, 2015).  Smith (2015) describe this approximation in the following simplified relationship (equation 2):

 eq5.PNG                                            (2)

Therefore, from equation (2) the type of prior assumptions influence the posterior resultant. Prior distributions come from Uniform, Constant, or Normal distribution that results in a Normal posterior distribution and a Beta or Binomial distribution results in a Beta posterior distribution (Smith, 2015).  To use Bayesian Analysis one must take into account the analysis’ assumptions.

Basic Assumptions of Bayesian Analysis

Though these three assumptions are great to have for Bayesian Analysis, it has been argued that they are quite unrealistic when real life data, particularly unstructured text-based data (Lewis, 1998; Turhan & Bener, 2009):

  • Each of the new data samples is independent of each other and identically distributed (Lewis, 1998; Nigam & Ghani, 2000; Turhan & Bener, 2009)
  • Each attribute has equation importance (Turhan & Bener, 2009)
  • The new data is compatible with the target posterior (Nigam & Ghani, 2000; Smith 2015).

Applications of Bayesian Analysis

There are typically three main situations where Bayesian Analysis is used (Spiegelhalter, & Rice, 2009):

  • Small data situations: The researcher has no choice but to include prior quantitative information, because of a lack of data, or lack of a distribution model.
  • Moderate size data situations: The researcher has multiple sources of data. They can create a hierarchical model on the assumption of similar prior distributions
  • Big data situations: where there are huge join probability models, with 1000s of data points or parameters, which can then be used to help make inferences of unknown aspects of the data

Pros and Cons

Applying Bayesian Analytics to data has its advantages and disadvantages.  Those Advantages and Disadvantages with Bayesian Analysis as identified by SAS (n.d.) are:

Advantages

+    Allows for a combination of prior information with data, for a strong decision-making

+    No reliance on asymptotic approximation, thus the inferences are conditional on the data

+    Provides easily interpretive results.

Disadvantages

– Posteriors are heavily influenced by their priors.

– This method doesn’t help the researcher to select the proper prior, given how much influence it has on the posterior.

– Computationally expensive with large data sets.

The key takeaway from this discussion is that the prior knowledge can heavily influence the posterior, which can easily be seen in equation (2).  That is because knowledge before data collection and the knowledge gained from data collection doesn’t tell the full story unless they are combined.

Reference