Adv Topics: Extracting Knowledge from big data

The evolution of data to wisdom is defined by the DIKW pyramid, where Data is just facts without any context, but when facts are used to understand relationships it generates Information (Almeyer-Stubbe & Coleman, 2014). That information can be used to understand patterns, it can then help build Knowledge, and when that knowledge is used to understand principles, it builds Wisdom (Almeyer-Stubbe & Coleman, 2014; Bellinger, Castro, Mills, n.d.). Building an understanding to jump from one level of the DIKW pyramid, is an appreciation of learning “why” (Bellinger et al., n.d.). Big data was first coined in a Gartner blog post, is data that has high volume, variety, and velocity, but without any interest in understanding that data, data scientist will lack context (Almeyer-Stubbe & Coleman, 2014; Bellinger et al., n.d.; Laney, 2001). Therefore, applying the DIKW pyramid can help turn that big data into extensive knowledge (Almeyer-Stubbe & Coleman, 2014; Bellinger et al., n.d.; Sakr, 2014). Extensive knowledge is a derived from placing meaning to big data usually in the form of predictive analytics algorithms (Sakr, 2014).

Machine learning requires historical data and is part of the data analytics process under data mining to understand hidden patterns or structures within the data (Almeyer-Stubbe & Coleman, 2014). Machine learning is easier to build and maintain than other classical data mining techniques (Wollan, Smith, & Zhou, 2010). Machine learning algorithms include clustering, classification, and association rules techniques and the right algorithm from any of these three techniques must be selected that meet the needs of the data (Services, 2015). Unsupervised machine learning techniques like clustering are used when data scientist do not understand or classify data prior to data mining techniques to understand hidden structures within the data set (Brownlee, 2016; Services, 2015). Supervised machine learning involves model training and model testing to aid in understanding which input variables feed into an output variable, involving such techniques as classification and regression (Brownlee, 2016).

An example of an open source Hadoop machine learning algorithm library would include Apache Mahout, which can be found at http://mahout.apache.org (Lublinsky, Smith, & Yakubovich, 2013). A limitation from learning from historical data to predict the future is it can “stifle innovation and imagination” (Almeyer-Stubbe & Coleman, 2014). Another limitation can exist that current algorithms may not run on distributed database systems. Thus some tailoring of the algorithms may be needed (Services, 2015). The future of machine learning involves its algorithms becoming more interactive to the end user, known as active learning (Wollan, Smith, & Zhou, 2010).

Case Study: Machine learning, medical diagnosis, and biomedical engineering research – commentary (Foster, Koprowski, & Skufca, 2014)

The authors created a synthetic training data set to simulate a typical medical classification problem of healthy and ill people and assigned random numbers to 10 health variables. Given this information, the actual classification accuracy should be 50%, which is also similar to pure chance alone. These authors found that when classification machine learning algorithms are misapplied, it can lead to false results. This was proven when their model took only 50 people to produce similar accuracy values of pure chance alone. Thus, the authors of this paper were trying to warn the medical field that misapplying classification techniques can lead to overfitting.

The authors then looked at feature selection for classifying Hashimoto’s disease from 250 clinical ultrasound data with the disease and 250 healthy people. Ten variables were selected to help classify these images, and a MATLAB machine learning algorithm was trained on 400 people (200 healthy and 200 ill) to then be tested on 100 people (50 healthy and 50 ill). They were able to show that when 3-4 variables were used they produced better classification results, thus 3-4 variables had huge information gain. This can mislead practitioners, because of the small data set that could be generalized too broadly and the lack of independence between training and testing datasets. The authors argued that larger data sets are needed to get rid of some of the issues that could result in the misapplication of classifiers.

The authors have the following four recommendations when considering the use of supervised machine learning classification algorithms:

    1. Clearly, state the purpose of the study and come from a place of understanding of that problem and its applications.
    2. Minimize the number of a variable when used in classifiers, such as using pruning algorithms in classification algorithms to only select certain variables that meet a certain level of information gain. This is more important with smaller data sets than with big data.
    3. Understand that classifiers are sensitive and that results gained from one set of instances might require further adjustments to be implemented elsewhere.
    4. Classification algorithms and data mining are part of the experimental process not the answer to all problems.

Resources:

Quant: Understanding Variance

If a researcher were to look at a measure of job performance resulting from 2 different manufacturing processes and found that the mean performance of process A was 82.5, and the mean performance of process B was 78.5, they could not automatically assume that process A will consistently outperform process B.  The reason the researchers cannot come to a conclusion until an analysis of variance done to that data.  There could be variance between the types of the statement of work that is uniquely different and are required between process A and process B (within-group variance), and there could be variances between the groups of people conducting the statement of work (between group variance).  These two types of variances will feed into the F-statistic result which would allow the researcher to state then whether or not they can reject the null hypothesis that the means between both mean performances are the same.

Business Intelligence: data mining success

For data mining success one must follow a data mining process. There are many processes out there, and here are two:

  • From Fayyad, Piatetsky-Shapiro, and Smyth (1996)
    1. Data -> Selection -> Target Data -> Preprocessing -> Preprocesses Data -> Transformation -> Transformed Data -> Data Mining -> Patterns -> Interpretation/Evaluation -> Knowledge
  • From Padhy, Mishra, and Panigrahi (2012)
    1. Business understanding -> Data understanding -> Data Preparation -> Modeling -> Evaluation -> Deployment

Success has much to do with the events that lead to the main event as it does with the main event.  Thus, what is done to the data before data mining can proceed successfully. Fayyad et al. (1996) address that data mining is just a subset of the knowledge discovery process, where data mining provides the algorithms/math that helps reach the final goal.  Looking at each of the individual processes we can see that they are slightly different, yet the same.  Another key thing to note is that we can move back and forth (i.e. iterations) between the steps in these processes.  These two are supposing that data is being pulled from a knowledge database or data warehouse, where the data should be cleaned (uniformly represented, handling missing data, noise, and errors) and accessible (provided access paths to data).

Pros/Challenges

If the removal of the pre-processing stage or data preparation phase, we will never be able to reduce the high-dimensionality in the data sets (Fayyad et al., 1996).  High dimensionality increases the size of the data, thus increases the need for more processing time, which may not be as advantageous on a real-time data feed into the data mining derived model.  Also, with all this data, you run into the chances that the data model derived through the data mining process will pick up spurious patterns, which will not be easily generalizable or even understandable for descriptive purposes (Fayyad et al., 1996).  Descriptive purposes are data mining for the sake of understanding the data, whereas predictive purposes are for data mining for the sake of predicting the next result of an input of variables from a data source (Fayyad et al., 1996, Padhy et al., 2012).  Thus, to avoid this high-dimensionality problem, we must understand the problem, understand why we have the data we have, what data is needed and reduce the dimensions to the bare essentials.

Another challenge that would come from data mining if we did do the selection, data understanding, or data mining algorithm selection, the step is the issue is overfitting.  Fayyad et al. (1996), defines selection as selecting the key data you need to feed into the model, and selecting the right data mining algorithm which will influence the results.  Understanding the problem will allow you to select the right data dimensions as aforementioned as well as the data mining algorithm (Padhy et al., 2012).  Overfitting is when a data mining algorithm tries to not only derive general patterns in the data but also describes it with noisy data (Fayyad et al., 1996).  Through the selection process, you can pick data with reduced noise to avoid an overfitting problem.  Also, Fayyad et al. (1996) suggest that solutions should include: cross-validation, regularization, and other statistical analysis.  Overfitting issues though can be fixed through understanding what you are looking for before using data mining, will aid in the evaluation/interpretation process (Padhy et al., 2012).

Cons/Opportunities

Variety in big data that changes with time, while applying the same data mined model, will at one point, either be outdated (no longer relevant) or invalid.  This is the case in social media, if we try to read posts without focusing on one type of post, it would be hard to say that one particular data pattern model derived from data mining is valid.  Thus, previously defined patterns are no longer valid as data rapidly change with respect to time (Fayyad et al., 1996).  We would have to solve this, through incrementally modifying, deleting or augmenting the defined patterns in the data mining process, but as data can vary in real-time, in the drop of a hat, and this can be quite hard to do (Fayyad et al., 1996).

Missing data and noisy data is very prevalent in Meteorology; we cannot sample the entire atmosphere at every point at every time.  We send up weather balloons 2-4 times a day at two points in a US state at a time.  We then try to feed that into a model for predictive purposes.  However, we have a bunch of gaps in the data.  What happens if the weather balloon is a dud, and we get no data. Hence, we have missing data.  This is a problem with the data.  How are we supposed to rely on the solution derived through data mining if the data is either missing or noisy? Fayyad et al. (1996) said that missing values are “not designed with discovery in mind”, but we must include statistical strategies to define what these values should be.  One of the ones that meteorologist use is data interpolation.  There are many types of interpolation, revolving simple nearest neighbor ones, to complex Gaussian types.

Resources: