Adv DBs: Unsupervised and Supervised Learning

Unsupervised and Supervised Learning:

Supervised learning is a type of machine learning that takes a given set of data points, we need to choose a function that gives users a classification or a value.  So, eventually, you will get data points that no longer defines a classification or a value, thus the machine now has to solve for that function. There are two main types of supervised learning: Classification (has a finite set, i.e. based on person’s chromosomes in a database, their biological gender is either male or female) and Regression (represents real numbers in the real space or n-dimensional real space).  In regression, you can have a 2-dimensional real space, with training data that gives you a regression formula with a Pearson’s correlation number r, given a new data point, can the machine use the regression formula with correlation r to predict where that data point will fall on in the 2-dimensional real space (Mathematicalmonk’s channel, 2011a).

Unsupervised learning aims to uncover homogenous subpopulations in databases (Connolly & Begg, 2015). In Unsupervised learning you are given data points (values, documents, strings, etc.) in n-dimensional real space, the machine will look for patterns through either clustering, density estimation, dimensional reduction, etc.  For clustering, one could take the data points and placing them in bins with common properties, sometimes unknown to the end-user due to the vast size of the data within the database.  With density estimation, the machine is fed a set of probability density functions to fit the data and it begins to estimates the density of that data set.  Finally, for dimensional reduction, the machine will find some lower dimensional space in which the data can be represented (Mathematicalmonk’s channel, 2011b).  With the dimensional reduction, it can destroy the structure that can be seen in the higher-order dimensions.

Applications suited to each method

  • Supervised: defining data transformations (Kelvin to Celsius, meters per second to miles per hour, classifying a biological male or female given the number of chromosomes, etc.), predicting weather (given the initial & boundary conditions, plug them into formulas that predict what will happen in the next time step).
  • Unsupervised: forecasting stock markets (through patterns identified in text mining news articles, or sentiment analysis), reducing demographical database data to common features that can easily describe why a certain population will fit a result over another (dimensional reduction), cloud classification dynamical weather models (weather models that use stochastic approximations, Monte Carlo simulations, or probability densities to generate cloud properties per grid point), finally real-time automated conversation translators (either spoken or closed captions).

Most important issues related to each method

Unsupervised machine learning is at the bedrock of big data analysis.  We could use training data (a set of predefined data that is representative of the real data in all its n-dimensions) to fine-tune the most unsupervised machine learning efforts to reduce error rates (Barak & Modarres, 2015). What I like most about unsupervised machine learning is its clustering and dimensional reduction capabilities, because it can quickly show me what is important about my big data set, without huge amounts of coding and testing on my end.

References:

Data Tools: WEKA

WEKA

The Java based, open sourced, and platform independent Waikato Environment for Knowledge Analysis (WEKA) tool, for data preprocessing, predictive data analytics, and facilitation interpretations and evaluation (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Miranda, n.d.; Xia & Gong, 2014).  It was originally developed for analyzing agricultural data and has evolved to house a comprehensive collection of data preprocessing and modeling techniques (Patel & Donga 2015).  It is a java based machine learning algorithm for data mining tasks as well as text mining that could be used for predictive modeling, housing pre-processing, classification, regression, clustering, association rules, and visualization (WEKA, n.d). Also, WEKA contains classification, clustering, association rules, regression, and visualization capabilities, in particular, the C4.5 decision tree predictive data analytics algorithm (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011). Here WEKA is an open source data and text mining software tool, thus it is free to use. Therefore there are no costs associated with this software solution.

WEKA can be applied to big data (WEKA, n.d.) and SQL Databases (Patel & Donga, 2015). Subsequently, WEKA has been used in many research studies that are involved in big data analytics (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011; Parkavi & Sasikumar, 2016; Xia & Gong, 2014). For instance, Barak and Modarres (2015) used WEKA for decision tree analysis on predicting stock risks and returns.

The fact that it has been using in this many research studies is that the reliability and validity of the software are high and well established.  Even in a study comparing WEKA with 12 other data analytics tools, is one of two apps studied that have a classification, regression, and clustering algorithms (Gera & Goel, 2015).

A disadvantage of using this tool is its lack of supporting multi-relational data mining, but if one can link all the multi-relational data into one table, it can do its job (Patel & Donga, 2015). The comprehensiveness of analysis algorithms for both data and text mining and pre-processing is its advantage. Another disadvantage of WEKA is that it cannot handle raw data directly, meaning the data had to be preprocessed before it is entered into the software package and analyzed (Hoonlor, 2011). WEKA cannot even import excel files, data in Excel have to be converted into CSV format to be usable within the system (Miranda, n.d.)

References:

  • Dogan, N., & Tanrikulu, Z. (2013). A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Information Technology and Management, 14(2), 105-124. doi:http://dx.doi.org/10.1007/s10799-012-0135-7
  • Gera, M., & Goel, S. (2015). Data Mining -Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Hoonlor, A. (2011). Sequential patterns and temporal patterns for text mining. UMI Dissertation Publishing.
  • Kumar, D., & Fet, D. (2011). Performance Analysis of Various Data Mining Algorithms: A Review. International Journal of Computer Applications, 32(6), 9–16.
  • Miranda, S. (n.d.). An Introduction to Social Analytics : Concepts and Methods.
  • Parkavi, S. & Sasikumar, S. (2016). Prediction of Commodities Market by Using Data Mining Technique. i-Manager’s Journal on Computer Science.
  • Patel, K., & Donga, J. (2015). Practical Approaches: A Survey on Data Mining Practical Tools. Foundations, 2(9).
  • WEKA (n.d.) WEKA 3: Data Mining Software in Java. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/
  • Xia, B. S., & Gong, P. (2014). Review of business intelligence through data analysis. Benchmarking, 21(2), 300–311. http://doi.org/http://dx.doi.org/10.1108/BIJ-08-2012-0051

Big Data Analytics: R

R is a powerful statistical tool that can aid in data mining.  Thus, it has huge relevance in the big data arena.  Focusing on my project, I have found that R has a text mining package [tm()].

Patal and Donga (2015) and Fayyad, Piatetsky-Shapiro, & Smyth, (1996) say that the main techniques in Data Mining are: anomaly detection (outlier/change/deviation detection), association rule learning (relationships between the variables), clustering (grouping data that are similar to another), classification (taking a known structure to new data), regressions (find a function to describe the data), and summarization (visualizations, reports, dashboards). Whereas, According to Ghosh, Roy, & Bandyopadhyay (2012), the main types of Text Mining techniques are: text categorization (assign text/documents with pre-defined categories), text-clustering (group similar text/documents together), concept mining (discovering concept/logic based ideas), Information retrieval (finding the relevant documents per the query), and information extraction (id key phrases and relationships within the text). Meanwhile, Agrawal and Batra (2013) add: summarization (compressed representation of the input), assessing document similarity (similarities between different documents), document retrieval (id and grabbing the most relevant documents), to the list of text mining techniques.

We use the “library(tm)” to aid in transforming text, stem words, build a term-document matrix, etc. mostly for preprocessing the data (RStudio pubs, n.d.). Based on RStudio pubs (n.d.) some text preprocessing steps and code are as follows:

  • To remove punctuation:

docs <- tm_map(docs, removePunctuation)

  • To remove special characters:

for(j in seq(docs))      {        docs[[j]] <- gsub(“/”, ” “, docs[[j]])        docs[[j]] <- gsub(“@”, ” “, docs[[j]])        docs[[j]] <- gsub(“\\|”, ” “, docs[[j]])     }

  • To remove numbers:

docs <- tm_map(docs, removeNumbers)

  • Convert to lowercase:

docs <- tm_map(docs, tolower)

  • Removing “stopwords”/common words

docs <- tm_map(docs, removeWords, stopwords(“english”))

  • Removing particular words

docs <- tm_map(docs, removeWords, c(“department”, “email”))

  • Combining words that should stay together

for (j in seq(docs)){docs[[j]] <- gsub(“qualitative research”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative studies”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative analysis”, “QDA”, docs[[j]])docs[[j]] <- gsub(“research methods”, “research_methods”, docs[[j]])}

  • Removing coming word endings

library(SnowballC)   docs <- tm_map(docs, stemDocument)

Text mining algorithms could consist of but are not limited to (Zhao, 2013):

  • Summarization:
    • Word clouds use “library (wordcloud)”
    • Word frequencies
  • Regressions
    • Term correlations use “library (ggplot2) use functions findAssocs”
    • Plot word frequencies Term correlations use “library (ggplot2)”
  • Classification models:
    • Decision Tree “library (party)” or “library (rpart)”
  • Association models:
    • Apriori use “library (arules)”
  • Clustering models:
    • K-mean clustering use “library (fpc)”
    • K-medoids clustering use “library(fpc)”
    • Hierarchical clustering use “library(cluster)”
    • Density-based clustering use “library (fpc)”

As we can see, there are current libraries, functions, etc. to help with data preprocessing, data mining, and data visualization when it comes to text mining with R and RStudio.

Resources: