Adv DBs: Unsupervised and Supervised Learning

Unsupervised and Supervised Learning:

Supervised learning is a type of machine learning that takes a given set of data points, we need to choose a function that gives users a classification or a value.  So, eventually, you will get data points that no longer defines a classification or a value, thus the machine now has to solve for that function. There are two main types of supervised learning: Classification (has a finite set, i.e. based on person’s chromosomes in a database, their biological gender is either male or female) and Regression (represents real numbers in the real space or n-dimensional real space).  In regression, you can have a 2-dimensional real space, with training data that gives you a regression formula with a Pearson’s correlation number r, given a new data point, can the machine use the regression formula with correlation r to predict where that data point will fall on in the 2-dimensional real space (Mathematicalmonk’s channel, 2011a).

Unsupervised learning aims to uncover homogenous subpopulations in databases (Connolly & Begg, 2015). In Unsupervised learning you are given data points (values, documents, strings, etc.) in n-dimensional real space, the machine will look for patterns through either clustering, density estimation, dimensional reduction, etc.  For clustering, one could take the data points and placing them in bins with common properties, sometimes unknown to the end-user due to the vast size of the data within the database.  With density estimation, the machine is fed a set of probability density functions to fit the data and it begins to estimates the density of that data set.  Finally, for dimensional reduction, the machine will find some lower dimensional space in which the data can be represented (Mathematicalmonk’s channel, 2011b).  With the dimensional reduction, it can destroy the structure that can be seen in the higher-order dimensions.

Applications suited to each method

  • Supervised: defining data transformations (Kelvin to Celsius, meters per second to miles per hour, classifying a biological male or female given the number of chromosomes, etc.), predicting weather (given the initial & boundary conditions, plug them into formulas that predict what will happen in the next time step).
  • Unsupervised: forecasting stock markets (through patterns identified in text mining news articles, or sentiment analysis), reducing demographical database data to common features that can easily describe why a certain population will fit a result over another (dimensional reduction), cloud classification dynamical weather models (weather models that use stochastic approximations, Monte Carlo simulations, or probability densities to generate cloud properties per grid point), finally real-time automated conversation translators (either spoken or closed captions).

Most important issues related to each method

Unsupervised machine learning is at the bedrock of big data analysis.  We could use training data (a set of predefined data that is representative of the real data in all its n-dimensions) to fine-tune the most unsupervised machine learning efforts to reduce error rates (Barak & Modarres, 2015). What I like most about unsupervised machine learning is its clustering and dimensional reduction capabilities, because it can quickly show me what is important about my big data set, without huge amounts of coding and testing on my end.

References:

Adv Topics: Security Issues associated with Big Data

The scientific method helps give a framework for the data analytics lifecycle (Dietrich, 2013). Per Khan et al. (2014), the entire data lifecycle consists of the following eight stages:

  • Raw big data
  • Collection, cleaning, and integration of big data
  • Filtering and classification of data usually by some filtering criteria
  • Data analysis which includes tool selection, techniques, technology, and visualization
  • Storing data with consideration of CAP theory
  • Sharing and publishing data, while understanding ethical and legal requirements
  • Security and governance
  • Retrieval, reuse, and discovery to help in making data-driven decisions

Prajapati (2013), stated the entire data lifecycle consists of the following five steps:

  • Identifying the problem
  • Designing data requirements
  • Pre-processing data
  • Data analysis
  • Data visualizing

It should be noted that Prajapati includes steps that first ask what, when, who, where, why, and how with regards to trying to solve a problem. It doesn’t just dive into getting data. Combining both Prajapati (2013) and Kahn et al. (2014) data lifecycles, provides a better data lifecycle. However, there are 2 items to point out from the above lifecycle: (a) the security phase is an abstract phase because security considerations are involved in stages (b) storing data, sharing and publishing data, and retrieving, reusing and discovery phase.

Over time the threat landscape has gotten worse and thus big data security is a major issue. Khan et al. (2014) describe four aspects of data security: (a) privacy, (b) integrity, (c) availability, and (d) confidentiality. Minelli, Chambers, and Dhiraj (2013) stated that when it comes to data security a challenge to it is understanding who owns and has authority over the data and the data’s attributes, whether it is the generator of that data, the organization collecting, processing, and analyzing the data. Carter, Farmer, and Siegel (2014) stated that access to data is important, because if competitors and substitutes to the service or product have access to the same data then what advantage does that provide the company. Richard and King (2014), describe that a binary notion of data privacy does not exist.  Data is never completely private/confidential nor completely divulged, but data lies in-between these two extremes.  Privacy laws should focus on the flow of personal information, where an emphasis should be placed on a type of privacy called confidentiality, where data is agreed to flow to a certain individual or group of individuals (Richard & King, 2014).

Carter et al. (2014) focused on data access where access management leads to data availabilities to certain individuals. Whereas, Minelli et al. (2013) focused on data ownership. However, Richard and King (2014) was able to tie those two concepts into data privacy. Thus, each of these data security aspects is interrelated to each other and data ownership, availability, and privacy impacts all stages of the lifecycle. The root causes of the security issues in big data are using dated techniques that are best practices but don’t lead to zero-day vulnerability action plans, with a focus on prevention, focus on perimeter access, and a focus on signatures (RSA, 2013). Specifically, certain attacks like denial of service attacks are a threat and root cause to data availability issues (Khan, 2014). Also, RSA (2013) stated that from a sample of 257 security officials felt that the major challenges to security were the lack of staffing, large false positive amounts which creates too much noise, lack of security analysis skills, etc. Subsequently, data privacy issues arise from balancing compensation risks, maintaining privacy, and maintaining ownership of the data, similar to a cost-benefit analysis problem (Khan et al., 2014).

One way to solve security concerns when dealing with big data access, privacy, and ownership is to place a single entry point gateway between the data warehouse and the end-users (The Carology, 2013). The single entry point gateway is essentially middleware, which help ensures data privacy and confidentiality by acting on behalf of an individual (Minelli et al., 2013). Therefore, this gateway should aid in threat detection, assist in recognizing too many requests to the data which can cause a denial of service attacks, provides an audit trail and doesn’t require to change the data warehouse (The Carology, 2013). Thus, the use of middleware can address data access, privacy, and ownership issues. RSA (2013) proposed a solution to use data analytics to solve security issues by automating detection and responses, which will be covered in detail in another post.

Resources:

  • Carter, K. B., Farmer, D., and Siegel, C. (2014). Actionable Intelligence: A Guide to Delivering Business Results with Big Data Fast! John Wiley & Sons P&T. VitalBook file.
  • Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z. Ali, W. K. M., Alam, M., Shiraz, M., & Gani., A. (2014). Big data: Survey, technologies, opportunities, and challenges. The Scientific World Journal, 2014. Retrieved from http://www.hindawi.com/journals/tswj/2014/712826/
  • Minelli, M., Chambers, M., & Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.

Data Tools: WEKA

WEKA

The Java based, open sourced, and platform independent Waikato Environment for Knowledge Analysis (WEKA) tool, for data preprocessing, predictive data analytics, and facilitation interpretations and evaluation (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Miranda, n.d.; Xia & Gong, 2014).  It was originally developed for analyzing agricultural data and has evolved to house a comprehensive collection of data preprocessing and modeling techniques (Patel & Donga 2015).  It is a java based machine learning algorithm for data mining tasks as well as text mining that could be used for predictive modeling, housing pre-processing, classification, regression, clustering, association rules, and visualization (WEKA, n.d). Also, WEKA contains classification, clustering, association rules, regression, and visualization capabilities, in particular, the C4.5 decision tree predictive data analytics algorithm (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011). Here WEKA is an open source data and text mining software tool, thus it is free to use. Therefore there are no costs associated with this software solution.

WEKA can be applied to big data (WEKA, n.d.) and SQL Databases (Patel & Donga, 2015). Subsequently, WEKA has been used in many research studies that are involved in big data analytics (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011; Parkavi & Sasikumar, 2016; Xia & Gong, 2014). For instance, Barak and Modarres (2015) used WEKA for decision tree analysis on predicting stock risks and returns.

The fact that it has been using in this many research studies is that the reliability and validity of the software are high and well established.  Even in a study comparing WEKA with 12 other data analytics tools, is one of two apps studied that have a classification, regression, and clustering algorithms (Gera & Goel, 2015).

A disadvantage of using this tool is its lack of supporting multi-relational data mining, but if one can link all the multi-relational data into one table, it can do its job (Patel & Donga, 2015). The comprehensiveness of analysis algorithms for both data and text mining and pre-processing is its advantage. Another disadvantage of WEKA is that it cannot handle raw data directly, meaning the data had to be preprocessed before it is entered into the software package and analyzed (Hoonlor, 2011). WEKA cannot even import excel files, data in Excel have to be converted into CSV format to be usable within the system (Miranda, n.d.)

References:

  • Dogan, N., & Tanrikulu, Z. (2013). A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Information Technology and Management, 14(2), 105-124. doi:http://dx.doi.org/10.1007/s10799-012-0135-7
  • Gera, M., & Goel, S. (2015). Data Mining -Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Hoonlor, A. (2011). Sequential patterns and temporal patterns for text mining. UMI Dissertation Publishing.
  • Kumar, D., & Fet, D. (2011). Performance Analysis of Various Data Mining Algorithms: A Review. International Journal of Computer Applications, 32(6), 9–16.
  • Miranda, S. (n.d.). An Introduction to Social Analytics : Concepts and Methods.
  • Parkavi, S. & Sasikumar, S. (2016). Prediction of Commodities Market by Using Data Mining Technique. i-Manager’s Journal on Computer Science.
  • Patel, K., & Donga, J. (2015). Practical Approaches: A Survey on Data Mining Practical Tools. Foundations, 2(9).
  • WEKA (n.d.) WEKA 3: Data Mining Software in Java. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/
  • Xia, B. S., & Gong, P. (2014). Review of business intelligence through data analysis. Benchmarking, 21(2), 300–311. http://doi.org/http://dx.doi.org/10.1108/BIJ-08-2012-0051

Adv Quant: Decision Trees in R

Classification, Regression, and Conditional Tree Growth Algorithms

The variables used for tree growth algorithms are the log of benign prostatic hyperplasia amount (lbph), log of prostate-specific antigen (lpsa), Gleason score (gleason), log of capsular penetration (lcp) and log of the cancer volume (lcavol) to understand and predict tumor spread (seminal vesicle invasion=svi).

Results

5db3f1.PNG

Figure 1: Visualization of cross-validation results, for the classification tree (left) and regression tree (right).

5db3f2

Figure 2: Classification tree (left), regression tree (center), and conditional tree (right).

5db3f3.PNG

Figure 3: Summarization of tree data: (a) classification tree, (b) regression tree, and (c) conditional tree.

Discussion

For the classification tree growth algorithm, the head node is the seminal vesicle invasion which helps show the tumor spread in this dataset, and the cross-validation results show that there is only one split in the tree, with an x-value relative value for the first split of 0.71429 (Figure 1 & Figure 3a), and an x-value standard deviation of 0.16957 (Figure 3a).  The variable that was used to split the tree was the log of capsular penetration (Figure 2), when the log of capsular penetration at <1.791.

Next, for the regression tree growth algorithm, there are three leaf nodes, because the algorithm split the data three times.  In this case, the relative error for the first split is 1.00931, and a standard deviation of 0.18969 and at the second split the relative error is 0.69007 and a standard deviation of 0.15773 (Figure 1 & Figure 3b).  The tree was split at first at the log of capsular penetration at <1.791, and with the log of prostate specific antigen value at <2.993 (Figure 2).  It is interesting that the first split occurred at the same value for these two different tree growth algorithm, but that the relative errors and standard deviations were different and that the regression tree created one more level.

Finally, the conditional tree growth algorithm produced a split at <1.749 of the log capsular penetration at the 0.001 significance level and <2.973 for the log of prostate specific antigen also at the 0.001 significance level (Figure 2 & Figure 3c).  The results are similar to the regression tree, with the same number of leaf nodes and values in which they are split against, but more information is gained from the conditional tree growth algorithm than the classification and regression tree growth algorithm.

Code

#

### ———————————————————————————————————-

## Use the prostate cancer dataset available in R, in which biopsy results are given for 97 men.

## Goal:  Predict tumor spread in this dataset of 97 men who had undergone a biopsy.

## The measures to be used for prediction are BPH=lbhp, PSA=lpsa, Gleason Score=gleason, CP=lcp,

## and size of prostate=lcavol.

### ———————————————————————————————————-

##

install.packages(“lasso2”)

library(lasso2)

data(“Prostate”)

install.packages(“rpart”)

library(rpart)

## Grow a classification tree

classification = rpart(svi~lbph+lpsa+gleason+lcp+lcavol, data=Prostate, method=”class”)

printcp(classification) # display the results

plotcp(classification)  # visualization cross-validation results

plot(classification, uniform = T, main=”Classification Tree for prostate cancer”) # plot tree

text(classification, use.n = T, all = T, cex=.8)                                  # create text on the tree

## Grow a regression tree

Regression = rpart(svi~lbph+lpsa+gleason+lcp+lcavol, data=Prostate, method=”anova”)

printcp(Regression) # display the results

plotcp(Regression)  # visualization cross-validation results

plot(Regression, uniform = T, main=”Regression Tree for prostate cancer”) # plot tree

text(Regression, use.n = T, all = T, cex=.8)                              # create text on the tree

install.packages(“party”)

library(party)

## Grow a conditional inference tree

conditional = ctree(svi~lbph+lpsa+gleason+lcp+lcavol, data=Prostate)

conditional # display the results

plot(conditional, main=”Conditional inference tree for prostate cancer”)

References

Adv Quant: K-means classification in R

The explanatory variables in the logistic regression are both the type of loan and the borrowing amount.

4dbf1.PNG

Figure 1: The summary output of the logistic regression based on the type of loan and the borrowing amount.

The logistic equation shows statistical significance at the 0.01 level when the variables amount, and when the type of loan is used for a used car and a radio/television (Figure 1).  Thus, the regression equation comes out to be:

default = -0.9321 + 0.0001330(amount) – 1.56(Purpose is for used car) – 0.6499(purpose is for radio/television)

4dbf2.PNG

Figure 2: The comparative output of the logistic regression prediction versus actual results.

When comparing the predictions to the actual values (Figure 2), the mean and minimum scores between both of them are similar.  However, all other values are not. When the prediction values are rounded to the nearest whole number the actual prediction rate is 73%.

K-means classification, on the 3 continuous variables: duration, amount, and installment.

In K-means classification the data is clustered by the mean Euclidean distance between their differences (Ahlemeyer-Stubbe & Coleman, 2014).  In this exercise, there are two clusters. Thus, the cluster size is 825 no defaults, 175 defaults, where the within-cluster sum of squares for between/total is 69.78%.  The matrix of cluster centers is shown below (Figure 3).

4dbf3

Figure 3: K means center values, per variable

Cross-validation with k = 5 for the nearest neighbor.

K-nearest neighbor (K =5) is when a data point is clustered into a group, by having 5 of the nearest neighbors vote on that data point, and it is particularly useful if the data is a binary or categorical (Berson, Smith, & Thearling, 1999).  In this exercise, the percentage of correct classifications from the trained and predicted classification is 69%.  However, logistic regression in this scenario was able to produce a much higher prediction rate of 73%, this for this exercise and this data set, logistic regression was quite useful in predicting the default rate than the k-nearest neighbor algorithm at k=5.

Code

#

## The German credit data contains attributes and outcomes on 1,000 loan applications.

## Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data

## Metadata file: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc

#

## Reading the data from source and displaying the top five entries.

credits=read.csv(“https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data&#8221;, header = F, sep = ” “)

head(credits)

#

##

### ———————————————————————————————————-

## The two outcomes are success (defaulting on the loan) and failure (not defaulting).

## The explanatory variables in the logistic regression are both the type of loan and the borrowing amount.

### ———————————————————————————————————-

##

#

## Defining and re-leveling the variables (Taddy, n.d.)

default = credits$V21 – 1 # set default true when = 2

amount = credits$V5

purpose = factor(credits$V4, levels = c(“A40″,”A41″,”A42″,”A43″,”A44″,”A45″,”A46″,”A48″,”A49″,”A410”))

levels(purpose) = c(“newcar”, “usedcar”, “furniture/equip”, “radio/TV”, “apps”, “repairs”, “edu”, “retraining”, “biz”, “other”)

## Create a new matrix called “cred” with the 8 defined variables (Taddy, n.d.)

credits$default = default

credits$amount  = amount

credits$purpose = purpose

cred = credits[,c(“default”,”amount”,”purpose”)]

head(cred[,])

summary(cred[,])

## Create a design matrix, such that factor variables are turned into indicator variables

Xcred = model.matrix(default~., data=cred)[,-1]

Xcred[1:5,]

## Creating training and prediction datasets: Select 900 rows for esitmation and 100 for testing

set.seed(1)

train = sample(1:1000,900)

## Defining which x and y values in the design matrix will be for training and for testing

xtrain = Xcred[train,]

xtest = Xcred[-train,]

ytrain = cred$default[train]

ytest = cred$default[-train]

## logistic regresion

datas=data.frame(default=ytrain,xtrain)

creditglm=glm(default~., family=binomial, data=datas)

summary(creditglm)

percentOfCorrect=100*(sum(ytest==round(testingdata$defaultPrediction))/100)

percentOfCorrect

## Predicting default from the test data (Alice, 2015; UCLA: Statistical Consulting Group., 2007)

testdata=data.frame(default=ytest,xtest)

testdata[1:5,]

testingdata=testdata[,2:11] #removing the variable default from the data matrix

testingdata$defaultPrediction = predict(creditglm, newdata=testdata, type = “response”)

results = data.frame(ytest,testingdata$defaultPrediction)

summary(results)

head(results,10)

#

##

### ———————————————————————————————————-

##  K-means classification, on the 3 continuous variables: duration, amount, and installment.

### ———————————————————————————————————-

##

#

install.packages(“class”)

library(class)

## Defining and re-leveling the variables (Taddy, n.d.)

default = credits$V21 – 1 # set default true when = 2

duration = credits$V2

amount = credits$V5

installment = credits$V8

## Create a new matrix called “cred” with the 8 defined variables (Taddy, n.d.)

credits$default = default

credits$amount  = amount

credits$installment = installment

credits$duration = duration

creds = credits[,c(“duration”,”amount”,”installment”,”default”)]

head(creds[,])

summary(creds[,])

## K means classification (R, n.b.a)

kmeansclass= cbind(creds$default,creds$duration,creds$amount,creds$installment)

kmeansresult= kmeans(kmeansclass,2)

kmeansresult$cluster

kmeansresult$size

kmeansresult$centers

kmeansresult$betweenss/kmeansresult$totss

#

##

### ———————————————————————————————————-

##  Cross-validation with k = 5 for the nearest neighbor. 

### ———————————————————————————————————-

##

#

## Create a design matrix, such that factor variables are turned into indicator variables

Xcreds = model.matrix(default~., data=creds)[,-1]

Xcreds[1:5,]

## Creating training and prediction datasets: Select 900 rows for esitmation and 100 for testing

set.seed(1)

train = sample(1:1000,900)

## Defining which x and y values in the design matrix will be for training and for testing

xtrain = Xcreds[train,]

xtest = Xcreds[-train,]

ytrain = creds$default[train]

ytest = creds$default[-train]

## K-nearest neighbor clustering (R, n.d.b.)

nearestFive=knn(train = xtrain[,2,drop=F],test=xtest[,2,drop=F],cl=ytrain,k=5)

knnresults=cbind(ytest+1,nearestFive) # The addition of 1 is done on ytest because when cbind is applied to nearestFive it adds 1 to each value.

percentOfCorrect=100*(sum(ytest==nearestFive)/100)

References

Business Intelligence: Data Mining

Data mining is just a subset of the knowledge discovery process (or concept flow of Business Intelligence), where data mining provides the algorithms/math that aid in developing actionable data-driven results (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). It should be noted that success has much to do with the events that lead to the main event as it does with the main event.  Incorporating data mining processes into Business Intelligence, one must understand the business task/question behind the problem, properly process all the required data, analyze the data, evaluate and validate the data while analyzing the data, apply the results, and finally learn from the experience (Ahlemeyer-Stubbe & Coleman, 2014). Conolly and Begg (2014), stated that there are four operations of data mining: predictive modeling, database segmentation, link analysis, and deviation detection.  Fayyad et al. (1996), classifies data mining operations by their outcomes: prediction and descriptive.

It is crucial to understand the business task/question behind the problem you are trying to solve.  The reason why is because some types of business applications are associated with particular operations like marketing strategies use database segmentation (Conolly & Begg, 2014).  However, any of the data mining operations can be implemented for any business application, and many business applications can use multiple operations.  Customer profiling can use database segmentation first and then use predictive modeling next (Conolly & Begg, 2014). By thinking outside of the box about which combination of operations and algorithms to use, rather than using previously used operations and algorithms to help meet the business objectives, it could generate even better results (Minelli, Chambers, & Dhiraj, 2013).

A consolidated list (Ahlemeyer-Stubbe & Coleman, 2014; Berson, Smith, & Thearling 1999; Conolly & Begg, 2014; Fayyad et al., 1996) of the different types of data mining operations, algorithms and purposes are listed below.

  • Prediction – “What could happen?”
    • Classification – data is classified into different predefined classes
      • C4.5
      • Chi-Square Automatic Interaction Detection (CHAID)
      • Support Vector Machines
      • Decision Trees
      • Neural Networks (also called Neural Nets)
      • Naïve Bayes
      • Classification and Regression Trees (CART)
      • Bayesian Network
      • Rough Set Theory
      • AdaBoost
    • Regression (Value Prediction) – data is mapped to a prediction formula
      • Linear Regression
      • Logistic Regression
      • Nonlinear Regression
      • Multiple linear regression
      • Discriminant Analysis
      • Log-Linear Regression
      • Poisson Regression
    • Anomaly Detection (Deviation Detection) – identifies significant changes in the data
      • Statistics (outliers)
  • Descriptive – “What has happened?”
    • Clustering (database segmentation) – identifies a set of categories to describe the data
      • Nearest Neighbor
      • K-Nearest Neighbor
      • Expectation-Maximization (EM)
      • K-means
      • Principle Component Analysis
      • Kolmogorov-Smirnov Test
      • Kohonen Networks
      • Self-Organizing Maps
      • Quartile Range Test
      • Polar Ordination
      • Hierarchical Analysis
    • Association Rule Learning (Link Analysis) – builds a model that describes the data dependencies
      • Apriori
      • Sequential Pattern Analysis
      • Similar Time Sequence
      • PageRank
    • Summarization – smaller description of the data
      • Basic probability
      • Histograms
      • Summary Statistics (max, min, mean, median, mode, variance, ANOVA)
  • Prescriptive – “What should we do?” (an extension of predictive analytics)
    • Optimization
      • Decision Analysis

Finally, Ahlemeyer-Stubbe and Coleman (2014) stated that even though there are a ton of versatile data mining software available that would do any of the abovementioned operations and algorithms; a good data mining software would be deployable across different environments and include tools for data prep and transformation.

References

Big Data Analytics: R

R is a powerful statistical tool that can aid in data mining.  Thus, it has huge relevance in the big data arena.  Focusing on my project, I have found that R has a text mining package [tm()].

Patal and Donga (2015) and Fayyad, Piatetsky-Shapiro, & Smyth, (1996) say that the main techniques in Data Mining are: anomaly detection (outlier/change/deviation detection), association rule learning (relationships between the variables), clustering (grouping data that are similar to another), classification (taking a known structure to new data), regressions (find a function to describe the data), and summarization (visualizations, reports, dashboards). Whereas, According to Ghosh, Roy, & Bandyopadhyay (2012), the main types of Text Mining techniques are: text categorization (assign text/documents with pre-defined categories), text-clustering (group similar text/documents together), concept mining (discovering concept/logic based ideas), Information retrieval (finding the relevant documents per the query), and information extraction (id key phrases and relationships within the text). Meanwhile, Agrawal and Batra (2013) add: summarization (compressed representation of the input), assessing document similarity (similarities between different documents), document retrieval (id and grabbing the most relevant documents), to the list of text mining techniques.

We use the “library(tm)” to aid in transforming text, stem words, build a term-document matrix, etc. mostly for preprocessing the data (RStudio pubs, n.d.). Based on RStudio pubs (n.d.) some text preprocessing steps and code are as follows:

  • To remove punctuation:

docs <- tm_map(docs, removePunctuation)

  • To remove special characters:

for(j in seq(docs))      {        docs[[j]] <- gsub(“/”, ” “, docs[[j]])        docs[[j]] <- gsub(“@”, ” “, docs[[j]])        docs[[j]] <- gsub(“\\|”, ” “, docs[[j]])     }

  • To remove numbers:

docs <- tm_map(docs, removeNumbers)

  • Convert to lowercase:

docs <- tm_map(docs, tolower)

  • Removing “stopwords”/common words

docs <- tm_map(docs, removeWords, stopwords(“english”))

  • Removing particular words

docs <- tm_map(docs, removeWords, c(“department”, “email”))

  • Combining words that should stay together

for (j in seq(docs)){docs[[j]] <- gsub(“qualitative research”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative studies”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative analysis”, “QDA”, docs[[j]])docs[[j]] <- gsub(“research methods”, “research_methods”, docs[[j]])}

  • Removing coming word endings

library(SnowballC)   docs <- tm_map(docs, stemDocument)

Text mining algorithms could consist of but are not limited to (Zhao, 2013):

  • Summarization:
    • Word clouds use “library (wordcloud)”
    • Word frequencies
  • Regressions
    • Term correlations use “library (ggplot2) use functions findAssocs”
    • Plot word frequencies Term correlations use “library (ggplot2)”
  • Classification models:
    • Decision Tree “library (party)” or “library (rpart)”
  • Association models:
    • Apriori use “library (arules)”
  • Clustering models:
    • K-mean clustering use “library (fpc)”
    • K-medoids clustering use “library(fpc)”
    • Hierarchical clustering use “library(cluster)”
    • Density-based clustering use “library (fpc)”

As we can see, there are current libraries, functions, etc. to help with data preprocessing, data mining, and data visualization when it comes to text mining with R and RStudio.

Resources: