Using R and Spark for health care

Use of R with regard to healthcare field case study by Pereira and Noronha (2016):

R and RStudio have been used to look at patient health and diseases records located in Electronic Medical Records (EMR) for fraud detection.  Anomaly detection revolves around using a mapping code that filters data based on geo-locations.  Secondly, a reducer code which aggregates the data based on extreme values of cost claims per disease along with calculating the difference.  Finally, a code that analyzed the data that meets a 60% cost fraud threshold. It was found that as the geo-location resolution increased, the anomalies detected increased.

R and RStudio have been able to use big data analytics to predict diabetes from the Health Information System (HIS) which houses patient information, based on symptoms. For predicting diabetes, the authors used a classification algorithm (decision tree) with a 70%-30% training-test dataset split, to eventually plot the false positive rate v. True positive rate.  This plot showed skill in predicting diabetes.

Use of Spark about the healthcare field case study by Pita et al. (2015):

Data quality in healthcare data is poor and in particular that from the Brazilian Public Health System.  Spark was used to help in data processing to improve quality through deterministic and probabilistic record linking within multiple databases.  Record linking is a technique that uses common attributes across multiple databases and identifies a 1-to-1 match.  Spark workflows were created to help do record linking by (1) analyzing all data in each database and common attributes with high probabilities of linkage; (2) pre-processing data where data is transformed, anonymization, and cleaned to a single format so that all the attributes can be compared to each other for a 1-to-1 match; (3) record linking based on deterministic and probabilistic algorithms; and (4) statistical analysis to evaluate the accuracy. Over 397M comparisons were made in 12 hours.  They concluded that accuracy depends on the size of the data, where the bigger the data, the more accuracy in record linking.


  • Pereira, J. P., & Noronha, V. (2016). Anomalies Detection and Disease Prediction in Healthcare Systems using Big Data Analytics. Retrieved from
  • Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., & Rasella, D. (2015). A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data. In EDBT/ICDT Workshops (pp. 17-26).

An Innovation that is possible 15-20 years from now

Innovation idea that is not possible today but will be in the next 15-20 years

Mobile technology is everywhere today, and their use is prolific among all the diverse populations in the U.S., even to segments of the populations that do not own a computer own a smartphone (Kumar, 2015).  Electronic transactions carrying trillions of dollars, sensitive flight data, etc. take place all the time (Kumar, 2015; Safian, 2015).  Safian (2015) is calling that mobile voting will be one of the many things that will occur in the next 20 years.

Thirty-three states offer online voter registration and that allowed for 6.5% of the electorate to register for 2014 up from 1.7% in 2010 (Election Assistance Commission [EAC], 2015; Jayakumar, 2015). About 19.2% of ballots in 2014 were rejected due to improper registration (EAC, 2015).  Eighty cities and towns in Canada have experimented with mobile voting since 2003, and Sweden, Latvia, and Switzerland have tested the idea (Gross, 2011).  Since 2005, Estonia with a mobile voting period that last about seven days and is available for all citizens had about 1/4 to 1/3 votes cast were online (Vabariigi Valimiskomisjon, 2016).

Mobile voting, can help reduce the cost of elections, reduce the need for polling places, encourage and engage disenfranchised voters, reduce the time it takes to cast a vote, reduce the need to travel to a polling place, facilitate fast results, more convenient way of collecting huge data about the voting population and their turnout, while finally allowing for easier voter registration (Jayakumar, 2015; Kumar, 2015). However, to make mobile voting a key innovation in the next 15-20 years, the main goals of mobile voting must be addressed: security, accessibility, anonymity, conveniency, and verifiable (Gross, 2011; Jayakumar, 2015; Kumar, 2015 Safian, 2015).

Forces that define the innovation that may facilitate or reduce its likelihood of success

Technological: Paper ballots allow for and provide anonymity, free from manipulation (Jayakumar, 2015). Even though, some ballots could be switched. Mobile voting devices currently have issues with security and verifiability (Jayakumar, 2015).  However, other countries are working on providing democracy to all through allowing both paper and electronic ballots as previously discussed.  However, mobile voting is not like other typical transactional data from a bank, where a user can correct errors (Jayakumar, 2015).  Technology must take this into account.  Such that, voting data is unalterable in transit from the mobile device to the main destination (Jayakumar, 2015).  However, in 2014, Zimmerman and Kiniry were able to show how Alaska’s PDF Ballots are insecure, as proof that the technology is currently not as reliable to ensure a tamper free election.

Ethical: Mobile voting can allow for the lowest income workers afraid to take time off from work to vote, or single parents with no daycare options, or people without cars in a remote rural area, increase turnout during midterm and off-season elections, e.g. runoff elections (Jayakumar, 2015; Kumar, 2015). It is suggested that voter intimidation may also be resolved through mobile voting, as people can vote in the privacy of the person’s home (Kumar, 2015).

Financial: Huge cost savings could be realized because, in 2014, 732K poll workers were hired for 114K polling locations, which amounts to 6.4 people per polling location (Election Assistance Commission [EAC], 2015).


Adv Quant: Compelling Topics

Compelling topics summary/definitions

  • Supervised machine learning algorithms: is a model that needs training and testing data set. However it does need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014).
  • Unsupervised machine learning algorithms: is a model that needs training and testing data set, but unlike supervised learning, it doesn’t need to validate its model on some predetermined output value (Ahlemeyer-Stubbe & Coleman, 2014, Conolly & Begg, 2014). Therefore, unsupervised learning tries to find the natural relationships in the input data (Ahlemeyer-Stubbe & Coleman, 2014).
  • General Least Squares Model (GLM): is the line of best fit, for linear regressions modeling along with its corresponding correlations (Smith, 2015). There are five assumptions to a linear regression model: additivity, linearity, independent errors, homoscedasticity, and normally distributed errors.
  • Overfitting: is stuffing a regression model with so many variables that have little contributional weight to help predict the dependent variable (Field, 2013; Vandekerckhove, Matzke, & Wagenmakers, 2014). Thus, to avoid the over-fitting problem, the use of parsimony is important in big data analytics.
  • Parsimony: is describing a dependent variable with the fewest independent variables as possible (Field, 2013; Huck, 2013; Smith, 2015). The concept of parsimony could be attributed to Occam’s Razor, which states “plurality out never be posited without necessity” (Duignan, 2015).  Vandekerckhove et al. (2014) describe parsimony as a way of removing the noise from the signal to create better predictive regression models.
  • Hierarchical Regression: When the researcher builds a multivariate regression model, they build it in stages, as they tend to add known independent variables first, and add newer independent variables in order to avoid overfitting in a technique called hierarchical regression (Austin, Goel & van Walraven, 2001; Field, 2013; Huck 2013).
  • Logistic Regression: multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).
  • Nearest Neighbor Methods: K-nearest neighbor (i.e. K =5) is when a data point is clustered into a group, by having 5 of the nearest neighbors vote on that data point, and it is particularly useful if the data is a binary or categorical (Berson, Smith, & Thearling, 1999).
  • Classification Trees: aid in data abstraction and finding patterns in an intuitive way (Ahlemeyer-Stubbe & Coleman, 2014; Brookshear & Brylow, 2014; Conolly & Begg, 2014) and aid the decision-making process by mapping out all the paths, solutions, or options available to the decision maker to decide upon.
  • Bayesian Analysis: can be reduced to a conditional probability that aims to take into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).
  • Discriminate Analysis: how should data be best separated into several groups based on several independent variables that create the largest separation of the prediction (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013).
  • Ensemble Models: can perform better than a single classifier, since they are created as a combination of classifiers that have a weight attached to them to properly classify new data points (Bauer & Kohavi, 1999; Dietterich, 2000), through techniques like Bagging and Boosting. Boosting procedures help reduce both bias and variance of the different methods, and bagging procedures reduce just the variance of the different methods (Bauer & Kohavi, 1999; Liaw & Wiener, 2002).



  • Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Austin, P. C., Goel, V., & van Walraven, C. (2001). An introduction to multilevel regression models. Canadian Journal of Public Health92(2), 150.
  • Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning,36(1-2), 105-139.
  • Berson, A. Smith, S. & Thearling K. (1999). Building Data Mining Applications for CRM. McGraw-Hill. Retrieved from
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, 12th Edition. [VitalSource Bookshelf Online].
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • Dietterich, T. G. (2000). Ensemble methods in machine learning. International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.
  • Duignan, B. (2015). Occam’s razor. Encyclopaedia Britannica. Retrieved from
  • Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
  • Gall, M. D., Gall, J. P., Borg, W. R. (2006). Educational Research: An Introduction, 8th Edition. [VitalSource Bookshelf Online].
  • Hubbard, D. W. (2010). How to measure anything: Finding the values of “intangibles” in business. (2nd e.d.) New Jersey, John Wiley & Sons, Inc.
  • Huck, Schuyler W. (2011). Reading Statistics and Research, 6th Edition. [VitalSource Bookshelf Online].
  • Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from
  • Spiegelhalter, D. & Rice, K. (2009) Bayesian statistics. Retrieved from
  • Vandekerckhove, J., Matzke, D., & Wagenmakers, E. J. (2014). Model comparison and the principle of parsimony.
  • Yudkowsky, E.S. (2003). An intuitive explanation of Bayesian reasoning. Retrieved from

Adv Quant: Association Rules in R


Online radio keeps track of everything you play. This information is used to make recommendations to you for additional music. This large dataset was mined with arules in R to recommend new music to this community of radio listeners which has ~300,000 records and ~15,000 users.



Figure 1. The output of the apriori command, which filtered data for the rules under a support of 0.01, a confidence of 0.5, and max length of 3.


Figure 2. The output of the apriori, searching for only a subset of rules: (a) all rules with lift is greater than 5, (b) all rules where the confidence is greater than 0.6, (c) all rules with support > 0.02 and confidence greater than 0.6, (d) all the rules where Rihanna appears on the right-hand side, and (e) the top ten rules with the largest lift.


Figure 3. The output of the apriori command, which filtered data for the rules as aforementioned under a support of 0.001, a confidence of 0.5, and max length of 2.


Figure 4. The output of the apriori, searching for only a subset of rules: (a) all rules with lift is greater than 5, (b) all rules where the confidence is greater than 0.6, (c) all rules with support > 0.02 and confidence greater than 0.6, (d) all the rules where Rihanna appears on the right-hand side, and (e) the top ten rules with the largest lift.


There are a total of 289,956 data points, with 15,001 unique users that are listening to 1,005 unique artists.  From this dataset, there is a total of 48 rules under a support of 0.01, a confidence of 0.5, and a max length of 3.  When inspecting the first five rules (Figure 1), the results show each rule, and its corresponding support, confidence and lift if it meets the restrictions placed above.   Also, there is a total of 93 rules under a support of 0.001, the confidence of 0.5, and max length of 2.  When inspecting the first five rules (Figure 2), the results show each rule, and its corresponding support, confidence and lift if it meets the restrictions placed above.

 Apriori counts the transactions within the “playtrans” matrix.  According to Hahsler et al. (n.d.), the most used constraints for apriori are known as support and confidence, where the lower the confidence or support values, the more rules the algorithm will generate.  This relationship is illustrated between the two rule sets, where with higher support values, there were fewer rules generated.  Essentially, support can be seen as the proportion (%) of transactions in the data set with that exact item, whereas confidence is the proportion (%) of transaction where the rule is correct (Hahsler et al., n.d.).  The effects between just varying the support values can be seen in the number of subset rules for each rule set (Figure 2 & 4).    When reducing the support levels, there was an increase in the number of rules with Rihanna on the right-hand side (Figure 2d & 4d), and this happened across inspecting all the subset rules, even though the support, confidence, and lift values are the same between the rule sets.

Finally, the greater the lift value, the stronger the association rule (Hahsler et al., n.d.).  When relaxing the constraints, higher lift values could be observed (Figure 1-4).  This happens due to showing more rules, as constraints are weakened, then lift values can increase. Analyzing the top 10 lift values between both rule sets (Figure 2e and 4e), the top value with stricter results doesn’t appear in the top 10 lift values for relaxed constraints.  However, with stricter constraints (Figure 2e), users that listen to “the pussycat dolls” have a higher chance of listening to “rihanna”, than any other artist.  Whereas with relaxed constraints (Figure 4e), users that listen to “madvillain” have a higher chance of listening to “mf doom”, than any other artist, and that is more likely than the “the pussycat doll”-“rihanna” rule.  Similar associations can be made from the data found in the figures (1-4).



LastFM=read.csv(“lastfm.csv”, header = F, sep = “,”) ## (Celma, 2009)




## Variables: UserID = V1; ArtistID = V2; ArtistName = V3; PlayCount = V4


## Apriori info(Hahsler, Grun, Hornic, & Buchta, n.d.):

##   Constraints for apriori are known as support and confidence, the lower the confidence or supprot the more rules.

##     * Support is the proportion (%) of transactions in the data set with that exact item.

##     * Confidence is the proportion (%) of transaction where the rule is correct.

##   The greater the lift, the stronger the assocition rule, thus lift is a deviation measure of the total rule

##   support from the support expected under independence.

##   Other Contraints used

##     * Max length defines the maximum size of mined frequent item rules.








## a-rules package for asociation rules



## Computational enviroment for mining association rules and frequent item sets

## we need to manpulate the data a bit before using arules, we split the data in the vector

## x into groups defined in vector f. (Hahsler, Grun, Hornic, & Buchta, n.d.)

playlists = split(x=LastFM[,”V2″],f=LastFM$V1) # Convert the data to a matrix so that each fan is a row for artists across the clmns (R, n.d.c.)

playlists = lapply(playlists,unique)           # Find unique attributes in playlist, and create a list of those in playlists (R, n.d.a.; R, n.d.b.)

playtrans = as(playlists,”transactions”)       # Converts data and produce rule sets

## Create association rules with a support of 0.01 and confidence of 0.5, with a max length of 3

## which will show the support that listening to one artist gives to other artists; in other words,

## providing lift to an associated artist.

musicrules = apriori(playtrans, parameter=list(support=0.01, confidence=0.5, maxlen=3)) # filter the data for rules



## Choose any subset

inspect(subset(musicrules, subset=lift>5))                        # tell me all the rules with a lift > 5

inspect(subset(musicrules, subset=confidence>0.6))                # tell me all the rules with a confidence of 0.6 or greater

inspect(subset(musicrules, subset=support>0.02& confidence >0.6)) # tell me the rules within a particular CI

inspect(subset(musicrules, subset=rhs%in%”rihanna”))              # tell me all the rules with rihanna in the left hand side

inspect(head(musicrules, n=10, by=”lift”))                        # tell me the top 10 rules with the largest lift

## Create association rules with a support of 0.001 and confidence of 0.1, with a max length of 2

artrules = apriori(playtrans, parameter=list(support=0.001, confidence=0.5, maxlen=2)) # filter the data for rules



 ## Choose any subset

inspect(subset(artrules, subset=lift>5))

inspect(subset(artrules, subset=confidence>0.6))

inspect(subset(artrules, subset=support>0.02& confidence >0.6))

inspect(subset(artrules, subset=rhs%in%”rihanna”))

inspect(head(artrules, n=10, by=”lift”))

## Write down all the rules into a CSV file for co

write(musicrules, file=”musicRulesFromApriori.csv”, sep = “,”, col.names = NA)

write(artrules, file=”artistRulesFromApriori.csv”, sep = “,”, col.names = NA)