Modeling and analyzing big data in health care

Let’s consider using the building blocks system for healthcare systems, on a healthcare problem that wants to monitor patient vital signs similar to Chen et al. (2010).

  • The purpose that the new data will serve: Most hospitals measure the following vitals for triaging patients: blood pressure and flow, core temperature, ECG, carbon dioxide concentration (Chen et al. 2010).
    1. Functions should it serve: gathering, storing, preprocessing, and processing the data. Chen et al. (2010) suggested that they should also perform a consistency check, aggregating and integrate the data.
    2. Which parts of the data are needed to serve these functions: all
  • Tools needed: distributed database system, wireless network, parallel processing, graphical user interface for healthcare providers to understand the data, servers, subject matter experts to create upper limits and lower limits, classification algorithms that used machine learning
  • Top level plan: The data will be collected from the vital sign sensors, streaming at various time intervals into a central hub that sends the data in packets over a wireless network into a server room. The server can divide the data into various distributed systems accordingly. A parallel processing program will be able to access the data per patient per window of time to conduct the needed functions and classifications to be able to provide triage warnings if the vitals hit any of the predetermined key performance indicators that require intervention by the subject matter experts.  If a key performance indicator is sparked, send data to the healthcare provider’s device via a graphical user interface.
  • Pivoting is bound to happen; the following can happen:
    1. Graphical user interface is not healthcare provider friendly
    2. Some of the sensors need to be able to throw a warning if they are going bad
    3. Subject matter experts may need to readjust the classification algorithm for better triaging

Thus, the above problem as discussed by Chen et al. (2010), could be broken apart to its building block components as addressed in Burkle et al. (2011).  These components help to create a system to analyze this set of big health care data through analytics, via distributed systems and parallel processing as addressed by Services (2015) and Mirtaheri et al. (2008).

Draw on a large body of data to form a prediction or variable comparisons within the premise of big data.

Fayyad, Piatetsky-Shapiro, and Smyth (1996) defined that data analytics can be divided into descriptive and predictive analytics. Vardarlier and Silaharoglu (2016) agreed with Fayyad et al. (1996) division but added prescriptive analytics.  Depending on the goal of diagnosing illnesses with the use of big data analytics should depend on the theory/division one should choose.  Raghupathi & Raghupathi (2014), stated some common examples of big data in the healthcare field to be: personal medical records, radiology images, clinical trial data, 3D imaging, human genomic data, population genomic data, biometric sensor reading, x-ray films, scripts, and traditional paper files.  Thus, the use of big data analytics to understand the 23 pairs of chromosomes that are the building blocks for people. Healthcare professionals are using the big data generated from our genomic code to help predict which illnesses a person could get (Services, 2013). Thus, using predictive analytics tools and algorithms like decision trees would be of some use.  Another use of predictive analytics and machine learning can be applied to diagnosing an eye disease like diabetic retinopathy from an image by using classification algorithms (Goldbloom, 2016).

Examine the unique domain of health informatics and explain how big data analytics contributes to the detection of fraud and the diagnosis of illness.

A process mining framework for the detection of healthcare fraud and abuse case study (Yang & Hwang, 2006): Fraud exists in processing health insurance claims because there are more opportunities to commit fraud because there are more channels of communication: service providers, insurance agencies, and patients. Any one of these three people can commit fraud, and the highest chance of fraud occurs where service providers can do unnecessary procedures putting patients at risk. Thus this case study provided the framework on how to conduct automated fraud detection. The study collected data from 2543 gynecology patients from 2001-2002 from a hospital, where they filtered out noisy data, identified activities based on medical expertise, identified fraud in about 906.

Summarize one case study in detail related to big data analytics as it relates to organizational processes and topical research.

The use of Spark about the healthcare field case study by Pita et al. (2015): Data quality in healthcare data is poor and in particular that of the Brazilian Public Health System.  Spark was used to help in data processing to improve quality through deterministic and probabilistic record linking within multiple databases.  Record linking is a technique that uses common attributes across multiple databases and identifies a 1-to-1 match.  Spark workflows were created to help do record linking by (1) analyzing all data in each database and common attributes with high probabilities of linkage; (2) pre-processing data where data is transformed, anonymization, and cleaned to a single format so that all the attributes can be compared to each other for a 1-to-1 match; (3) record linking based on deterministic and probabilistic algorithms; and (4) statistical analysis to evaluate the accuracy. Over 397M comparisons were made in 12 hours.  They concluded that accuracy depends on the size of the data, where the bigger the data, the more accuracy in record linking.

References

  • Burkle, T., Hain, T., Hossain, H., Dudeck, J., & Domann, E. (2001). Bioinformatics in medical practice: What is necessary for a hospital?. Studies in health technology and informatics, (2), 951-955.
  • Chen, B., Varkey, J. P., Pompili, D., Li, J. K., & Marsic, I. (2010). Patient vital signs monitoring using wireless body area networks. In Bioengineering Conference, Proceedings of the 2010 IEEE 36th Annual Northeast (pp. 1-2). IEEE.
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. Retrieved from: http://www.aaai.org/ojs/index.php/aimagazine/article/download/1230/1131/
  • Goldbloom, A. (2016). The jobs we’ll lose to machines – and the ones we won’t. TED Talks. Retrieved from https://www.youtube.com/watch?v=gWmRkYsLzB4
  • Mirtaheri, S. L., Khaneghah, E. M., Sharifi, M., & Azgomi, M. A. (2008). The influence of efficient message passing mechanisms on high performance distributed scientific computing. In Parallel and Distributed Processing with Applications, 2008. ISPA’08. International Symposium on (pp. 663-668). IEEE.
  • Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., & Rasella, D. (2015). A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data. In EDBT/ICDT Workshops (pp. 17-26).
  • Raghupathi, W. Raghupathi, V. (2014). Big Data Analytics in healthcare: promise and potential. Heath Information Science and Systems. 2(3). Retrieved from http://hissjournal.biomedcentral.com/articles/10.1186/2047-2501-2-3
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].
  • Vardarlier, P., & Silahtaroglu, G. (2016). Gossip management at universities using big data warehouse model integrated with a decision support system. International Journal of Research in Business and Social Science, 5(1), 1–14. Doi: http://doi.org/10.1108/ 17506200710779521
  • Yang, W. S., & Hwang, S. Y. (2006). A process-mining framework for the detection of healthcare fraud and abuse.Expert Systems with Applications31(1), 56-68.

Compelling topics

Hadoop, XML and Spark

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). MapReduce has two queries, one that maps the input data into a final format and split across a group of computer nodes, while the second query reduces the data in each node so that when combining all the nodes it can provide the answer sought (Eini, 2010).

XML documents represent a whole data file, which contains markups, elements, and nodes (Lublinsky, Smith, & Yakubovich,, 2013; Myer, 2005):

  • XML markups are tags that helps describe the data start and end points as well as the data properties/attributes, which are encapsulated by < and a >
  • XML elements are data values, encapsulated by an opening <tag> and a closing </tag>
  • XML nodes are part of the hierarchical structure of a document that contains a data element and its tags

Unfortunately, the syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007)

Five questions must be asked before designing an XML data document (Font, 2010):

  1. Will this document be part of a solution?
  2. Will this document have design standards that must be followed?
  3. What part may change over time?
  4. To what extent is human readability or machine readability important?
  5. Will there be a massive amount of data? Does file size matter?

All XML data documents should be versioned, and key stakeholders should be involved in the XML data design process (Font, 2010).  XML is a machine and human readable data format (Smith, 2012). With a goal of using XML for MapReduce, we need to assume that we need to map and reduce huge files (Eini, 2010; Smith 2012). Unfortunately, XML doesn’t include sync markers in the data format and therefore MapReduce doesn’t support XML (Smith, 2012). However, Smith (2012) and Rohit (2013) used the XmlInputFormat class from mahout to work with XML input data into HBase. Smith (2012), stated that the Mahout’s code needs to know the exact sequence of XML start and end tags that will be searched for and Elements with attributes are hard for Mahout’s XML library to detect and parse.

Apache Spark started from a working group inside and outside of UC Berkley, in search of an open-sourced, multi-pass algorithm batch processing model of MapReduce (Zaharia et al., 2012). Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Apache Spark, on their website, boasts that they can run programs 100X faster than Hadoop’s MapReduce in Memory (Spark, n.d.). Spark outperforms Hadoop by 10x on iterative machine learning jobs (Gu & Li, 2013). Also, Spark runs 10x faster than Hadoop on disk memory (Spark, n.d.). Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized.

Data visualization

Big data can be defined as any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do, 2013; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  What is considered to be big data can change with respect to time.  What is considered as big data in 2002 is not considered big data in 2016 due to advancements made in technology over time (Fox & Do, 2013).  Then there is Data-in-motion, which can be defined as a part of data velocity, which deals with the speed of data coming in from multiple sources as well as the speed of data traveling between systems (Katal, Wazid, & Goudar, 2013). Essentially data-in-motion can encompass data streaming, data transfer, or real-time data. However, there are challenges and issues that have to be addressed to conducting real-time analysis on data streams (Katal et al., 2013; Tsinoremas et al., n.d.).

It is not enough to analyze the relevant data for data-driven decisions but also selecting relevant visualizations of that data to enable those data-driven decision (eInfochips, n.d.). There are many types of ways to visualize the data to highlight key facts through style and succinctly: tables and rankings, bar charts, line graphs, pie charts, stacked bar charts, tree maps, choropleth maps, cartograms, pinpoint maps, or proportional symbol maps (CHCF, 2014).  The above visualization plots, charts, maps and graphs could be part of an animated, static, and Interactive Visualizations and would it be a standalone image, dashboards, scorecards, or infographics (CHCF, 2014; eInfochips, n.d.).

Artificial Intelligence (AI)

Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). AI can provide tremendous value since it builds thousands of models and correlations automatically in one week, which use to take a few quantitative data scientist years to do (Dewey, 2013; Power, 2015).  Unfortunately, the rules created by AI out of 50K variables lack substantive human meaning, or the “Why” behind it, thus making it hard to interpret the results (Power, 2015).

“Machines can excel at frequent high-volume tasks. Humans can tackle novel situations.” said by Anthony Goldbloom. Thus, the fundamental question that decision makers need to ask, is how the decision is reduced to frequent high volume task and how much of it is reduced to novel situations (Goldbloom, 2016).  Therefore, if the ratio is skewed on the high volume tasks then AI could be a candidate to replace decision makers, if the ratio is evenly split then AI could augment and assist decision makers, and if the ratio is skewed on novel situations, then AI wouldn’t help decision makers.  They novel situations is equivalent to our tough challenges today (McAfee, 2013).  Finally, Meetoo (2016), warned that it doesn’t matter how intelligent or strategic a job could be, if there is enough data on that job to create accurate rules it can be automated as well; because machine learning can run millions of simulations against itself to generate huge volumes of data to learn from.

 

Resources:

Data Tools: Hadoop Basic Componets & Architecture

Big Data

Big data can be defined as any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do, 2013; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  What is considered to be big data can change with respect to time.  What is considered as big data in 2002 is not considered big data in 2016 due to advancements made in technology over time (Fox & Do, 2013).  However, given that big data today is too big to be processed just by using one processor, the use of parallel processing allows for data analytics to be conducted through platforms like Hadoop more efficiently (Hortonworks, 2013; IBM, n.d.).

Hadoop: Basic Components and Architecture

Hadoop’s service is part of cloud (as Platform as a Service = PaaS).  For PaaS, the end users manage the applications and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001).

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). Data is broken up into small blocks, like Legos, such that they are distributed across a distributed database system and across multiple servers (IBM, n.d.).  Just like Legos, the end the results can be assembled back.  This feature of HDFS allows for Hadoop to manage big data through parallel processing and analysis (Gary et al., 2005, Hortonworks, 2013; IBM, n.d.).  Multiple data types are supported through the HFDS (IBM, n.d.) For Hadoop’s MapReduce function, it can be broken down into two queries.

Parallel processing is key for Hadoop, because it allows for making quick work on a big data set, because rather than having one processor doing all the work, Hadoop splits up the task amongst many processors. One of MapReduce’s main two queries is that it splits the data into the Lego pieces and places them across a group of computer nodes in the HDFS called the mapping procedure (Eini, 2010; IBM, n.d; Hortonworks, 2013; Sathupadi, 2010). The second MapReduce query applied algorithms to reduce the data in each of the computer nodes equally to answer the question that was asked of the data; such that at the end of the parallel processing procedures, the reduced data gets combined and further reduced to provide the final answer (Eini, 2010; IBM, n.d; Hortonworks, 2013; Minelli et al., 2013; Sathupadi, 2010). In other words, data is partitioned, sorted and grouped to provide a key and value as an output (Hortonworks, 2013; Rathbone, 2013; Sathupadi, 2010). Therefore, IBM’s (n.d.) MapReduce functions use the HFDS to house the data and MapReduce runs its procedures on the server in which the data is stored.  Data is stored in a memory, not in cache and allow for continuous service (Gu & Li, 2013; Zaharia et al., 2012).

Given the Lego blocks feature in the HDFS, which allows for MapReduce functions, these blocks can contain a subset of data, which are small enough that they can be easily duplicated (for disaster recovery purposes) in two or more different servers (IBM, n.d.).  This partitioning of the data into data Lego blocks allows for big iterative tasks to be done quite easily and efficiently for big data sets (Gu & Li, 2013).

When to use Hadoop

Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized. Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Also, Hadoop fails in providing a real-time response (Greer, Rodriguez-Martinez, & Seguel, 2010).  Therefore, for big data that isn’t streaming real-time data and has a ton of iterative processing/analytical tasks Hadoop should be used.

Preparation of Big Data for Hadoop

Collecting the raw and unaltered real world data is usually the first step of any data or text mining study (Coralles et al., 2015; Gera & Goel, 2015; He et al., 2013; Hoonlor, 2011; Nassirtoussi et al., 2014). Next, the data must be preprocessed, because raw text data files are unsuitable for predictive data analytics tools like Hadoop (Hoonlor, 2011). Barak and Modarres (2015) and Nassirtoussi et al. (2014), all stated that in both data and text mining, data preprocessing has the most significant impact on the research results.  Wayner (2013) and Lublinksy, Smith, and Yakubovich (2013), enumerated the following tools used to preprocess data prior to data analysis with Hadoop as part of the core components of the ecosystem:

  • Ambari: Graphical User Interface for setting up clusters with common components. Essentially a simple management tool.
  • Avro: serialization systems that compiles all the data together into a XML or JSON output to be shared with others.
  • BigTop: tool that provides testing of sub-projects within Hadoop.
  • Clouds: Allows the end-user to spin up multiple nodes to process the data without necessarily owning the infrastructure, essentially pay as you go model
  • Flume: Gathers all data and places it into HDFS. Essentially an enterprise data integration tool.
  • GIS tools: allows end-users to work with big data stored as geographic maps under GIS (Geographic Information Systems) formats.
  • HBase: helps search and share a big tabular data set, unfortunate full ACID is not available. Essentially a NoSQL Database.
  • HDFS: Storage of big data in multiple distributed systems into data blocks. Essentially a Distributed reliable data storage.
  • Hive: SQL type language that files and pulls out data that is needed from HBase. Essentially a high-level abstraction tool.
  • Lucene: indexes large blocks of unstructured text based data and allows for dynamic clustering and ability to read XML
  • Mahout: Allows for Hadoop to use classification, filtering, k-means, Dirichelet, parallel pattern, and Bayesian classification similar to Hadoops MapReduce. Essentially a data analytics library.
  • NoSQL: Uses NoSQL data stores for data that is not typically stored in HBase or HDFS.
  • Oozie: manages the workflow of a job by allowing the user to break the job into simple steps in a flowchart fashion. Essentially a workflow manager.
  • Pig: stores and maps data in processing nodes for Hadoop to find and process. Essentially a high-level abstraction tool.
  • Spark: uses Hadoop infrastructure to store data in the cache to allow for faster processing time
  • SQL on Hadoop: ad-hoc query the data stored in Hadoop servers using SQL
  • Sqoop: stores data in SQL databases into Hadoop. Essentially an enterprise data integration tool.
  • Whirr: Library that allows to run Hadoop clusters on Amazon EC2, Rackspace, etc.
  • ZooKeeper: maintains order and synchronization throughout the parallel processing cluster. Essentially a coordinator of processes.

According to Lublinksy et al. (2013), there are always new datasets, data formats, and data preprocessing and processing tools being added to Hadoop.  Thus the list provided above is not a comprehensive list, but rather one to begin off from.

Reference

  • Barak, S., & Modarres, M. (2015). Developing an approach to evaluate stocks by forecasting effective features with data mining methods. Expert Systems with Applications, 42(3), 1325–1339. http://doi.org/10.1016/j.eswa.2014.09.026
  • Corrales, D. C., Ledezma, A., & Corrales, J. C. (2015). A Conceptual Framework for Data Quality in Knowledge Discovery Tasks (FDQ-KDT): A Proposal. Journal of Computers, V10(6), 396-405. Doi: 10.17706/jcp.10.6.396-405.
  • Davenport, T. H., & Dyche, J. (2013). Big Data in Big Companies. International Institute for Analytics, (May), 1–31.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Gera, M., & Goel, S. (2015). Data Mining-Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Greer, M., Rodriguez-Martinez, M., & Seguel, J. (2010). Open Source Cloud Computing Tools: A Case Study with a Weather Application.Florida: IEEE Open Source Cloud Computing.
  • Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J., & Zients, J. (2014). Big Data: Seizing Opportunities. Executive Office of the President of USA, 1–79.
  • Gray, J., Liu, D. T., Nieto-Santisteban, M., Szalay, A., DeWitt, D. J., & Heber, G. (2005). Scientific data management in the coming decade. ACM SIGMOD Record, 34(4), 34-41.
  • Gu, L., & Li, H. (2013). Memory or time: Performance evaluation for iterative operation on hadoop and spark. InHigh Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on (pp. 721-727). IEEE.
  • Eini, O. (2010). Map/Reduce- a visual explanation. Retrieved from https://ayende.com/blog/4435/map-reduce-a-visual-explanation
  • He, W., Zha, S., & Li, L. (2013). Social media competitive analysis and text mining: A case study in the pizza industry. International Journal of Information Management, 33, 464–472. http://doi.org/10.1016/j.ijinfomgt.2013.01.001
  • Hoonlor, A. (2011). Sequential patterns and temporal patterns for text mining. UMI Dissertation Publishing.
  • Hortonworks (2013). Introduction to MapReduce. Retrieved from https://www.youtube.com/watch?v=ht3dNvdNDzI
  • IBM (n.d.) What is the Hadoop Distributed File System (HDFS)? Retrieved from https://www-01.ibm.com/software/data/infosphere/hadoop/hdfs/
  • Lau, W. (2001). A Comprehensive Introduction to Cloud Computing. Retrieved from https://www.simple-talk.com/cloud/development/a-comprehensive-introduction-to-cloud-computing/
  • Lublinsky, B., Smith, K., Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox, VitalBook file.
  • Minelli, M., Chambers, M., Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses (1st). VitalSource Bookshelf Online.
  • Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2014). Text mining for market prediction: a systematic review. Expert Systems with Applications41(16), 7653–7670. http://doi.org/10.1016/j.eswa.2014.06.009
  • Rathbone, M. (2013). A beginners guide to Hadoop. Retrieved from http://blog.matthewrathbone.com/2013/04/17/what-is-hadoop.html
  • Sathupadi, K. (2010) Map Reduce: A really simple introduction. Retrieved from http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/

 

Data Tools: Hadoop Vs Spark

 

Apache Spark

Apache Spark started from a working group inside and outside of UC Berkley, in search of an open-sourced, multi-pass algorithm batch processing model of MapReduce (Zaharia et al., 2012). Spark can have applications written in Java, Scala, Python, R, and interfaces with SQL, which increases ease of use (Spark, n.d.; Zaharia et al., 2012).

Essentially, Spark is a high-performance computing cluster framework, but it doesn’t have its distributed file system and thus uses Hadoop Distributed File System (HDFS, HBase) as in input and output (Gu & Li, 2013).  Not only can it access data from HDFS, HBase, it can also access data from Cassandra, Hive, Tachyon, and any other Hadoop data source (Spark, n.d.).  However, Spark uses its data structure called Resilient Distribution Datasets (RDD) which cache’s data and is a read-only operation to improve its processing time as long as there is enough memory for it in all the nodes of a cluster (Gu & Li, 2013; Zaharia et al., 2012). Spark tries to avoid data reloading from the disk that is why it stores its data in the node’s cache system, for initial and intermediate results (Gu & Li, 2013).

Machines in the cluster can be rebuilt if lost, thus making the RDDs are fault-tolerant without requiring replication (Gu &LI, 2013; Zaharia et al., 2012).  Each RDD is tracked in a lineage graph, and reruns the operations if data becomes lost, therefore reconstructing data, even if all the nodes running spark were to fail (Zaharia et al., 2012).

Hadoop

Hadoop is Java-based system that allows for manipulation and calculations to be done by calling on MapReduce function on its HDFS system (Hortonworks, 2013; IBM, n.d.).

HFDS big data is broken up into smaller blocks across different locations, no matter the type or amount of data, each of these blocs can be still located, which can be aggregated like a set of Legos throughout a distributed database system (IBM, n.d.; Minelli, Chambers, & Dhiraj, 2013). Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). IBM (n.d.) boasts that the data blocks in the HFDS are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on your data needs), offering fault tolerance as well. Therefore, IBM’s (n.d.) MapReduce functions use the HFDS to run its procedures on the server in which the data is stored, where data is stored in a memory, not in cache and allow for continuous service.

MapReduce contains two job types that work in parallel on distributed systems: (1) Mappers which creates & processes transactions on the system by mapping/aggregating data by key values, and (2) Reducers which know what that key value is, will take all those values stored in a map and reduce the data to what is relevant (Hortonworks, 2013; Sathupadi, 2010). Reducers can work on different keys, and when huge amounts of data are entered into MapReduce, then the Mapper maps the data, where the data is then shuffled and sorted before it is reduced (Hortonworks, 2013).  Once the data is reduced, the researcher gets the output that they sought.

Significant Differences between Hadoop and Apache Spark              

Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Apache Spark, on their website, boasts that they can run programs 100X faster than Hadoop’s MapReduce in Memory (Spark, n.d.). Spark outperforms Hadoop by 10x on iterative machine learning jobs (Gu & Li, 2013). Also, Spark runs 10x faster than Hadoop on disk memory (Spark, n.d.).

Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized.

References

  • Gu, L., & Li, H. (2013). Memory or time: Performance evaluation for iterative operation on hadoop and spark. InHigh Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on (pp. 721-727). IEEE.

Using R and Spark for health care

Use of R with regard to healthcare field case study by Pereira and Noronha (2016):

R and RStudio have been used to look at patient health and diseases records located in Electronic Medical Records (EMR) for fraud detection.  Anomaly detection revolves around using a mapping code that filters data based on geo-locations.  Secondly, a reducer code which aggregates the data based on extreme values of cost claims per disease along with calculating the difference.  Finally, a code that analyzed the data that meets a 60% cost fraud threshold. It was found that as the geo-location resolution increased, the anomalies detected increased.

R and RStudio have been able to use big data analytics to predict diabetes from the Health Information System (HIS) which houses patient information, based on symptoms. For predicting diabetes, the authors used a classification algorithm (decision tree) with a 70%-30% training-test dataset split, to eventually plot the false positive rate v. True positive rate.  This plot showed skill in predicting diabetes.

Use of Spark about the healthcare field case study by Pita et al. (2015):

Data quality in healthcare data is poor and in particular that from the Brazilian Public Health System.  Spark was used to help in data processing to improve quality through deterministic and probabilistic record linking within multiple databases.  Record linking is a technique that uses common attributes across multiple databases and identifies a 1-to-1 match.  Spark workflows were created to help do record linking by (1) analyzing all data in each database and common attributes with high probabilities of linkage; (2) pre-processing data where data is transformed, anonymization, and cleaned to a single format so that all the attributes can be compared to each other for a 1-to-1 match; (3) record linking based on deterministic and probabilistic algorithms; and (4) statistical analysis to evaluate the accuracy. Over 397M comparisons were made in 12 hours.  They concluded that accuracy depends on the size of the data, where the bigger the data, the more accuracy in record linking.

References

  • Pereira, J. P., & Noronha, V. (2016). Anomalies Detection and Disease Prediction in Healthcare Systems using Big Data Analytics. Retrieved from http://www.aijet.in/v3/1608001.pdf
  • Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., & Rasella, D. (2015). A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data. In EDBT/ICDT Workshops (pp. 17-26).