Modeling and analyzing big data in health care

Let’s consider using the building blocks system for healthcare systems, on a healthcare problem that wants to monitor patient vital signs similar to Chen et al. (2010).

  • The purpose that the new data will serve: Most hospitals measure the following vitals for triaging patients: blood pressure and flow, core temperature, ECG, carbon dioxide concentration (Chen et al. 2010).
    1. Functions should it serve: gathering, storing, preprocessing, and processing the data. Chen et al. (2010) suggested that they should also perform a consistency check, aggregating and integrate the data.
    2. Which parts of the data are needed to serve these functions: all
  • Tools needed: distributed database system, wireless network, parallel processing, graphical user interface for healthcare providers to understand the data, servers, subject matter experts to create upper limits and lower limits, classification algorithms that used machine learning
  • Top level plan: The data will be collected from the vital sign sensors, streaming at various time intervals into a central hub that sends the data in packets over a wireless network into a server room. The server can divide the data into various distributed systems accordingly. A parallel processing program will be able to access the data per patient per window of time to conduct the needed functions and classifications to be able to provide triage warnings if the vitals hit any of the predetermined key performance indicators that require intervention by the subject matter experts.  If a key performance indicator is sparked, send data to the healthcare provider’s device via a graphical user interface.
  • Pivoting is bound to happen; the following can happen:
    1. Graphical user interface is not healthcare provider friendly
    2. Some of the sensors need to be able to throw a warning if they are going bad
    3. Subject matter experts may need to readjust the classification algorithm for better triaging

Thus, the above problem as discussed by Chen et al. (2010), could be broken apart to its building block components as addressed in Burkle et al. (2011).  These components help to create a system to analyze this set of big health care data through analytics, via distributed systems and parallel processing as addressed by Services (2015) and Mirtaheri et al. (2008).

Draw on a large body of data to form a prediction or variable comparisons within the premise of big data.

Fayyad, Piatetsky-Shapiro, and Smyth (1996) defined that data analytics can be divided into descriptive and predictive analytics. Vardarlier and Silaharoglu (2016) agreed with Fayyad et al. (1996) division but added prescriptive analytics.  Depending on the goal of diagnosing illnesses with the use of big data analytics should depend on the theory/division one should choose.  Raghupathi & Raghupathi (2014), stated some common examples of big data in the healthcare field to be: personal medical records, radiology images, clinical trial data, 3D imaging, human genomic data, population genomic data, biometric sensor reading, x-ray films, scripts, and traditional paper files.  Thus, the use of big data analytics to understand the 23 pairs of chromosomes that are the building blocks for people. Healthcare professionals are using the big data generated from our genomic code to help predict which illnesses a person could get (Services, 2013). Thus, using predictive analytics tools and algorithms like decision trees would be of some use.  Another use of predictive analytics and machine learning can be applied to diagnosing an eye disease like diabetic retinopathy from an image by using classification algorithms (Goldbloom, 2016).

Examine the unique domain of health informatics and explain how big data analytics contributes to the detection of fraud and the diagnosis of illness.

A process mining framework for the detection of healthcare fraud and abuse case study (Yang & Hwang, 2006): Fraud exists in processing health insurance claims because there are more opportunities to commit fraud because there are more channels of communication: service providers, insurance agencies, and patients. Any one of these three people can commit fraud, and the highest chance of fraud occurs where service providers can do unnecessary procedures putting patients at risk. Thus this case study provided the framework on how to conduct automated fraud detection. The study collected data from 2543 gynecology patients from 2001-2002 from a hospital, where they filtered out noisy data, identified activities based on medical expertise, identified fraud in about 906.

Summarize one case study in detail related to big data analytics as it relates to organizational processes and topical research.

The use of Spark about the healthcare field case study by Pita et al. (2015): Data quality in healthcare data is poor and in particular that of the Brazilian Public Health System.  Spark was used to help in data processing to improve quality through deterministic and probabilistic record linking within multiple databases.  Record linking is a technique that uses common attributes across multiple databases and identifies a 1-to-1 match.  Spark workflows were created to help do record linking by (1) analyzing all data in each database and common attributes with high probabilities of linkage; (2) pre-processing data where data is transformed, anonymization, and cleaned to a single format so that all the attributes can be compared to each other for a 1-to-1 match; (3) record linking based on deterministic and probabilistic algorithms; and (4) statistical analysis to evaluate the accuracy. Over 397M comparisons were made in 12 hours.  They concluded that accuracy depends on the size of the data, where the bigger the data, the more accuracy in record linking.

References

  • Burkle, T., Hain, T., Hossain, H., Dudeck, J., & Domann, E. (2001). Bioinformatics in medical practice: What is necessary for a hospital?. Studies in health technology and informatics, (2), 951-955.
  • Chen, B., Varkey, J. P., Pompili, D., Li, J. K., & Marsic, I. (2010). Patient vital signs monitoring using wireless body area networks. In Bioengineering Conference, Proceedings of the 2010 IEEE 36th Annual Northeast (pp. 1-2). IEEE.
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. Retrieved from: http://www.aaai.org/ojs/index.php/aimagazine/article/download/1230/1131/
  • Goldbloom, A. (2016). The jobs we’ll lose to machines – and the ones we won’t. TED Talks. Retrieved from https://www.youtube.com/watch?v=gWmRkYsLzB4
  • Mirtaheri, S. L., Khaneghah, E. M., Sharifi, M., & Azgomi, M. A. (2008). The influence of efficient message passing mechanisms on high performance distributed scientific computing. In Parallel and Distributed Processing with Applications, 2008. ISPA’08. International Symposium on (pp. 663-668). IEEE.
  • Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., & Rasella, D. (2015). A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data. In EDBT/ICDT Workshops (pp. 17-26).
  • Raghupathi, W. Raghupathi, V. (2014). Big Data Analytics in healthcare: promise and potential. Heath Information Science and Systems. 2(3). Retrieved from http://hissjournal.biomedcentral.com/articles/10.1186/2047-2501-2-3
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].
  • Vardarlier, P., & Silahtaroglu, G. (2016). Gossip management at universities using big data warehouse model integrated with a decision support system. International Journal of Research in Business and Social Science, 5(1), 1–14. Doi: http://doi.org/10.1108/ 17506200710779521
  • Yang, W. S., & Hwang, S. Y. (2006). A process-mining framework for the detection of healthcare fraud and abuse.Expert Systems with Applications31(1), 56-68.

Higgs Boson: Case Study on an infamous prediction that came true

Definitions:

  • Forecasting (business context): relies on empirical relationships that were created from observations, theory, and consistent patterns, which can have assumptions and limitations that are either known or unknown to give the future state of a certain event (Seeman, 2002). For instance forecasting, income from a simple income statement could help provide key data for how a company is operating, but the assumptions and limitations on using this method can wipe out a business (Garrett, 2013).
  • Predictions (business context): are a more general term in which, is a statement of a future state of a certain event, that can be based on empirical relationships, strategic foresight, or even scenario planning (Seeman, 2002; Ogilvy, 2015).
  • Scenarios: alternate futures that change with time as supportive and challenging forces unfold, usually containing enough data like the likelihood of success or failure, the story of the landscape, innovative opportunities, challenges to be faced, signals, etc. (Ogilvy, 2015; Wade, 2012).

Case Study: An infamous prediction that came true

The Higgs Boson helps tell the origin of mass in the universe (World Science Festival, 2013). Mass is the resistance of an object to be pushed and pulled by other objects or forces in the universe, and mass is made up of the constitute particles of that object (Greene, 2013; PBS Space-Time, 2015; World Science Festival, 2013).  The question is where does the mass of these particles that give an object its mass comes from?  The universe if filled with an invisible Higgs Field, in which these particles are swimming in and experiencing a form of resistance (when the particle speeds up or slows down), this resistance in the Higgs Field is the mass of the particles (Greene, 2013; World Science Festival, 2013).  Certain particles have mass (electrons), and others don’t (photons), this is because the certain particles interact with the invisible Higgs Field (PBS Space-Time, 2015). Scientist use the large Hadron Collider to speed up particles in such a way that when they collided in the correct way (1:1,000,000,000 chance), the particles’ collisions would be able to clump a bit of the Higgs Field to create a Higgs particle that lasted for a 10-22 second (Greene, 2013; PBS Space-Time, 2015; World Science Festival, 2013). Therefore, finding the Higgs particle is a direct link to proving that the existence of the Higgs field (PBS Space-Time, 2015).

The importance of proving this prediction correct (World Science Festival, 2013):

  • Understanding where mass comes from
  • The Higgs particle is a new form of particle that doesn’t spin
  • Shows that mathematics lead the way to discovering something about our reality

This was a prediction in the waiting to be confirmed through observation for over 50 years, which got its roots in the form of scientific and mathematical roots of quantum physics and by Higgs in 1964 (Greene, 2013; PBS Space-Time, 2015; World Science Festival, 2013).

Supporting Forces for the prediction:

  • Technological: the development of technology to study mathematics over the course of 50 years helped facilitate the discovery of this prediction (Greene, 2013; World Science Festival, 2013). The actual technology use is called the ATLAS detector attached to the Large Hadron Collider (Greene, 2013).
  • Financial: Through international collaboration from thousands of scientists and over a dozen of countries, they were able to amass the financial capital to build this $10 Billion Large Hadron Collider.

References:

Business Intelligence: Data Mining

Data mining is just a subset of the knowledge discovery process (or concept flow of Business Intelligence), where data mining provides the algorithms/math that aid in developing actionable data-driven results (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). It should be noted that success has much to do with the events that lead to the main event as it does with the main event.  Incorporating data mining processes into Business Intelligence, one must understand the business task/question behind the problem, properly process all the required data, analyze the data, evaluate and validate the data while analyzing the data, apply the results, and finally learn from the experience (Ahlemeyer-Stubbe & Coleman, 2014). Conolly and Begg (2014), stated that there are four operations of data mining: predictive modeling, database segmentation, link analysis, and deviation detection.  Fayyad et al. (1996), classifies data mining operations by their outcomes: prediction and descriptive.

It is crucial to understand the business task/question behind the problem you are trying to solve.  The reason why is because some types of business applications are associated with particular operations like marketing strategies use database segmentation (Conolly & Begg, 2014).  However, any of the data mining operations can be implemented for any business application, and many business applications can use multiple operations.  Customer profiling can use database segmentation first and then use predictive modeling next (Conolly & Begg, 2014). By thinking outside of the box about which combination of operations and algorithms to use, rather than using previously used operations and algorithms to help meet the business objectives, it could generate even better results (Minelli, Chambers, & Dhiraj, 2013).

A consolidated list (Ahlemeyer-Stubbe & Coleman, 2014; Berson, Smith, & Thearling 1999; Conolly & Begg, 2014; Fayyad et al., 1996) of the different types of data mining operations, algorithms and purposes are listed below.

  • Prediction – “What could happen?”
    • Classification – data is classified into different predefined classes
      • C4.5
      • Chi-Square Automatic Interaction Detection (CHAID)
      • Support Vector Machines
      • Decision Trees
      • Neural Networks (also called Neural Nets)
      • Naïve Bayes
      • Classification and Regression Trees (CART)
      • Bayesian Network
      • Rough Set Theory
      • AdaBoost
    • Regression (Value Prediction) – data is mapped to a prediction formula
      • Linear Regression
      • Logistic Regression
      • Nonlinear Regression
      • Multiple linear regression
      • Discriminant Analysis
      • Log-Linear Regression
      • Poisson Regression
    • Anomaly Detection (Deviation Detection) – identifies significant changes in the data
      • Statistics (outliers)
  • Descriptive – “What has happened?”
    • Clustering (database segmentation) – identifies a set of categories to describe the data
      • Nearest Neighbor
      • K-Nearest Neighbor
      • Expectation-Maximization (EM)
      • K-means
      • Principle Component Analysis
      • Kolmogorov-Smirnov Test
      • Kohonen Networks
      • Self-Organizing Maps
      • Quartile Range Test
      • Polar Ordination
      • Hierarchical Analysis
    • Association Rule Learning (Link Analysis) – builds a model that describes the data dependencies
      • Apriori
      • Sequential Pattern Analysis
      • Similar Time Sequence
      • PageRank
    • Summarization – smaller description of the data
      • Basic probability
      • Histograms
      • Summary Statistics (max, min, mean, median, mode, variance, ANOVA)
  • Prescriptive – “What should we do?” (an extension of predictive analytics)
    • Optimization
      • Decision Analysis

Finally, Ahlemeyer-Stubbe and Coleman (2014) stated that even though there are a ton of versatile data mining software available that would do any of the abovementioned operations and algorithms; a good data mining software would be deployable across different environments and include tools for data prep and transformation.

References