Modeling and analyzing big data in health care

Let’s consider using the building blocks system for healthcare systems, on a healthcare problem that wants to monitor patient vital signs similar to Chen et al. (2010).

  • The purpose that the new data will serve: Most hospitals measure the following vitals for triaging patients: blood pressure and flow, core temperature, ECG, carbon dioxide concentration (Chen et al. 2010).
    1. Functions should it serve: gathering, storing, preprocessing, and processing the data. Chen et al. (2010) suggested that they should also perform a consistency check, aggregating and integrate the data.
    2. Which parts of the data are needed to serve these functions: all
  • Tools needed: distributed database system, wireless network, parallel processing, graphical user interface for healthcare providers to understand the data, servers, subject matter experts to create upper limits and lower limits, classification algorithms that used machine learning
  • Top level plan: The data will be collected from the vital sign sensors, streaming at various time intervals into a central hub that sends the data in packets over a wireless network into a server room. The server can divide the data into various distributed systems accordingly. A parallel processing program will be able to access the data per patient per window of time to conduct the needed functions and classifications to be able to provide triage warnings if the vitals hit any of the predetermined key performance indicators that require intervention by the subject matter experts.  If a key performance indicator is sparked, send data to the healthcare provider’s device via a graphical user interface.
  • Pivoting is bound to happen; the following can happen:
    1. Graphical user interface is not healthcare provider friendly
    2. Some of the sensors need to be able to throw a warning if they are going bad
    3. Subject matter experts may need to readjust the classification algorithm for better triaging

Thus, the above problem as discussed by Chen et al. (2010), could be broken apart to its building block components as addressed in Burkle et al. (2011).  These components help to create a system to analyze this set of big health care data through analytics, via distributed systems and parallel processing as addressed by Services (2015) and Mirtaheri et al. (2008).

Draw on a large body of data to form a prediction or variable comparisons within the premise of big data.

Fayyad, Piatetsky-Shapiro, and Smyth (1996) defined that data analytics can be divided into descriptive and predictive analytics. Vardarlier and Silaharoglu (2016) agreed with Fayyad et al. (1996) division but added prescriptive analytics.  Depending on the goal of diagnosing illnesses with the use of big data analytics should depend on the theory/division one should choose.  Raghupathi & Raghupathi (2014), stated some common examples of big data in the healthcare field to be: personal medical records, radiology images, clinical trial data, 3D imaging, human genomic data, population genomic data, biometric sensor reading, x-ray films, scripts, and traditional paper files.  Thus, the use of big data analytics to understand the 23 pairs of chromosomes that are the building blocks for people. Healthcare professionals are using the big data generated from our genomic code to help predict which illnesses a person could get (Services, 2013). Thus, using predictive analytics tools and algorithms like decision trees would be of some use.  Another use of predictive analytics and machine learning can be applied to diagnosing an eye disease like diabetic retinopathy from an image by using classification algorithms (Goldbloom, 2016).

Examine the unique domain of health informatics and explain how big data analytics contributes to the detection of fraud and the diagnosis of illness.

A process mining framework for the detection of healthcare fraud and abuse case study (Yang & Hwang, 2006): Fraud exists in processing health insurance claims because there are more opportunities to commit fraud because there are more channels of communication: service providers, insurance agencies, and patients. Any one of these three people can commit fraud, and the highest chance of fraud occurs where service providers can do unnecessary procedures putting patients at risk. Thus this case study provided the framework on how to conduct automated fraud detection. The study collected data from 2543 gynecology patients from 2001-2002 from a hospital, where they filtered out noisy data, identified activities based on medical expertise, identified fraud in about 906.

Summarize one case study in detail related to big data analytics as it relates to organizational processes and topical research.

The use of Spark about the healthcare field case study by Pita et al. (2015): Data quality in healthcare data is poor and in particular that of the Brazilian Public Health System.  Spark was used to help in data processing to improve quality through deterministic and probabilistic record linking within multiple databases.  Record linking is a technique that uses common attributes across multiple databases and identifies a 1-to-1 match.  Spark workflows were created to help do record linking by (1) analyzing all data in each database and common attributes with high probabilities of linkage; (2) pre-processing data where data is transformed, anonymization, and cleaned to a single format so that all the attributes can be compared to each other for a 1-to-1 match; (3) record linking based on deterministic and probabilistic algorithms; and (4) statistical analysis to evaluate the accuracy. Over 397M comparisons were made in 12 hours.  They concluded that accuracy depends on the size of the data, where the bigger the data, the more accuracy in record linking.

References

  • Burkle, T., Hain, T., Hossain, H., Dudeck, J., & Domann, E. (2001). Bioinformatics in medical practice: What is necessary for a hospital?. Studies in health technology and informatics, (2), 951-955.
  • Chen, B., Varkey, J. P., Pompili, D., Li, J. K., & Marsic, I. (2010). Patient vital signs monitoring using wireless body area networks. In Bioengineering Conference, Proceedings of the 2010 IEEE 36th Annual Northeast (pp. 1-2). IEEE.
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. Retrieved from: http://www.aaai.org/ojs/index.php/aimagazine/article/download/1230/1131/
  • Goldbloom, A. (2016). The jobs we’ll lose to machines – and the ones we won’t. TED Talks. Retrieved from https://www.youtube.com/watch?v=gWmRkYsLzB4
  • Mirtaheri, S. L., Khaneghah, E. M., Sharifi, M., & Azgomi, M. A. (2008). The influence of efficient message passing mechanisms on high performance distributed scientific computing. In Parallel and Distributed Processing with Applications, 2008. ISPA’08. International Symposium on (pp. 663-668). IEEE.
  • Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., & Rasella, D. (2015). A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data. In EDBT/ICDT Workshops (pp. 17-26).
  • Raghupathi, W. Raghupathi, V. (2014). Big Data Analytics in healthcare: promise and potential. Heath Information Science and Systems. 2(3). Retrieved from http://hissjournal.biomedcentral.com/articles/10.1186/2047-2501-2-3
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].
  • Vardarlier, P., & Silahtaroglu, G. (2016). Gossip management at universities using big data warehouse model integrated with a decision support system. International Journal of Research in Business and Social Science, 5(1), 1–14. Doi: http://doi.org/10.1108/ 17506200710779521
  • Yang, W. S., & Hwang, S. Y. (2006). A process-mining framework for the detection of healthcare fraud and abuse.Expert Systems with Applications31(1), 56-68.

International health care data laws

Governing the way that health is dealt with internationally since 1969 is the International Health Regulations (IHR) and it had been updated in 2005 (Georgetown Law, n.d.; World Health Organization [WHO], 2005). Under Article 45 of the IHR deals with the treatment of personal data (WHO, 2005):

  • Personal identifiable data and information that has been collected or received shall be confidential and processed anonymously.
  • Data can be disclosed for purposes that are vital for public health. However, the data that is transferred must be adequate, accurate, relevant, up-to-date, and not excessive data that has to be processed fairly and lawfully.
  • Bad or incompatible data is either corrected or deleted.
  • Personal data is not kept any longer than what is necessary.
  • WHO will provide data of the patient to the patient upon request in a timely fashion and allow for data correction from the patients

The European Union has the Directive on Data Protection of 1998 (DDP), and Canada has Personal Information Protection and Electronic Documents Act of 2000 (PIPEDA) that is similar to the U.S. HIPAA regulations set forth by the U.S. Department of Health and Human Services (Guiliano, 2014). Eventually, the EU in 2012 proposed the addition of the Data Protection Regulation (DPR) of 2016 (Hordern, 2015, Justice, n.d.).

EU’s DDP allows (Guiliano, 2014):

  • It is outlawed to transfer data to any non-EU entity that doesn’t meet EU data protection standards.
  • The government must give consent before gathering sensitive data for certain situations only
  • Only data that is needed at the time that has an explicit and reasonPable purpose.
  • Patients should be allowed to correct errors in personal data, and if the data is outdated or useless, they must be discarded.
  • People with access to this data must have been properly trained.

EU’s DPR allows (Hordern, 2015; Justice, n.d.):

  • People can allow for data to be used for future scientific research where the purpose is still unknown as long as the research is conducted by “recognized ethical ”
  • Processing data for scientific studies based on the data that has already been collected is legal without the need to get additional consent
  • Health data may be used without the consent of the individual for public health
  • Health data cannot be used by employers, insurance, and banking companies
  • If data is being or will be used for future research, data can be retained further than current regulations

Canadian’s PIPEDA allows (Guiliano, 2014):

  • Patients should know the business justification for using their personal and medical data.
  • Patients can review their data and have errors corrected
  • Organizations must request from their patients the right to use their data for each situation except in criminal cases or emergencies
  • Organizations cannot collect patient and medical data that is not needed for the current situation unless they ask for permission from their patients and telling them how it will be used and who will use it.

Other Internal laws or regulations regard big data from Australia, Brazil, China, France, Germany, India, Israel, Japan, South Africa and the United Kingdom are summarized in the International and Comparative Study on Big Data (der Sloot & van Schendel, 2016).  When it comes to transferring U.S. collected and processed data internationally, the U.S. holds all U.S. regulated entities liable to all U.S. data regulations (Jolly, 2016).  Some states in the U.S. further restrict the export of personal data to international entities (Jolly, 2016).  Thus, any data exported or imported from other countries must deal with the regulations of the country (or state) of origin and those of the country (or state) to which it is exported in.

In the United Kingdom, a legal case on health care data was presented and was ruled upon.  This case dealt with the rate of de-identifiable primary care physician prescription habits data breached confidentiality laws because of the lack of consent (Knoppers, 2000).  The consent had to cover both commercial and public issues purposes.  This lack of both types of consent meant that there was a misuse of data. In the Supreme Court of Canada, consent was not collected properly and violated the expectation of privacy between the patients and private healthcare provider (Knoppers, 2000).  All of these laws and regulations amongst international and domestic views of data usage, consent, and expectation of privacy with healthcare data all are trying to protect people from the misuse of data.

References

Data auditing for health care

Data auditing is assessing the quality and fit for purpose of data via key metrics and properties of the data (Techopedia, n.d.).  Data auditing processes and procedures are the business’ way of assessing and controlling their data quality (Eichhorn, 2014). Doing data audits allows a business to fully realize the value of their data and provides higher fidelity to their data analytics results (Jones, Ross, Ruusalepp, & Dobreva, 2009). Data auditing is needed because the data could contain human error or it could be subject to IT data compliance like HIPAA, SOX, etc. regulations (Eichhorn, 2014). When it comes to health care data audits, it can help detect unauthorized access to confidential patient data, reduce the risk of unauthorized access to data, help detect defects, help detect threats and intrusion attempts, etc. (Walsh & Miaolis, 2014).

Data auditors can perform a data audit by considering the following aspects of a dataset (Jones et al., 2009):

  • Data by origin: observation, computed, experiments
  • Data by data type: text, images, audio, video, databases, etc.
  • Data by Characteristics: value, condition, location

A condensed data audits process for research is proposed by Shamoo (1989):

  • Select published, claimed, or random data from a figure, table, or data source
  • Evaluate if all the formulas and equations are correct and used correctly
  • Convert all the data into numerical values
  • Re-derive the original data using the formulas and equations
  • Segregate the various parameters and values to identify the sources of the original data
  • If the data is the same as those in (1), then the audit turned up no quality issues, if not a cause analysis needs to be conducted to understand where the data quality faulted
  • Formulate a report based on the results of the audit

Jones et al. (2009) provided a four stage process with a detailed swim lane diagram:

5db1

For some organizations, it is the creation of log file for all data transactions that can aid in improving data integrity (Eichhorn, 2014).  The creation of the log file must be scalable and separated from the system under audit (Eichhorn, 2015).  Log files can be created for one system or many. Meanwhile, all the log files should be centralized in one location, and the log data must be abstracted into a common and universal format for easy searching (Eichhorn, 2015). Regardless of the techniques, HIPAA section 164.308-3012 talk about information and audits in the health care system (Walsh & Miaolis, 2014).

HIPAA has determined key activities for a healthcare system to have a data auditing protocol (Walsh & Miaolis, 2014):

  • Determine the activities that will be tracked or audited: creating a process flow or swim lane diagram like the one above, involve key data stakeholders, and evaluate which audit tools will be used.
  • Select the tools that will be deployed for auditing and system activity reviews: one that can detect unauthorized access to data, ability to drill down into the data, collect audit logs, and present the findings in a report or dashboard.
  • Develop and employ the information system activity review/audit policy: determine the frequency of the audits and what events would trigger other audits.
  • Develop appropriate standard operating procedures: to deal with presenting the results, dealing with the fallout of what the audit reveals, and efficient audit follow-up

References

Data privacy and governance in health care

Lawyers define privacy as (Richard & King, 2014):

  1. Invasions into protecting spaces, relationships or decisions
  2. Collection of information
  3. Use of information
  4. Disclosure of information

Given the body of knowledge of technology and data analytics, data collection and analysis may give off the appearance of a “Big Brother” state (Li, 2010). The Privacy Act of 1974, prevents the U.S. government from collecting its citizen’s data and storing in databases, but it does not expand to companies (Brookshear & Brylow, 2014).  Confidentiality does exist for health records via the Health Insurance Portability and Accountability Act (HIPAA) of 1996, and for financial records through the Fair Credit Act, which also allows people to correct erroneous information in the credit (Richard & King, 2014). The Electronic Communication Privacy Act of 1986 limits wiretapping communications by the government, but it does not expand to companies (Brookshear & Brylow, 2014). The Video Privacy Protection Act of 1988 protects people via videotaped records (Richard and King, 2014). Finally, in 2009 the HITECH Act, strengthened the enforcement of HIPAA (Pallardy, 2015). Some people see the risk of the loss of privacy via technology and data analytics, while another embrace it due to the benefits they perceive that they would gain from disclosing this information (Wade, 2012).  All of these privacy protection laws are outdated and do not extend to the rampant use, collection, and mining of data based on the technology of the 21st century.

However, Richard and King (2014), describe that a binary notion of data privacy does not exist.  Data is never completely private/confidential nor completely divulged, but data lies in-between these two extremes.  Privacy laws should focus on the flow of personal information, where an emphasis should be placed on a type of privacy called confidentiality, where data is agreed to flow to a certain individual or group of individuals (Richard & King, 2014).  Thus, from a future legal perspective data privacy should focus on creating rules on how data should flow, be used, and the concept of confidentiality between people and groups.  Right now the only thing preventing abuse of personal privacy from companies is the negative public outcry that will affect their bottom line (Brookshear & Brylow, 2014).

Healthcare Industry

In the healthcare industry, patients and healthcare providers are concerned about data breaches, where personal confidential information could be accessed, and if a breach did occur 54% of patients were willing of switching from their current provider (Pallardy, 2015).

In healthcare, if data gets migrated into a public cloud rather than a community cloud-specific to healthcare, the data privacy enters into legal limbo.  According to Brookshear and Brylow (2014), cloud computing data privacy and security becomes an issue because, in a public cloud, healthcare will not own the infrastructure that houses the data.  HIPAA government regulations provide patient privacy standard that the healthcare industry must follow.  HIPAA covers a patient’s right to privacy by asking for permission on how to use their personally identifiable information in medical records, personal health, health plans, healthcare clearinghouses, and healthcare transactions (HHS, n.d.b.).  The Department of Health & Human Services collects complaints that deal directly with a violation of the HIPAA regulations (HHS, n.d.a.).  Brown (2014), outlines the cost of each violation that is based on the type of violation, the willful or willful neglect, and how many identical violations have occurred, where penalty costs can range from $10-50K per incident. Industry best practices on how to avoid HIPAA violations come from (Pallardy, 2015):

  • De-identify personal data: Names, Birth dates, death dates, treatment dates, admission dates, discharge dates, telephone numbers, contact information, address, social security numbers, medical record numbers, photographs, finger and voice prints, etc.
  • Install technical controls: anti-malware, data loss prevention, two-factor authentication, patch management, disc encryption, and logging and monitoring software
  • Install certain security controls: Security and compliance oversight committee, formal security assessment process, security incident response plan, ongoing user awareness and training, information classification system, security policies

References

Data brokers for health care

Data brokers are tasked collecting data from people, building a particular type of profile on that person, and selling it to companies (Angwin, 2014; Beckett, 2014; Tsesis, 2014). The data brokers main mission is to collect data and drop down the barriers of geographic location, cognitive or cultural gaps, different professions, or parties that don’t trust each other (Long, Cunningham, & Braithwaite, 2013). The danger of collecting this data from people can raise the incidents of discrimination based on race or income directly or indirectly (Beckett, 2014).  Some prominent data brokers are Acxiom, Google, and LexisNexis (Tsesis, 2014).

When it comes to companies, it is unknown which data is being collected from people and sold to other companies (Beckett, 2014).  According to Tsesis (2014), the fourth amendment of the constitution does not apply here, due to the nature on how the data is collected and correlated by data mining of third party entities.  So what kind of data do they have?  Current data brokers do have data obtained from company’s people shop at or credit/loans applied for like: names, address, contact info, demographics, occupation, education level, parents’ names, children’s names, gender of the person’s children, hobbies, purchases, salary, and other data that is unknown (Beckett, 2014).

Sensitive data is protected from these commercial data brokers, like medical records, doctor-patient conversations (Beckett, 2014). This is due to HIPAA (Health Insurance Portability and Accountability Act of 1996), which helps de-identify patient data. However, there is always ways around this.  Companies and health insurances are buying online search data, allergy data, dieting data and are correlating it other data to build a health profile on the person (Beckett, 2014). There is a benefit of having in-house data brokers in hospitals where data is stored in silos (Long et al., 2013):

  • Brokers can bring specialized subject matter expertise to connect distributed data for improving patient care and improve healthcare service efficiency.
  • Brokers can help reduce redundant data held in silos.
  • Brokers can increase access to heterogeneous knowledge, though gathering and increasing tacit knowledge. This type of knowledge is derived from a different groups or networks thus the knowledge is a different source of new information.
  • Brokers efforts can help generate innovations.

However, data collected and correlated by data brokers could still be completely wrong as proven with the credit score information from the three big credit agencies (Angwin, 2014; Beckett, 2014). If the data is wrong in totality or partially, it can draw the wrong conclusions on the person and if it is used for discrimination just compounds the problem.  Long et al. (2013), concluded that brokers even in the field of healthcare are an expensive endeavor and as the primary gatekeeper of data they could be overwhelmed. Angwin (2014), compiled a list of 212 commercial data brokers, with about 92 of them allowing for opting-out.  Tsesis (2014), stated that many scholars in the field of privacy are advocating for a person’s right to opt-out of data they did not give consent to be collected. This suggests that with the current law and regulations, data collection and correlation to a person’s profile is currently unavoidable, yet there is a chance that some of that data is wrong, to begin with.

References

Fraud detection in the health care industry using analytics

Fraud is deception, fraud detection is really needed, because as fraud detection algorithms are improving, the rate of fraud is increasing (Minelli, Chambers, &, Dhiraj, 2013). Hadoop and the HFlame distribution have to be used to help identify fraudulent data in other companies like banking in near-real-time (Lublinsky, Smith, & Yakubovich, 2013).

Data mining has allowed for fraud detection via multi-attribute monitoring, where it tries to find hidden anomalies by identifying hidden patterns through the use of class description and class discrimination (Brookshear & Brylow, 2014; Minellli et al., 2013). Class Descriptions identify patterns that define a group of data, and class discrimination identifies patterns that divide groups of data (Brookshear & Brylow, 2014). As data flows in, data is monitored through validity check and detection rules and gives them a score, such that if the validity and detection score surpasses a threshold, that data point is flagged as potentially suspicious (Minelli et al., 2013).

This is a form of outlier data mining analysis, where data that doesn’t fit any of the above groups of data that has been described and discriminated can be used to identify fraudulent data (Brookshear & Brylow, 2014; Connolly & Begg, 2014). Minelli et al. (2013), stated that using historical data to build up the validity check and detection rules with real-time data can help identify outliers in near-real time. However, what about predicting fraud?  In the future, companies will be using Hadoop’s machine learning capability paired with its fraud detection algorithms to provided predictive modeling of fraud events (Lublinsky, Smith, & Yakubovich, 2013).

A process mining framework for the detection of healthcare fraud and abuse case study (Yang & Hwang, 2006)

Fraud exists in processing health insurance claims because there are more opportunities to commit fraud because there are more channels of communication: service providers, insurance agencies, and patients. Any one of these three people can commit fraud, and the highest chance of fraud occurs where service providers can do unnecessary procedures putting patients at risk. Thus this case study provided the framework on how to conduct automated fraud detection. The study collected data from 2543 gynecology patients from 2001-2002 from a hospital, where they filtered out noisy data, identified activities based on medical expertise, identified fraud in about 906.

Before data mining and machine learning, the process was heavily reliant on medical professional with subject matter expertise to detect fraud, which was costly for multiple resources.  Also, machine learning is not subject to human and manual error that is common with humans.  Machine learning algorithms for fraud detection relies on clinical pathways, which are defined as the right people giving the right care services in the right order, with the aim at the reduction of waste and implementing best practices.  Any deviation from this that is abnormal can be flagged by the machine learning algorithm.

References

  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, (12th). Pearson Learning Solutions. VitalBook file.
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th). Pearson Learning Solutions. VitalBook file.
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox. VitalBook file.
  • Minelli, M., Chambers, M., &, Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.
  • Yang, W. S., & Hwang, S. Y. (2006). A process-mining framework for the detection of healthcare fraud and abuse.Expert Systems with Applications31(1), 56-68.

Building block system of health care data analytics

Building block system of healthcare big data analytics involves a few steps Burkle et al. (2001):

  • What is the purpose that the new data will and should serve
    • How many functions should it support
    • Marking which parts of that new data is needed for each function
  • Identify the tool needed to support the purpose of that new data
  • Create a top level architecture plan view
  • Building based on the plan but leaving room to pivot when needed
    • Modifications occur to allow for the final vision to be achieved given the conditions at the time of building the architecture.
    • Other modifications come under a closer inspection of certain components in the architecture

With big data analytics in healthcare, parallel programming and distributive programming are part of the solution to consider in building a network cluster (Mirtaheri, Khaneghah, Sharifi, & Azgomi, 2008; Services, 2015). Distributed programming allows for connecting multiple computer resources distributed across different locations (Mirtaheri et al., 2008). Programming that is maximizing the connections for processing or accessing data that is distributed across the computational resources is considered as parallel programming (Mirtaheri et al., 2008; Saden, 2011).  Burkle et al. (2001) explained how they used the building block design for a DNA network cluster build a system to help classify/predict what genome they are analyzing to a pathogen and understand which part of the genome found in many pathogens may be immune to certain treatments:

  • Tracing data from sequencer (*.esd file)
    • Base caller
  • “Raw” sequence (*.scf file)
    • Edit and assemble and export genome assembly for research
  • “Clean” sequence
    • External references are called in and outputted from reference databases
  • Genus and species are identified
  • Completed results are calling and outputting into an attributed local database

The process above flow for sequencing genomic data is part of the top-level plan that was modified as time went by, thus step four of the building blocks process.

Now, let’s consider using the building blocks system for healthcare systems, on a healthcare problem that wants to monitor patient vital signs similar to Chen et al. (2010).

  • The purpose that the new data will serve: Most hospitals measure the following vitals for triaging patients: blood pressure and flow, core temperature, ECG, carbon dioxide concentration (Chen et al. 2010).
    1. Functions should it serve: gathering, storing, preprocessing, and processing the data. Chen et al. (2010) suggested that they should also perform a consistency check, aggregating and integrate the data.
    2. Which parts of the data are needed to serve these functions: all
  • Tools needed: distributed database system, wireless network, parallel processing, graphical user interface for healthcare providers to understand the data, servers, subject matter experts to create upper limits and lower limits, classification algorithms that used machine learning
  • Top level plan: The data will be collected from the vital sign sensors, streaming at various time intervals into a central hub that sends the data in packets over a wireless network into a server room. The server can divide the data into various distributed systems accordingly. A parallel processing program will be able to access the data per patient per window of time to conduct the needed functions and classifications to be able to provide triage warnings if the vitals hit any of the predetermined key performance indicators that require intervention by the subject matter experts.  If a key performance indicator is sparked, send data to the healthcare provider’s device via a graphical user interface.
  • Pivoting is bound to happen; the following can happen
    1. Graphical user interface is not healthcare provider friendly
    2. Some of the sensors need to be able to throw a warning if they are going bad
    3. Subject matter experts may need to readjust the classification algorithm for better triaging

Thus, the above problem as discussed by Chen et al. (2010), could be broken apart to its building block components as addressed in Burkle et al. (2011).  These components help to create a system to analyze this set of big health care data through analytics, via distributed systems and parallel processing as addressed by Services (2015) and Mirtaheri et al. (2008).

References

  • Burkle, T., Hain, T., Hossain, H., Dudeck, J., & Domann, E. (2001). Bioinformatics in medical practice: what is necessary for a hospital?. Studies in health technology and informatics, (2), 951-955.
  • Chen, B., Varkey, J. P., Pompili, D., Li, J. K., & Marsic, I. (2010). Patient vital signs monitoring using wireless body area networks. In Bioengineering Conference, Proceedings of the 2010 IEEE 36th Annual Northeast (pp. 1-2). IEEE.
  • Mirtaheri, S. L., Khaneghah, E. M., Sharifi, M., & Azgomi, M. A. (2008). The influence of efficient message passing mechanisms on high performance distributed scientific computing. In Parallel and Distributed Processing with Applications, 2008. ISPA’08. International Symposium on (pp. 663-668). IEEE.
  • Sandén, B. I. (2011-01-14). The design of Multithreaded Software: The Entity-Life Modeling Approach, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].

Column-oriented NoSQL databases

NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). Column-oriented databases are perfect for sparse datasets, ones with many null values and when columns do have data the related columns are grouped together (Services, 2015).  Grouping demographic data like age, income, gender, marital status, sexual orientation, etc. are a great example for using this NoSQL database. Cassandra, which is a column-oriented NoSQL database focuses on availability and partition tolerance, this means that as an AP system it can achieve consistency if data can be replicated and verified (Hurst, 2010).

Cassandra has been assessed for performance evaluation against other NoSQL databases like MongoDB and Raik for health care data analytics (Weider, Kollipara, Penmetsa, & Elliadka, 2013).  In this study, NoSQL database demands for health care data were two-fold:

  • Read/write efficiency of medical test results for a patient X (Availability)
  • All medical professionals should see the same information on patient X (Consistency)

A NoSQL graph database did not have the fit to use for the above demands, thus wasn’t part of this study.

The architecture of this project: nine partition nodes, where three by three nodes were used to mimic three data centers that would be used by 100 global health facilities, where data is generated at a rate of 1TB per month and must be kept for 99 years.

The dataset used in this project: a synthetic dataset that has 1M patients with 10M lab reports, averaging at seven lab reports per person, but randomly distributed of from 0-20 lab reports per person.

In meeting both of these two demands, Cassandra had a significantly higher throughput value than the other two NoSQL databases. Cassandra’s EACH_QUORUM write and LOCAL_QUORUM read options are part of their datacenter aware system, providing the great throughputs results, using the three synthetic datacenters. Testing consistency, by using Cassandra’s ONE for its write and read options at an eventual rate (slower consistency) or strong rate (faster consistency), shows that throughput increases with the eventual system. The choice to use either rate rests with the healthcare stakeholders.

The authors concluded that for their system and their requirements Cassandra had the highest throughput regardless of the level of consistency rates (Weider et al., 2013).  They also suggested that each of these tests should be adjusted based on the requirements from key stakeholders in the healthcare profession and that a small variation in the data model could change the results seen here.

In conclusion of this post, NoSQL databases provide huge advantages to data analytics over traditional relational database management systems. But, NoSQL databases must fit the needs of the stakeholders, and quantitative tests must be thoroughly designed to assess which NoSQL database will meet those needs.

References

  • Hurst, N. (2010). Visual guide to NoSQL systems. Retrieved from http://blog.nahurst.com/visual-guide-to-nosql-systems
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].
  • Weider, D. Y., Kollipara, M., Penmetsa, R., & Elliadka, S. (2013, October). A distributed storage solution for cloud based e-Healthcare Information System. In e-Health Networking, Applications & Services (Healthcom), 2013 IEEE 15th International Conference on (pp. 476-480). IEEE.

Graphical NoSQL Databases

There is a lot of complicated connections between patients, their provider, their diagnoses, etc., and graphically representing this relationship data is one of the main highlights of using a NoSQL graph database (Park, Shankar, Park, & Ghosh, 2014). NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). Graph NoSQL databases are used drawing networks by showing the relationship between items in a graphical format that has been optimized for easy searching and editing (Services, 2015). Each item is considered a node and adding more nodes or relationships while traversing through them is made simpler through a graph database rather than a traditional database (Sadalage & Fowler, 2012). Some sample graph databases consist of Neo4j Pregel, etc. (Park et al., 2014).

Case Study: Graph Databases for large-scale healthcare systems: A framework for efficient data management and data services (Park et al., 2014)

Driver for data analytics needs: Finding areas for cost savings through anomaly detection algorithms, because currently there are a bunch of individual tables and non-normalized data that are replicated multiple times which is causing bottlenecks.

Problem: Understanding and establishing relationships between self-referrals and shared providers, which allows for the use of a collaborative filter.

System Needs: Data management needs error-tolerant and non-redundant database system, while data services need data retrieval, analytics queries, statistical data extraction and mining algorithms.

NoSQL Database used: Neo4J graph NoSQL Database using Cypher query to keep the data normalized and reduce the number of individual tables of data due to the advanced yet simple query capabilities

Methodology: Using the 3EG: 3NF Equivalent Graph Transformation algorithm to convert traditional relational database data into graph database data on realistic synthetic healthcare data.  The synthetic healthcare data consists of zip-codes, diagnosis of disease, available procedures, beneficiary, claim, and providers. The data when flattened can showcase 1 M beneficiaries to 100 K providers, but in a graphical format, that same data will have 51 M nodes and 257 M relationships.

Queries Ran on the NoSQL Database:

  • Shared providers between two beneficiaries
  • Shared providers between two beneficiaries through either actual visits or by referrals
  • List of shared diseases between two beneficiaries through their claim records
  • Any link between two beneficiaries à helps to direct further investigations/queries
  • Shared beneficiaries between two providers
  • Self-referred beneficiaries for a given provider
  • Similar claims based on diagnoses codes
  • Patient wants to switch to a new provider based on a referral by another provider

Using 50 random queries for each of the 8 cases above: the time it took to run the first three cases was faster in a MySQL query, but by less than 0.0X seconds, whereas the last 5 cases the NoSQL was faster ranging from 0.5-40 seconds.  As data size grew so did the processing time for the last five cases on MySQL grew.

Conclusions: The authors were able to show that with more highly advanced cases, MySQL takes more time than NoSQL. Thus, for big data analytics, NoSQL graph databases can help store dynamic relationship data as well as process more complex queries using fewer lines of code and faster than MySQL queries.  This style of storing data allows the end-user in the healthcare field to ask more complex questions and get those answers promptly.

References

  • Park, Y., Shankar, M., Park, B. H., & Ghosh, J. (2014). Graph databases for large-scale healthcare systems: A framework for efficient data management and data services. In Data Engineering Workshops (ICDEW), 2014 IEEE 30th International Conference on (pp. 12-19). IEEE.
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].

Diagnosis of illness via big data

Big data is defined as high volume, high variety/complexity, and high velocity, which is known as the 3Vs (Services, 2015). Using Machine Learning and Artificial Intelligence, they do well at analyzing patterns from frequent and voluminous amounts of data at faster speeds than humans, but they fail to recognize patterns in infrequent and small amounts of data (Goldbloom, 2016). Thus, the use of data analytic theories and techniques on big data rather than novel situations in healthcare is vital to understand.

Fayyad, Piatetsky-Shapiro, and Smyth (1996) defined that data analytics can be divided into descriptive and predictive analytics. Vardarlier and Silaharoglu (2016) agreed with Fayyad et al. (1996) division but added prescriptive analytics.  Thus, these three divisions of big data analytics are:

  • Descriptive analytics explains “What happened?”
  • Predictive analytics explains “What will happen?”
  • Prescriptive analytics explains “Why will it happen?”

(Fayyad et al., 1996; Vardarlier & Silahtaroglu, 2016). Depending on the goal of diagnosing illnesses with the use of big data analytics should depend on the theory/division one should choose.  Raghupathi & Raghupathi (2014), stated some common examples of big data in the healthcare field to be: personal medical records, radiology images, clinical trial data, 3D imaging, human genomic data, population genomic data, biometric sensor reading, x-ray films, scripts, and traditional paper files.

The use of big data analytics to understand the 23 pairs of chromosomes that are the building blocks for people. Healthcare professionals are using the big data generated from our genomic code to help predict which illnesses a person could get (Services, 2013). Thus, using predictive analytics tools and algorithms like decision trees would be of some use.  Another use of predictive analytics and machine learning can be applied to diagnosing an eye disease like diabetic retinopathy from an image by using classification algorithms (Goldbloom, 2016).

The study of epigenetics, which are what parts of the genetic code is turned on versus turned off, can help explain why will certain illnesses are more probable to occur in the future (What is epigenetics, n.d.).  Thus, the use of prescriptive analytics could be of some use in the study of epigenetics. Currently, clinical trials use descriptive analytics to help calculate true positives, false positives, true negatives, and false negatives of a drug treatment versus a placebo are commonly used.  Thus, depending on the goal of diagnosing illnesses and the problem, that should help define which theories and techniques of big data analytics to use. The use of different data analytics techniques and theories based on the problem and data can change how healthcare jobs in the next 30 years from today (Goldbloom, 2016; McAfee, 2013).

References