Big Data Analytics: Privacy & HIPAA

Since its inception 25 years ago, the human genome project has been sequenced many 3B base pair of the human genomes (Green, Watson, & Collins, 2015).  This project has given rise of a new program, the Ethical, Legal and Social Implication (ELSI) project.  ELSI got 5% of the National Institute of Health Budget, to study ethical implications of this data, opening up a new field of study (Green et al., 2015 & O’Driscoll, Daugelaite, & Sleator, 2013).  Data sharing must occur, to leverage the benefits of the genome projects and others like it.  Poldrak and Gorgolewski (2014) stated that the goals of sharing data help out with the advancement of the field in a few ways: maximizing the contribution of research subjects, enabling responses to new questions, enabling the generation of new questions, enhance research results reproducibility (especially when the data and software used are combined), test bed for new big data analysis methods, improving research practices (development of a standard of ethics), reducing the cost of doing the science (what is feasible for one scientist to do), and protecting valuable scientific resources (via indirectly creating a redundant backup for disaster recovery).  Allowing for data sharing of genomic data can present ethical challenges, yet allow for multiple countries and disciplines to come together and analyze data sets to come up with new insights (Green et al., 2015).

Richards and King (2014), state that concerning privacy, we must think of it regarding the flow of personal information.  Privacy cannot be thought of as a binary, as data is private and public, but within a spectrum.  Richards and Kings (2014) argue that the data as exchanged between two people has a certain level of expectation of privacy and that data can remain confidential, but there is never a case were data is in absolute private or public.  Not everyone in the world would know or care about every single data point, nor will any data point be kept permanently secret if it is uttered out loud from the source.  Thus, Richards and Kings (2014) stated that transparency can help prevent abuse of the data flow.  That is why McEwen, Boyer, and Sun (2013) discussed that there could exist options for open-consent (your data can be used for any other future research project), broad-consent (describe various ways the data could be used, but it is not universal), or an opt-out-consent (where participants can say what their data shouldn’t be used for).

Attempts are being made through the enactment of Genetic Information Nondiscrimination Act (GINA) to protect identifying data for fears that it can be used to discriminate against a person with a certain type of genomic indicator (McEwen et al., 2013).  Internal Review Boards and Common Rules, with the Office of Human Research Protection (OHRP), have guidance on information flow that is de-identified.  De-identified information can be shared and is valid under current Health Insurance Portability and Accounting Act of 1996 (HIPAA) rules (McEwen et al, 2013).  However, fear of loss of data flow control comes from increase advances in technological decryption and de-anonymisation techniques (O’Driscoll et al., 2013 and McEwen et al., 2013).

Data must be seen and recognized as a person’s identity, which can be defined as the “ability of individuals to define who they are” (Richards & Kings, 2014). Thus, the assertion made in O’Driscoll et al. (2013) about how the ability to protect medical data, with respects to bid data and changing concept, definitional and legal landscape of privacy is valid.  Thanks to HIPAA, cloud computing, is currently on a watch list. Cloud computing can provide a lot of opportunity for cost savings. However, Amazon cloud computing is not HIPAA compliant, hybrid clouds could become HIPAA, and commercial cloud options like GenomeQuest and DNANexus are HIPAA compliant (O’Driscoll et al., 2013).

However, ethical issues extend beyond privacy and compliance.  McEwen et al. (2013) warn that data has been collected for 25 years, and what if data from 20 years ago provides data that a participant can suffer an adverse health condition that could be preventable.  What is the duty of the researchers today to that participant?  How far back in years should that go through?

Other ethical issues to consider: When it comes to data sharing, how should the researchers who collected the data, but didn’t analyze it should be positively incentivized?  One way is to make them co-author of any publication revolving their data, but then that makes it incompatible with standards of authorships (Poldrack & Gorgolewski, 2013).

 

Resources:

  • Green, E. D., Watson, J. D., & Collins, F. S. (2015). Twenty-five years of big biology. Nature, 526.
  • McEwen, J. E., Boyer, J. T., & Sun, K. Y. (2013). Evolving approaches to the ethical management of genomic data. Trends in Genetics, 29(6), 375-382.
  • Poldrack, R. A., & Gorgolewski, K. J. (2014). Making big data open: data sharing in neuroimaging. Nature Neuroscience, 17(11), 1510-1517
  • O’Driscoll, A., Daugelaite, J., & Sleator, R. D. (2013). ‘Big data,’ Hadoop and cloud computing in genomics. Journal of biomedical informatics, 46(5), 774-781.
  • Richards, N. M., & King, J. H. (2014). Big data ethics. Wake Forest L. Rev., 49, 393.

 

Big Data Analytics: Health Care Industry

Since its inception 25 years ago, the human genome project has been trying to sequence its first 3B base pair of the human genome over a 13 year period (Green, Watson, & Collins, 2015).  This 3B base pair is about 100 GB uncompressed and by 2011, 13 quadrillion bases were sequenced (O’Driscoll, Daugelaite, & Sleator, 2013).  With the advancement in technology and software as a service, the cost of sequencing a human genome has been drastically cut from $1M to $1K in 2012 (Green et al., 2015 and O’Driscoll et al., 2013).  It is so cheap now that 23andMe and others were formed as a consumer drove genetic testing industry that has been developed (McEwen, Boyer, & Sun, 2013).  At the beginning of this project, the researcher was wondering what insights the sequencing could bring to understanding decease, to the now explosion of research dealing with studying millions of other genomes from biological pathways, cancerous tumors, microbiomes, etc. (Green et al., 2015 and O’Driscoll et al., 2013).  Storing 1M genomes will exceed 1 Exabyte (O’Driscoll et al., 2013).  Based on the definition of Volume (size like 1 EB), Variety (different types of genomes), and Velocity (processing huge amounts of genomic data), we can classify that the whole genomic project in the health care industry as big data.

This project has paved the way for other projects like sharing MRI data from 511 participants, (exceeding 18 TB) to be shared and analyzed (Poldrak & Gorgolewski, 2014).  Green et al. (2015) have stated that the genome project has led to huge innovation in tangent fields, not directly related to biology, like chemistry, physics, robotics, computer science, etc.  It was due to this type of research that a capillary-based DNA sequencing instruments were invented to be used for sequencing genomes (Green et al., 2015).  The Ethical, legal and Social Implication project, got 5% of the National Institute of Health Budget, to study ethical implications of this data, opening up a new field of study (Green et al., 2015 & O’Driscoll et al., 2013).  O’Driscoll et al. (2013), suggested that solutions like Hadoop’s MapReduce would greatly advance this field.  However, he argues that current java intensive knowledge is needed, which can be a bottleneck on the biologist.   Luckily, this field is helping to provide a need to create a Guided User Interface, which will allow scientist to conduct research and not learn to program.  O’Driscoll et al. (2013), also state that the biggest drawback of using Hadoop MapReduce function is that it reduces data line by line, whereas genomic data needs to be reduced in groups.  This project, should, with time improve the service offering of Hadoop to other fields outside of biomedical research.

In the medical field, cancer diagnosis and treatments will now be possible due to this project (Green et al., 2015).  Green et al. (2015) also predict that a maturation of the microbiome science, routine use of stem-cell therapies could result from this.  These predictions are not far from becoming reality and are the foundation of predictive and preventative medicine.  This is not so far into the future that McEwen et al. (2013) have stated what are the ethical issues, for people who have submitted their genomic data 25 years ago, and they found data that could help the participants take preventative measures for adverse health conditions.  Mostly because clinical versions of this data are starting to become available like from companies like 23andMe. This information so far has yield genealogy data, a few predictive medical measures (to a certain confidence interval).  Predictive and preventative medical advances are still primary and currently in the research phase (McEwen et al., 2013).  Finally, genomics research will pave the way for metagenomics, which is the study of microbiome data of as many of the ~4-6* 10^30 bacterial cells (O’Driscoll et al., 2013).

From this discussion, there is no doubt that genomic data can fall under the classification of big data.  The analysis of this data has yielded advances in the medical fields and other tangential fields.  Future work, to expanding the predictive and preventative medicine is still needed; it is only in research studies, where the participants can learn about their genomic indicators that may lead them to certain types of adverse health conditions.

Resources:

  • Green, E. D., Watson, J. D., & Collins, F. S. (2015). Twenty-five years of big biology. Nature, 526.
  • McEwen, J. E., Boyer, J. T., & Sun, K. Y. (2013). Evolving approaches to the ethical management of genomic data. Trends in Genetics, 29(6), 375-382.
  • O’Driscoll, A., Daugelaite, J., & Sleator, R. D. (2013). ‘Big data,’ Hadoop and cloud computing in genomics. Journal of biomedical informatics, 46(5), 774-781.
  • Poldrack, R. A., & Gorgolewski, K. J. (2014). Making big data open: data sharing in neuroimaging. Nature neuroscience, 17(11), 1510-1517.