Pros and Cons of Hadoop MapReduce

The are some of the advantages and disadvantages of using MapReduce are (Lusblinksy et al., 2014; Sakr, 2014):

Advantages 

  • Hadoop is ideal because it is a highly scalable platform that is cost-effective for many businesses.
  • It supports huge computations, particularly in parallel execution.
  • It isolates low-level applications such as fault-tolerance, scheduling, and data distribution.
  • It supports parallelism for program execution.
  • It allows easier fault tolerance.
  • Has a highly scalable redundant array of independent nodes
  • It has a cheap unreliable computer or commodity hardware.
  • Aggregation techniques under the mapper function can exploit multiple different techniques
  • No read or write of intermediate data, thus preserving the input data
  • No need to serialize or de-serialize code in either memory or processing
  • It is scalable based on the size of data and resources needed for processing the data
  • Isolation of the sequential program from data distribution, scheduling, and fault tolerance

Disadvantages 

  • The product is not ideal for real-time process data. During the map phase, the process creates too many keys, which consume sorting time. 
  • Most of the MapReduce outputs are merged.
  • MapReduce cannot use natural indices.
  • It is a must to buffer all the records for a particular join from the input relations in repartition join.
  • Users of the MapReduce framework use textual formats that are inefficient.
  • There is a huge waste of CPU resources, network bandwidth, and I/O since data must be reprocessed and loaded at every iteration.
  • The common framework of MapReduce doesn’t support applications designed for iterative data analysis.
  • When a fixed point is reached, detection may be the termination condition that calls for more MapReduce job that incurs overhead.
  • The framework of MapReduce doesn’t allow building one task from multiple data sets.
  • Too many mapper functions can create an infrastructure overhead, which increases resources and thus cost 
  • Too few mapper functions can create huge workloads for certain types of computational nodes
  • Too many reducers can provide too many outputs, and too few reducers can provide too few outputs
  • It’s a different programming paradigm that most programmers are not familiar with
  • The use of available parallelism will be underutilized for smaller data sets

Resources

  • Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.

Health care as a Service (HaaS) – cloud solution

Health cloud-healthcare as a service (HaaS) Case Study (John & Shenoy, 2014):

The goal of this study is to provide the framework to build Health Cloud, a healthcare system that helps solve some of the issues currently dealt with in the Healthcare data analytics field. Especially, when paper images and data are limiting to only that healthcare provider’s facility until it is faxed, scanned, or mailed. The Health Cloud will be able to: store and index medical data, image processing, report generating, charting, trend analysis, and be secured with identification and access control.  The image processing capabilities of Health Cloud enable for better medical condition diagnosis of a patient.  The image processing structure was built using C++ Code for processing, to request data and to report out is done in Binary JSON (BSON) or text formats. Finally, the system built allows for the image to be framed, visualized, panned, zoomed, and annotated.

Issues related to health care data on the cloud (John & Shenoy, 2014):

  1. The number of MIR data has doubled in a decade, and CT data has increased by 50%, increasing the number of images primary providers are requesting on their patient to improve and create informed patient care. Thus, there is a need for hyper-scale cloud features.
  2. Health Insurance Portability and Accountability Act (HIPAA) requires data to be stored for six years after a patient has been discharged therefore increasing the volume of data. Consequently, there is another need for hyper-scale cloud features.
  3. Healthcare data should be able to be sharing medical data from anywhere and at any time per the Health Information Technology for Economic and Clinical Heath Act (HITECH) and American Recover and Reinvestment Act (ARRA), which aim to reduce duplication of data and improve data quality and access. HIPAA has created security regulations on data backup, recovery, and access. Hence, there is a need to have a community cloud provider familiar with HIPAA and other Regulations.
  4. Each hospital system is developed in silos or purchased from different suppliers. Thus, if data is shared, it may not be in the format that is easily received by the other Thus a common architecture and data model must be developed.  This can be resolved under a community cloud.
  5. Creation of seamless access to the data stored in the cloud among various mobile platforms. Thus, a cloud provided option such as a Software as a Service may best fit this requirement.
  6. Healthcare workflows are better managed in cloud-based solutions versus paper-based
  7. Cloud capabilities can be used for processing data, depending on what is purchased from which supplier.

Pros and Cons of healthcare data on the public or private cloud:

On-site private clouds can have limited storage space, and the data may not be in a format that is easily transferable to other on-site private clouds (Bhokare et al., 2016). Upgrades, maintenance, and infrastructure costs fall 100% of the health care providers.  Although these clouds are expensive, they offer the most control of their data and more control over-specialization of reports.

Public clouds distribute the cost of the upgrades, maintenance, and infrastructure to all others requesting the servers (Connolly & Begg, 2014). However, the servers may not be specialized 100% to all regulatory and legal specifications, or the servers could have additional regulatory and legal specification not advantageous to the healthcare cloud system.  Also, data stored on public clouds are shared with other companies, which can leave healthcare providers feeling vulnerable with their data’s security within the public cloud (Sumana & Biswal, 2016).

The solution should be a private or public community cloud.  A community cloud environment is a cloud that is shared exclusively by a set of companies that share the similar characteristics, compliance, security, jurisdiction, etc. (Connolly & Begg, 2014). Thus, the infrastructure of all of these servers and grids meet industry standards and best practices, with the shared cost of the infrastructure is maintained by the community. Certain community services would be optimized for HIPAA, HITECH, ARRA, etc. with little overhead to the individual IT teams that make up the overall community (John & Shenoy, 2014).

Reference

  • Bhokare, P., Bhagwat, P., Bhise, P., Lalwani, V., & Mahajan, M. R. (2016). Private Cloud using GlusterFS and Docker.International Journal of Engineering Science5016.
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th). Pearson Learning Solutions. [Bookshelf Online].
  • John, N., & Shenoy, S. (2014). Health cloud-Healthcare as a service (HaaS). InAdvances in Computing, Communications and Informatics (ICACCI, 2014 International Conference on (pp. 1963-1966). IEEE.
  • Sumana, P., & Biswal, B. K. (2016). Secure Privacy Protected Data Sharing Between Groups in Public Cloud.International Journal of Engineering Science3285.