The are some of the advantages and disadvantages of using MapReduce are (Lusblinksy et al., 2014; Sakr, 2014):
- Hadoop is ideal because it is a highly scalable platform that is cost-effective for many businesses.
- It supports huge computations, particularly in parallel execution.
- It isolates low-level applications such as fault-tolerance, scheduling, and data distribution.
- It supports parallelism for program execution.
- It allows easier fault tolerance.
- Has a highly scalable redundant array of independent nodes
- It has a cheap unreliable computer or commodity hardware.
- Aggregation techniques under the mapper function can exploit multiple different techniques
- No read or write of intermediate data, thus preserving the input data
- No need to serialize or de-serialize code in either memory or processing
- It is scalable based on the size of data and resources needed for processing the data
- Isolation of the sequential program from data distribution, scheduling, and fault tolerance
- The product is not ideal for real-time process data. During the map phase, the process creates too many keys, which consume sorting time.
- Most of the MapReduce outputs are merged.
- MapReduce cannot use natural indices.
- It is a must to buffer all the records for a particular join from the input relations in repartition join.
- Users of the MapReduce framework use textual formats that are inefficient.
- There is a huge waste of CPU resources, network bandwidth, and I/O since data must be reprocessed and loaded at every iteration.
- The common framework of MapReduce doesn’t support applications designed for iterative data analysis.
- When a fixed point is reached, detection may be the termination condition that calls for more MapReduce job that incurs overhead.
- The framework of MapReduce doesn’t allow building one task from multiple data sets.
- Too many mapper functions can create an infrastructure overhead, which increases resources and thus cost
- Too few mapper functions can create huge workloads for certain types of computational nodes
- Too many reducers can provide too many outputs, and too few reducers can provide too few outputs
- It’s a different programming paradigm that most programmers are not familiar with
- The use of available parallelism will be underutilized for smaller data sets
- Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
- Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.