Adv Topics: MapReduce and Hadoop

Hadoop allows for data processing through MapReduce and it also allows for data storage (Lublinsky et al., 2014). MapReduce is an analytical engine and pattern that takes advantage of distributed systems while keeping the processes and data in one machine (Sadalage & Fowler, 2012). MapReduce thus contains two functions that work in parallel on distributed systems (Hortonworks, 2013; Sadalage & Fowler, 2012; Sakr, 2014; Sathupadi, 2010):

    1. Mappers functions create and process transactions on the system by mapping and aggregating data by key values. Mappers can read only one data record at a time.
    2. Reducers functions know what that key values are and will take all those values stored in a map to reduce the data to what is relevant. Reducers help summarize the data into a single output. This helps deal with the amount of data moving between multiple computational nodes.

Lublinsky, Smith, and Yakubovich, (2014), stated that an intermediate component of MapReduce is known as the shuffle and sort, where the data from the mapping function outputs are moved and presented to the reducer function.

Thus, MapReduce is a framework that uses parallel sequential algorithms that capitalize on cloud architecture, which became popular under the open source Hadoop project, as its main executable analytic engine (Lublinsky et al., 2014; Sadalage & Fowler, 2012; Sakr, 2014). Essentially, a sequential algorithm is a computer program that runs on a sequence of commands, and a parallel algorithm runs a set of sequential commands over separate computational cores (Brookshear & Brylow, 2014; Sakr, 2014). Thus, a parallel sequential algorithm runs a full sequential program over multiple but separate cores (Sakr, 2014). Another feature of MapReduce is that a reduced output can become another’s map function (Sadalage & Fowler, 2012). Subsequently, the advantages and disadvantages of using MapReduce are (Lusblinksy et al., 2014; Sakr, 2014):

+ aggregation techniques under the mapper function can exploit multiple different techniques

+ no read or write of intermediate data, thus preserving the input data

+ no need to serialize or de-serialize code in either memory or processing

+ it is scalable based on the size of data and resources needed for processing the data

+ isolation of the sequential program from data distribution, scheduling, and fault tolerance

– too many mapper functions can create an infrastructure overhead, which increases resources and thus cost

– too few mapper functions can create huge workloads for certain types of computational nodes

– too many reducers can provide too many outputs, and too little reducers can provide too little outputs

 – it’s a different programming paradigm that most programmers are not familiar with

 – the use of available parallelism will be underutilized for smaller data sets

Given that Hadoop is predominately known for popularizing MapReduce tasks, it is also known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems (Rathbone, 2013). Hadoop’s service is part of the cloud (as Platform as a Service = PaaS).  For PaaS, the end users manage the applications and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001). Data is broken up into small blocks, like Legos, such that they are distributed across a distributed database system and across multiple servers and can be processed across all these servers, e.g. Hadoop Cluster (IBM, n.d.).

A common example of a parallel sequential program is dynamical weather forecasting models. In dynamical weather forecasting models, there is a set of defined geodynamic, thermodynamic, and physical sequential algorithms define and evolve the main seven variables of weathers across time. For each time step, the forecasting models run these sequential algorithms over each grid point, which can represent a finite geospatial region. Each of these geospatial regions is split amongst multiple computational scores. This example expands in complexity when data has to travel between different finite geospatial regions through the boundaries, which is an example of data parallelism (Sakr, 2014). MapReduce uses the concept of data parallelism to help map and reduce data. Therefore, weather models could be considered as a loose form of MapReduce algorithm.

Resources: