Data Tools: Hadoop Vs Spark

 

Apache Spark

Apache Spark started from a working group inside and outside of UC Berkley, in search of an open-sourced, multi-pass algorithm batch processing model of MapReduce (Zaharia et al., 2012). Spark can have applications written in Java, Scala, Python, R, and interfaces with SQL, which increases ease of use (Spark, n.d.; Zaharia et al., 2012).

Essentially, Spark is a high-performance computing cluster framework, but it doesn’t have its distributed file system and thus uses Hadoop Distributed File System (HDFS, HBase) as in input and output (Gu & Li, 2013).  Not only can it access data from HDFS, HBase, it can also access data from Cassandra, Hive, Tachyon, and any other Hadoop data source (Spark, n.d.).  However, Spark uses its data structure called Resilient Distribution Datasets (RDD) which cache’s data and is a read-only operation to improve its processing time as long as there is enough memory for it in all the nodes of a cluster (Gu & Li, 2013; Zaharia et al., 2012). Spark tries to avoid data reloading from the disk that is why it stores its data in the node’s cache system, for initial and intermediate results (Gu & Li, 2013).

Machines in the cluster can be rebuilt if lost, thus making the RDDs are fault-tolerant without requiring replication (Gu &LI, 2013; Zaharia et al., 2012).  Each RDD is tracked in a lineage graph, and reruns the operations if data becomes lost, therefore reconstructing data, even if all the nodes running spark were to fail (Zaharia et al., 2012).

Hadoop

Hadoop is Java-based system that allows for manipulation and calculations to be done by calling on MapReduce function on its HDFS system (Hortonworks, 2013; IBM, n.d.).

HFDS big data is broken up into smaller blocks across different locations, no matter the type or amount of data, each of these blocs can be still located, which can be aggregated like a set of Legos throughout a distributed database system (IBM, n.d.; Minelli, Chambers, & Dhiraj, 2013). Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). IBM (n.d.) boasts that the data blocks in the HFDS are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on your data needs), offering fault tolerance as well. Therefore, IBM’s (n.d.) MapReduce functions use the HFDS to run its procedures on the server in which the data is stored, where data is stored in a memory, not in cache and allow for continuous service.

MapReduce contains two job types that work in parallel on distributed systems: (1) Mappers which creates & processes transactions on the system by mapping/aggregating data by key values, and (2) Reducers which know what that key value is, will take all those values stored in a map and reduce the data to what is relevant (Hortonworks, 2013; Sathupadi, 2010). Reducers can work on different keys, and when huge amounts of data are entered into MapReduce, then the Mapper maps the data, where the data is then shuffled and sorted before it is reduced (Hortonworks, 2013).  Once the data is reduced, the researcher gets the output that they sought.

Significant Differences between Hadoop and Apache Spark              

Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Apache Spark, on their website, boasts that they can run programs 100X faster than Hadoop’s MapReduce in Memory (Spark, n.d.). Spark outperforms Hadoop by 10x on iterative machine learning jobs (Gu & Li, 2013). Also, Spark runs 10x faster than Hadoop on disk memory (Spark, n.d.).

Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized.

References

  • Gu, L., & Li, H. (2013). Memory or time: Performance evaluation for iterative operation on hadoop and spark. InHigh Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on (pp. 721-727). IEEE.