Big data can be defined as any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do, 2013; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014). What is considered to be big data can change with respect to time. What is considered as big data in 2002 is not considered big data in 2016 due to advancements made in technology over time (Fox & Do, 2013). However, given that big data today is too big to be processed just by using one processor, the use of parallel processing allows for data analytics to be conducted through platforms like Hadoop more efficiently (Hortonworks, 2013; IBM, n.d.).
Hadoop: Basic Components and Architecture
Hadoop’s service is part of cloud (as Platform as a Service = PaaS). For PaaS, the end users manage the applications and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001).
Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). Data is broken up into small blocks, like Legos, such that they are distributed across a distributed database system and across multiple servers (IBM, n.d.). Just like Legos, the end the results can be assembled back. This feature of HDFS allows for Hadoop to manage big data through parallel processing and analysis (Gary et al., 2005, Hortonworks, 2013; IBM, n.d.). Multiple data types are supported through the HFDS (IBM, n.d.) For Hadoop’s MapReduce function, it can be broken down into two queries.
Parallel processing is key for Hadoop, because it allows for making quick work on a big data set, because rather than having one processor doing all the work, Hadoop splits up the task amongst many processors. One of MapReduce’s main two queries is that it splits the data into the Lego pieces and places them across a group of computer nodes in the HDFS called the mapping procedure (Eini, 2010; IBM, n.d; Hortonworks, 2013; Sathupadi, 2010). The second MapReduce query applied algorithms to reduce the data in each of the computer nodes equally to answer the question that was asked of the data; such that at the end of the parallel processing procedures, the reduced data gets combined and further reduced to provide the final answer (Eini, 2010; IBM, n.d; Hortonworks, 2013; Minelli et al., 2013; Sathupadi, 2010). In other words, data is partitioned, sorted and grouped to provide a key and value as an output (Hortonworks, 2013; Rathbone, 2013; Sathupadi, 2010). Therefore, IBM’s (n.d.) MapReduce functions use the HFDS to house the data and MapReduce runs its procedures on the server in which the data is stored. Data is stored in a memory, not in cache and allow for continuous service (Gu & Li, 2013; Zaharia et al., 2012).
Given the Lego blocks feature in the HDFS, which allows for MapReduce functions, these blocks can contain a subset of data, which are small enough that they can be easily duplicated (for disaster recovery purposes) in two or more different servers (IBM, n.d.). This partitioning of the data into data Lego blocks allows for big iterative tasks to be done quite easily and efficiently for big data sets (Gu & Li, 2013).
When to use Hadoop
Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized. Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013). Also, Hadoop fails in providing a real-time response (Greer, Rodriguez-Martinez, & Seguel, 2010). Therefore, for big data that isn’t streaming real-time data and has a ton of iterative processing/analytical tasks Hadoop should be used.
Preparation of Big Data for Hadoop
Collecting the raw and unaltered real world data is usually the first step of any data or text mining study (Coralles et al., 2015; Gera & Goel, 2015; He et al., 2013; Hoonlor, 2011; Nassirtoussi et al., 2014). Next, the data must be preprocessed, because raw text data files are unsuitable for predictive data analytics tools like Hadoop (Hoonlor, 2011). Barak and Modarres (2015) and Nassirtoussi et al. (2014), all stated that in both data and text mining, data preprocessing has the most significant impact on the research results. Wayner (2013) and Lublinksy, Smith, and Yakubovich (2013), enumerated the following tools used to preprocess data prior to data analysis with Hadoop as part of the core components of the ecosystem:
- Ambari: Graphical User Interface for setting up clusters with common components. Essentially a simple management tool.
- Avro: serialization systems that compiles all the data together into a XML or JSON output to be shared with others.
- BigTop: tool that provides testing of sub-projects within Hadoop.
- Clouds: Allows the end-user to spin up multiple nodes to process the data without necessarily owning the infrastructure, essentially pay as you go model
- Flume: Gathers all data and places it into HDFS. Essentially an enterprise data integration tool.
- GIS tools: allows end-users to work with big data stored as geographic maps under GIS (Geographic Information Systems) formats.
- HBase: helps search and share a big tabular data set, unfortunate full ACID is not available. Essentially a NoSQL Database.
- HDFS: Storage of big data in multiple distributed systems into data blocks. Essentially a Distributed reliable data storage.
- Hive: SQL type language that files and pulls out data that is needed from HBase. Essentially a high-level abstraction tool.
- Lucene: indexes large blocks of unstructured text based data and allows for dynamic clustering and ability to read XML
- Mahout: Allows for Hadoop to use classification, filtering, k-means, Dirichelet, parallel pattern, and Bayesian classification similar to Hadoops MapReduce. Essentially a data analytics library.
- NoSQL: Uses NoSQL data stores for data that is not typically stored in HBase or HDFS.
- Oozie: manages the workflow of a job by allowing the user to break the job into simple steps in a flowchart fashion. Essentially a workflow manager.
- Pig: stores and maps data in processing nodes for Hadoop to find and process. Essentially a high-level abstraction tool.
- Spark: uses Hadoop infrastructure to store data in the cache to allow for faster processing time
- SQL on Hadoop: ad-hoc query the data stored in Hadoop servers using SQL
- Sqoop: stores data in SQL databases into Hadoop. Essentially an enterprise data integration tool.
- Whirr: Library that allows to run Hadoop clusters on Amazon EC2, Rackspace, etc.
- ZooKeeper: maintains order and synchronization throughout the parallel processing cluster. Essentially a coordinator of processes.
According to Lublinksy et al. (2013), there are always new datasets, data formats, and data preprocessing and processing tools being added to Hadoop. Thus the list provided above is not a comprehensive list, but rather one to begin off from.
- Barak, S., & Modarres, M. (2015). Developing an approach to evaluate stocks by forecasting effective features with data mining methods. Expert Systems with Applications, 42(3), 1325–1339. http://doi.org/10.1016/j.eswa.2014.09.026
- Corrales, D. C., Ledezma, A., & Corrales, J. C. (2015). A Conceptual Framework for Data Quality in Knowledge Discovery Tasks (FDQ-KDT): A Proposal. Journal of Computers, V10(6), 396-405. Doi: 10.17706/jcp.10.6.396-405.
- Davenport, T. H., & Dyche, J. (2013). Big Data in Big Companies. International Institute for Analytics, (May), 1–31.
- Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
- Gera, M., & Goel, S. (2015). Data Mining-Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
- Greer, M., Rodriguez-Martinez, M., & Seguel, J. (2010). Open Source Cloud Computing Tools: A Case Study with a Weather Application.Florida: IEEE Open Source Cloud Computing.
- Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J., & Zients, J. (2014). Big Data: Seizing Opportunities. Executive Office of the President of USA, 1–79.
- Gray, J., Liu, D. T., Nieto-Santisteban, M., Szalay, A., DeWitt, D. J., & Heber, G. (2005). Scientific data management in the coming decade. ACM SIGMOD Record, 34(4), 34-41.
- Gu, L., & Li, H. (2013). Memory or time: Performance evaluation for iterative operation on hadoop and spark. InHigh Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on (pp. 721-727). IEEE.
- Eini, O. (2010). Map/Reduce- a visual explanation. Retrieved from https://ayende.com/blog/4435/map-reduce-a-visual-explanation
- He, W., Zha, S., & Li, L. (2013). Social media competitive analysis and text mining: A case study in the pizza industry. International Journal of Information Management, 33, 464–472. http://doi.org/10.1016/j.ijinfomgt.2013.01.001
- Hoonlor, A. (2011). Sequential patterns and temporal patterns for text mining. UMI Dissertation Publishing.
- Hortonworks (2013). Introduction to MapReduce. Retrieved from https://www.youtube.com/watch?v=ht3dNvdNDzI
- IBM (n.d.) What is the Hadoop Distributed File System (HDFS)? Retrieved from https://www-01.ibm.com/software/data/infosphere/hadoop/hdfs/
- Lau, W. (2001). A Comprehensive Introduction to Cloud Computing. Retrieved from https://www.simple-talk.com/cloud/development/a-comprehensive-introduction-to-cloud-computing/
- Lublinsky, B., Smith, K., Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox, VitalBook file.
- Minelli, M., Chambers, M., Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses (1st). VitalSource Bookshelf Online.
- Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2014). Text mining for market prediction: a systematic review. Expert Systems with Applications, 41(16), 7653–7670. http://doi.org/10.1016/j.eswa.2014.06.009
- Rathbone, M. (2013). A beginners guide to Hadoop. Retrieved from http://blog.matthewrathbone.com/2013/04/17/what-is-hadoop.html
- Sathupadi, K. (2010) Map Reduce: A really simple introduction. Retrieved from http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/
- Wayner, P. (2013). 18 essential Hadoop tools for crunching big data. InfoWorld. Retrieved from http://www.infoworld.com/article/2606340/hadoop/131105-18-essential-Hadoop-tools-for-crunching-big-data.html
- Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., Mccauley, M., … & Stoica, I. (2012). Fast and interactive analytics over Hadoop data with Spark. USENIX Login, 37(4), 45-51.