According to Gary, et al.(2005), traditional data management relies on arrays and tables in order to analyze objects, which can range from financial data, galaxies, proteins, events, spectra data, 2D weather, etc., but when it comes to N-dimensional arrays there is an “impedance mismatch” between the data and the database. Big data, can be N-dimensional, which can also vary across time, i.e. text data (Gary et al., 2005). Traditional data management relied heavily on relational databases (where data was treated like liked systems), in which queries can read and write data from within these databases. From personal experience, relational databases have not been used by scientists and instead the file systems like NetCDF (which allows for meta-data storage and N-dimensional data storage), meet their needs. Gary et al. (2005), also gathered that scientist chooses not to use databases because: “They do not offer good visualization/plotting tools”, “I can handle my data volumes with my programming language”, “They do not support our data types (arrays, spatial, text, etc.)”, “They do not support our access patterns (spatial, temporal, etc.)”, “… they were too slow”, and “…once we loaded our data we could no longer manipulate the data using our standard application program”. These are yet some of the reason why traditional data management has failed in the sciences and are some of the reasons why they will fail in analyzing big data.
From relational databases, we can move on to distributed databases, where data is stored and disseminated across different locations no matter the type or amount of data, but can still be accessed (Minelli, Chambers, & Dhiraj, 2013). Popularity in these systems rose because of the little need to set up the system. In traditional databases, you need to create a schema or entity relationship diagram (ERD). In distributed databases like Hadoop’s Distributed File System (HFDS) you don’t have to. HFDS can support many different data types, even those that are unknown or yet to be classified and it can store a bunch of data. Thus, Hadoop’s technology to manage big data allows for parallel processing, which can allow for parallel searching, metadata management, parallel analysis (with MapReduce), the establishment of workflow system analysis, etc. (Gary et al., 2005, Hortonworks, 2013, & IBM, n.d.).
Given the massive amounts of data in Big Data that needs to get processed, manipulated, and calculated upon, parallel processing and programming are there to use the benefits of distributed systems to get the job done (Minelli et al., 2013). Hadoop, which is Java based allows for manipulation and calculations to be done by calling on MapReduce, which pulls on the data which is distributed on its servers, to map key items/objects, and reduces the data to the query at hand. Parallel processing allows making quick work on a big data set, because rather than having one processor doing all the work, you split up the task amongst many processors. This is the largest benefit of parallel processing. Another advantage of parallel processing is when one processor/node goes out, another node can pick up from where that task last saved safe object task (which can slow down the calculation but by just a bit). Hadoop knows that this happens all the time, so if it is not only the processor/node that is not working they create backups of their data (IBM, n.d), so that another processor/node can continue its work on the copied data, which enhances data availability, which in the end gets the task you need to be done now.
Minelli et al. (2013) stated that traditional relational database systems can depend on hardware architecture. However, Hadoop’s service is part of cloud (as Platform as a Service = PaaS). For PaaS, we manage the applications, and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001).
Resources:
- Gray, J., Liu, D. T., Nieto-Santisteban, M., Szalay, A., DeWitt, D. J., & Heber, G. (2005). Scientific data management in the coming decade. ACM SIGMOD Record, 34(4), 34-41.
- Hortonworks (2013). Introduction to MapReduce. Retrieved from https://www.youtube.com/watch?v=ht3dNvdNDzI
- IBM (n.d.) What is the Hadoop Distributed File System (HDFS)? Retrieved from https://www-01.ibm.com/software/data/infosphere/hadoop/hdfs/
- Lau, W. (2001). A Comprehensive Introduction to Cloud Computing. Retrieved from https://www.simple-talk.com/cloud/development/a-comprehensive-introduction-to-cloud-computing/
- Minelli, M., Chambers, M., & Dhiraj, A. (2013-01-14). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses, 1st Edition. [VitalSource Bookshelf Online]. Retrieved from https://bookshelf.vitalsource.com/#/books/9781118870815/