Skip to content

Dr. Skylar (Sky) Hernandez

Innovator, Data Analyst, and Diversity Champion, with passions for Weather and Investing

  • Home
  • About
  • Technical Papers
  • Courses Taught
    • Application of Computers to Meteorology
    • FORTRAN 90 Workshop
  • Curated Links
    • Some of my previous research projects
    • Tornado Chasing
  • Volunteering
  • Contact
  • LinkedIn Resume

Tag: meta-data storage

Big Data Analytics: Hadoop & Parallel Processing

Big Data Analytics: Hadoop & Parallel Processing

According to Gary, et al.(2005), traditional data management relies on arrays and tables in order to analyze objects, which can range from financial data, galaxies, proteins, events, spectra data, 2D weather, etc., but when it comes to N-dimensional arrays there is an “impedance mismatch” between the data and the database.    Big data, can be N-dimensional, which can also vary across time, i.e. text data (Gary et al., 2005).  Traditional data management relied heavily on relational databases (where data was treated like liked systems), in which queries can read and write data from within these databases. From personal experience, relational databases have not been used by scientists and instead the file systems like NetCDF (which allows for meta-data storage and N-dimensional data storage), meet their needs. Gary et al. (2005), also gathered that scientist chooses not to use databases because: “They do not offer good visualization/plotting tools”, “I can handle my data volumes with my programming language”, “They do not support our data types (arrays, spatial, text, etc.)”, “They do not support our access patterns (spatial, temporal, etc.)”, “… they were too slow”, and “…once we loaded our data we could no longer manipulate the data using our standard application program”.  These are yet some of the reason why traditional data management has failed in the sciences and are some of the reasons why they will fail in analyzing big data.

From relational databases, we can move on to distributed databases, where data is stored and disseminated across different locations no matter the type or amount of data, but can still be accessed (Minelli, Chambers, & Dhiraj, 2013).  Popularity in these systems rose because of the little need to set up the system.  In traditional databases, you need to create a schema or entity relationship diagram (ERD).  In distributed databases like Hadoop’s Distributed File System (HFDS) you don’t have to.  HFDS can support many different data types, even those that are unknown or yet to be classified and it can store a bunch of data.  Thus, Hadoop’s technology to manage big data allows for parallel processing, which can allow for parallel searching, metadata management, parallel analysis (with MapReduce), the establishment of workflow system analysis, etc. (Gary et al., 2005, Hortonworks, 2013, & IBM, n.d.).

Given the massive amounts of data in Big Data that needs to get processed, manipulated, and calculated upon, parallel processing and programming are there to use the benefits of distributed systems to get the job done (Minelli et al., 2013).  Hadoop, which is Java based allows for manipulation and calculations to be done by calling on MapReduce, which pulls on the data which is distributed on its servers, to map key items/objects, and reduces the data to the query at hand.  Parallel processing allows making quick work on a big data set, because rather than having one processor doing all the work, you split up the task amongst many processors. This is the largest benefit of parallel processing.  Another advantage of parallel processing is when one processor/node goes out, another node can pick up from where that task last saved safe object task (which can slow down the calculation but by just a bit).  Hadoop knows that this happens all the time, so if it is not only the processor/node that is not working they create backups of their data (IBM, n.d), so that another processor/node can continue its work on the copied data, which enhances data availability, which in the end gets the task you need to be done now.

Minelli et al. (2013) stated that traditional relational database systems can depend on hardware architecture.  However, Hadoop’s service is part of cloud (as Platform as a Service = PaaS).  For PaaS, we manage the applications, and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001).

Resources:

  • Gray, J., Liu, D. T., Nieto-Santisteban, M., Szalay, A., DeWitt, D. J., & Heber, G. (2005). Scientific data management in the coming decade. ACM SIGMOD Record, 34(4), 34-41.
  • Hortonworks (2013). Introduction to MapReduce. Retrieved from https://www.youtube.com/watch?v=ht3dNvdNDzI
  • IBM (n.d.) What is the Hadoop Distributed File System (HDFS)? Retrieved from https://www-01.ibm.com/software/data/infosphere/hadoop/hdfs/
  • Lau, W. (2001). A Comprehensive Introduction to Cloud Computing. Retrieved from https://www.simple-talk.com/cloud/development/a-comprehensive-introduction-to-cloud-computing/
  • Minelli, M., Chambers, M., & Dhiraj, A. (2013-01-14). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses, 1st Edition. [VitalSource Bookshelf Online]. Retrieved from https://bookshelf.vitalsource.com/#/books/9781118870815/

 

Author SkyHernandezPosted on August 13, 2016July 29, 2019Categories Introduction to Big Data AnalyticsTags big data, big data analytics, cloud, data management, distributed databases, entity relationship diagram, ERD, hadoop, Hadoop Distributed File System, HFDS, MapReduce, meta-data storage, N-dimensional, nodes, PaaS, parallel processing, Platform as a Service, relational database, safe objects, traditional data management, workflow systemsLeave a comment on Big Data Analytics: Hadoop & Parallel Processing

Archives

  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016

Enter your email address to follow this blog and receive notifications of new posts by email.

  • Home
  • About
  • Technical Papers
  • Courses Taught
    • Application of Computers to Meteorology
    • FORTRAN 90 Workshop
  • Curated Links
    • Some of my previous research projects
    • Tornado Chasing
  • Volunteering
  • Contact
  • LinkedIn Resume
Dr. Skylar (Sky) Hernandez Blog at WordPress.com.
Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
  • Follow Following
    • Dr. Skylar (Sky) Hernandez
    • Join 54 other followers
    • Already have a WordPress.com account? Log in now.
    • Dr. Skylar (Sky) Hernandez
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar