Adv Topics: Distributed Programing

Distributed programming can be divided into the following two models:

  • Shared memory distributed programming: Is where serialized programs run on multiple threads, where all the threads have access to the underlying data that is stored in shared memory (Sakr, 2014). Each thread should be synchronized as to ensure that read and write functions aren’t being done on the same segment of the shared data at the same time. Sandén (2011) and Sakr, (2014) stated that this could be achieved via semaphores (signals other threads that data is being written/posted and other threads should wait to use the data until a condition is met), locks (data can be locked or unlocked from reading and writing), and barriers (threads cannot run on this next step until everything preceding it is completed). A famous example of this style of parallel programming is the use of MapReduce on data stored in the Hadoop Distributed File System (HDFS) (Lublinsky, Smith, & Yakubovich, 2013; Sakr, 2014). The HDFS is where the data is stored, and the mapper and reducers functions can access the data stored in the HDFS.
  • Message passing distributed programming: Is where data is stored in one location, and a master thread helps spread chunks of the data onto sub-tasks and threads to process the overall data in parallel (Sakr, 2014).       There are explicitly direct send and receive messages that have synchronized communications (Lublinsky et al., 2013; Sakr, 2014).   At the end of the runs, data is the merged together by the master thread (Sakr, 2014). A famous example of this style of parallel programming is Message Passing Interface (MPI), such that many weather models like the Weather Research and Forecasting (WRF) model benefits use this form of distributed programming (Sakr, 2014; WRF, n.d.). The initial weather conditions are stored in one location and are chucked into small pieces and spread across the threads, which are then eventually joined in the end to produce one cohesive forecast.

However, there are six challenges to distributed programming model: Heterogeneity, Scalability, Communications, Synchronization, Fault-tolerance, and Scheduling (Sakr, 2014). Each of these six challenges is interrelated. Thus, an increase in complexity in one of these challenges can increase the level of complexity of one or more of the other ones. Therefore, both the shared memory and message passing distributed programming are insufficient when processing the large-scale data in cloud computing environment. This post will focus on two of these six:

  • Scalability issues exist when an increase in the number of users, the amount of data, and request for resources and the distributed processing system can still be effective (Sakr, 2014). Using Hadoop and HDFS in the cloud allows for a mitigation of the scalability issues by providing a free open-source way of managing such an explosion of data and demand on resources. But, the storage costs on the cloud will also increase, even though it is usually 10% of the cost than normal information technology infrastructure (Minelli, Chambers, & Dhiraj, 2013). As the scale of resources increase, it can also increase a number of resources needed for a deal with communication and synchronization (Sakr, 2014).
  • Synchronization is a critical challenge that must be addressed because multiple threads should be able to share data without corrupting the data or cause inconsistencies (Sandén, 2011; Sakr, 2014). Lublinsky et al. (2013), stated that MapReduce requires proper synchronization between the mapper and reducer functions to work. Improper synchronization can lead to issues in fault tolerance. Thus, efficient synchronization between reading and write operations are vital and are within the control of the programmers (Sakr, 2014). The challenge comes when scalability issues are introduced and applying synchronization methods without degrading performances, causing deadlocks where two tasks want access to the same data, load balancing issues, or wasteful use of computational resources (Lublinsky et al., 2013; Sandén, 2011; Sakr, 2014).

Resources

  • Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
  • Minelli, M., Chambers, M., & Dhiraj, M. (2013) Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.
  • Sandén, B. I. (2011) Design of Multithreaded Software: The Entity-Life Modeling Approach. Wiley-Blackwell. VitalBook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.

Data Tools: Case Study on Hadoop’s effectiveness

Case Study: Open source Cloud Computing Tools: A case study with a weather application

Focus on: Hadoop V0.20, which has a Platform as a Service cloud solution, which have parallel processing capabilities

Cluster size: 6 nodes, with Hadoop, Eucalyptus, and Django-Python clouds interfaces installed

Variables: Managing historical average temperature, rainfall, humidity data, and weather conditions per latitude and longitude across time and mapping it on top of a Google’s Map user interface

Data Source: Yahoo! Weather Page

Results/Benefits to the Industry:  The Hadoop platform has been evaluated by ten different criteria and compared to Eucalyptus and Django-Python, from a scale of 0-3, where 0 “indicates [a] lack of adequate feature support” and 3 “indicates that the particular tool provides [an] adequate feature to fulfill the criterion.”

Table 1: The criterion matrix and numerical scores have been adopted from Greer, Rodriguez-Martinez, and Seguel (2010) results.

Criterion Description Score
Management Tools Tools to deploy, configure, and maintain the system 0
Development Tools Tools to build new applications or features 3
Node Extensibility Ability to add new nodes without re-initialization 3
Use of Standards Use of TCP/IP, SSH, etc. 3
Security Built-in security as oppose to use of 3rd party patches. 3
Reliability Resilience to failures 3
Learning Curve Time to learn technology 2
Scalability Capacity to grow without degrading performance
Cost of Ownership Investments needed for usage 2
Support Availability of 3rd party support 3
Total 22

Eucalyptus scored 18, and Django-Python scored 20, therefore making Hadoop a better solution for this case study.  They study mentioned that:

  • Management tools: configuration was done by hand with XML and text and not graphical user interface
  • Development tools: Eclipse plug-in aids in debugging Hadoop applications
  • Node Extensibility: Hadoop can accept new nodes with no interruption in service
  • Use of standards: uses TCP/IP, SSH, SQL, JDK 1.6 (Java Standard), Python V2.6, and Apache tools
  • Security: password protected user-accounts and encryption
  • Reliability: Fault-tolerance is presented, and the user is shielded from the effects
  • Learning curve: It is not intuitive and required some experimentation after practicing from online tutorials
  • Scalability: not assessed due to the limits of the study (6-nodes is not enough)
  • Cost of Ownership: To be effective Hadoop needs a cluster, even if they are cheap machines
  • Support: there is a third party support for Hadoop

The authors talk about how Hadoop fails in providing a real-time response, and that part of the batch code should include email requests to be sent out at the start, key points of the iteration, or even at the end of the job when the output is ready.  The speed of Hadoop is slower to the other two solutions that were evaluated, but the fault tolerance features make up for it.  For set-up and configuration, Hadoop is simple to use.

Use in the most ample manner?

Hadoop was not fully used in my opinion and the opinion of the authors because they stated that they could not scale their research because the study was limited to a 6-node cluster. Hadoop is built for big data sets from various sources, formats, etc. to be ingested and processed to help deliver data-driven insights and the features of scalability that address this point were not addressed adequately in this study.

Resources

  • Greer, M., Rodriguez-Martinez, M., & Seguel, J. (2010). Open Source Cloud Computing Tools: A Case Study with a Weather Application.Florida: IEEE Open Source Cloud Computing.