Adv Topics: CAP Theory and NoSQL Databases

Brewer (2000) and Gilbert and Lynch (2012) concluded that for a distributed shared-data system you could only have at most two of the three properties: consistency, availability, partition-tolerance (CAP theory). Gilbert and Lynch (2012) describes these three as akin to the safety of the data, live data, and reliability of the data. Thus, systems that are giving up

  • consistency creates a system that needs expirations, conflict resolution, and optimistic locking (Brewer, 2000). A lack of consistency means that there is a chance that the data or processes may not return the right response to a request (Gilbert & Lynch, 2012).
  • availability creates a system that needs pessimistic locking and making some partitions unavailable (Brewer, 2000). A lack of availability means that there is a chance that a request may not get a response (Gilbert & Lynch, 2012).
  • Partition-tolerance creates a system that needs a 2-phase commit and cache validation profiles (Brewer, 2000). A lack of partition-tolerance means that there is a chance that messages between servers, tasks, threads, can be lost forever and never are committed (Gilbert & Lynch, 2012).

Therefore, in a NoSQL distributed database systems (DDBS), it means that partition-tolerance should exist, and therefore administrators should then select between consistency and availability (Gilbert & Lynch, 2012; Sakr, 2014). However, if the administrators focus on availability they can try to achieve weak consistency, or if the administrators focus on consistency, they are planning on having a strong consistency system. An availability focus is having access to the data even during downtimes (Sakr, 2014). However, providing high levels of availability can cost money. Per the web application Uptime.is:

Availability Level Monthly downtime Yearly downtime
99.9% 43m 49.7s 8h 45m 75.0s
99.99% 4m 23.0s 52m 35.7s
99.999% 26.3s 5m 15.6s
99.9999% 2.6s 31.6s

To achieve high levels of availability means having a set of fail-safe systems to build for fault tolerance.

From the previous paragraph, there is both strong and weak consistency. Strong consistency ensures that all copies of the data are updated in real-time, whereas weak consistency means that eventually all the copies of the data will be updated (Connolly and Begg, 2014; Sakr, 2014). Thus, there is a resource cost to have stronger consistency over weaker consistency due to how fast the data needs to be updated (Gilbert & Lynch, 2012). Consequently, this is where the savings come from when handling for overhead in a NoSQL DDBS.

Finally, the table below illustrates some of the NoSQL databases that are either an AP or CP system (Hurst, 2010).

Availability & Partition Tolerance

NoSQL systems

Consistency & Partition Tolerance

NoSQL systems

Dynamo, Voldemort, Tokyo Cabinet, KAI, Riak, CouchDB, SimpleDB, Cassandra Big Table, MongoDB, Terrastore, Hypertable, Hbase, Scalaris, Berkley DB, MemcacheDB, Redis

 Resources

  • Brewer, E. (2000). Towards robust distributed systems. Proceedings of 19th Annual ACM Symposium Principles of Distributed Computing (PODC00). 7–10.
  • Connolly, T., & Begg, B. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Gilbert, S., and Lynch N. A. (2012). Perspectives on the CAP Theorem. Computer 45(2), 30–36. doi: 10.1109/MC.2011.389

 

Adv Topics: Distributed Programing

Distributed programming can be divided into the following two models:

  • Shared memory distributed programming: Is where serialized programs run on multiple threads, where all the threads have access to the underlying data that is stored in shared memory (Sakr, 2014). Each thread should be synchronized as to ensure that read and write functions aren’t being done on the same segment of the shared data at the same time. Sandén (2011) and Sakr, (2014) stated that this could be achieved via semaphores (signals other threads that data is being written/posted and other threads should wait to use the data until a condition is met), locks (data can be locked or unlocked from reading and writing), and barriers (threads cannot run on this next step until everything preceding it is completed). A famous example of this style of parallel programming is the use of MapReduce on data stored in the Hadoop Distributed File System (HDFS) (Lublinsky, Smith, & Yakubovich, 2013; Sakr, 2014). The HDFS is where the data is stored, and the mapper and reducers functions can access the data stored in the HDFS.
  • Message passing distributed programming: Is where data is stored in one location, and a master thread helps spread chunks of the data onto sub-tasks and threads to process the overall data in parallel (Sakr, 2014).       There are explicitly direct send and receive messages that have synchronized communications (Lublinsky et al., 2013; Sakr, 2014).   At the end of the runs, data is the merged together by the master thread (Sakr, 2014). A famous example of this style of parallel programming is Message Passing Interface (MPI), such that many weather models like the Weather Research and Forecasting (WRF) model benefits use this form of distributed programming (Sakr, 2014; WRF, n.d.). The initial weather conditions are stored in one location and are chucked into small pieces and spread across the threads, which are then eventually joined in the end to produce one cohesive forecast.

However, there are six challenges to distributed programming model: Heterogeneity, Scalability, Communications, Synchronization, Fault-tolerance, and Scheduling (Sakr, 2014). Each of these six challenges is interrelated. Thus, an increase in complexity in one of these challenges can increase the level of complexity of one or more of the other ones. Therefore, both the shared memory and message passing distributed programming are insufficient when processing the large-scale data in cloud computing environment. This post will focus on two of these six:

  • Scalability issues exist when an increase in the number of users, the amount of data, and request for resources and the distributed processing system can still be effective (Sakr, 2014). Using Hadoop and HDFS in the cloud allows for a mitigation of the scalability issues by providing a free open-source way of managing such an explosion of data and demand on resources. But, the storage costs on the cloud will also increase, even though it is usually 10% of the cost than normal information technology infrastructure (Minelli, Chambers, & Dhiraj, 2013). As the scale of resources increase, it can also increase a number of resources needed for a deal with communication and synchronization (Sakr, 2014).
  • Synchronization is a critical challenge that must be addressed because multiple threads should be able to share data without corrupting the data or cause inconsistencies (Sandén, 2011; Sakr, 2014). Lublinsky et al. (2013), stated that MapReduce requires proper synchronization between the mapper and reducer functions to work. Improper synchronization can lead to issues in fault tolerance. Thus, efficient synchronization between reading and write operations are vital and are within the control of the programmers (Sakr, 2014). The challenge comes when scalability issues are introduced and applying synchronization methods without degrading performances, causing deadlocks where two tasks want access to the same data, load balancing issues, or wasteful use of computational resources (Lublinsky et al., 2013; Sandén, 2011; Sakr, 2014).

Resources

  • Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
  • Minelli, M., Chambers, M., & Dhiraj, M. (2013) Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.
  • Sandén, B. I. (2011) Design of Multithreaded Software: The Entity-Life Modeling Approach. Wiley-Blackwell. VitalBook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.