Adv DB: Key-value DBs

NoSQL and Key-value databases

A recap from my last post: “Not Only SQL” databases, best known as NoSQL contains aggregate databases like key-value, document, and column friendly (Sadalage & Fowler, 2012). Aggregates are related sets of data that we would like to treat as a unit (MUSE, 2015c). Relationships between units/aggregates are captured in the relational mapping (Sadalage & Fowler, 2012). A key-value database maps aggregate data to a key, this data is embedded into a key-value.

Consider a bank account, my social security may be used as a key-value to bring up all my accounts: my checking, my 2 savings, and my mortgage loan.  The aggregate is my account, but savings, checking, and a mortgage loan act differently and can exist on different databases and distributed across different physical locations.

These NoSQL databases can be schemaless databases, where data can be stored without any predefined schema.  NoSQL is best for application-specific databases, not to substitute all relational databases (MUSE, 2015b).  NoSQL databases can also have an implicit schema, where the data definition can be taken from a database from an application in order to place the data into the database.

MapReduce & Materialized views

According to Hortonworks (2013), MapReduce’s Process in a high level is: Input -> Map -> Shuffle and Sort -> Reduce -> Output.

Jobs:  Mappers, create and process transactions on a data set filed away in a distributed system and places the wanted data on a map/aggregate with a certain key.  Reducers will know what the key values are, and will take all the values stored in a similar map but in different nodes on a cluster (per the distributed system) from the mapper to reduce the amount of data that is relevant (MUSE, 2015a, Hortonworks, 2013). Reducers can work on different keys.

Benefit: MapReduce knows where the data is placed, thus it does the tasks/computations to the data (on which node in a distributed system in which the data is located at).  Not using MapReduce, tasks/computations take place after moving data from one place to another, which can eat up the computational resources (Hortonworks, 2013).  From this, we know that the data is stored in a cluster of multiple processors, and what MapReduce tries to do is map the data (generate new data sets and store them in a key-value database) and reduce (data from one or more maps is reduced to a smaller pair of key-values) the data (MUSE, 2015a).

Other advantages:  Maps and reduce functions can work independently, while the grouper (groups key-values by key) and Master (divides the work amongst the nodes in a cluster) coordinates all the actions and can work really fast (Sathupadi, 2010).  However, depending on the task division, the work of the mapping and reducing functions can vary greatly amongst the nodes in a cluster.  Nothing has to happen in sequential order and a node can sometimes be a mapper and/or a grouper at any one time of the transaction request.

A great example of this a MapReduce Request is to look at all CTU graduate students and sum up their current outstanding school loans per degree level.  Thus, the final output from our example would be Doctoral Students Current Outstanding School Loan Amount and Master Students Current Outstanding School Loan Amount.  If I ran this in Hadoop, I could use 50 nodes to process this transaction request.  The bad data that gets thrown out in the mapper phase would be the Undergraduate Students.  Doctoral Students will get one key, and Master students would get another key, that is similar in all nodes, so that way the sum of all current outstanding school loan amounts get processed under the correct group.

Resources

Adv DB: NoSQL DB

Emergence

Relational Databases will persist due to ACID, ERDs, concurrency control, transaction management, and SQL capabilities.  It doesn’t help that major software can easily integrate with these databases.  But, the reason why so many new ways keep popping up is due to impedance resource costs on computational systems, when data is pulled and pushed from in-memory to databases.  This resource cost can compound fast with big amounts of data.  Industry wants and needs to use parallel computing with clusters to store, retrieve, and manipulate big amounts of data.  Data could also be aggregated into units of similarities, and data consistency can be thrown out the window, in real-life applications since they can actually be divided into multiple phases (MUSE, 2015a).

Think of a bank transaction, not all transactions you do at the same time get processed at the same time, and they may show up on your mobile device (mobile database), they may not be committed until a few hours or days later.  The bank will in my case withdraw my mortgage payment from my checking on the first, but apply it on the second of every month into the loan.  But, for 24 hours my payment is pending.

Thanks to the aforementioned ideas have created a movement to support “Not Only SQL” databases, best known as NoSQL, which was derived from a twitter hashtag #NoSQL.  NoSQL contains Aggregate databases like key-value, document, and column friendly, as well as aggregate ignorant databases like the graph (Sadalage & Fowler, 2012). These can be schemaless databases, where data can be stored without any predefined schema.  NoSQL is best for application-specific databases, not to substitute all relational databases (MUSE, 2015b).

 Originally meant for open-sourced, distributed, nonrelational databases like Voldemort, Dynomite, CouchDB, MongoDB, Cassandra, it expanded in its definition and what applications/platforms it can take on.  CQL is from Cassandra and was written to act like SQL in most cases, but also act differently when needed (Sadalage & Fowler, 2012), hence the No in NoSQL.

Suitable Applications

According to Cassandra Planet (n.d.), NoSQL is best for large data sets (big data, complex data, and data mining):

  • Graph: where data relationships are graphical and interconnected like a web (ex: Neo4j & Titan)
  • Key-Value: data is stored and index by a key (ex: Cassandra, DynamoDB, Azure Table Storage, Riak, & BerkeleyDB)
  • Column Store: stores tables as columns rather than rows (ex: Hbase, BigTable, & HyperTable)
  • Document: can store more complex data, with each document having a key (ex: MongoDB & CouchDB).

System Platform

In Relational databases, there is a resource cost, but in as industry wants to deal with big amounts of data, we can gravitate towards NoSQL.  To process all that data we may need to use parallel computing with clusters to store, retrieve, and manipulate big amounts of data.

 References: