Adv DB: Key-value DBs

NoSQL and Key-value databases

A recap from my last post: “Not Only SQL” databases, best known as NoSQL contains aggregate databases like key-value, document, and column friendly (Sadalage & Fowler, 2012). Aggregates are related sets of data that we would like to treat as a unit (MUSE, 2015c). Relationships between units/aggregates are captured in the relational mapping (Sadalage & Fowler, 2012). A key-value database maps aggregate data to a key, this data is embedded into a key-value.

Consider a bank account, my social security may be used as a key-value to bring up all my accounts: my checking, my 2 savings, and my mortgage loan.  The aggregate is my account, but savings, checking, and a mortgage loan act differently and can exist on different databases and distributed across different physical locations.

These NoSQL databases can be schemaless databases, where data can be stored without any predefined schema.  NoSQL is best for application-specific databases, not to substitute all relational databases (MUSE, 2015b).  NoSQL databases can also have an implicit schema, where the data definition can be taken from a database from an application in order to place the data into the database.

MapReduce & Materialized views

According to Hortonworks (2013), MapReduce’s Process in a high level is: Input -> Map -> Shuffle and Sort -> Reduce -> Output.

Jobs:  Mappers, create and process transactions on a data set filed away in a distributed system and places the wanted data on a map/aggregate with a certain key.  Reducers will know what the key values are, and will take all the values stored in a similar map but in different nodes on a cluster (per the distributed system) from the mapper to reduce the amount of data that is relevant (MUSE, 2015a, Hortonworks, 2013). Reducers can work on different keys.

Benefit: MapReduce knows where the data is placed, thus it does the tasks/computations to the data (on which node in a distributed system in which the data is located at).  Not using MapReduce, tasks/computations take place after moving data from one place to another, which can eat up the computational resources (Hortonworks, 2013).  From this, we know that the data is stored in a cluster of multiple processors, and what MapReduce tries to do is map the data (generate new data sets and store them in a key-value database) and reduce (data from one or more maps is reduced to a smaller pair of key-values) the data (MUSE, 2015a).

Other advantages:  Maps and reduce functions can work independently, while the grouper (groups key-values by key) and Master (divides the work amongst the nodes in a cluster) coordinates all the actions and can work really fast (Sathupadi, 2010).  However, depending on the task division, the work of the mapping and reducing functions can vary greatly amongst the nodes in a cluster.  Nothing has to happen in sequential order and a node can sometimes be a mapper and/or a grouper at any one time of the transaction request.

A great example of this a MapReduce Request is to look at all CTU graduate students and sum up their current outstanding school loans per degree level.  Thus, the final output from our example would be Doctoral Students Current Outstanding School Loan Amount and Master Students Current Outstanding School Loan Amount.  If I ran this in Hadoop, I could use 50 nodes to process this transaction request.  The bad data that gets thrown out in the mapper phase would be the Undergraduate Students.  Doctoral Students will get one key, and Master students would get another key, that is similar in all nodes, so that way the sum of all current outstanding school loan amounts get processed under the correct group.

Resources

Adv DB: Document DBs

Main concepts

Data models are how we see, interact with, and transform our data in a system like a database (MUSE, 2015). A data model to a dev person is an ERD, whereas a metamodel is what is used to describe how a database organizes data in four key ways: Key-values, document, column-family, and graph databases (although graph databases are not aggregates) (Sadalage & Fowler, 2012).

In relational data models, tuples are a set of values (divided and stored information) that cannot be nested, nor placed within another, so all operations must be thought of as reading or writing tuples.  For aggregate data models, we want to do more complex things (like key values, column family and documents) rather than just dealing with tuples (Sadalage & Fowler, 2012). Aggregates are related sets of data that we would like to treat as a unit (MUSE, 2015). Relationships between units/aggregates are captured in relational mapping, and a relational or graph database has no idea that the aggregate exists, also known as “aggregate-ignorant” (Sadalage & Fowler, 2012).

Let’s consider a UPS.  For transactions like amazon.com or ebay.com, we need to know only the shipping address if we are a distributor, but paypal.com or your bank cares about the billing address to give you credit into your account.  UPS must collect both.  Thus, UPS, in their relational models they may have in an ERD with two Entities called: Billing Address and Shipping Address.  Naturally, we can group these into one unit (aggregate) called: Address with an indicator/key to state which address is which.  Thus, I can query the key for shipping addresses.

Finally, atomic operations are supported on a single aggregate at a time, and ACID is not followed for transactions across multiple aggregates at a time (Sadalage & Fowler, 2012).

Document Databases

A document database is able to look into the structure of a unit because we need to use a query, which can return a subset/part of the aggregate (Sadalage & Fowler, 2012). You can think of this as either a chapter or a section in a document (MUSE, 2015).  It can be limited by the size restrictions, but also in what can be placed (structure and type).  People can blur the line between this and key-value databases by placing an ID field, but for the most part, you will query a document database rather than look up a key or ID (Sadalage & Fowler, 2012).

Pros and Cons of Aggregate Data model

Aggregate ignorance allows for manipulation of data, replication, and sharding because if not, we would have to search every unit, yet manipulation of data, replication, and sharding can be easier when done in these units.  Thus it can help in some cases and not in others.  Also, there is no correct or right way on where aggregate boundaries should or shouldn’t exist, which can add to the complexity in understanding a data model.  It is great if we want to run our transactions on as little nodes as possible on a cluster, and dealing with units is easier on a cluster (Sadalage & Fowler, 2012).  It is not great for mapping out the relationships of units of different formats (MUSE, 2015).

References: