Adv DB: Document DBs

Main concepts

Data models are how we see, interact with, and transform our data in a system like a database (MUSE, 2015). A data model to a dev person is an ERD, whereas a metamodel is what is used to describe how a database organizes data in four key ways: Key-values, document, column-family, and graph databases (although graph databases are not aggregates) (Sadalage & Fowler, 2012).

In relational data models, tuples are a set of values (divided and stored information) that cannot be nested, nor placed within another, so all operations must be thought of as reading or writing tuples.  For aggregate data models, we want to do more complex things (like key values, column family and documents) rather than just dealing with tuples (Sadalage & Fowler, 2012). Aggregates are related sets of data that we would like to treat as a unit (MUSE, 2015). Relationships between units/aggregates are captured in relational mapping, and a relational or graph database has no idea that the aggregate exists, also known as “aggregate-ignorant” (Sadalage & Fowler, 2012).

Let’s consider a UPS.  For transactions like amazon.com or ebay.com, we need to know only the shipping address if we are a distributor, but paypal.com or your bank cares about the billing address to give you credit into your account.  UPS must collect both.  Thus, UPS, in their relational models they may have in an ERD with two Entities called: Billing Address and Shipping Address.  Naturally, we can group these into one unit (aggregate) called: Address with an indicator/key to state which address is which.  Thus, I can query the key for shipping addresses.

Finally, atomic operations are supported on a single aggregate at a time, and ACID is not followed for transactions across multiple aggregates at a time (Sadalage & Fowler, 2012).

Document Databases

A document database is able to look into the structure of a unit because we need to use a query, which can return a subset/part of the aggregate (Sadalage & Fowler, 2012). You can think of this as either a chapter or a section in a document (MUSE, 2015).  It can be limited by the size restrictions, but also in what can be placed (structure and type).  People can blur the line between this and key-value databases by placing an ID field, but for the most part, you will query a document database rather than look up a key or ID (Sadalage & Fowler, 2012).

Pros and Cons of Aggregate Data model

Aggregate ignorance allows for manipulation of data, replication, and sharding because if not, we would have to search every unit, yet manipulation of data, replication, and sharding can be easier when done in these units.  Thus it can help in some cases and not in others.  Also, there is no correct or right way on where aggregate boundaries should or shouldn’t exist, which can add to the complexity in understanding a data model.  It is great if we want to run our transactions on as little nodes as possible on a cluster, and dealing with units is easier on a cluster (Sadalage & Fowler, 2012).  It is not great for mapping out the relationships of units of different formats (MUSE, 2015).

References:

Adv DB: NoSQL DB

Emergence

Relational Databases will persist due to ACID, ERDs, concurrency control, transaction management, and SQL capabilities.  It doesn’t help that major software can easily integrate with these databases.  But, the reason why so many new ways keep popping up is due to impedance resource costs on computational systems, when data is pulled and pushed from in-memory to databases.  This resource cost can compound fast with big amounts of data.  Industry wants and needs to use parallel computing with clusters to store, retrieve, and manipulate big amounts of data.  Data could also be aggregated into units of similarities, and data consistency can be thrown out the window, in real-life applications since they can actually be divided into multiple phases (MUSE, 2015a).

Think of a bank transaction, not all transactions you do at the same time get processed at the same time, and they may show up on your mobile device (mobile database), they may not be committed until a few hours or days later.  The bank will in my case withdraw my mortgage payment from my checking on the first, but apply it on the second of every month into the loan.  But, for 24 hours my payment is pending.

Thanks to the aforementioned ideas have created a movement to support “Not Only SQL” databases, best known as NoSQL, which was derived from a twitter hashtag #NoSQL.  NoSQL contains Aggregate databases like key-value, document, and column friendly, as well as aggregate ignorant databases like the graph (Sadalage & Fowler, 2012). These can be schemaless databases, where data can be stored without any predefined schema.  NoSQL is best for application-specific databases, not to substitute all relational databases (MUSE, 2015b).

 Originally meant for open-sourced, distributed, nonrelational databases like Voldemort, Dynomite, CouchDB, MongoDB, Cassandra, it expanded in its definition and what applications/platforms it can take on.  CQL is from Cassandra and was written to act like SQL in most cases, but also act differently when needed (Sadalage & Fowler, 2012), hence the No in NoSQL.

Suitable Applications

According to Cassandra Planet (n.d.), NoSQL is best for large data sets (big data, complex data, and data mining):

  • Graph: where data relationships are graphical and interconnected like a web (ex: Neo4j & Titan)
  • Key-Value: data is stored and index by a key (ex: Cassandra, DynamoDB, Azure Table Storage, Riak, & BerkeleyDB)
  • Column Store: stores tables as columns rather than rows (ex: Hbase, BigTable, & HyperTable)
  • Document: can store more complex data, with each document having a key (ex: MongoDB & CouchDB).

System Platform

In Relational databases, there is a resource cost, but in as industry wants to deal with big amounts of data, we can gravitate towards NoSQL.  To process all that data we may need to use parallel computing with clusters to store, retrieve, and manipulate big amounts of data.

 References: