Adv DB: NoSQL DB

Emergence

Relational Databases will persist due to ACID, ERDs, concurrency control, transaction management, and SQL capabilities.  It doesn’t help that major software can easily integrate with these databases.  But, the reason why so many new ways keep popping up is due to impedance resource costs on computational systems, when data is pulled and pushed from in-memory to databases.  This resource cost can compound fast with big amounts of data.  Industry wants and needs to use parallel computing with clusters to store, retrieve, and manipulate big amounts of data.  Data could also be aggregated into units of similarities, and data consistency can be thrown out the window, in real-life applications since they can actually be divided into multiple phases (MUSE, 2015a).

Think of a bank transaction, not all transactions you do at the same time get processed at the same time, and they may show up on your mobile device (mobile database), they may not be committed until a few hours or days later.  The bank will in my case withdraw my mortgage payment from my checking on the first, but apply it on the second of every month into the loan.  But, for 24 hours my payment is pending.

Thanks to the aforementioned ideas have created a movement to support “Not Only SQL” databases, best known as NoSQL, which was derived from a twitter hashtag #NoSQL.  NoSQL contains Aggregate databases like key-value, document, and column friendly, as well as aggregate ignorant databases like the graph (Sadalage & Fowler, 2012). These can be schemaless databases, where data can be stored without any predefined schema.  NoSQL is best for application-specific databases, not to substitute all relational databases (MUSE, 2015b).

 Originally meant for open-sourced, distributed, nonrelational databases like Voldemort, Dynomite, CouchDB, MongoDB, Cassandra, it expanded in its definition and what applications/platforms it can take on.  CQL is from Cassandra and was written to act like SQL in most cases, but also act differently when needed (Sadalage & Fowler, 2012), hence the No in NoSQL.

Suitable Applications

According to Cassandra Planet (n.d.), NoSQL is best for large data sets (big data, complex data, and data mining):

  • Graph: where data relationships are graphical and interconnected like a web (ex: Neo4j & Titan)
  • Key-Value: data is stored and index by a key (ex: Cassandra, DynamoDB, Azure Table Storage, Riak, & BerkeleyDB)
  • Column Store: stores tables as columns rather than rows (ex: Hbase, BigTable, & HyperTable)
  • Document: can store more complex data, with each document having a key (ex: MongoDB & CouchDB).

System Platform

In Relational databases, there is a resource cost, but in as industry wants to deal with big amounts of data, we can gravitate towards NoSQL.  To process all that data we may need to use parallel computing with clusters to store, retrieve, and manipulate big amounts of data.

 References:

Adv Topics: Graph dataset partitioning in Cloud Computing

When dealing with cloud environments, the network management systems are key to processing big data, because some computational components within the network can have bandwidth unevenness (Conolly & Begg, 2014; Sakr, 2014). Chen et al. (2016) stated that often the cloud infrastructure is designed as interconnected computational resources in a tree structure, where they are first group in pods, which are then connected to other pods This unevenness can be defined as differing network performance between computational components within and across different racks from each other (Lublinsky, Smith, & Yakubovich, 2013). This unevenness grows as the cloud gets upgrade and the computational components grow to become increasingly heterogeneous with time (Chen et al., 2016). Thus, network management systems, which is part of resource pooling in cloud environments, helps with data load balancing, route connections, and diagnose problems between the “computational node ó link ó computational node ó link ó computational node …;” where each node is connected to multiple databases through multiple links (Conolly & Begg, 2014).

Network bandwidth unevenness becomes an issue when trying to process graph and network data, because (a) the complexity of the data cannot be stored in typical relational database systems; (b) the data is complex to process; and (c) scalability issues (Chen et al., 2016). Sakr (2014) stated there are two ways to scale data: vertical scaling where researchers are finding the few computational components that can handle huge data loads or horizontal scaling where researchers are spreading the data across multiple computational components. Chen et al. (2016) and Sakr (2014) recommended these two large-scale graph dataset partitioning methods for cloud computing systems:

  • Machine Graph: Models the graphical data by treating each computational node as a graphical vertex, such that the interconnectivity between each computational node would represent the connection of that graphical data element to others graphical data elements. A benefit of using this design is that the computational nodes representing this graphical data doesn’t need the knowledge of the topology of the network and can be built by utilizing the network bandwidth between computational nodes.
  • Partition Sketch: A graphical data set is partitioned out to represent multiple tree-structured data, and each computational node address a single level of the graphical tree data. From the perspective of the partitioned tree; the root node holds the input graph, non-leaf nodes hold partition iteration data, and leaf nodes hold the partitioned graph data points. This is built iteratively, with the goal of reducing the number of cross leaf-nodes between partitions. Design wise, each partition sketch is locally optimized, addresses monotonicity (cross leaf-nodes between partitions and non-leaf node depth), and addresses the proximity of parent nodes by defining common parent data between leaf nodes and non-leaf nodes. For instance, data that has a low common parent data should be stored together in high bandwidth computational nodes, whereas leaf-node data should be stored in low bandwidth computational nodes. However, the limitation of such a design is that the partition size must be carefully selected to fit monotonicity, but at the same time not have too many nodes as to run into input and output errors when trying to retrieve the data. Thus, more processing nodes should be used to decrease the number of cross leaf-nodes with low bandwidth, but not too many to cause memory issues.

Resources

  • Chen, R. Weng, X., He, B., Choi, B. and Yang, M. (2016). Network performance aware graph partitioning for large graph processing systems in the cloud. Retrieved from http://www.comp.nus.edu.sg/~hebs/pub/rishannetwork_crc14.pdf
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. Pearson Learning Solutions. VitalBook file.
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox, VitalBook file.
  • Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.

Graphical NoSQL Databases

There is a lot of complicated connections between patients, their provider, their diagnoses, etc., and graphically representing this relationship data is one of the main highlights of using a NoSQL graph database (Park, Shankar, Park, & Ghosh, 2014). NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). Graph NoSQL databases are used drawing networks by showing the relationship between items in a graphical format that has been optimized for easy searching and editing (Services, 2015). Each item is considered a node and adding more nodes or relationships while traversing through them is made simpler through a graph database rather than a traditional database (Sadalage & Fowler, 2012). Some sample graph databases consist of Neo4j Pregel, etc. (Park et al., 2014).

Case Study: Graph Databases for large-scale healthcare systems: A framework for efficient data management and data services (Park et al., 2014)

Driver for data analytics needs: Finding areas for cost savings through anomaly detection algorithms, because currently there are a bunch of individual tables and non-normalized data that are replicated multiple times which is causing bottlenecks.

Problem: Understanding and establishing relationships between self-referrals and shared providers, which allows for the use of a collaborative filter.

System Needs: Data management needs error-tolerant and non-redundant database system, while data services need data retrieval, analytics queries, statistical data extraction and mining algorithms.

NoSQL Database used: Neo4J graph NoSQL Database using Cypher query to keep the data normalized and reduce the number of individual tables of data due to the advanced yet simple query capabilities

Methodology: Using the 3EG: 3NF Equivalent Graph Transformation algorithm to convert traditional relational database data into graph database data on realistic synthetic healthcare data.  The synthetic healthcare data consists of zip-codes, diagnosis of disease, available procedures, beneficiary, claim, and providers. The data when flattened can showcase 1 M beneficiaries to 100 K providers, but in a graphical format, that same data will have 51 M nodes and 257 M relationships.

Queries Ran on the NoSQL Database:

  • Shared providers between two beneficiaries
  • Shared providers between two beneficiaries through either actual visits or by referrals
  • List of shared diseases between two beneficiaries through their claim records
  • Any link between two beneficiaries à helps to direct further investigations/queries
  • Shared beneficiaries between two providers
  • Self-referred beneficiaries for a given provider
  • Similar claims based on diagnoses codes
  • Patient wants to switch to a new provider based on a referral by another provider

Using 50 random queries for each of the 8 cases above: the time it took to run the first three cases was faster in a MySQL query, but by less than 0.0X seconds, whereas the last 5 cases the NoSQL was faster ranging from 0.5-40 seconds.  As data size grew so did the processing time for the last five cases on MySQL grew.

Conclusions: The authors were able to show that with more highly advanced cases, MySQL takes more time than NoSQL. Thus, for big data analytics, NoSQL graph databases can help store dynamic relationship data as well as process more complex queries using fewer lines of code and faster than MySQL queries.  This style of storing data allows the end-user in the healthcare field to ask more complex questions and get those answers promptly.

References

  • Park, Y., Shankar, M., Park, B. H., & Ghosh, J. (2014). Graph databases for large-scale healthcare systems: A framework for efficient data management and data services. In Data Engineering Workshops (ICDEW), 2014 IEEE 30th International Conference on (pp. 12-19). IEEE.
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].