NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). Column-oriented databases are perfect for sparse datasets, ones with many null values and when columns do have data the related columns are grouped together (Services, 2015). Grouping demographic data like age, income, gender, marital status, sexual orientation, etc. are a great example for using this NoSQL database. Cassandra, which is a column-oriented NoSQL database focuses on availability and partition tolerance, this means that as an AP system it can achieve consistency if data can be replicated and verified (Hurst, 2010).
Cassandra has been assessed for performance evaluation against other NoSQL databases like MongoDB and Raik for health care data analytics (Weider, Kollipara, Penmetsa, & Elliadka, 2013). In this study, NoSQL database demands for health care data were two-fold:
- Read/write efficiency of medical test results for a patient X (Availability)
- All medical professionals should see the same information on patient X (Consistency)
A NoSQL graph database did not have the fit to use for the above demands, thus wasn’t part of this study.
The architecture of this project: nine partition nodes, where three by three nodes were used to mimic three data centers that would be used by 100 global health facilities, where data is generated at a rate of 1TB per month and must be kept for 99 years.
The dataset used in this project: a synthetic dataset that has 1M patients with 10M lab reports, averaging at seven lab reports per person, but randomly distributed of from 0-20 lab reports per person.
In meeting both of these two demands, Cassandra had a significantly higher throughput value than the other two NoSQL databases. Cassandra’s EACH_QUORUM write and LOCAL_QUORUM read options are part of their datacenter aware system, providing the great throughputs results, using the three synthetic datacenters. Testing consistency, by using Cassandra’s ONE for its write and read options at an eventual rate (slower consistency) or strong rate (faster consistency), shows that throughput increases with the eventual system. The choice to use either rate rests with the healthcare stakeholders.
The authors concluded that for their system and their requirements Cassandra had the highest throughput regardless of the level of consistency rates (Weider et al., 2013). They also suggested that each of these tests should be adjusted based on the requirements from key stakeholders in the healthcare profession and that a small variation in the data model could change the results seen here.
In conclusion of this post, NoSQL databases provide huge advantages to data analytics over traditional relational database management systems. But, NoSQL databases must fit the needs of the stakeholders, and quantitative tests must be thoroughly designed to assess which NoSQL database will meet those needs.
References
- Hurst, N. (2010). Visual guide to NoSQL systems. Retrieved from http://blog.nahurst.com/visual-guide-to-nosql-systems
- Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
- Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].
- Weider, D. Y., Kollipara, M., Penmetsa, R., & Elliadka, S. (2013, October). A distributed storage solution for cloud based e-Healthcare Information System. In e-Health Networking, Applications & Services (Healthcom), 2013 IEEE 15th International Conference on (pp. 476-480). IEEE.