Adv DB: Conducting data migration to NoSQL databases

Relational databases schema design (primarily ERDs) are all about creating models, then translating it a schema to which is normalized, but one must be an oracle to anticipate a holistic, end-to-end design, or else suffer when making changes to the database (Scherzinger et al, 2013).  Relational databases are poor at data replication, horizontal scalability, and high availability rates (Schram & Anderson, 2012).  Thus, waterfall approaches to database design are no longer advantageous, and like software development databases can be designed with an agile mentality.  Especially as data store requirements are always evolving. Databases that adopt a “Schema-less” (where data can be stored without any predefined schema) or an “Implicit Schema” (where the data definition van be taken from a database from an application in order to place the data into the database) in “Not Only SQL” (NoSQL) can allow for agile development on a release cycle that can vary from yearly, monthly, weekly, or daily, which is completely dependent on the developers’ iteration cycle (Sadalage & Fowler, 2012).  Taking a look at a blogging agile development lifecycle (below) can show how great schema-less or implicit schemas in NoSQL database development can become, as well as the technical debt that is created, which can cause migration issues down the line.

Blogging

We start a blogging site called “blog.me” and we are in an agile environment, which means iterative improvements and each iteration produces a releasable product (even if we decide not to make a release or update at the end of the iteration).  As a programming team, they have decided that the minimum viable product will consist of the fields, title, and content for the blogger and comments from other people.  This is a similar example proposed by Scherzinger et al in 2013, as they try to explain how implicit schemas work.  In the second iteration, the programming team for “blog.me” has discovered an abuse on the commenting section of the blog.  People have been “trolling” the blog, thus to mitigate this, they implemented a sign-in process with a username and password that is taken from Facebook, which allows for liking a post as well.  Rather than having bloggers to recreate their content, the programmers make the implementation of this update for current and future posts. In a third iteration, the programming teams to institute a uniformed nomenclature to some of their fields.  Rather than changing all the posts from the first two iterations, the programmers decide to enforce these changes moving forward.

Now, one can see how useful a schema-less development (provided by NoSQL) can become.   There is no downtime to how the site interacts and adds no additional burden to the end-users when an update occurs. But, we now have to worry about migrating these three data classes (or as Scherzinger et al calls it technical debt), but what if a commenter goes and comments in a post made in iteration one or two after iteration three has been implemented, we may then have four to five different data classes.  These developers love to develop code and add new features rather than maintain code, which is why this form of developing a database is great, but as we can see technical debt can pile on quickly.  Our goal is to manage a schema of this data, yet have the flexibility of a schema-less database system.

Types of Migration

The migration of data in and out of a data store is usually enabled through a replication scheme (Shirazi et al, 2012) conducted through an application.  There are two primary types of data migration per Scherzinger et al (2013): eager and lazy.  Eager migration means we migrate all the data in a batched fashion, one-by-one retrieval from the data store, transform it and write it back into the data store.  As data becomes larger, eager migration can become resource-intensive and could be a wasted effort. Wasted efforts can come from stale data.  Thus, the lazy approach is considered as a viable option.  Transformations are conducted when a piece of data is touched, so only live and hot data (relevant data) is updated.  Even though this approach saves on resources, if an entity becomes corrupted, there may be no way to retrieve it.  In order to do the migration, an application needs to create an “implicit-schema” on the “schema-less” data.

NoSQL and its multiple flavors

NoSQL databases can deal with aggregate data (relationships between units of data that can be relationally mapped), using key-value, document, and column friendly databases (Scherzinger et al, 2013, Sadalage & Fowler, 2012, Schram & Anderson, 2012).  There also exist graphical databases (Sadalage & Fowler, 2012).  Key-value databases deal with storing data with a unique key and value, while document databases store documents or their parts in a value. (Scherzinger et al, 2013). People can blur the line between this and key-value databases by placing an ID field, but for the most part, you will query a document database rather than look up a key or ID (Sadalage & Fowler, 2012). Whereas column friendly databases store the information in transposed table structures (as columns rather than rows).  Graph databases can show relationships with huge datasets that are highly interconnected, and the complexity of the data is emphasized in this database rather than the size of data (Shirazi et al, 2012).  A further example of a graphical database is shown in the health section in the following pages.  Migrations between the multiple flavors of NoSQL databases allow for one to exploit the strengths and mitigate the weakness between the types when it comes to analyzing the large data quickly.

Data Migration Considerations and Steps

Since data migration uses replication schemes from an application, one must consider how complex writing a SQL query would be if this were a relational database scheme (Shirazi et al, 2012).  This has implications on how complex transforming data or migrating it would be under NoSQL databases, especially when big data is introduced into the equation.  Thus, the pattern of database design must be taken into account when migrating data between relational databases to NoSQL database, or between different NoSQL database types (or even provider). Also, each of these database types treats NULL values differently, some NoSQL databases don’t even waste the storage space and ignore NULL values, some systems have them as in relational databases, and some systems allow for it, but don’t query for it (Scherzinger et al, 2013).  Scherzinger et al (2013) suggest that when migrating data, data models (data stored in the databases that belong to a object or a group, which can have several properties) query models (data that can be inserted, transformed and deleted based on a key-value, or some other kind identification), and freedom from schema (the global structure of the data that can or cannot be fixed in advance) must be taken into account. Whereas, Schram & Anderson in 2012, stated that data models are key when making design changes (migrations) between database systems. Since in NoSQL data is “schema-less” there may not be any global structure, but applications (such as web user-interfaces) built on top of the data-stores can display an implicit structure, and from that, we can list a few steps to consider when migrating data (Tran et al, 2011):

  • Installation and configuration
    1. Set up development tools and environment
    2. Install and set up environments
    3. Install third-party tools
  • Code modification
    1. Set up database connections
    2. Database operation query (if using a NoSQL database)
    3. Any required modifications for compatibility issues
  • Migration
    1. Prepare the database for migration
    2. Migrate the local database to the NoSQL database (the schema-less part)
    3. Prepare system for migration
    4. Migrate the application (the implicit-schema part)
  • Test (how to ensure the data stored in the databases matched with the “Implicit Schema” embedded in the applications when the “Implicit Schema” has experienced a change)
    1. Test if the local system works with a database in NoSQL
    2. Test if the system works with databases in NoSQL
    3. Write test cases and test for functionality of the application in NoSQL

When doing code modification (step 2) from a relational database to a NoSQL database the more changes will be required, and JOIN operations may not be fully supported.  Thus, additional code may be required in order to maintain the serviceability of the application, pre-migration, during migration and post-migration (Tran et al, 2011).  Considering ITIL Service Transition standards, the best time to do a migration or update is in windows of minimum usage by end-users, while still maintaining agreed-upon minimum SLA standards.  As stated in Schram & Anderson (2012) they didn’t want their service to break while they were migrating their data from a relational database to a NoSQL column friendly database.  Other issues, like compatibility between the systems housing the databases or even database types, can also add complexity to migration.  When migrating (step 3) SQL scripts need to be transformed as well, to align with the new database structure, environment, etc. (Tran et al, 2011). Third-party apps can help to a degree with this.  If the planning phase was conducted correctly this phase should be relatively smooth.  Tran et al (2011) stated that there are at least 8 features that drive the cost of migration: (1) Project team’s capability, (2) Application/Database complexity, (3) Existing knowledge and experience, (4) Selecting the correct database and database management system, (5) Compatibility issues, database features, and (8) Connection issues during migration.

Health

A database was created from 7.2M medical reports, in order to understand human diseases, called HealthTable.  The authors in Shirazi et al in 2012, decided to convert a column store into a graph database of Health Infoscape (Table 1 to Figure 1).  Each cause/symptom stems from disease (Dx), yet the power of graph databases as aforementioned are shown, thus facilitating data analysis, even though column friendly databases provide an easier way to maintain the 7.2M data records.

Table 1. HealthTable in Hbase per Shirazi et al (2012).

Row key Info Prevalence Causes
D1 Name Category Female Male Total Cause1 Cause2 Cause3
Heartburn Digestive system 9.4% 9% 9.2% D2    
1 1 1 1 1 2    
D2 Chest Pain Circulatory System 6.8% 6.8% 6.8%      
3 3 3 3 3      
D4 Dizziness Nervous System 4% 2.8% 3.5%      
5 5 5 5 5      

health graph

Figure 1. HeathGraph Bases on HealthTable

Conclusions

From these two use cases (Heath and Blogging) is that data migration can be quite complicated.  Schema-less databases allow for a more agile approach to developing, whereas the alternative is best for the waterfall.  However, with waterfall development slowly on the decay, one must also migrate to other forms of development.  Though applications/databases can migrate from relational databases to NoSQL and thus require a lot of coding because of compatibility issues, applications/databases can also migrate between different types of NoSQL databases.  Each database structure has its strengths and weakness, and migrating data between these databases can provide opportunities for knowledge discovery from the data that is contained within them.  Migrating between database systems and NoSQL types should be conducted if it fulfills many of the requirements and promises to reduce the cost of maintenance (Schram & Anderson, 2012).

References

  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [VitalSource Bookshelf Online]. Retrieved from https://bookshelf.vitalsource.com/#/books/9781323137376/
  • Scherzinger, S., Klettke, M., & Störl, U. (2013). Managing schema evolution in NoSQL data stores. arXiv preprint arXiv:1308.0514.
  • Schram, A., & Anderson, K. M. (2012). MySQL to NoSQL: data modeling challenges in supporting scalability. In Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity (pp. 191-202). ACM.
  • Shirazi, M. N., Kuan, H. C., & Dolatabadi, H. (2012, June). Design Patterns to Enable Data Portability between Clouds’ Databases. In Computational Science and Its Applications (ICCSA), 2012 12th International Conference on (pp. 117-120). IEEE.
  • Tran, V., Keung, J., Liu, A., & Fekete, A. (2011, May). Application migration to cloud: a taxonomy of critical factors. In Proceedings of the 2nd international workshop on software engineering for cloud computing (pp. 22-28). ACM.

Adv DB: Data Services in the Cloud Service Platform

Rimal et al (2009), states that Cloud Computing Systems are a disruptive service that has gained momentum. What makes it disruptive is that it has similar properties of prior technology, while adding new features and capabilities (like big data processing) in a very cost-effective way.   It has become part of the XaaS (where X can be infrastructure, hardware, software, etc.) as a Service.  According to Connolly & Begg (2014), Data as a Service (DaaS) and Database as a Service (DBaaS) are considered as cloud-based solutions. DaaS doesn’t use SQL interfaces, but it does enable corporations to access data to analyze value streams that they own or those they can easily expand into. DBaaS must be continuously monitored and improved on, because they usually serve multiple organizations, with the added benefit of providing charge-back functions per organization (Connolly & Begg, 2014) or a pay-for-use model (Rimal et al, 2009).  However, one must pick a solution that best serves their business’/organization’s needs.

Benefits of the service

Connolly & Begg (2014), stated that there are benefits to cloud computing such as Cost-reduction due to lower CapEx, ability to scale up or down based on data demands, needs for higher security making data stored here more secure than in-house solutions, 24/7 reliability can be provided, faster development time because time is not take away from building the DBMS from scratch, finally sensitivity/load testing can be made readily available and cheaply because of lower CapEX.  Rimal et al (2009), stated that the benefits came from lower costs, improved flexibility and agility, and scalability.  You can set up systems as quickly as you wish, under a pay-as-you-go model and is great for disaster recovery efforts (as long as the disaster affected your systems, not theirs).   Other benefits could be seen from an article looking at the Data-as-a-Service in the health field: a low cost to implementation and maintainability of databases, defragmentation of data, exchange of patient data across the heady provider organization, and provide a mode for standardization of data types, forms, and frequency to capture data.  From a health-care perspective, it could lead to supporting research, strategic planning of medical decisions, improve data quality, reduce cost, reduce resource scarcity issues from an IT perspective, and finally provide better patient care (AbuKhousa et al, 2012).

Problems that can be removed because of the service

Unfortunately, there are two sides to a coin.  Given that on a cloud service there exists network dependency, such that if the supplier has a power outage, the consumer will not have access to their data.  Other network dependencies can occur like peak service hours, where service tends to be degraded compared to if the company used the supplier during its off-peak hours.  Quite a few organizations use Amazon EC2 (Rimal et al, 2009) as their Cloud DBMS, which if that system is hacked or security is breached the problem is bigger than if it were carried out to only one company. There are system dependencies, like in the case of disaster recovery (AbuKhousa et al, 2012), organizations are as strong as their weakest link when it comes to a disaster, if the point of failure in the service, and there are no other mitigation plans, that organization may have a hard time recuperating their losses.  Also, placing data into these services, you lose control over the data (lose control over availability, reliability, maintainability, integrity, confidentiality, intervenability, and isolation) (Connolly & Begg, 2014).  Rimal et al (2009) stated clear examples of outages that existed in Services like Microsoft (down 22 hours in 2008), or Google Gmail (2.5 hours in 2009), etc.  All of these lack of control points is perhaps one of the main reasons why certain government agencies have had a hard time adopting a commercialized cloud service provider, however they are attempting to create internal clouds.

Architectural view of the system

The overall cloud architecture is the layered system that serves as a single point of contact and uses software applications over the web, using an infrastructure, which draws on resources from necessary hardware to complete a task (Rimal et al, 2009).  Adopted from Rimal et al (2009) the figure below is what these authors describe as the layered cloud architecture.

Software-as-a-Service (SaaS)

Platform-as-a-Service (PaaS)

Developers implementing cloud applications

Infrastructure-as-a-Service (IaaS)

[(Virtualizations, Storage Networks) as-a-Service]

Hardware-as-a-Service (HaaS)

Figure 1. A layered cloud architecture.

Cloud DBMS can be private (an internal provider where the data is held within the organization), public (an external provider manages resources dynamically across their system and through multiple organizations supplying them data), and hybrid (consists of multiple internal and external providers) as defined in Rimal et al (2009).

A problem with DBaaS is the fact that databases between multiple organizations are stored by the same service provider.  Where data can be what separates one organization from its competitors, they must consider the following architecture: Separate Servers, Shared Server but different database processes, Shared databases but separate databases, Shared databases but separate schema, or Shared databases but shared schema (Connolly & Begg, 2014).

Let’s take two competitors: Walmart & Target.  Two supergiant organizations that have trillions of dollars and inventory flowing in and out of their systems monthly.  Let’s also assume three database tables with automated primary keys and foreign keys that connect the data together: Product, Purchasing, and Inventory.  Another set of assumptions: (1) their data can be stored by the same DBaaS provider, (2) their DBaaS provider is Amazon.

While Target and Walmart may use the same supplier for their DBaaS, they can select one of those five architectural solutions to avoid their valuable data to be seen.  If Walmart and Target purchased separate servers, their data can be safe.  They could also go this route if they want to store their huge data and/or have many users of their data. Now if we narrow down our assumptions to the children’s toy section (due to breaking up the datasets to manageable chunks), both Target and Walmart can store their data on a shared server but on separate database processes, they would not have any shared resources like memory or disk, just the virtual environment.  If Walmart and Target went on a shared database server but separate databases, would allow for better resource management between each organization.  If Walmart and Target decided to use the same database (which is unlikely) but hold separate schema, data must have strong access management systems in place.  This may be elected between CVS and Walgreen’s Pharmacy databases, where patient data can be vital, but not impossible to switch from one schema to another, however, this interaction is highly unlikely.   The final structure, highly unlikely for most corporations is sharing databases and schemas.  This final architectural structure is best used for hospitals sharing patient data (AbuKhousa et al, 2012), but, HIPPA must be observed still under this final architecture mode.  Though this is the desired state for some hospitals, it may take years to get to a full system.  Security is a big issue here and will take a ton of developmental hours, but in the long run, it is the cheapest solution available (Connolly & Begg, 2014).

Requirements on services that must be met

Looking at Amazon’s cloud database offering, it should be easy to set up, easy to operate, and easy to scale.  The system should enhance availability and reliability to its live databases compared to in house solutions. There should be software in the databases to back up the database, for recovery at any time. Security patches and maintenance of the underlying hardware and software of the cloud DBMS should be reduced significantly since that is not the burden that should be placed onto the organization. The goal of the Cloud DBMS should be to remove development costs away from managing the DBMS to focus on applications and the data to be stored in the system.  They should also provide infrastructure and hardware as a service to reduce overhead costs in managing these systems.  Amazon’s Relational Database Service can use MySQL, Oracle, SQL Server, or PostgreSQL databases.  Amazon’s DynamoDB is a NoSQL database service.  Amazon’s Redshift costs less than $1K to store a TB of data per year (Varia & Mathew, 2013).

Issues to be resolved during implementation

Rimal et al in 2010, stated some interesting things to consider during a before implementing a Cloud DBMS.  What are the Service-Level Agreements (SLAs) of the supplier Cloud DBMS?  The Cloud DBMS may be up and running 24×7, but if they experience a power outage, what is their SLAs to the organization, as not to impact the organization at all?  What is their backup/replication scheme?  What is their discovery (assists in reusability) schema? What is their load balancing (trying to avoid bottlenecks), especially since most suppliers cater to more than one organization? What does their resource management plan look like? Most cloud DBMS have several copies of the data spread across several servers, so how sure is the vender to ensure no data loss?  What types of security are provided? What is their encryption and decryption strength for the data held within its servers?  How private will the organization’s data be, if hosted on the same server or same database but separate schema?   What are their authorization and Authentication safeguards?  Looking at Varia & Mathew (2013) explain all the cloud DBMS services that Amazon provides, these questions are definitely things that should be addressed for each of their solutions.  Thus, when analyzing a supplier for a cloud DBMS, having technical requirements that meet the Business Goals & Objects (BG&Os) is great to help guide the organization pick the right supplier and solution, given issues that need to be resolved.  Other issues identified came from Chaudhuri (2012): data privacy through access control, auditing, and statistical privacy; allow for data exploration to enable deeper analytics; data enrichment with web 2.0 and social media; query optimization; scalable data platforms; manageability; and performance isolation for multiple organizations occupying the same server.

Data migration strategy & Security Enforcement

When migrating data between organizations into a Cloud DBMS, the taxonomy of the data must be preserved.  Along with taxonomy, one must consider that no data is lost in the transfer, that data is still available to the end-user, before, during and after the migration, and that the transfer is done in a cost-efficient way (Rimal et al 2010).  Furthermore, data migration should be done seamlessly and efficiently as if one were to move between suppliers of services, such that a supplier doesn’t get too entangled into the organization that it is the only solution the organization can see itself as using.  Finally, what type of data do you want to migrate over, mission-critical data may be too precious on certain types of Cloud DBMS, but may be great for a virtualized disaster recovery system?  What type of data to migrate over depends on the needs of a cloud system, to begin with, and what services does the organization want to pay-as-they-go now and in the near future. The type of data to migrate may also depend on the security features provided by the supplier.

Organizational information is a vital resource to any organization, and access to it and maintaining it proprietary is key. If not enforced data like employee social security numbers can be compromised, or credit card numbers of past consumers.  Rimal et al (2009) compared the security considerations in the current cloud systems at the time:

  • Google: uses 128 Bit or higher server authentication
  • GigaSpaces: SSH tunneling
  • Microsoft Azure: Token services
  • OpenNebula: Firewall and virtual private network tunnel

In 2010, Rimal et al, further expanded the security considerations by the suggestion that organizations should look into: authentication and authorization protocols (access management), privacy and federated data, encryption and decryption schemes, etc.

References

  • AbuKhousa, E., Mohamed, N., & Al-Jaroodi, J. (2012). e-Health cloud: opportunities and challenges. Future Internet, 4(3), 621-645.
  • Chaudhuri, S. (2012, May). What next?: a half-dozen data management research goals for big data and the cloud. In Proceedings of the 31st symposium on Principles of Database Systems (pp. 1-4). ACM.
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf version]. Retrieved from http://online.vitalsource.com/books/9781323135761/epubcfi/6/210
  • Rimal, B. P., Choi, E., & Lumb, I. (2009, August). A taxonomy and survey of cloud computing systems. In INC, IMS and IDC, 2009. NCM’09. Fifth International Joint Conference on (pp. 44-51). Ieee.
  • Rimal, B. P., Choi, E., & Lumb, I. (2010). A taxonomy, survey, and issues of cloud computing ecosystems. In Cloud Computing (pp. 21-46). Springer London.
  • Varia, J., & Mathew, S. (2013). Overview of amazon web services. Jan-2014.