Adv DB: Conducting data migration to NoSQL databases

Relational databases schema design (primarily ERDs) are all about creating models, then translating it a schema to which is normalized, but one must be an oracle to anticipate a holistic, end-to-end design, or else suffer when making changes to the database (Scherzinger et al, 2013).  Relational databases are poor at data replication, horizontal scalability, and high availability rates (Schram & Anderson, 2012).  Thus, waterfall approaches to database design are no longer advantageous, and like software development databases can be designed with an agile mentality.  Especially as data store requirements are always evolving. Databases that adopt a “Schema-less” (where data can be stored without any predefined schema) or an “Implicit Schema” (where the data definition van be taken from a database from an application in order to place the data into the database) in “Not Only SQL” (NoSQL) can allow for agile development on a release cycle that can vary from yearly, monthly, weekly, or daily, which is completely dependent on the developers’ iteration cycle (Sadalage & Fowler, 2012).  Taking a look at a blogging agile development lifecycle (below) can show how great schema-less or implicit schemas in NoSQL database development can become, as well as the technical debt that is created, which can cause migration issues down the line.

Blogging

We start a blogging site called “blog.me” and we are in an agile environment, which means iterative improvements and each iteration produces a releasable product (even if we decide not to make a release or update at the end of the iteration).  As a programming team, they have decided that the minimum viable product will consist of the fields, title, and content for the blogger and comments from other people.  This is a similar example proposed by Scherzinger et al in 2013, as they try to explain how implicit schemas work.  In the second iteration, the programming team for “blog.me” has discovered an abuse on the commenting section of the blog.  People have been “trolling” the blog, thus to mitigate this, they implemented a sign-in process with a username and password that is taken from Facebook, which allows for liking a post as well.  Rather than having bloggers to recreate their content, the programmers make the implementation of this update for current and future posts. In a third iteration, the programming teams to institute a uniformed nomenclature to some of their fields.  Rather than changing all the posts from the first two iterations, the programmers decide to enforce these changes moving forward.

Now, one can see how useful a schema-less development (provided by NoSQL) can become.   There is no downtime to how the site interacts and adds no additional burden to the end-users when an update occurs. But, we now have to worry about migrating these three data classes (or as Scherzinger et al calls it technical debt), but what if a commenter goes and comments in a post made in iteration one or two after iteration three has been implemented, we may then have four to five different data classes.  These developers love to develop code and add new features rather than maintain code, which is why this form of developing a database is great, but as we can see technical debt can pile on quickly.  Our goal is to manage a schema of this data, yet have the flexibility of a schema-less database system.

Types of Migration

The migration of data in and out of a data store is usually enabled through a replication scheme (Shirazi et al, 2012) conducted through an application.  There are two primary types of data migration per Scherzinger et al (2013): eager and lazy.  Eager migration means we migrate all the data in a batched fashion, one-by-one retrieval from the data store, transform it and write it back into the data store.  As data becomes larger, eager migration can become resource-intensive and could be a wasted effort. Wasted efforts can come from stale data.  Thus, the lazy approach is considered as a viable option.  Transformations are conducted when a piece of data is touched, so only live and hot data (relevant data) is updated.  Even though this approach saves on resources, if an entity becomes corrupted, there may be no way to retrieve it.  In order to do the migration, an application needs to create an “implicit-schema” on the “schema-less” data.

NoSQL and its multiple flavors

NoSQL databases can deal with aggregate data (relationships between units of data that can be relationally mapped), using key-value, document, and column friendly databases (Scherzinger et al, 2013, Sadalage & Fowler, 2012, Schram & Anderson, 2012).  There also exist graphical databases (Sadalage & Fowler, 2012).  Key-value databases deal with storing data with a unique key and value, while document databases store documents or their parts in a value. (Scherzinger et al, 2013). People can blur the line between this and key-value databases by placing an ID field, but for the most part, you will query a document database rather than look up a key or ID (Sadalage & Fowler, 2012). Whereas column friendly databases store the information in transposed table structures (as columns rather than rows).  Graph databases can show relationships with huge datasets that are highly interconnected, and the complexity of the data is emphasized in this database rather than the size of data (Shirazi et al, 2012).  A further example of a graphical database is shown in the health section in the following pages.  Migrations between the multiple flavors of NoSQL databases allow for one to exploit the strengths and mitigate the weakness between the types when it comes to analyzing the large data quickly.

Data Migration Considerations and Steps

Since data migration uses replication schemes from an application, one must consider how complex writing a SQL query would be if this were a relational database scheme (Shirazi et al, 2012).  This has implications on how complex transforming data or migrating it would be under NoSQL databases, especially when big data is introduced into the equation.  Thus, the pattern of database design must be taken into account when migrating data between relational databases to NoSQL database, or between different NoSQL database types (or even provider). Also, each of these database types treats NULL values differently, some NoSQL databases don’t even waste the storage space and ignore NULL values, some systems have them as in relational databases, and some systems allow for it, but don’t query for it (Scherzinger et al, 2013).  Scherzinger et al (2013) suggest that when migrating data, data models (data stored in the databases that belong to a object or a group, which can have several properties) query models (data that can be inserted, transformed and deleted based on a key-value, or some other kind identification), and freedom from schema (the global structure of the data that can or cannot be fixed in advance) must be taken into account. Whereas, Schram & Anderson in 2012, stated that data models are key when making design changes (migrations) between database systems. Since in NoSQL data is “schema-less” there may not be any global structure, but applications (such as web user-interfaces) built on top of the data-stores can display an implicit structure, and from that, we can list a few steps to consider when migrating data (Tran et al, 2011):

  • Installation and configuration
    1. Set up development tools and environment
    2. Install and set up environments
    3. Install third-party tools
  • Code modification
    1. Set up database connections
    2. Database operation query (if using a NoSQL database)
    3. Any required modifications for compatibility issues
  • Migration
    1. Prepare the database for migration
    2. Migrate the local database to the NoSQL database (the schema-less part)
    3. Prepare system for migration
    4. Migrate the application (the implicit-schema part)
  • Test (how to ensure the data stored in the databases matched with the “Implicit Schema” embedded in the applications when the “Implicit Schema” has experienced a change)
    1. Test if the local system works with a database in NoSQL
    2. Test if the system works with databases in NoSQL
    3. Write test cases and test for functionality of the application in NoSQL

When doing code modification (step 2) from a relational database to a NoSQL database the more changes will be required, and JOIN operations may not be fully supported.  Thus, additional code may be required in order to maintain the serviceability of the application, pre-migration, during migration and post-migration (Tran et al, 2011).  Considering ITIL Service Transition standards, the best time to do a migration or update is in windows of minimum usage by end-users, while still maintaining agreed-upon minimum SLA standards.  As stated in Schram & Anderson (2012) they didn’t want their service to break while they were migrating their data from a relational database to a NoSQL column friendly database.  Other issues, like compatibility between the systems housing the databases or even database types, can also add complexity to migration.  When migrating (step 3) SQL scripts need to be transformed as well, to align with the new database structure, environment, etc. (Tran et al, 2011). Third-party apps can help to a degree with this.  If the planning phase was conducted correctly this phase should be relatively smooth.  Tran et al (2011) stated that there are at least 8 features that drive the cost of migration: (1) Project team’s capability, (2) Application/Database complexity, (3) Existing knowledge and experience, (4) Selecting the correct database and database management system, (5) Compatibility issues, database features, and (8) Connection issues during migration.

Health

A database was created from 7.2M medical reports, in order to understand human diseases, called HealthTable.  The authors in Shirazi et al in 2012, decided to convert a column store into a graph database of Health Infoscape (Table 1 to Figure 1).  Each cause/symptom stems from disease (Dx), yet the power of graph databases as aforementioned are shown, thus facilitating data analysis, even though column friendly databases provide an easier way to maintain the 7.2M data records.

Table 1. HealthTable in Hbase per Shirazi et al (2012).

Row key Info Prevalence Causes
D1 Name Category Female Male Total Cause1 Cause2 Cause3
Heartburn Digestive system 9.4% 9% 9.2% D2    
1 1 1 1 1 2    
D2 Chest Pain Circulatory System 6.8% 6.8% 6.8%      
3 3 3 3 3      
D4 Dizziness Nervous System 4% 2.8% 3.5%      
5 5 5 5 5      

health graph

Figure 1. HeathGraph Bases on HealthTable

Conclusions

From these two use cases (Heath and Blogging) is that data migration can be quite complicated.  Schema-less databases allow for a more agile approach to developing, whereas the alternative is best for the waterfall.  However, with waterfall development slowly on the decay, one must also migrate to other forms of development.  Though applications/databases can migrate from relational databases to NoSQL and thus require a lot of coding because of compatibility issues, applications/databases can also migrate between different types of NoSQL databases.  Each database structure has its strengths and weakness, and migrating data between these databases can provide opportunities for knowledge discovery from the data that is contained within them.  Migrating between database systems and NoSQL types should be conducted if it fulfills many of the requirements and promises to reduce the cost of maintenance (Schram & Anderson, 2012).

References

  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [VitalSource Bookshelf Online]. Retrieved from https://bookshelf.vitalsource.com/#/books/9781323137376/
  • Scherzinger, S., Klettke, M., & Störl, U. (2013). Managing schema evolution in NoSQL data stores. arXiv preprint arXiv:1308.0514.
  • Schram, A., & Anderson, K. M. (2012). MySQL to NoSQL: data modeling challenges in supporting scalability. In Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity (pp. 191-202). ACM.
  • Shirazi, M. N., Kuan, H. C., & Dolatabadi, H. (2012, June). Design Patterns to Enable Data Portability between Clouds’ Databases. In Computational Science and Its Applications (ICCSA), 2012 12th International Conference on (pp. 117-120). IEEE.
  • Tran, V., Keung, J., Liu, A., & Fekete, A. (2011, May). Application migration to cloud: a taxonomy of critical factors. In Proceedings of the 2nd international workshop on software engineering for cloud computing (pp. 22-28). ACM.

Data Tools: Hadoop and how to install it

What is Hadoop

Hadoop’s Distributed File System (HFDS) is where big data is broken up into smaller blocks (IBM, n.d.), which can be aggregated like a set of Legos throughout a distributed database system. Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). Blocks are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on the data needs).

HFDS can support many different data types, even those that are unknown or yet to be classified and it can store a bunch of data.  Thus, Hadoop’s technology to manage big data allows for parallel processing, which can allow for parallel searching, metadata management, parallel analysis (with MapReduce), the establishment of workflow system analysis, etc. (Gary et al., 2005, Hortonworks, 2013, & IBM, n.d.).

Given the massive amounts of data in Big Data that needs to get processed, manipulated, and calculated upon, parallel processing and programming are there to use the benefits of distributed systems to get the job done (Minelli et al., 2013).  Hadoop, which is Java based allows for manipulation and calculations to be done by calling on MapReduce, which pulls on the data which is distributed on its servers, to map key items/objects, and reduces the data to the query at hand (Hortonworks, 2013 & Sathupadi, 2010).

Parallel processing allows making quick work on a big data set, because rather than having one processor doing all the work, Hadoop splits up the task amongst many processors. This is the largest benefit of Hadoop, which allows for parallel processing.  Another advantage of parallel processing is when one processor/node goes out; another node can pick up from where that task last saved safe object task (which can slow down the calculation but by just a bit).  Hadoop knows that this happens all the time with their nodes, so the processor/node create backups of their data as part of their fail safe (IBM, n.d).  This is done so that another processor/node can continue its work on the copied data, which enhances data availability, which in the end gets the task you need to be done now.

Minelli et al. (2013) stated that traditional relational database systems could depend on hardware architecture.  However, Hadoop’s service is part of cloud (as Platform as a Service = PaaS).  For PaaS, we manage the applications, and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001).  The next section discusses how to install Hadoop and how to set up Eclipse to access map/reduce servers.

Installation steps

  • Go to the Hadoop Main Page < http://hadoop.apache.org/ > and scroll down to the getting started section, and click “Download Hadoop from the release page.” (Birajdar, 2015)
  • In the Apache Hadoop Releases < http://hadoop.apache.org/releases.html > Select the link for the “source” code for Hadoop 2.7.3, and then select the first mirror: “http://apache.mirrors.ionfish.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3-src.tar.gz” (Birajdar, 2015)
  • Open the Hadoop-2.7.3 tarball file with a compression file reader like WinRAR archiver < http://www.rarlab.com/download.htm >. Then drag the file into the Local Disk (C:). (Birajdar, 2015)
  • Once the file has been completely transferred to the Local Disk drive, close the tarball file, and open up the hadoop-2.7.3-src folder. (Birajdar, 2015)
  • Download Hadoop 0.18.0 tarball file < https://archive.apache.org/dist/hadoop/core/hadoop-0.18.0/ > and place the copy the “Hadoop-vm-appliance-0-18-0” folder into the Java “jdk1.8.0_101” folder. (Birajdar, 2015; Gnsaheb, 2013)
  • Download Hadoop VM file < http://ydn.zenfs.com/site/hadoop/hadoop-vm-appliance-0-18-0_v1.zip >, unzip it and place it inside the Hadoop src file. (Birajdar, 2015)
  • Open up VMware Workstation 12, and open a virtual machine “Hadoop-appliance-0.18.0.vmx” and select play virtual machine. (Birajdar, 2015)
  • Login: Hadoop-user and password: Hadoop. (Birajdar, 2015; Gnsaheb, 2013)
  • Once in the virtual machine, type “./start-hadoop” and hit enter. (Birajdar, 2015; Gnsaheb, 2013)
    1. To test MapReduce on the VM: bin/Hadoop jar Hadoop-0.18.0-examples.jar pi 10 100000000
      1. You should get a “job finished in X seconds.”
      2. You should get an “estimated value of PI is Y.”
  • To bind MapReduce plugin to eclipse (Gnsaheb, 2013)
    1. Go into the JDK folder, under Hadoop-0.18.0 > contrib> eclipse-plugin > “Hadoop-0.18.0-eclipse-plugin” and place it into the eclipse neon 1 plugin folder “eclipse\plugins”
    2. Open eclipse, then open perspective button> other> map/reduce.
    3. In Eclipse, click on Windows> Show View > other > MapReduce Tools > Map/Reduce location
    4. Adding a server. On the Map/Reduce Location window, click on the elephant
      1. Location name: your choice
      2. Map/Reduce master host: IP address achieved after you log in via the VM
  • Map/Reduce Master Port: 9001
  1. DFS Master Port: 9000
  2. Username: Hadoop-user
  1. Go to the advance parameter tab > mapred.system.dir > edit to /Hadoop/mapped/system

Issues experienced in the installation processes (Discussion of any challenges and explain how it was investigated and solved)

Not one source has the entire solution Birajdar, 2015; Gnsaheb, 2013; Korolev, 2008).  It took a combination of all three sources, to get the same output that each of them has described.  Once the solution was determined to be correct, and the correct versions of the files were located, they were expressed in the instruction set above.  Whenever a person runs into a problem with computer science, google.com is their friend.  The links above will become outdated with time, and methods will change.  Each person’s computer system is different than those from my personal computer system, which is reflected in this instruction manual.  This instruction manual should help others google the right terms and in the right order to get Hadoop installed correctly onto their system.  This process takes about 3-5 hours to install correctly, with the long time it takes to download and install the right files, and with the time to set up everything correctly.

Resources