Mobile & Distributed Database Management Systems

A transaction is a set of operations/transformations to be carried out on a database or relational dataset from one state to another. Once completed and validated to be a successful transaction, the ending result is saved into the database (Panda et al, 2011). Both ACID and CAP (discussed in further detail) are known as Integrity Properties for these transactions (Mapanga & Kadebu, 2013).

Mobile Databases

Mobile devices have become prevalent and vital for many transactions when the end-user is unable to access a wired connection. Since the end-user is unable to find a wired connection to conduct their transaction their device will retrieve and save information on the transaction either on a wireless connection or disconnected mode (Panda et al, 2011). A problem with a mobile user accessing and creating transactions with databases, is the bandwidth speeds in a wireless network are not constant, which if there is enough bandwidth connection to the end user’s data is rapid, and vice versa. There are a few transaction models that can efficiently be used for mobile database transactions: Report and Co-transactional model; Kangaroo transaction model; Two-Tiered transaction model; Multi-database transaction model; Pro-motion transaction model; and Toggle Transaction model. This is by no means an exhaustive list of transaction models to be used for mobile databases. 

According to Panda et al (2011), in a Report and Co-transactional Model, transactions are completed from the bottom-up in a nested format, such that a transaction is split up between its children and parent transaction. The child transaction once completed then feeds that information up to the chain until it reaches the parent. However, not until the parent transaction is completed is everything committed. Thus, a transaction can occur on the mobile device but not be fully implemented until it reaches the parent database. In the Kangaroo transaction model, a mobile transaction manager collects and accepts transactions from the end-user, and forwards (hops) the transaction request to the database server. Transaction made in this model is done by proxy in the mobile device, and when the mobile devices move from one location to the next, a new transaction manager is assigned to produce a new proxy transaction. The two-Tiered transaction model is inspired by the data replication schemes, where there is a master copy of the data but for multiple replicas. The replicas are considered to be on the mobile device but can make changes to the master copy if the connection to the wireless network is strong enough. If the connection is not strong enough, then the changes will be made to the replicas and thus, it will show as committed on these replicas, and it will still be made visible to other transactions. 

The multi-database transaction model uses asynchronous schemes, to allow a mobile user to unplug from it and still coordinate the transaction. To use this scheme, five queues are set up: input, allocate, active, suspend, and output. Nothing gets committed until all five queues have been completed. Pro-motion transactions come from nested transaction models, where some transactions are completed through fixed hosts and others are done in mobile hosts. When a mobile user is not connected to the fixed host, it will spark a command such that the transaction now needs to be completed in the mobile host. Though carrying out this sparked command is resource-intensive. Finally, the Toggle transaction model relies on software on a pre-determined network and can operate on several database systems, and changes made to the master database (global) can be presented different mobile systems and thus concurrency is fixed for all transactions for all databases (Panda et al, 2011).  

At a cursory glance, these models seem similar but they vary strongly on how they implement the ACID properties in their transaction (see table 1) in the next section.

ACID Properties and their flaws

Jim Gray in 1970 introduced the idea of ACID transactions, which provide four guarantees: Atomicity (all or nothing transactions), Consistency (correct data transactions), Isolation (each transaction is independent of others), and Durability (transactions that survive failures) (Mapanga & Kedebu, 2013, Khachana et al, 2011; Connolly & Begg, 2015). ACID is used to assure reliability in a database system, due to a transaction, which changes the state of the data in the database. This approach is perfect for small relational centralized/distributed databases, but with the demand to make mobile transactions, big data, and NoSQL, the ACID may be a bit constricting. The web has independent services connected relationally, but hard to maintain (Khachana et al, 2011). An example of this is booking a flight for a CTU Doctoral Symposium. One purchases a flight, but then also may need another service that is related to the flight, like ground transportation to and from the hotel, the flight database is completely different and separate from the ground transportation system, yet sites like Kayak.com provide the service of connecting these databases and providing a friendly user interface for their customers. Kayak.com has its own mobile app as well. So taking this example further we can see how ACID, perfect for centralized databases, may not be the best for web-based services. Another case to consider is, mobile database transactions, due to their connectivity issues and recovery plans, the models aforementioned cover some of the ACID properties (Panda et al, 2011). This is the flaw for mobile databases, through the lens of ACID.

Table 1

Mobile Distributed Database Management Systems Transaction Models vs ACID.

ModelAtomicityConsistencyIsolationDurability
Report & Co-transaction modelYesYesYesYes
Kangaroo transaction modelMaybeNoNoNo
Two-tiered transaction modelNoNoNoNo
Multi-database Transaction modelNoNoNoNo
Pro-motion ModelYesYesYesYes
Toggle transaction modelYesYesYesYes

Note: A subset of the information found in Panda et al (2011) dealing with mobile database system transaction models and how they use or do not use the ACID properties.

CAP Properties and their trade-offs

CAP stands for Consistency (just like in ACID, correct all data transactions and all users see the same data), Availability (users always have access to the data), and Partition Tolerance (splitting the database over many servers do not have a single point of failure to exist), which was developed in 2000 by Eric Brewer (Mapanga & Kadebu, 2013; Abadi, 2012; Connolly & Begg, 2015). These three properties are needed for distributed database management systems and are seen as a less strict alternative to the ACID properties by Jim Gary. Unfortunately, you can only create a distributed database system using two of the three systems so a CA, CP, or AP systems. 

CP systems have a reputation of not being made available all the time, which is contrary to the fact. 

Availability in a CP system is given up (or out-prioritized) when Partition Tolerance is needed. Availability in a CA system can be lost if there is a partition in the data that needs to occur (Mapanga & Kadebu, 2013). Though you can only create a system that is the best in two, that doesn’t mean you cannot add the third property in there, the restriction only talks applies to priority. In a CA system, ACID can be guaranteed alongside Availability (Abadi, 2012)Partitions can vary per distributed database management systems due to WAN, hardware, a network configured parameters, level of redundancies, etc. (Abadi, 2012). Partitions are rare compared to other failure events, but they must be considered. But, the question remains for all database administrators: 

Which of the three CAP properties should be prioritized above all others? Particularly if there is a distributed database management system with partitions considerations. Abadi (2012) answers this question, for mission-critical data/applications, availability during partitions should not be sacrificed, thus consistency must fall for a while.

Amazon’s Dynamo & Riak, Facebook’s Cassandra, Yahoo’s PNUTS, and LinkedIn’s Voldemort are all examples of distributed database systems, which can be accessed on a mobile device (Abadi, 2012). 

However, according to Abadi (2012), latency (similar to Accessibility) is critical to all these systems, so much so that a 100ms delay can significantly reduce an end user’s future retention and future repeat transactions. Thus, not only for mission-critical systems but for e-commerce, is availability during partitions key.

Unfortunately, this tradeoff between Consistency and Availability arises due to data replication and depends on how it’s done. 

According to Abadi (2012), there are three ways to do data replications: data updates sent to all the replicas at the same time (high consistency enforced); data updates sent to an agreed-upon location first through synchronous and asynchronous schemes (high availability enforced dependent on the scheme); and data updates sent to an arbitrary location first through synchronous and asynchronous schemes (high availability enforced dependent on the scheme). According to Abadi (2012), PNUTS sends data updates sent to an agreed-upon location first through asynchronous schemes, which improves Availability at the cost of Consistency. Whereas, Dynamo, Cassandra, and Riak send data updates sent to an agreed-upon location first through a combination of synchronous and asynchronous schemes. 

These three systems, propagate data synchronously, so a small subset of servers and the rest are done asynchronously, which can cause inconsistencies. All of this is done to reduce delay to the end-user. 

Going back to the Kayak.com example from the previous section, consistency in the web environment should be relaxed (Khachana et al, 2011). Further expanding on Kayak.com, if 7 users wanted to access the services at the same time they can ask which of these properties should be relaxed or not. One can order a flight, hotel, and car, and enforce that none is booked until all services are committed. Another person may be content with whichever car for ground transportation as long as they get the flight times and price they want. This can cause inconsistencies, information being lost, or misleading information needed for proper decision analysis, but systems must be adaptable (Khachana et al, 2011). They must take into account the wireless signal, their mode of transferring their data, committing their data, and load-balance of the incoming request (who has priority to get a contested plane seat when there is only one left at that price). At the end of the day, when it comes to CAP, Availability is king. It will drive business away or attract it, thus C or P must give, to cater to the customer. If I were designing this system, I would run an AP system, but conduct the partitioning when the load/demand on the database system will be small (off-peak hours), so to give the illusion of a CA system (because Consistency degradation will only be seen by fewer people). Off-peak hours don’t exist for global companies or mobile web services, or websites, but there are times throughout the year where transaction to the database system is smaller than normal days. So, making around those days is key. For a mobile transaction system, I would select a pro-motion transaction system that helps comply with ACID properties. Make the updates locally on the mobile device when services are not up, and set up a queue of other transactions in order, waiting to be committed once wireless service has been restored or a stronger signal is sought. 

Resources

  • Abadi, D. J. (2012). Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. IEEE Computer Society, (2), 37-42.
  • Connolly, Thomas & Begg, Carolyn (2015). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. Pearson Education, Inc., publishing as Addison-Wesley, Upper Saddle River, New Jersey.
  • Khachana, R. T., James, A., & Iqbal, R. (2011). Relaxation of ACID properties in AuTrA, The adaptive user-defined transaction relaxing approach. Future Generation Computer Systems, 27(1), 58-66.
  • Mapanga, I., & Kadebu, P. (2013). Database Management Systems: A NoSQL Analysis. International Journal of Modern Communication Technologies & Research (IJMCTR), 1, 12-18.
  • Panda, P. K., Swain, S., & Pattnaik, P. K. (2011). Review of some transaction models used in mobile databases. International Journal of Instrumentation, Control & Automation (IJICA), 1(1), 99-104.

Adv DBs: A possible future project?

Below is a possible future research paper on a database related subject.

Title: Using MapReduce to aid in clinical test utilization patterns in the medicine

The motivation:

Efficient processing and analysis of clinical data could aid in better clinical tests on patients, and MapReduce solutions allow for an integrated solution in the medical field, which aids in saving resources when it comes to moving data in and out of storage.

The problem statement (symptom and root cause)

The rates of Sexually Transmitted Infections (STIs) are increasing at alarming rates, could the addition of Roper Saint Francis Clinical Network in the South test utilization patterns into Hadoop with MapReduce reveal patterns in the current STIs population and predict areas where an outbreak may be imminent?

The hypothesis statement (propose a solution and address the root cause)

H0: Data mining in Hadoop with MapReduce will not be able to identify any meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

H1: Data mining in Hadoop with MapReduce can identify a meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

The research questions

Could this study apply to STIs outbreaks rates be generalized into other disease outbreak rates?

Is this application of data-mining in Hadoop with MapReduce the correct way to analyze the data?

The professional significance statement (new contribution to the body of knowledge)

Identifying where an outbreak of any disease (or STIs), via clinical tests utilization patterns has yet to be done according to Mohammed et al (2014), and they have stated that Hadoop with MapReduce is a great tool for clinical work because it has been adopted in similar fields of medicine like bioinformatics.

Resources

  • Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce programming framework to clinical big data analysis: Current landscape and future trends. Biodata Mining, 7. doi:http://dx.doi.org/10.1186/1756-0381-7-22 – Doctoral Library Advanced Technologies & Aerospace CollectionPokorny, J. (2011).
  • NoSQL databases: A step to database scalability in web environment. In iiWAS ’11 Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services (pp. 278-283). – Doctoral Library ACM Digital Library

Adv DBs: Data warehouses and OLAP

Data warehouses allow for people with decision power to locate the adequate data quickly from one location that spans across multiple functional departments and is very well integrated to produce reports and in-depth analysis to make effective decisions (MUSE, 2015a). Data could be stored in n-dimensional data cubes that can be dissected, filtered through, rolled up into a dynamic application called Online analytical processing (OLAP). OLAP can be its own system or part of a data warehouse, and if it’s combined with data mining tools it creates a decision support system (DSS) to uncover/discover hidden relationships within the data (MUSE, 2015b). DSS needs both a place to store data and a way to store data.  The data warehouse doesn’t solve they “Why?” questions, but the “How?, What?, When?, Where?” and that is where OLAP helps.  We want to extract as much knowledge as possible for decision making from these systems, hence this explains why we need both in DSS to solve all questions not just a subset.  But, as aforementioned that data mining tools are also needed for a DSS.

Data Warehouses

Discovering knowledge through archives of data from multiple sources in a consolidated and integrated way is what this warehouse does best.  They are subject-oriented (organized by customers, products, sales, and not in the invoice, product sales), integrated (data from different sources in the enterprise perhaps in different formats), time-variant (varies with respect to time), and nonvolatile (new data is appended not replacing old).  Suitable applications to feed data can be mainframes, proprietary file systems, servers, internal workstations, external website data, etc., which can be used for analysis and discovering knowledge for effective data-based decision making.  Detailed data can be stored online if it can help support/supplement summarized data, so data warehouses can be technically light due to summarized data.  Summarized data, which is updated automatically as new data enters the warehouse, mainly help improve query speeds. So, where is the detailed data: offline (Connolly & Begg, 2015).  Looking into the architecture of this system:

The ODS, Operational data store, holds the data for staging into the data warehouse.  From staging the load manager performs all Extraction and Loading functions to the data into the warehouse, meanwhile, the warehouse manager performs all the Transformation functions to the data into the warehouse.  The query manager performs all the queries into the data warehouse. Metadata (definition of the data stored and its units) are used by all these processes and exist in the warehouse as well (Connolly & Begg, 2015).

 OLAP

Using analytics to answer the “Why?” questions from data that is placed in an n-dimensional aggregate view of the data that is a dynamical system, sets this apart from other query systems.  OLAP is more complex than statistical analysis on aggregated data, it’s more of a slice and dice with time series and data modeling.  OLAP servers come in four main flavors: Multidimensional OLAP (MOLAP: uses multidimensional databases where data is stored per usage), Relational OLAP (ROLAP: supports relational DBMS products with a metadata layer), Hybrid OLAP (HOLAP: certain data is ROLAP and other is in MOLAP), and Desktop OLAP (DOLAP: usually for small file extracts, data is stored in client files/systems like a laptop/desktop).

DSS, OLAP, and Data Warehouse Application

Car insurance claims DSS.  Insurance companies can use this system to analyze patterns of driving from people, what damage can or cannot occur due to an accident, and why someone might claim false damages to fix their cars or cash out.  Thus, their systems can define who, what, when, where, why and how per accident against all other accidents (they can slice and dice by state, type of accident, vehicle types involved, etc) they have processed to help them resolve if a claim is legitimate.

References:

Data Tools: WEKA

WEKA

The Java based, open sourced, and platform independent Waikato Environment for Knowledge Analysis (WEKA) tool, for data preprocessing, predictive data analytics, and facilitation interpretations and evaluation (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Miranda, n.d.; Xia & Gong, 2014).  It was originally developed for analyzing agricultural data and has evolved to house a comprehensive collection of data preprocessing and modeling techniques (Patel & Donga 2015).  It is a java based machine learning algorithm for data mining tasks as well as text mining that could be used for predictive modeling, housing pre-processing, classification, regression, clustering, association rules, and visualization (WEKA, n.d). Also, WEKA contains classification, clustering, association rules, regression, and visualization capabilities, in particular, the C4.5 decision tree predictive data analytics algorithm (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011). Here WEKA is an open source data and text mining software tool, thus it is free to use. Therefore there are no costs associated with this software solution.

WEKA can be applied to big data (WEKA, n.d.) and SQL Databases (Patel & Donga, 2015). Subsequently, WEKA has been used in many research studies that are involved in big data analytics (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011; Parkavi & Sasikumar, 2016; Xia & Gong, 2014). For instance, Barak and Modarres (2015) used WEKA for decision tree analysis on predicting stock risks and returns.

The fact that it has been using in this many research studies is that the reliability and validity of the software are high and well established.  Even in a study comparing WEKA with 12 other data analytics tools, is one of two apps studied that have a classification, regression, and clustering algorithms (Gera & Goel, 2015).

A disadvantage of using this tool is its lack of supporting multi-relational data mining, but if one can link all the multi-relational data into one table, it can do its job (Patel & Donga, 2015). The comprehensiveness of analysis algorithms for both data and text mining and pre-processing is its advantage. Another disadvantage of WEKA is that it cannot handle raw data directly, meaning the data had to be preprocessed before it is entered into the software package and analyzed (Hoonlor, 2011). WEKA cannot even import excel files, data in Excel have to be converted into CSV format to be usable within the system (Miranda, n.d.)

References:

  • Dogan, N., & Tanrikulu, Z. (2013). A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Information Technology and Management, 14(2), 105-124. doi:http://dx.doi.org/10.1007/s10799-012-0135-7
  • Gera, M., & Goel, S. (2015). Data Mining -Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Hoonlor, A. (2011). Sequential patterns and temporal patterns for text mining. UMI Dissertation Publishing.
  • Kumar, D., & Fet, D. (2011). Performance Analysis of Various Data Mining Algorithms: A Review. International Journal of Computer Applications, 32(6), 9–16.
  • Miranda, S. (n.d.). An Introduction to Social Analytics : Concepts and Methods.
  • Parkavi, S. & Sasikumar, S. (2016). Prediction of Commodities Market by Using Data Mining Technique. i-Manager’s Journal on Computer Science.
  • Patel, K., & Donga, J. (2015). Practical Approaches: A Survey on Data Mining Practical Tools. Foundations, 2(9).
  • WEKA (n.d.) WEKA 3: Data Mining Software in Java. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/
  • Xia, B. S., & Gong, P. (2014). Review of business intelligence through data analysis. Benchmarking, 21(2), 300–311. http://doi.org/http://dx.doi.org/10.1108/BIJ-08-2012-0051