Parallel Programming: Resource Guard

A quick note:

“In the resource-guard-thread pattern, resource-guard threads represent resources.  Such threads are arranged in a virtual assembly line and connected by queues implemented as safe objects” (Saden, 2011)

By the definition above, the search and insertion threads have exclusive data to perform subdivision and legalization through an insertion point, not a queue, thus this is a resource-user thread pattern.

“As long as each resource user has exclusive access to no more than one resource at a time, the designer can usually choose between a solution with resource-guard threads and one with resource threads.  In this sense, the two patterns are dual.” (Saden, 2011)

A dual solution would look like: The search and insertion threads would return an index to a safe object, which would house all the data.  The data can then be a queue from in order to proceed with step two which is subdivision and legalization.

Reference

Parallel Programming: Safe objects and shared objects

Shared objects that force mutual exclusion on threads that try to call it are “safe objects”.  The mutual exclusion on threads/operations can be relaxed when threads don’t change any data, this may be a read of the data in the “safe object” (Sanden, 2011).  In the examples for this course, we have dealt with such Java “safe objects” which are called synchronized.

  1. A safe object in a jukebox represents the CD player. Customer threads call an operation to queue up to play a song.
    • Input into Song Queue: data can be added by multiple people on multiple devices that only have one set of CDs, and can only play one song from a CD.  Data is stored in an array.
    • Change Song order in the Queue: The Song Queue can be prioritized based on predefined parameters, like the DJ, can have ultimate priority to adjust the order and make their own request, but customers have a less priority.  If there is a tiered pay structure then we can see a higher priority placed on a Song on the Song Queue for those willing to pay more. This means that the data stored in the array can be rearranged depending on the thread’s priority.
    • Remove Song from Queue: after the song is done playing, the song’s name is removed from the Song Queue position number one. This will force the array values to shift up by one.
    • Read Song Queue: though not needed to be mutually exclusive, it is still an operation that is needed in order to find the next song to play.  This shouldn’t change any data in the array, it is only reading the song in position 0 of the array.
  1. In a different design, the safe object in a jukebox represents a queue of song requests. Customer threads call an operation to add a song request to the queue. A CD thread calls a different operation to retrieve the next request.
    • All of those that are required for a song queue in the previous example could be applied to this example or a subset.  An example of a sufficient subset would be {Input into Song Queue, Remove Song from Queue, Read Song Queue}
    • Locate the next CD request: Based on the data in Input into Song Queue, pull, locate the CD containing the next Song to be played.
    • Play Song on CD: One song from one CD can be played at any time.
    • Transition Song on CD: As one song ends, fade out the noise exponentially in the last 10 seconds and begin the next song on the Song Queue by increasing the song volume exponentially in the first 5 seconds to normal volume.
    • Put away the CD from the last song played: places the cd back into its predetermined location for future use. Once completed it will call on the Locate next CD Request Safe Operation.

References:

Parallel Programming: Synchronized Objects

Sanden (2011) shows how to use synchronized objects (concurrency in Java), which is a “safe” object, that are protected by locks in critical synchronized methods.  Through Java we can create threads by: (1) extend class Thread or (2) implement the interface Runnable.  The latter defines the code of a thread under a method: void run ( ), and the thread completes its execution when it reaches the end of the method (which is essentially a subroutine in FORTRAN).  Using the former you need the contractors public Thread ( ) and public Thread (Runnable runObject) along with methods like public start ( ).

Additional Examples:

MapReduce

According to Hortonworks (2013), MapReduce’s Process in a high level is: Input -> Map -> Shuffle and Sort -> Reduce -> Output.

Tasks:  Mappers, create and process transactions on a data set filed away in a distributed system and places the wanted data on a map/aggregate with a certain key.  Reducers will know what the key values are, and will take all the values stored in a similar map but in different nodes on a cluster (per the distributed system) from the mapper to reduce the amount of data that is relevant (Hortonworks, 2013). Reducers can work on different keys.

Example: A great example of this a MapReduce: Request, is to look at all CTU graduate students and sum up their current outstanding school loans per degree level.  Thus, the final output from our example would be:

  • Doctoral Students Current Outstanding School Loan Amount
  • Master Students Current Outstanding School Loan Amount.

Now let’s assume that this ran in Hadoop, which can do MapReduce.   Also, let’s assume that I could use 50 nodes (threads) to process this transaction request.  The bad data that gets thrown out in the mapper phase would be the Undergraduate Students, given that it does not match the initial search criteria.  The safe data will be those that are associated with Doctoral and Masters Students.  So, during the mapping phase, the threads will assign Doctoral Students to one key, and Master students would get another key.  Each node (thread) will use the same keys for their respective students, thus the keys are similar in all nodes (threads).  The reducer uses these keys and the safe objects in them, to sum up, all of the current outstanding school loan amounts get processed under the correct group.  Thus, once all nodes (threads) use the reducer part, we will have our two amounts:

  • Doctoral Students Current Outstanding School Loan
  • Masters Students Current Outstanding School Loan

Complexity could be added if we only wanted to look into graduate students that are currently active and non-active service members.  Or they could be complicated by gender, profession, diversity signifiers, we can even map to the current industry.

Resources

Parallel Programming: Threads

A thread is a unit (or sequence of code) that can be executed by a scheduler, essentially a task (Sanden, 2011). A single thread (task) will have one program counter and a sequence of code. Multi-threading occurs when one program counter shares a common code. Thus, the counter in multi-threading has many sequences of code that can be assigned to different processors to run in parallel (simultaneously) to speed up a task. Another way for multi-threading is to have the counter execute the same code on different processors with different inputs. If data is shared between the threads, there is a need for a “safe” object through synchronization, where one thread can access the data stored in a “safe” object at one time. It is through these “safe” objects that a thread can communicate with another thread.

An additional example that may help illustrate the material: 

Maybe we would like to know the average of the sum of all the credits and the average of the sum of all the debits made in personal checking accounts in December in Suntrust Bank. After Map-Reduce techniques using multiple threading, we can go through their entire database system to find accounts and timestamp transactions, map out all the data and reduce it to what we need to return the two numbers in our query. 

Resources:

Adv DB: Data Warehouse & Data Mining

Data warehouses allow for people with decision power to locate the adequate data quickly from one location that spans across multiple functional departments and is very well integrated to produce reports and in-depth analysis to make effective decisions (MUSE, 2015a). The data warehouse doesn’t solve the: Who, What, Where, When, Why and How, but that is where data mining can help.  Data warehouse, when combined with data mining tools, can create a decision support system (DSS), which can be used to uncover/discover hidden relationships within the data (MUSE, 2015b). DSS needs both a place to store data and a way to sort meaningful data in order to make sense of the data and provide meaningful insights to the decision-maker.  Data that can be used for meaningful insights must be prepared/transformed (and checked for quality) while in the data warehouse, but must be completed before the data is used in a data mining tool.  Also, results from the data mining tool can be placed back into the data warehouse to allow its results to be seen by all end-users and to be reused by others.

Data Warehouse & Data Mining

A data warehouse is a centralized collection of data that is consistent, subject-oriented, integrated, special variant and/or temporally variant, nonvolatile data to enable decisions makers to make desirable business decisions based on their gathered insights and predictions from the data about the near future (Tryfona et al, 1999). Ballou & Tayi (1999) stated that a key feature of a data warehouse is its usage for decision making not for operational purposes.  Nevertheless, data warehouses don’t solve the questions: Who, What, When, Where, Why and How, it’s just a data depository (MUSE, 2015b). Hence, it validates what Tryfona et al (1999) stated, there is little distinction/differentiator on how data is modeled in a data warehouse as with a database. Databases though can and are used in operational situations, thus invalidating Tryfona et al (1999) argument, because as Ballou & Tayi (1999) pointed out operational data usually focuses heavily on current data whereas decision-makers look at historical data across time intervals to make temporal comparisons.

Databases and/or data warehouses cannot make a decision all on its own, but they are the platform to which data is stored centrally so that the right decision analysis techniques can be conducted on the data in order to provide meaning from them. The right decision analysis technique comes from data mining, which helps find meaningful once-hidden patterns from the data (in this case stored in the data warehouse).  Data mining can look into the past and current data to make predictions into the future (Silltow, 2006).   Though this is nothing new, statisticians have been using these techniques in a manual fashion for years to help discover knowledge from data. Thus, discovering knowledge through these centrally stored data, which can possibly come from multiple sources in a business or other data creation system that could be tied/linked together is what a warehouse does best (Connolly & Begg, 2015). What data warehouses also enable is using the same data in new ways to discover new insights about a subject than what the original purpose was (reuse) for collecting that data (Ballou & Tayi, 1999).  Data warehouses can support several low-level organizational decisions as well as high-level organizational (enterprise-wide) decisions.  Suitable applications to feed data into a data warehouse to aid in decision making can come from: mainframes, proprietary file systems, servers, internal workstations, external website data, etc.  Storing some data offline or online helps mainly to improve querying speeds. Summarized data, which is updated automatically as new data enters the warehouse, can help improve query speeds, while detailed data can be stored online if it can help support/supplement summarized data (Connolly & Begg, 2015).

Failure in the implementation of a data warehouse can be generated from poor data quality. Data quality should be built into the data warehouse: planning, implementation, and maintenance phases.  Ballou & Tayi (1999) warned that even though this feature of data stored in a data warehouse is a key driver for companies to adopt a warehouse is that data quality must be preserved.  Data quality encompasses the following attributes: accuracy, completeness, consistency, timeliness, interpretability, believability, value-added, and accessibility.  Most people generating data are familiar with its error rates, margins of error, its deficiencies, and idiosyncrasies, but when rolled up in a data warehouse (and it is not communicated properly), people outside of that data-generating organization will not know this and their final decisions could be prone to errors.  One must consider the different needs for data quality within a data warehouse, as the levels of quality needed for relevant decision making, project design, future needs, etc.  We must ask from our data providers what is unsatisfactory and to what quantifiable level is the current data that they are providing into the data warehouse (Ballou & Tayi, 1999).  As the old adage goes “Garbage In – Garbage Out”.

So, what can cause data quality issues?  Let’s take a mortgage company, REMAX, which has a data warehouse, however, the data for sales isn’t consistent, because there are different definitions of what a sale/price could be based on differing stakeholders.  The mortgage company can say that a sale is the closing price of the house, whereas REMAX may say the negotiated list price of house, the broker may say the final settlement price of the house after the home inspection, the insurance company is the price of the building materials in the house plus 65-70 thousand dollars for internal possessions.  This may be all the data that REMAX wants to have to provide the best service to their customer and to provide a realistic view of what goes on in purchasing a house, monetarily, but REMAX must know this information ahead of time as they input this data into their data warehouse.  This could be valuable information for the home buyer when they are deciding which one of two to three properties that they would like own.  There could be syntactic inconsistencies between all these sources of data like $60K, $60,000, $60,000.00, 60K, $60000, etc.

Another way the implementation of a data warehouse could fail, according to Ballou & Tayi (1999), can come from not including appropriate data (in other words: data availability).   Even though critical data can exist among: soft data (uncertain data), text-based data, external sources of data, this set of data could altogether be ignored.  They continue to add that this type of data, so long as it can support the organization in any “meaningful way” should be added into the centralized data warehouse.  Though one must weigh the high cost of acquiring the data that may be useless because it is relatively easy (inexpensive) to delete data that is rarely used once in the system.  But, then there is an opportunity cost to adding irrelevant data, we could have used our resources to improve the timeliness of the current data (or provide real-time data) or eliminating null values in a different data set that is already in the system.

To solve the issue of data quality, decision-makers and data warehouse managers must think systematically about what data is required, why it is required, and how should it be collected and used (Ballou & Tayi, 1999).  This could be done when a data warehouse manager asks the end-users what decisions this data warehouse will support.  From that information one can decipher what is required from these stakeholders through the MoSCoW: What is a “Must have”?; What is a “Should have”?; What is a “Could have?”; and What is a “Wish to have”? In the REMAX case, they should have the final asking price before the inspection listed (as they do) as a “Must have”, typical closing costs for a house in that price range that is provided by the mortgage company as a “Should have”, the average house insurance costs as a “Could Have”, etc. Balou & Tayi (1999) said that other factors can affect data quality enhancement projects, like the: Current quality, required quality, anticipated quality, priority of organizational activity (as aforementioned with MoSCoW), Cost of data quality enhancements (and their aforementioned tradeoffs/opportunity costs), and their value-added to the data warehouse.  Data quality is needed in order to use data mining tools, and many papers using data mining or text mining always talk about a preprocessing step that must occur before full analysis can begin: Nassirtoussi et al (2015),  Kim et al (2014), Barak & Modarres (2015), etc.

According to Silltow (2006), data mining tools can be group into three types: Traditional (have complex algorithms and techniques to find hidden patterns in the data and highlight trends), dashboard (data changes are shown on a screen which is mostly used to monitor information), and text-mining (using complex algorithms and techniques to find hidden patterns in text data, even to a point of figuring out the sentiment in a string of words and can include video and audio data).  These data mining techniques range from artificial neural networks (prediction models that use training data to learn and then make forecasts) like in Nassirtoussi et al (2015) and Kim et al (2014); decision trees (uses a bunch of defined if-then statements, also known as rules, and are easier to understand the results of the data) like in Barak & Modarres (2015); nearest neighbor (uses similar past data to make predictions into the future), etc.

Finally, another aspect of data quality is the output of the data from data mining tools, especially since we can then plug the output back into the data warehouse for future reuse.  Data mining tools are just that, automatic algorithms used to discover knowledge.  These tools lack the intuitive nature presented in humans to decipher between a relevant and irrelevant correlation.  For instance, data stored in a hospital data warehouse may link data collected in the summer of insane amount of increased ice cream consumption which could lead to obesity and the number of pool/beach drownings and say that ice cream consumption leads to them, rather than looking at the fact that they both occur in the summer but are not necessarily causing one or the other.  This is why Silltow (2006) suggest that all results provided by these tools be quality checked after utilized to not give out false, irrelevant insights that are preposterous when analyzed by a human.

Conclusion

Data warehouses allow for people with decision power to locate adequate data quickly to make effective decisions. The data that is planned, entered, maintained should be of acceptable quality.  Poor quality in the data may drive poor quality decisions.  The best way to improve data quality is by looking at the eight factors of data quality as aforementioned when asking stakeholders what data from a systemic point of view would be useful in a data warehouse.  Sometimes asking what data should be included is very hard for decision-makers to make at that moment, though they could have a general idea of what decisions they need to make soon.  Data collection and quality must be weighed against all of their cost and their significance.

References

  • Ballou, D. P., & Tayi, G. K. (1999). Enhancing data quality in data warehouse environments. Communications of the ACM, 42(1), 73-78.
  • Barak, S., & Modarres, M. (2015). Developing an approach to evaluate stocks by forecasting effective features with data mining methods. Expert Systems with Applications, 42(3), 1325–1339. http://doi.org/10.1016/j.eswa.2014.09.026
  • Connolly, T. & Begg, C. (2015).  Database Systems:  A Practical Approach to Design, Implementation, and Management, Sixth Edition.  Boston:  Pearson.
  • Kim, Y., Jeong, S. R., & Ghani, I. (2014). Text opinion mining to analyze news for stock market prediction. Int. J. Advance. Soft Comput. Appl, 6(1).
  • My Unique Student Experience (2015a). Data Warehousing Concepts and Design. Retrieved from: https://class.ctuonline.edu/_layouts/MUSEViewer/ Asset.aspx?MID=1819502&aid=1819506
  • My Unique Student Experience (2015b). Online Analytical Processing. Retrieved from: https://class.ctuonline.edu/_layouts/MUSEViewer/Asset.aspx?MID=1819502&aid=1819509
  • Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2015). Text mining of news-headlines for FOREX market prediction: A Multi-layer Dimension Reduction Algorithm with semantics and sentiment. Expert Systems with Applications, 42(1), 306-324.
  • Silltow, J. (2006) Data mining 101: Tools and techniques.  Retrieved from: https://iaonline.theiia.org/data-mining-101-tools-and-techniques
  • Tryfona, N., Busborg, F., & Borch Christiansen, J. G. (1999, November). starER: a conceptual model for data warehouse design. In Proceedings of the 2nd ACM international workshop on Data warehousing and OLAP (pp. 3-8). ACM.

Adv DB: Conducting data migration to NoSQL databases

Relational databases schema design (primarily ERDs) are all about creating models, then translating it a schema to which is normalized, but one must be an oracle to anticipate a holistic, end-to-end design, or else suffer when making changes to the database (Scherzinger et al, 2013).  Relational databases are poor at data replication, horizontal scalability, and high availability rates (Schram & Anderson, 2012).  Thus, waterfall approaches to database design are no longer advantageous, and like software development databases can be designed with an agile mentality.  Especially as data store requirements are always evolving. Databases that adopt a “Schema-less” (where data can be stored without any predefined schema) or an “Implicit Schema” (where the data definition van be taken from a database from an application in order to place the data into the database) in “Not Only SQL” (NoSQL) can allow for agile development on a release cycle that can vary from yearly, monthly, weekly, or daily, which is completely dependent on the developers’ iteration cycle (Sadalage & Fowler, 2012).  Taking a look at a blogging agile development lifecycle (below) can show how great schema-less or implicit schemas in NoSQL database development can become, as well as the technical debt that is created, which can cause migration issues down the line.

Blogging

We start a blogging site called “blog.me” and we are in an agile environment, which means iterative improvements and each iteration produces a releasable product (even if we decide not to make a release or update at the end of the iteration).  As a programming team, they have decided that the minimum viable product will consist of the fields, title, and content for the blogger and comments from other people.  This is a similar example proposed by Scherzinger et al in 2013, as they try to explain how implicit schemas work.  In the second iteration, the programming team for “blog.me” has discovered an abuse on the commenting section of the blog.  People have been “trolling” the blog, thus to mitigate this, they implemented a sign-in process with a username and password that is taken from Facebook, which allows for liking a post as well.  Rather than having bloggers to recreate their content, the programmers make the implementation of this update for current and future posts. In a third iteration, the programming teams to institute a uniformed nomenclature to some of their fields.  Rather than changing all the posts from the first two iterations, the programmers decide to enforce these changes moving forward.

Now, one can see how useful a schema-less development (provided by NoSQL) can become.   There is no downtime to how the site interacts and adds no additional burden to the end-users when an update occurs. But, we now have to worry about migrating these three data classes (or as Scherzinger et al calls it technical debt), but what if a commenter goes and comments in a post made in iteration one or two after iteration three has been implemented, we may then have four to five different data classes.  These developers love to develop code and add new features rather than maintain code, which is why this form of developing a database is great, but as we can see technical debt can pile on quickly.  Our goal is to manage a schema of this data, yet have the flexibility of a schema-less database system.

Types of Migration

The migration of data in and out of a data store is usually enabled through a replication scheme (Shirazi et al, 2012) conducted through an application.  There are two primary types of data migration per Scherzinger et al (2013): eager and lazy.  Eager migration means we migrate all the data in a batched fashion, one-by-one retrieval from the data store, transform it and write it back into the data store.  As data becomes larger, eager migration can become resource-intensive and could be a wasted effort. Wasted efforts can come from stale data.  Thus, the lazy approach is considered as a viable option.  Transformations are conducted when a piece of data is touched, so only live and hot data (relevant data) is updated.  Even though this approach saves on resources, if an entity becomes corrupted, there may be no way to retrieve it.  In order to do the migration, an application needs to create an “implicit-schema” on the “schema-less” data.

NoSQL and its multiple flavors

NoSQL databases can deal with aggregate data (relationships between units of data that can be relationally mapped), using key-value, document, and column friendly databases (Scherzinger et al, 2013, Sadalage & Fowler, 2012, Schram & Anderson, 2012).  There also exist graphical databases (Sadalage & Fowler, 2012).  Key-value databases deal with storing data with a unique key and value, while document databases store documents or their parts in a value. (Scherzinger et al, 2013). People can blur the line between this and key-value databases by placing an ID field, but for the most part, you will query a document database rather than look up a key or ID (Sadalage & Fowler, 2012). Whereas column friendly databases store the information in transposed table structures (as columns rather than rows).  Graph databases can show relationships with huge datasets that are highly interconnected, and the complexity of the data is emphasized in this database rather than the size of data (Shirazi et al, 2012).  A further example of a graphical database is shown in the health section in the following pages.  Migrations between the multiple flavors of NoSQL databases allow for one to exploit the strengths and mitigate the weakness between the types when it comes to analyzing the large data quickly.

Data Migration Considerations and Steps

Since data migration uses replication schemes from an application, one must consider how complex writing a SQL query would be if this were a relational database scheme (Shirazi et al, 2012).  This has implications on how complex transforming data or migrating it would be under NoSQL databases, especially when big data is introduced into the equation.  Thus, the pattern of database design must be taken into account when migrating data between relational databases to NoSQL database, or between different NoSQL database types (or even provider). Also, each of these database types treats NULL values differently, some NoSQL databases don’t even waste the storage space and ignore NULL values, some systems have them as in relational databases, and some systems allow for it, but don’t query for it (Scherzinger et al, 2013).  Scherzinger et al (2013) suggest that when migrating data, data models (data stored in the databases that belong to a object or a group, which can have several properties) query models (data that can be inserted, transformed and deleted based on a key-value, or some other kind identification), and freedom from schema (the global structure of the data that can or cannot be fixed in advance) must be taken into account. Whereas, Schram & Anderson in 2012, stated that data models are key when making design changes (migrations) between database systems. Since in NoSQL data is “schema-less” there may not be any global structure, but applications (such as web user-interfaces) built on top of the data-stores can display an implicit structure, and from that, we can list a few steps to consider when migrating data (Tran et al, 2011):

  • Installation and configuration
    1. Set up development tools and environment
    2. Install and set up environments
    3. Install third-party tools
  • Code modification
    1. Set up database connections
    2. Database operation query (if using a NoSQL database)
    3. Any required modifications for compatibility issues
  • Migration
    1. Prepare the database for migration
    2. Migrate the local database to the NoSQL database (the schema-less part)
    3. Prepare system for migration
    4. Migrate the application (the implicit-schema part)
  • Test (how to ensure the data stored in the databases matched with the “Implicit Schema” embedded in the applications when the “Implicit Schema” has experienced a change)
    1. Test if the local system works with a database in NoSQL
    2. Test if the system works with databases in NoSQL
    3. Write test cases and test for functionality of the application in NoSQL

When doing code modification (step 2) from a relational database to a NoSQL database the more changes will be required, and JOIN operations may not be fully supported.  Thus, additional code may be required in order to maintain the serviceability of the application, pre-migration, during migration and post-migration (Tran et al, 2011).  Considering ITIL Service Transition standards, the best time to do a migration or update is in windows of minimum usage by end-users, while still maintaining agreed-upon minimum SLA standards.  As stated in Schram & Anderson (2012) they didn’t want their service to break while they were migrating their data from a relational database to a NoSQL column friendly database.  Other issues, like compatibility between the systems housing the databases or even database types, can also add complexity to migration.  When migrating (step 3) SQL scripts need to be transformed as well, to align with the new database structure, environment, etc. (Tran et al, 2011). Third-party apps can help to a degree with this.  If the planning phase was conducted correctly this phase should be relatively smooth.  Tran et al (2011) stated that there are at least 8 features that drive the cost of migration: (1) Project team’s capability, (2) Application/Database complexity, (3) Existing knowledge and experience, (4) Selecting the correct database and database management system, (5) Compatibility issues, database features, and (8) Connection issues during migration.

Health

A database was created from 7.2M medical reports, in order to understand human diseases, called HealthTable.  The authors in Shirazi et al in 2012, decided to convert a column store into a graph database of Health Infoscape (Table 1 to Figure 1).  Each cause/symptom stems from disease (Dx), yet the power of graph databases as aforementioned are shown, thus facilitating data analysis, even though column friendly databases provide an easier way to maintain the 7.2M data records.

Table 1. HealthTable in Hbase per Shirazi et al (2012).

Row key Info Prevalence Causes
D1 Name Category Female Male Total Cause1 Cause2 Cause3
Heartburn Digestive system 9.4% 9% 9.2% D2    
1 1 1 1 1 2    
D2 Chest Pain Circulatory System 6.8% 6.8% 6.8%      
3 3 3 3 3      
D4 Dizziness Nervous System 4% 2.8% 3.5%      
5 5 5 5 5      

health graph

Figure 1. HeathGraph Bases on HealthTable

Conclusions

From these two use cases (Heath and Blogging) is that data migration can be quite complicated.  Schema-less databases allow for a more agile approach to developing, whereas the alternative is best for the waterfall.  However, with waterfall development slowly on the decay, one must also migrate to other forms of development.  Though applications/databases can migrate from relational databases to NoSQL and thus require a lot of coding because of compatibility issues, applications/databases can also migrate between different types of NoSQL databases.  Each database structure has its strengths and weakness, and migrating data between these databases can provide opportunities for knowledge discovery from the data that is contained within them.  Migrating between database systems and NoSQL types should be conducted if it fulfills many of the requirements and promises to reduce the cost of maintenance (Schram & Anderson, 2012).

References

  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [VitalSource Bookshelf Online]. Retrieved from https://bookshelf.vitalsource.com/#/books/9781323137376/
  • Scherzinger, S., Klettke, M., & Störl, U. (2013). Managing schema evolution in NoSQL data stores. arXiv preprint arXiv:1308.0514.
  • Schram, A., & Anderson, K. M. (2012). MySQL to NoSQL: data modeling challenges in supporting scalability. In Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity (pp. 191-202). ACM.
  • Shirazi, M. N., Kuan, H. C., & Dolatabadi, H. (2012, June). Design Patterns to Enable Data Portability between Clouds’ Databases. In Computational Science and Its Applications (ICCSA), 2012 12th International Conference on (pp. 117-120). IEEE.
  • Tran, V., Keung, J., Liu, A., & Fekete, A. (2011, May). Application migration to cloud: a taxonomy of critical factors. In Proceedings of the 2nd international workshop on software engineering for cloud computing (pp. 22-28). ACM.

Adv DB: Data Services in the Cloud Service Platform

Rimal et al (2009), states that Cloud Computing Systems are a disruptive service that has gained momentum. What makes it disruptive is that it has similar properties of prior technology, while adding new features and capabilities (like big data processing) in a very cost-effective way.   It has become part of the XaaS (where X can be infrastructure, hardware, software, etc.) as a Service.  According to Connolly & Begg (2014), Data as a Service (DaaS) and Database as a Service (DBaaS) are considered as cloud-based solutions. DaaS doesn’t use SQL interfaces, but it does enable corporations to access data to analyze value streams that they own or those they can easily expand into. DBaaS must be continuously monitored and improved on, because they usually serve multiple organizations, with the added benefit of providing charge-back functions per organization (Connolly & Begg, 2014) or a pay-for-use model (Rimal et al, 2009).  However, one must pick a solution that best serves their business’/organization’s needs.

Benefits of the service

Connolly & Begg (2014), stated that there are benefits to cloud computing such as Cost-reduction due to lower CapEx, ability to scale up or down based on data demands, needs for higher security making data stored here more secure than in-house solutions, 24/7 reliability can be provided, faster development time because time is not take away from building the DBMS from scratch, finally sensitivity/load testing can be made readily available and cheaply because of lower CapEX.  Rimal et al (2009), stated that the benefits came from lower costs, improved flexibility and agility, and scalability.  You can set up systems as quickly as you wish, under a pay-as-you-go model and is great for disaster recovery efforts (as long as the disaster affected your systems, not theirs).   Other benefits could be seen from an article looking at the Data-as-a-Service in the health field: a low cost to implementation and maintainability of databases, defragmentation of data, exchange of patient data across the heady provider organization, and provide a mode for standardization of data types, forms, and frequency to capture data.  From a health-care perspective, it could lead to supporting research, strategic planning of medical decisions, improve data quality, reduce cost, reduce resource scarcity issues from an IT perspective, and finally provide better patient care (AbuKhousa et al, 2012).

Problems that can be removed because of the service

Unfortunately, there are two sides to a coin.  Given that on a cloud service there exists network dependency, such that if the supplier has a power outage, the consumer will not have access to their data.  Other network dependencies can occur like peak service hours, where service tends to be degraded compared to if the company used the supplier during its off-peak hours.  Quite a few organizations use Amazon EC2 (Rimal et al, 2009) as their Cloud DBMS, which if that system is hacked or security is breached the problem is bigger than if it were carried out to only one company. There are system dependencies, like in the case of disaster recovery (AbuKhousa et al, 2012), organizations are as strong as their weakest link when it comes to a disaster, if the point of failure in the service, and there are no other mitigation plans, that organization may have a hard time recuperating their losses.  Also, placing data into these services, you lose control over the data (lose control over availability, reliability, maintainability, integrity, confidentiality, intervenability, and isolation) (Connolly & Begg, 2014).  Rimal et al (2009) stated clear examples of outages that existed in Services like Microsoft (down 22 hours in 2008), or Google Gmail (2.5 hours in 2009), etc.  All of these lack of control points is perhaps one of the main reasons why certain government agencies have had a hard time adopting a commercialized cloud service provider, however they are attempting to create internal clouds.

Architectural view of the system

The overall cloud architecture is the layered system that serves as a single point of contact and uses software applications over the web, using an infrastructure, which draws on resources from necessary hardware to complete a task (Rimal et al, 2009).  Adopted from Rimal et al (2009) the figure below is what these authors describe as the layered cloud architecture.

Software-as-a-Service (SaaS)

Platform-as-a-Service (PaaS)

Developers implementing cloud applications

Infrastructure-as-a-Service (IaaS)

[(Virtualizations, Storage Networks) as-a-Service]

Hardware-as-a-Service (HaaS)

Figure 1. A layered cloud architecture.

Cloud DBMS can be private (an internal provider where the data is held within the organization), public (an external provider manages resources dynamically across their system and through multiple organizations supplying them data), and hybrid (consists of multiple internal and external providers) as defined in Rimal et al (2009).

A problem with DBaaS is the fact that databases between multiple organizations are stored by the same service provider.  Where data can be what separates one organization from its competitors, they must consider the following architecture: Separate Servers, Shared Server but different database processes, Shared databases but separate databases, Shared databases but separate schema, or Shared databases but shared schema (Connolly & Begg, 2014).

Let’s take two competitors: Walmart & Target.  Two supergiant organizations that have trillions of dollars and inventory flowing in and out of their systems monthly.  Let’s also assume three database tables with automated primary keys and foreign keys that connect the data together: Product, Purchasing, and Inventory.  Another set of assumptions: (1) their data can be stored by the same DBaaS provider, (2) their DBaaS provider is Amazon.

While Target and Walmart may use the same supplier for their DBaaS, they can select one of those five architectural solutions to avoid their valuable data to be seen.  If Walmart and Target purchased separate servers, their data can be safe.  They could also go this route if they want to store their huge data and/or have many users of their data. Now if we narrow down our assumptions to the children’s toy section (due to breaking up the datasets to manageable chunks), both Target and Walmart can store their data on a shared server but on separate database processes, they would not have any shared resources like memory or disk, just the virtual environment.  If Walmart and Target went on a shared database server but separate databases, would allow for better resource management between each organization.  If Walmart and Target decided to use the same database (which is unlikely) but hold separate schema, data must have strong access management systems in place.  This may be elected between CVS and Walgreen’s Pharmacy databases, where patient data can be vital, but not impossible to switch from one schema to another, however, this interaction is highly unlikely.   The final structure, highly unlikely for most corporations is sharing databases and schemas.  This final architectural structure is best used for hospitals sharing patient data (AbuKhousa et al, 2012), but, HIPPA must be observed still under this final architecture mode.  Though this is the desired state for some hospitals, it may take years to get to a full system.  Security is a big issue here and will take a ton of developmental hours, but in the long run, it is the cheapest solution available (Connolly & Begg, 2014).

Requirements on services that must be met

Looking at Amazon’s cloud database offering, it should be easy to set up, easy to operate, and easy to scale.  The system should enhance availability and reliability to its live databases compared to in house solutions. There should be software in the databases to back up the database, for recovery at any time. Security patches and maintenance of the underlying hardware and software of the cloud DBMS should be reduced significantly since that is not the burden that should be placed onto the organization. The goal of the Cloud DBMS should be to remove development costs away from managing the DBMS to focus on applications and the data to be stored in the system.  They should also provide infrastructure and hardware as a service to reduce overhead costs in managing these systems.  Amazon’s Relational Database Service can use MySQL, Oracle, SQL Server, or PostgreSQL databases.  Amazon’s DynamoDB is a NoSQL database service.  Amazon’s Redshift costs less than $1K to store a TB of data per year (Varia & Mathew, 2013).

Issues to be resolved during implementation

Rimal et al in 2010, stated some interesting things to consider during a before implementing a Cloud DBMS.  What are the Service-Level Agreements (SLAs) of the supplier Cloud DBMS?  The Cloud DBMS may be up and running 24×7, but if they experience a power outage, what is their SLAs to the organization, as not to impact the organization at all?  What is their backup/replication scheme?  What is their discovery (assists in reusability) schema? What is their load balancing (trying to avoid bottlenecks), especially since most suppliers cater to more than one organization? What does their resource management plan look like? Most cloud DBMS have several copies of the data spread across several servers, so how sure is the vender to ensure no data loss?  What types of security are provided? What is their encryption and decryption strength for the data held within its servers?  How private will the organization’s data be, if hosted on the same server or same database but separate schema?   What are their authorization and Authentication safeguards?  Looking at Varia & Mathew (2013) explain all the cloud DBMS services that Amazon provides, these questions are definitely things that should be addressed for each of their solutions.  Thus, when analyzing a supplier for a cloud DBMS, having technical requirements that meet the Business Goals & Objects (BG&Os) is great to help guide the organization pick the right supplier and solution, given issues that need to be resolved.  Other issues identified came from Chaudhuri (2012): data privacy through access control, auditing, and statistical privacy; allow for data exploration to enable deeper analytics; data enrichment with web 2.0 and social media; query optimization; scalable data platforms; manageability; and performance isolation for multiple organizations occupying the same server.

Data migration strategy & Security Enforcement

When migrating data between organizations into a Cloud DBMS, the taxonomy of the data must be preserved.  Along with taxonomy, one must consider that no data is lost in the transfer, that data is still available to the end-user, before, during and after the migration, and that the transfer is done in a cost-efficient way (Rimal et al 2010).  Furthermore, data migration should be done seamlessly and efficiently as if one were to move between suppliers of services, such that a supplier doesn’t get too entangled into the organization that it is the only solution the organization can see itself as using.  Finally, what type of data do you want to migrate over, mission-critical data may be too precious on certain types of Cloud DBMS, but may be great for a virtualized disaster recovery system?  What type of data to migrate over depends on the needs of a cloud system, to begin with, and what services does the organization want to pay-as-they-go now and in the near future. The type of data to migrate may also depend on the security features provided by the supplier.

Organizational information is a vital resource to any organization, and access to it and maintaining it proprietary is key. If not enforced data like employee social security numbers can be compromised, or credit card numbers of past consumers.  Rimal et al (2009) compared the security considerations in the current cloud systems at the time:

  • Google: uses 128 Bit or higher server authentication
  • GigaSpaces: SSH tunneling
  • Microsoft Azure: Token services
  • OpenNebula: Firewall and virtual private network tunnel

In 2010, Rimal et al, further expanded the security considerations by the suggestion that organizations should look into: authentication and authorization protocols (access management), privacy and federated data, encryption and decryption schemes, etc.

References

  • AbuKhousa, E., Mohamed, N., & Al-Jaroodi, J. (2012). e-Health cloud: opportunities and challenges. Future Internet, 4(3), 621-645.
  • Chaudhuri, S. (2012, May). What next?: a half-dozen data management research goals for big data and the cloud. In Proceedings of the 31st symposium on Principles of Database Systems (pp. 1-4). ACM.
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf version]. Retrieved from http://online.vitalsource.com/books/9781323135761/epubcfi/6/210
  • Rimal, B. P., Choi, E., & Lumb, I. (2009, August). A taxonomy and survey of cloud computing systems. In INC, IMS and IDC, 2009. NCM’09. Fifth International Joint Conference on (pp. 44-51). Ieee.
  • Rimal, B. P., Choi, E., & Lumb, I. (2010). A taxonomy, survey, and issues of cloud computing ecosystems. In Cloud Computing (pp. 21-46). Springer London.
  • Varia, J., & Mathew, S. (2013). Overview of amazon web services. Jan-2014.

Sample Literature Analysis

Article: Le, J. K, Tissington, P. A., & Budhwar, P. (2010). To move or not to move – a question of family? International Journal of Human Resource Management, 21(1), 17–45. CYBRARY – Business Source Premier

Issue or problem

The primary issue that Le et al (2010) wanted to study was the effect that family-on-work and work-on-family have.  They decide to do this research on relocation, as it is the most prevalent, direct, and most invasive aspect of work impinging its lives of the employees and its family members.  They break down the effect it has on intermediate family members compared to those on the external family members.  With further break down of the intermediate family members on the spouse and their children. The authors focused primarily on 62 Military Personnel (from UK’s Royal Air Force), because relocation in these situations is less of a choice, and they relocate many times that this is the extreme case scenario.

Stated purpose

After the economic periods in the late 2000s, advancements in technology, higher than normal unemployment, and globalization, relocation for the purpose of work-related issues is on a rise for the past decade (le et al, 2010).  Work relocation directly impacts the family, thus studying how family and work interact with each other with a single point of commonality (the employee) is why this aspect was studied.

A lot of studies have looked into the negative effects on relocation on the family or on the employee.  A lot of studies measure this as a one-way relationship.  Le et al (2010), is trying to study a bidirectional positive and negative impact of relocation on work and family.  They want to use exploratory qualitative research to help find variables or themes (to be used in future quantitative studies).  The researchers also wanted to use this exploratory study to make suggestions on how to mitigate the negative side effects of relocation on the employee’s family and vice versa.

Theoretical (concept or construct) focus or topic

Relocation impacts could be defined loosely as:

Relocation effects on the employee ~ F (- marital status, – number of kids, – spousal employment, spousal support, marital status) * G (adjustment time, willingness)

It should be noted that the function above doesn’t contain weights, but just the positive or negative effect of each variable.  Weights can imply and add more meaning to this equation.  This equation was defined by lee et al (2010) survey of the literature.  These are some of the main factors that were addressed or brought up during a qualitative exploratory study of 62 military personnel.

The concepts or constructs defined

Spillover and facilitation were defined as the key to this analysis.  Spillover is defined as aspects of work that affect the family.  Whereas facilitation doing one thing for work positively impacts another thing for work.  These are needed in order for the study to take place.  If spillover doesn’t happen then how does family impact work and work impact the family through the employee?  If relocation doesn’t positively impact the career of the employee, then why would the employee undergo it?  So, there has to be a perceived or actualized benefit to relocation, before the employee moves or decides to leave their employer all together to avoid the relocation.

Research approach

The authors took an exploratory qualitative approach for their study.  Their main reason for this approach was to explore themes in relocation affecting the family, and the family affecting the relocation.  Their hopes were to identify themes for a future study that could measure the relative strengths of these themes through a quantitative approach. They also state that quantitative tools are insufficient at this part of the exploratory phase, whereas qualitative work has a particular advantage to it.  You need to know which themes to study on your sample before you can devise the appropriate measurement instrument and analysis tool.  Though this can be accomplished through an extensive analysis of the literature, the authors did state that the bidirectional relationship of family and work with respect to relocation is the gap in the current knowledge.

Conducting 30 minutes and 2 hours (average of 1 hour) long interviews with 62 military personnel, allowed to collect these themes.  Another aspect of qualitative research that was used is the three measures for validity.  Face validity (summarizing responses and getting confirmation back from the interviewers), Confirmation (asking clarifying questions), and peer examination (independent peers evaluating and commenting on the questions and findings), are used in this qualitative study, which is what makes this study appropriate for their purpose.

Conclusions of the study

Le et al (2010) stated that for the role of a family member, employees face issues like: guilt that arise from a lack of fulfilling family commitments and needs during relocation and pride due to advantages manifesting in the family because of relocation.

For the spouse of the employee, they face issues like: work-related issues (reduced earning potential, unemployment, and hire-ability), psychosocial impact (anger, depression, etc.), and social impacts (loss of social network, community, and friends).

For the children of the employees, they face issues like: school-related impacts (they may fall behind or speed ahead, depending on where they were relocated on and it is diminished if they were placed in boarding school), psychological impact (mirrors that of the spouse), and social impact (hard time making friends, but strengthens internal family bonds).

Finally for the extended family, though it can be hard to establish a connection, some found it amazing that they got to visit a new place to see the relocated family from time to time.

However, there can be a devastating impact on the family unit, separation can occur (divorce) if there is no focus on family, but only on one’s career, and if relocation fatigue (due to multiple relocations in a span of a few years) occurs.

With all of this work impacting the family, the family can impact the work.  The researchers found that the family can try to influence when and how they move.  This effect is amplified when the employee involves the family in the decision.  Doing this will increase buy-in from all members, and makes the family happier in the end.  The family can defer or accelerate the relocation depending on their own plans.  But, if the company pushes the relocation, the family could exert pressure on the employee as well, to a point where the employee will leave (think of leaving or be aware of the option) the organization because they would prefer to keep their family intact.

Recommendations for future research

This study involved military families.  They relocate every 2 to 3 years, more often than most families around the world.  In most of these cases, rejecting relocation is not a wise facilitating option.  For these reasons, this is an extreme case for employee relocation, as lee et al (2010), noted.  Thus, the study can be applied to generic global and national level companies.  Finally, now that they have identified themes, we can measure their strength/magnitude and correlations between each theme to relocation effects on family and family effects on relocation.

 

Internal and External Validity

In quantitative research, a study is valid if one could draw meaning and inferences from the results based on methodology employed.  The three ways to look at validity is in (1) Content (do we measure what we wanted), (2) Predictive (do we match similar results, can we predict something), and (3) construct (are these hypothetical or real concepts).  This is not to be confused with reliability & consistency.  Thus, Creswell (2013) warns that if we modify an instrument or combine it with others, the validity and reliability of it could change, and in order to use it we must reestablish its validity and reliability.  There are several threats to validity that exist, either internal (history, maturation, regression, selection, mortality, diffusion of treatment, compensatory/resentful demoralization, compensatory rivalry, testing, and instrumentation) or external (interaction of selection and treatment, interaction of setting and treatment, and interaction of history and treatment).

Sample Validity Considerations: The validity issues are and their mitigation plans

Internal Validity Issues:

Hurricane intensities and tracks can vary annually or even decadally.  As time passes during this study for the 2016 and 2017 Atlantic Ocean Basin this study may run into regression issues.  These regression issues threaten the validity of the study in a way that certain types of weather components may not be the only factors that can increase/decrease hurricane forecasting skill from the average.  To mitigate regression issues, the study could mitigate the effect that these storms with an extreme departure from the average forecast skill have on the final results by eliminating them.  Naturally, the extreme departures from the average forecast skill will, with time, slightly impact the mean, but their results are still too valuable to dismiss.  Finding out which weather components impact these extreme departures from the average forecast skill is what drives this project.  Thus, their removal doesn’t seem to fit in this study and defeats the purpose of knowledge discovery.

External Validity Issues: 

The Eastern Pacific, Central Pacific, and Atlantic Ocean Basin have the same underlying dynamics that can create, intensify and influence the path of tropical cyclones.  However, these three basins still behave differently, thus there is an interaction of setting and treatment threats to the validity of these studies results. Results garnered in this study will not allow me to generalize beyond the Atlantic Ocean Basin. The only way to mitigate this threat to validity is to suggest future research to be conducted on each basin separately.

Resources

Exploring Mixed Methods

Explanatory Sequential (QUAN -> qual)

According to Creswell (2013), this mix method style uses qualitative methods to do a deep dive into the quantitative results that have been previously gathered (often to understand the data with respect to the culture).  The key defining feature here is that quantitative data is collected before the qualitative data and that the quantitative data drives the results from the qualitative.  Thus, the emphasis is given to the quantitative results in order to explore and make sense of qualitative results.  It is used to probe quantitative results by explaining them via qualitative results.  Essentially, using qualitative results to enhance your quantitative results.

Exploratory Sequential (QUAL -> quan)

According to Creswell (2013), this mix method style uses quantitative methods to confirm the qualitative results that have been previously gathered (often to understand the culture behind the data).  The key defining feature here is that qualitative data is collected before the quantitative data and that the qualitative data drives the results from the quantitative.  Thus, the emphasis is given to the qualitative results in order to explore and make sense of quantitative results.  It is used to probe qualitative results by explaining them via quantitative results.  Essentially, using quantitative results to enhance your qualitative results.

Which method would you most likely use?  If your methodological fit suggests you to use a mixed-methods research project, does your world view colors your choice?

Resources