Data-in-motion is the real-time streaming of data from a broad spectrum of technologies, which also encompasses the data transmission between systems, while data that is stored on a database system or cloud system is considered as data-at-rest and data that is being processed and analyzed is considered as data-in-use (Katal, Wazid, & Goudar, 2013; Kishore & Sharma, 2016; Ovum, 2016; Ramachandran & Chang, 2016). Social media data or social network analysis data can be considered as data-in-motion and processing that type of data can be quite problematic. Data-in-motion has to be iteratively processed until there is a certain termination condition is reached and it can be reached between iterations (Sakr, 2014). Data-at-rest is probably considered easier to analyze; however, this type of data can also be problematic. If the data-at-rest is large in size and even if the data does not change or evolve, its large size requires iterative processes to analyze the data.
An example of an iterative process as suggested by Lusblinsky, Smith, and Yakubovich (2014), is solving a linear equation by approximation algorithms on hundreds of equations and variables. The data can be stored in matrix Ax = b, where A is a matrix of coefficients, b is a vector of output values, and x is a vector of variables. If the data is too large, using a simple linear algebraic solution would be impossible, so a quadratic spline solution would consist of the following:
f(x) = ½ xTAx-xTb
and a superscript “T” represents transposing the vector or matrix. Each iteration of this spline would result in a better vector solution. However, Sakr (2014) stated that MapReduce does not support iterative data processing and analysis directly. Thus workaround is needed to handle iterative programs for situations like data-in-motion or even streaming data.
Root causes and technical steps to address them
To deal with datasets that require iterative processes to analyze the data, computer coders need to create and arrange multiple MapReduce functions in a loop (Sakr, 2014). This workaround would increase the processing time of the serialized program because data would have to be reloaded and reprocessed, because there is no read or write of intermediate data, which was there for preserving the input data (Lusblinksy et al., 2014; Sakr, 2014). The root cause exists because in its simplest form MapReduce can consist of many mappers and one reducer, which is a performance bottleneck. This simplified model of the MapReduce analytical engine means one cannot reduce across keys, just one key at a time (Sadalage & Fowler, 2012). Thus, the algorithm is not built for iterations. HaLoop is an iteration solution built on top of the Hadoop infrastructure that has a loop control module, task scheduler, which caches invariant data from a previous iteration and uses it in future iterations (Sakr, 2014). This solution can be applied to static and unchanged data.
There are also disadvantages of using MapReduce on these types of data because too many mapper functions can create an infrastructure overhead or too many reducers can provide too many outputs (Lusblinksy et al., 2014; Sakr, 2014). Thus there has to be a basic implementation plan of effective data placement over the Hadoop cluster, to ensure proper load balance of data across the Hadoop servers (Sakr, 2014). Sometimes, there is a need to run two separate MapReduce functions, one to prepare the data by evenly distributing data across the servers and one that iteratively goes through the data (Lusblinksy et al., 2014). CoHadoop allows for data files and related files to be stored on the same server, provide a means of load balancing and fault tolerance by creating a file-level locator property (Sakr, 2014). Sakr describes the locator property as a means to keep track of where the data is stored via a unique identification number for each file in the system. This solution can also be applied to static and unchanged data.
For data-in-motion and streaming data, data has to be iteratively processed until there is a certain termination condition is reached. MapReduce Online has an approach where between the mapper and reducer function data is pipelined to allow for data processing and analysis as soon as the mappers produce their outputs (Sakr, 2014). This can run the MapReduce functions iteratively and provide relatively live reduced data outputs. Sakr, further explains that this approach is where the reducer contacts every mapper upon initiation of the scheduler and the data is temporarily stored on the pipeline, which is an in-memory buffer.
Resources:
- Katal, A., Wazid, M., & Goudar, R. H. (2013, August). Big data: issues, challenges, tools and good practices. InContemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
- Kishore, N. & Sharma, S. (2016). Secure data migration from enterprise to cloud storage – analytical survey. BIJIT-BVICAM’s Internal Journal of Information Technology. Retrieved from http://bvicam.ac.in/bijit/downloads/pdf/issue15/09.pdf
- Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
- Ovum (2016). 2017 Trends to watch: Big Data. Retrieved from http://info.ovum.com/uploads/files/2017_Trends_to_Watch_Big_Data.pdf
- Ramachandran, M. & Chang, V. (2016). Toward validating cloud service providers using business process modeling and simulation. Retrieved from http://eprints.soton.ac.uk/390478/1/cloud_security_bpmn1%20paper%20_accepted.pdf
- Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, (1st ed.). Vitalbook file.
- Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.