Data usually gets update on a regular basis. Connolly and Begg (2014) defined that data can be updated incrementally, only small sections of the data, or can be updated completely. An example of data that can be updated incrementally are webpages, computer codes, stale data, data-at-rest, bodies of knowledge, etc. Whereas, some examples of data that can be updated completely are: weather data, space weather data, social media data, data-in-motion, dynamic data, etc. Both sets of data provide their own unique challenges when it comes to data processing. On average, analyzing web data, new to old data can range from 10-1000x (Sakr, 2014). Thus, the focus of this discussion is on incremental data update and how to process data in between two data processing runs.
Incoop is an extension of Hadoop to allow for processing incremental changes on big data, by splitting the main computation to its sub-computation, logging in data updates in a memoization server, while checking the inputs of the input data to each sub-computation (Bhatotia et al., 2011; Sakr, 2014). These sub-computations are usually mappers and reducers (Sakr, 2014). Incremental mappers check against the memoization servers, and if the data has already been processed and unchanged it will not reprocess the data, and a similar process for incremental reducers that check for changed mapper outputs (Bhatotia et al., 2011).
Subsequently, MapReduce is an analytical engine and pattern that takes advantage of distributed systems while keeping the processes and data in one machine (Sadalage & Fowler, 2012). There are a few key principles to using the MapReduce framework and Hadoop efficiently to improve incremental computation:
- Data partitioning: The MapReduce framework aids in partitioning the data into similar size sets into Hadoop Distributed File System, aka HDFS (Lublinsky, Smith, & Yakubovich, 2013). Thus, MapReduce can support smaller sets of data stored in HDFS. This is part of the scalability of the cluster.
- Fault tolerance and durability: Given that data can be partitioned to tiny chunks across thousands of computations nodes and run in parallel sometimes these nodes can fail (Sakr, 2014). The MapReduce framework replicates the data in the background and can launch backup jobs if a node fails (Lublinsky et al., 2013; Sakr, 2014). Thus, failure doesn’t disrupt the data processing. However it does increase the number of processors needed (Connolly & Begg, 2014).
- Parallelization: The partitioned input data are considered as independent sets of data, such that the mapper functions can process the data in a parallel environment (Lublinsky et al., 2013; Sakr, 2014). This principle allows for the sub-functions within the mapper and reducer function to handle smaller data. It allows a data analyst to focus on the main problem rather than low-level parallel coding abstraction, like multithreading, file allocation, memory management, etc. (Sakr, 2014). Serialization does not allow for small incremental updates for large data (Connolly & Begg, 2014).
- Data reuse: There is no need to read or write of intermediate data, thus preserving the input data to enable the data to be reused because it is unchanged (Lublinsky et al., 2013; Sakr, 2014).
- Self-Adjusting Computation: Used for incremental computation, which only allows mappers and reducers only work on the smaller size sets of data that are impacted by the change (Sakr, 2014).
Both Bhatotia et al. (2011) and Sakr (2014), suggested an Inc-HDFS which is also an extension of the HDFS, for partitioning data based on content and removal of data duplication. There is the limitation of this approach where the number of files may be grouped in too many content bins or too little content bins and thus may not be evenly be spaced out (Bhatotia et al., 2011). Thus, invoking: too many mapper functions can create an infrastructure overhead, which increases resources and thus cost, or too few mapper functions can create huge workloads for certain types of computational nodes, or too many reducers can provide too many outputs, and too little reducers can provide too little outputs (Lublinsky et al., 2013; Sakr, 2014). A constraint must be added on both ends of the spectrum to allow for evenly distributed data sets (Bhatotia et al., 2011).
Subsequently, the Hadoop out-of-the-box product scheduler doesn’t account for memoization server data, therefore is not built for incremental analysis. Thus, Incoop has a memoization-aware scheduler, that schedules the sub-computations based on the affinity of the task and allows for efficient use of previously (Bhatotia et al., 2011; Sakr, 2014). The scheduler can run tasks on computational nodes that are either faster or locally to where the data is stored (Bhatotia et al., 2011). Using this type of scheduler, the scheduler should place priority on whether it is faster to conduct a data movement and run the task at a faster computational node or reduce data movement, and processes the data locally, while still making effective use of unchanged and processed data.
In the end, a practical application of this technique when it comes to analyzing web data would be to first partition the web data by its content. Scientific content can go in one context partition, corporate financial content into another context partition, etc. under the Inc-HDFS framework. These partitions are capped in size to allow for proper load balance. Incoop will then run the MapReduce function to process the data distributively through using parallel processes. If the data gets updated, like a new corporate update to the SEC 10K data, it will be recorded by the memoization server. This will allow for Incoop to be able to process that incremental change from the corporate financial content partition because it was using the memoization-aware scheduler, and reprocess the data through mapping and reducing function on just this small and partitioned dataset. Therefore, making effective use of unchanged and processed data.
- Bhatotia, P. Wieder, A., Rogrigues, R. Acar, U. A., & Pasquini, R. (2011). Incoop. MapReduce for incremental computation. Retrieved from https://www.cl.cam.ac.uk/~ey204/teaching/ACS/R202_2012_2013/papers/S4_Programming/papers/bhatotia_SOCC_2011.pdf
- Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th ed.). Vitalbook file.
- Lublinsky, B., Smith, K. T., Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
- Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, (1st ed.). Vitalbook file.
- Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.ok