Adv Topics: MapReduce and Resource Description Framework

In the Web 3.0, state the web at 2017, it involves the semantic web that is driven by data integration through the uses of metadata (Patel, 2013). This version of the web supports a worldwide database with static HTML documents, dynamically rendered data, next standard HTML (HTML5), and links between documents with hopes of creating an interconnected and interrelated openly accessible world data such that tagged micro-content can be easily discoverable through search engines (Connolly & Begg, 2014; Patel, 2013). This new version of HTML, HTML5 can handle multimedia and graphical content and introduces new tags like <section />, <article />, <nav />, and <header />, which are great for semantic content (Connolly & Begg, 2014). Also, end-users are beginning to build dynamic web applications for others to interact with (Patel, 2013). Key technologies include: the Extensible Markup Language is a tag based metalanguage and Resource Description Framework (RDF) URI triples (Connolly & Begg, 2014; Patel, 2013; UK Web Design Company, n.d.).

The Resource Description Framework (RDF) is based on URI triples <subject, predicate, object>, which helps describes data properties and classes uniquely (Connolly & Begg, 2014; Patel, 2013). RDF is usually represented at the top of the data set with a @prefix (Connolly & Begg, 2014). For instance,

@prefix: <https://skyhernandez.com&gt; s: Author <https://skyhernandez.wordpress.com/about/&gt;. <https://skyhernandez.wordpress.com/about/&gt;. s:Name “Dr. Skylar Hernandez”. <https://skyhernandez.wordpress.com/about/&gt; s:e-mail “dr.sky.hernandez@gmail.com.”

The syntax and tags can be redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007; Sakr, 2014). However, RDF has been used for knowledge management systems and big data-modeling because it helps define groups of data and the relationships between other related data resources (Brickley & Guha, 2014). Thus, these relationships can be graphically drawn (Figure 1).

ip3v1

Figure 1: RDF graph adapted from Schatzle, Przyjaciel-Sablocki, Dorner, Hornung and Lausen (2012).

RDF, when combined with MapReduce, can be used to store data efficiently, such as all the same predicates are stored into one location, and each predicate has its own storage location (Schatzle, Przyjaciel-Sablocki, Hornung & Lausen, 2011). Thus, it can be used for improvements in storage and partitioning in a distributed database system.

SPARQL (Simple Protocol and RDF Query Language), also searches via the triple’s convention <subject, predicate, object> and it can include variables that help narrow down the search for particular data items (Connolly & Begg, 2014, Sakr, 2014). A sample query to find a name and email from the website would be with certain constraints:

SELECT ?name, ?email

FROM < https://skyhernandez.wordpress.com/about.rdf>

WHERE {?x ?s:Name ?name. ?x ?s:email ?email.}

The application of RDF data query processing with the MapReduce framework is still a new concept, and multiple distinct solutions have been applied. The following sections will cover two of those solutions.

Solution 1: Pig-SPARQL (Schatzle et al., 2011)

 The focus of solution: Provides a way to conduct SPARQL queries on a MapReduce framework on an Apache Hadoop computing cluster. SPARQL queries are transformed into Pig Latin Programs which are then executed by the MapReduce framework. SPARQL queries are transformed into a Syntax, and Algebraic Tree, which then gets placed into an optimization program that reviews these trees from the bottom-up to then get these trees transformed into a Pig Latin program prior to execution by MapReduce Jobs (see figure 2).

ip3v2

Figure 2: Modular Translation Process adapted from Schatzle et al. (2011).

Technical changes made to MapReduce Framework: RDF queries go through the basic graph pattern, where multiset solution mapping is created to serve as input into Pig Latin. Thus, the basic SPARQL query data goes through the:

    1. LOAD = concatenation of the set of triple patterns
    2. FILTER/FOREACH = removes data items in the solution mapping that don’t meet criteria
    3. JOIN/FOREACH = merges compatible solution mappings
    4. UNION = combines all multiset solutions to one solution mapping
    5. GRAPH = SPARQL query data set is graphed

statements.

Rationale for Technical Changes: This requires that the input/output between mappers and reducers in MapReduce be reduced, thus optimizing the process. FILTER/FOREACH aids in producing immediate results early in the process. The identification of redundant data through the JOIN/FOREACH process and eliminating, also aids in the optimization. The UNION combines all the tables of data to a single multi-joined data set.

Pros and Cons:

+   The optimization produces an approximate 70% improvement in execution time

+   It is a translation framework (or also known as a preprocessor), thus no need to make changes to the underlying code or framework

+   Uses distributed database systems and parallel processing

+   The solution is approximately linearly scalable

  • Unnecessarily does join computations and data shuffling
  • Optimization results vary per RDF query

Solution 2: MAPSIN Join (Schatzle et al., 2012)

The focus of solution: Map-Side Index Nest Loop Join (MAPSIN Join) combines HBase with MapReduce, to retain the reducer joins but also utilizing the mapper joins. Mapper joints include the following two steps:

  1. Preprocessing step = based on a join key prior to the mapping, datasets are sorted and equally partitions
  2. Precondition step = if data sets over the network can be joined, then the mapping job can move forward in parallel merge join on the presorted data and no data shuffle is needed.

Technical changes made to MapReduce Framework: Data joins are conducted based on RDF triples. Thus, joining RDF data items by their triples their solution map should be compatible. Doing this type of join should be done on one machine, during one map phase (figure 3).

ip3v3.png

Figure 3: MAPSIN Join base case for a triple’s pattern query, adapted from Schatzle et al. (2012).

Therefore, during the initial phase of MAPSIN Join, (1+2) the code searches all RDF triples in the local machine and if they are compatible, would join them. (3) Subsequently, the map function is used on compatible data. To add more triple pattern joins, the above steps are repeated to a new triple pattern with the previously joined triple patterns. In other words, a similar concept as concatenating.

Rationale for Technical Changes: Just using the reducer joins means that data is completely transferred over a network, even if the data is never joined, thus increasing input/output data flow. To avoid this data throughput cost, HBase is used prior to and during the mapper phase. Given that MapReduce doesn’t have the capability to store data because it just handles the data without read and write capabilities. It is a combination of HBase and MapReduce that makes the MAPSIN Join strategy.

Pros and Cons:

+   The solution does not make changes to the underlying framework

+   The solution drives down the cost of data input and output and reduces data shuffling.

+   Solution is approximately linearly scalable

+   The solution outperforms PigSPARQL solutions by 10x, thus in some cases the solution outperforms the reducer join only solutions by 10x

  • Mapper only joins are hard to cascade for large datasets, due to the locality issue, thus driving the need for reducer side joins as well.

Conclusions

From these two distinct solutions of applying an RDF data query process to the MapReduce framework MAPSIN solution is better because it outperforms PigSPARQL, fully utilizes HBase, mapper and reducer joins, and efficiently reduces data throughput costs. But, it must be remembered that this is a nascent field, and the perfect solution would depend on project scope, project constraints, project resources, and project data.

Resources:

  • Brickley, D., & Guha, R. V. (eds). (2014). RDF Schema 1.1. W3C. Retrieved from http://www.w3.org/TR/2014/REC-rdf-schema-20140225/
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th ed.). Pearson Learning Solutions. VitalBook file.

Adv Topics: MapReduce and Incremental Computation

Data usually gets update on a regular basis. Connolly and Begg (2014) defined that data can be updated incrementally, only small sections of the data, or can be updated completely. An example of data that can be updated incrementally are webpages, computer codes, stale data, data-at-rest, bodies of knowledge, etc. Whereas, some examples of data that can be updated completely are: weather data, space weather data, social media data, data-in-motion, dynamic data, etc. Both sets of data provide their own unique challenges when it comes to data processing. On average, analyzing web data, new to old data can range from 10-1000x (Sakr, 2014). Thus, the focus of this discussion is on incremental data update and how to process data in between two data processing runs.

Incoop is an extension of Hadoop to allow for processing incremental changes on big data, by splitting the main computation to its sub-computation, logging in data updates in a memoization server, while checking the inputs of the input data to each sub-computation (Bhatotia et al., 2011; Sakr, 2014). These sub-computations are usually mappers and reducers (Sakr, 2014). Incremental mappers check against the memoization servers, and if the data has already been processed and unchanged it will not reprocess the data, and a similar process for incremental reducers that check for changed mapper outputs (Bhatotia et al., 2011).

Subsequently, MapReduce is an analytical engine and pattern that takes advantage of distributed systems while keeping the processes and data in one machine (Sadalage & Fowler, 2012). There are a few key principles to using the MapReduce framework and Hadoop efficiently to improve incremental computation:

  • Data partitioning: The MapReduce framework aids in partitioning the data into similar size sets into Hadoop Distributed File System, aka HDFS (Lublinsky, Smith, & Yakubovich, 2013). Thus, MapReduce can support smaller sets of data stored in HDFS. This is part of the scalability of the cluster.
  • Fault tolerance and durability: Given that data can be partitioned to tiny chunks across thousands of computations nodes and run in parallel sometimes these nodes can fail (Sakr, 2014). The MapReduce framework replicates the data in the background and can launch backup jobs if a node fails (Lublinsky et al., 2013; Sakr, 2014). Thus, failure doesn’t disrupt the data processing. However it does increase the number of processors needed (Connolly & Begg, 2014).
  • Parallelization: The partitioned input data are considered as independent sets of data, such that the mapper functions can process the data in a parallel environment (Lublinsky et al., 2013; Sakr, 2014). This principle allows for the sub-functions within the mapper and reducer function to handle smaller data. It allows a data analyst to focus on the main problem rather than low-level parallel coding abstraction, like multithreading, file allocation, memory management, etc. (Sakr, 2014). Serialization does not allow for small incremental updates for large data (Connolly & Begg, 2014).
  • Data reuse: There is no need to read or write of intermediate data, thus preserving the input data to enable the data to be reused because it is unchanged (Lublinsky et al., 2013; Sakr, 2014).
  • Self-Adjusting Computation: Used for incremental computation, which only allows mappers and reducers only work on the smaller size sets of data that are impacted by the change (Sakr, 2014).

Both Bhatotia et al. (2011) and Sakr (2014), suggested an Inc-HDFS which is also an extension of the HDFS, for partitioning data based on content and removal of data duplication. There is the limitation of this approach where the number of files may be grouped in too many content bins or too little content bins and thus may not be evenly be spaced out (Bhatotia et al., 2011). Thus, invoking: too many mapper functions can create an infrastructure overhead, which increases resources and thus cost, or too few mapper functions can create huge workloads for certain types of computational nodes, or too many reducers can provide too many outputs, and too little reducers can provide too little outputs (Lublinsky et al., 2013; Sakr, 2014). A constraint must be added on both ends of the spectrum to allow for evenly distributed data sets (Bhatotia et al., 2011).

Subsequently, the Hadoop out-of-the-box product scheduler doesn’t account for memoization server data, therefore is not built for incremental analysis. Thus, Incoop has a memoization-aware scheduler, that schedules the sub-computations based on the affinity of the task and allows for efficient use of previously (Bhatotia et al., 2011; Sakr, 2014). The scheduler can run tasks on computational nodes that are either faster or locally to where the data is stored (Bhatotia et al., 2011). Using this type of scheduler, the scheduler should place priority on whether it is faster to conduct a data movement and run the task at a faster computational node or reduce data movement, and processes the data locally, while still making effective use of unchanged and processed data.

In the end, a practical application of this technique when it comes to analyzing web data would be to first partition the web data by its content. Scientific content can go in one context partition, corporate financial content into another context partition, etc. under the Inc-HDFS framework. These partitions are capped in size to allow for proper load balance. Incoop will then run the MapReduce function to process the data distributively through using parallel processes. If the data gets updated, like a new corporate update to the SEC 10K data, it will be recorded by the memoization server. This will allow for Incoop to be able to process that incremental change from the corporate financial content partition because it was using the memoization-aware scheduler, and reprocess the data through mapping and reducing function on just this small and partitioned dataset. Therefore, making effective use of unchanged and processed data.

Resources:

  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, (1st ed.). Vitalbook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.ok

Adv Topics: MapReduce and Iterative computation

Data-in-motion is the real-time streaming of data from a broad spectrum of technologies, which also encompasses the data transmission between systems, while data that is stored on a database system or cloud system is considered as data-at-rest and data that is being processed and analyzed is considered as data-in-use (Katal, Wazid, & Goudar, 2013; Kishore & Sharma, 2016; Ovum, 2016; Ramachandran & Chang, 2016). Social media data or social network analysis data can be considered as data-in-motion and processing that type of data can be quite problematic. Data-in-motion has to be iteratively processed until there is a certain termination condition is reached and it can be reached between iterations (Sakr, 2014). Data-at-rest is probably considered easier to analyze; however, this type of data can also be problematic. If the data-at-rest is large in size and even if the data does not change or evolve, its large size requires iterative processes to analyze the data.

An example of an iterative process as suggested by Lusblinsky, Smith, and Yakubovich (2014), is solving a linear equation by approximation algorithms on hundreds of equations and variables. The data can be stored in matrix Ax = b, where A is a matrix of coefficients, b is a vector of output values, and x is a vector of variables. If the data is too large, using a simple linear algebraic solution would be impossible, so a quadratic spline solution would consist of the following:

f(x) = ½ xTAx-xT­b

and a superscript “T” represents transposing the vector or matrix. Each iteration of this spline would result in a better vector solution. However, Sakr (2014) stated that MapReduce does not support iterative data processing and analysis directly. Thus workaround is needed to handle iterative programs for situations like data-in-motion or even streaming data.

 Root causes and technical steps to address them

To deal with datasets that require iterative processes to analyze the data, computer coders need to create and arrange multiple MapReduce functions in a loop (Sakr, 2014). This workaround would increase the processing time of the serialized program because data would have to be reloaded and reprocessed, because there is no read or write of intermediate data, which was there for preserving the input data (Lusblinksy et al., 2014; Sakr, 2014). The root cause exists because in its simplest form MapReduce can consist of many mappers and one reducer, which is a performance bottleneck. This simplified model of the MapReduce analytical engine means one cannot reduce across keys, just one key at a time (Sadalage & Fowler, 2012). Thus, the algorithm is not built for iterations. HaLoop is an iteration solution built on top of the Hadoop infrastructure that has a loop control module, task scheduler, which caches invariant data from a previous iteration and uses it in future iterations (Sakr, 2014). This solution can be applied to static and unchanged data.

There are also disadvantages of using MapReduce on these types of data because too many mapper functions can create an infrastructure overhead or too many reducers can provide too many outputs (Lusblinksy et al., 2014; Sakr, 2014). Thus there has to be a basic implementation plan of effective data placement over the Hadoop cluster, to ensure proper load balance of data across the Hadoop servers (Sakr, 2014). Sometimes, there is a need to run two separate MapReduce functions, one to prepare the data by evenly distributing data across the servers and one that iteratively goes through the data (Lusblinksy et al., 2014). CoHadoop allows for data files and related files to be stored on the same server, provide a means of load balancing and fault tolerance by creating a file-level locator property (Sakr, 2014). Sakr describes the locator property as a means to keep track of where the data is stored via a unique identification number for each file in the system. This solution can also be applied to static and unchanged data.

For data-in-motion and streaming data, data has to be iteratively processed until there is a certain termination condition is reached. MapReduce Online has an approach where between the mapper and reducer function data is pipelined to allow for data processing and analysis as soon as the mappers produce their outputs (Sakr, 2014). This can run the MapReduce functions iteratively and provide relatively live reduced data outputs. Sakr, further explains that this approach is where the reducer contacts every mapper upon initiation of the scheduler and the data is temporarily stored on the pipeline, which is an in-memory buffer.

Resources:

  • Katal, A., Wazid, M., & Goudar, R. H. (2013, August). Big data: issues, challenges, tools and good practices. InContemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
  • Kishore, N. & Sharma, S. (2016). Secure data migration from enterprise to cloud storage – analytical survey. BIJIT-BVICAM’s Internal Journal of Information Technology. Retrieved from http://bvicam.ac.in/bijit/downloads/pdf/issue15/09.pdf
  • Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
  • Ovum (2016). 2017 Trends to watch: Big Data. Retrieved from http://info.ovum.com/uploads/files/2017_Trends_to_Watch_Big_Data.pdf
  • Ramachandran, M. & Chang, V. (2016). Toward validating cloud service providers using business process modeling and simulation. Retrieved from http://eprints.soton.ac.uk/390478/1/cloud_security_bpmn1%20paper%20_accepted.pdf
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, (1st ed.). Vitalbook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.

Adv Topics: MapReduce and Hadoop

Hadoop allows for data processing through MapReduce and it also allows for data storage (Lublinsky et al., 2014). MapReduce is an analytical engine and pattern that takes advantage of distributed systems while keeping the processes and data in one machine (Sadalage & Fowler, 2012). MapReduce thus contains two functions that work in parallel on distributed systems (Hortonworks, 2013; Sadalage & Fowler, 2012; Sakr, 2014; Sathupadi, 2010):

    1. Mappers functions create and process transactions on the system by mapping and aggregating data by key values. Mappers can read only one data record at a time.
    2. Reducers functions know what that key values are and will take all those values stored in a map to reduce the data to what is relevant. Reducers help summarize the data into a single output. This helps deal with the amount of data moving between multiple computational nodes.

Lublinsky, Smith, and Yakubovich, (2014), stated that an intermediate component of MapReduce is known as the shuffle and sort, where the data from the mapping function outputs are moved and presented to the reducer function.

Thus, MapReduce is a framework that uses parallel sequential algorithms that capitalize on cloud architecture, which became popular under the open source Hadoop project, as its main executable analytic engine (Lublinsky et al., 2014; Sadalage & Fowler, 2012; Sakr, 2014). Essentially, a sequential algorithm is a computer program that runs on a sequence of commands, and a parallel algorithm runs a set of sequential commands over separate computational cores (Brookshear & Brylow, 2014; Sakr, 2014). Thus, a parallel sequential algorithm runs a full sequential program over multiple but separate cores (Sakr, 2014). Another feature of MapReduce is that a reduced output can become another’s map function (Sadalage & Fowler, 2012). Subsequently, the advantages and disadvantages of using MapReduce are (Lusblinksy et al., 2014; Sakr, 2014):

+ aggregation techniques under the mapper function can exploit multiple different techniques

+ no read or write of intermediate data, thus preserving the input data

+ no need to serialize or de-serialize code in either memory or processing

+ it is scalable based on the size of data and resources needed for processing the data

+ isolation of the sequential program from data distribution, scheduling, and fault tolerance

– too many mapper functions can create an infrastructure overhead, which increases resources and thus cost

– too few mapper functions can create huge workloads for certain types of computational nodes

– too many reducers can provide too many outputs, and too little reducers can provide too little outputs

 – it’s a different programming paradigm that most programmers are not familiar with

 – the use of available parallelism will be underutilized for smaller data sets

Given that Hadoop is predominately known for popularizing MapReduce tasks, it is also known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems (Rathbone, 2013). Hadoop’s service is part of the cloud (as Platform as a Service = PaaS).  For PaaS, the end users manage the applications and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001). Data is broken up into small blocks, like Legos, such that they are distributed across a distributed database system and across multiple servers and can be processed across all these servers, e.g. Hadoop Cluster (IBM, n.d.).

A common example of a parallel sequential program is dynamical weather forecasting models. In dynamical weather forecasting models, there is a set of defined geodynamic, thermodynamic, and physical sequential algorithms define and evolve the main seven variables of weathers across time. For each time step, the forecasting models run these sequential algorithms over each grid point, which can represent a finite geospatial region. Each of these geospatial regions is split amongst multiple computational scores. This example expands in complexity when data has to travel between different finite geospatial regions through the boundaries, which is an example of data parallelism (Sakr, 2014). MapReduce uses the concept of data parallelism to help map and reduce data. Therefore, weather models could be considered as a loose form of MapReduce algorithm.

Resources: