In the Web 3.0, state the web at 2017, it involves the semantic web that is driven by data integration through the uses of metadata (Patel, 2013). This version of the web supports a worldwide database with static HTML documents, dynamically rendered data, next standard HTML (HTML5), and links between documents with hopes of creating an interconnected and interrelated openly accessible world data such that tagged micro-content can be easily discoverable through search engines (Connolly & Begg, 2014; Patel, 2013). This new version of HTML, HTML5 can handle multimedia and graphical content and introduces new tags like <section />, <article />, <nav />, and <header />, which are great for semantic content (Connolly & Begg, 2014). Also, end-users are beginning to build dynamic web applications for others to interact with (Patel, 2013). Key technologies include: the Extensible Markup Language is a tag based metalanguage and Resource Description Framework (RDF) URI triples (Connolly & Begg, 2014; Patel, 2013; UK Web Design Company, n.d.).
The Resource Description Framework (RDF) is based on URI triples <subject, predicate, object>, which helps describes data properties and classes uniquely (Connolly & Begg, 2014; Patel, 2013). RDF is usually represented at the top of the data set with a @prefix (Connolly & Begg, 2014). For instance,
@prefix: <https://skyhernandez.com> s: Author <https://skyhernandez.wordpress.com/about/>. <https://skyhernandez.wordpress.com/about/>. s:Name “Dr. Skylar Hernandez”. <https://skyhernandez.wordpress.com/about/> s:e-mail “dr.sky.hernandez@gmail.com.”
The syntax and tags can be redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007; Sakr, 2014). However, RDF has been used for knowledge management systems and big data-modeling because it helps define groups of data and the relationships between other related data resources (Brickley & Guha, 2014). Thus, these relationships can be graphically drawn (Figure 1).
Figure 1: RDF graph adapted from Schatzle, Przyjaciel-Sablocki, Dorner, Hornung and Lausen (2012).
RDF, when combined with MapReduce, can be used to store data efficiently, such as all the same predicates are stored into one location, and each predicate has its own storage location (Schatzle, Przyjaciel-Sablocki, Hornung & Lausen, 2011). Thus, it can be used for improvements in storage and partitioning in a distributed database system.
SPARQL (Simple Protocol and RDF Query Language), also searches via the triple’s convention <subject, predicate, object> and it can include variables that help narrow down the search for particular data items (Connolly & Begg, 2014, Sakr, 2014). A sample query to find a name and email from the website would be with certain constraints:
SELECT ?name, ?email
FROM < https://skyhernandez.wordpress.com/about.rdf>
WHERE {?x ?s:Name ?name. ?x ?s:email ?email.}
The application of RDF data query processing with the MapReduce framework is still a new concept, and multiple distinct solutions have been applied. The following sections will cover two of those solutions.
Solution 1: Pig-SPARQL (Schatzle et al., 2011)
The focus of solution: Provides a way to conduct SPARQL queries on a MapReduce framework on an Apache Hadoop computing cluster. SPARQL queries are transformed into Pig Latin Programs which are then executed by the MapReduce framework. SPARQL queries are transformed into a Syntax, and Algebraic Tree, which then gets placed into an optimization program that reviews these trees from the bottom-up to then get these trees transformed into a Pig Latin program prior to execution by MapReduce Jobs (see figure 2).
Figure 2: Modular Translation Process adapted from Schatzle et al. (2011).
Technical changes made to MapReduce Framework: RDF queries go through the basic graph pattern, where multiset solution mapping is created to serve as input into Pig Latin. Thus, the basic SPARQL query data goes through the:
-
- LOAD = concatenation of the set of triple patterns
- FILTER/FOREACH = removes data items in the solution mapping that don’t meet criteria
- JOIN/FOREACH = merges compatible solution mappings
- UNION = combines all multiset solutions to one solution mapping
- GRAPH = SPARQL query data set is graphed
statements.
Rationale for Technical Changes: This requires that the input/output between mappers and reducers in MapReduce be reduced, thus optimizing the process. FILTER/FOREACH aids in producing immediate results early in the process. The identification of redundant data through the JOIN/FOREACH process and eliminating, also aids in the optimization. The UNION combines all the tables of data to a single multi-joined data set.
Pros and Cons:
+ The optimization produces an approximate 70% improvement in execution time
+ It is a translation framework (or also known as a preprocessor), thus no need to make changes to the underlying code or framework
+ Uses distributed database systems and parallel processing
+ The solution is approximately linearly scalable
- Unnecessarily does join computations and data shuffling
- Optimization results vary per RDF query
Solution 2: MAPSIN Join (Schatzle et al., 2012)
The focus of solution: Map-Side Index Nest Loop Join (MAPSIN Join) combines HBase with MapReduce, to retain the reducer joins but also utilizing the mapper joins. Mapper joints include the following two steps:
- Preprocessing step = based on a join key prior to the mapping, datasets are sorted and equally partitions
- Precondition step = if data sets over the network can be joined, then the mapping job can move forward in parallel merge join on the presorted data and no data shuffle is needed.
Technical changes made to MapReduce Framework: Data joins are conducted based on RDF triples. Thus, joining RDF data items by their triples their solution map should be compatible. Doing this type of join should be done on one machine, during one map phase (figure 3).
Figure 3: MAPSIN Join base case for a triple’s pattern query, adapted from Schatzle et al. (2012).
Therefore, during the initial phase of MAPSIN Join, (1+2) the code searches all RDF triples in the local machine and if they are compatible, would join them. (3) Subsequently, the map function is used on compatible data. To add more triple pattern joins, the above steps are repeated to a new triple pattern with the previously joined triple patterns. In other words, a similar concept as concatenating.
Rationale for Technical Changes: Just using the reducer joins means that data is completely transferred over a network, even if the data is never joined, thus increasing input/output data flow. To avoid this data throughput cost, HBase is used prior to and during the mapper phase. Given that MapReduce doesn’t have the capability to store data because it just handles the data without read and write capabilities. It is a combination of HBase and MapReduce that makes the MAPSIN Join strategy.
Pros and Cons:
+ The solution does not make changes to the underlying framework
+ The solution drives down the cost of data input and output and reduces data shuffling.
+ Solution is approximately linearly scalable
+ The solution outperforms PigSPARQL solutions by 10x, thus in some cases the solution outperforms the reducer join only solutions by 10x
- Mapper only joins are hard to cascade for large datasets, due to the locality issue, thus driving the need for reducer side joins as well.
Conclusions
From these two distinct solutions of applying an RDF data query process to the MapReduce framework MAPSIN solution is better because it outperforms PigSPARQL, fully utilizes HBase, mapper and reducer joins, and efficiently reduces data throughput costs. But, it must be remembered that this is a nascent field, and the perfect solution would depend on project scope, project constraints, project resources, and project data.
Resources:
- Brewton, J., Yuan, X., & Akowuah, F. (2012). XML in health information systems. Retrieved from http://www.worldcomp-proceedings.com/proc/p2012/BIC4653.pdf
- Brickley, D., & Guha, R. V. (eds). (2014). RDF Schema 1.1. W3C. Retrieved from http://www.w3.org/TR/2014/REC-rdf-schema-20140225/
- Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th ed.). Pearson Learning Solutions. VitalBook file.
- Hiroshi (2007). Advantages & disadvantages of XML. Retrieved from http://www.techmynd.com/advantages-disadvantages-of-xml/
- Patel, K. (2013). Incremental journey for World Wide Web: Introduced with Web 1.0 to recent Web 5.0 – A survey paper. International Journal of Advanced Research in Computer Science and Software Engineering, 3(10), 410–417.
- Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.
- Schatzle, A., Przyjaciel-Sablocki, M., Dorner, C., Hornung, T., & Lausen, G. (2012). Cascading map-side joins over HBase for scalable join processing. Retrieved from http://ceur-ws.org/Vol-943/SSWS_HPCSW2012_paper5.pdf
- Schatzle, A., Przyjaciel-Sablocki, M., Hornung & Lausen, G. (2011). PigSPARQL: Mapping SPARQL to Pig Latin. Retrieved from http://www.csd.uoc.gr/~hy561/papers/storageaccess/largescale/PigSPARQL-%20Mapping%20SPARQL%20to%20Pig%20Latin.pdf
- Services, EMC E. (2015) Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. John Wiley & Sons P&T. VitalBook file.
- UK Web Design Company (n.d.). XML Advantages & disadvantages. Retrieved from http://www.theukwebdesigncompany.com/articles/xml-advantages-disadvantages.php