Data Tools: XML & Hadoop

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). MapReduce has two queries, one that maps the input data into a final format and split across a group of computer nodes, while the second query reduces the data in each node so that when combining all the nodes it can provide the answer sought (Eini, 2010). In other words, data is partitioned, sorted and grouped to provide a key and value as an output (Rathbone, 2013). As more data gets added in real time, data in motion, MapReduce can do the recalculations cheaper than before, and the data scientist doesn’t have to touch the data (Eini, 2010; Roy, 2014). Roy (2014) had suggested an example of using Intensive Care Unit (ICU) sensor data, which comes into a database multiple times per second to help avoid false positive alarms that could lead to overwork hospital staffers.  However, Hadoop is best used for non-realtime tasks with a huge demand for processing power (Rathbone, 2013). The issue for Hadoop is to identify the correct instance that an actionable item is needed and acting on that item (Roy, 2014).

Does XML have any impact on MapReduce application design?

XML is a machine and human readable data format (Smith, 2012). With a goal of using XML for MapReduce, we need to assume that we need to map and reduce huge files (Eini, 2010; Smith 2012). Unfortunately, XML doesn’t include sync markers in the data format and therefore MapReduce doesn’t support XML (Smith, 2012). There are posts out there by coders use workarounds to allow for XML processing in Hadoop (Atom, 2010; Krishna, 2014; Rohit, 2013; Smith, 2012).  Smith (2012) and Rohit (2013) used the XmlInputFormat class from mahout to work with XML input data into HBase.  So, depending on the path the data scientist chooses will mean how much work is needed to be able to use MapReduce: code a new version of reading, mapping and reducing XML data from scratch; or use libraries from other code that is compatible with Hadoop.  Smith (2012), stated that the Mahout’s code needs to know the exact sequence of XML start and end tags that will be searched for and Elements with attributes are hard for Mahout’s XML library to detect and parse. Depending on the complexity of the XML document, Smith’s (2012) statement may mean the more complex use of XML input codes may be needed.  Therefore, a well designed XML document could make this process a bit easier, but the complexity of the data stored in it will make the task of creating code for using MapReduce on XML data harder.  Finally, Smith (2012) recommended a preprocessing step to convert XML data and treat it as a line of a record into other libraries native for MapReduce.

References

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: