Compelling topics

Hadoop, XML and Spark

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). MapReduce has two queries, one that maps the input data into a final format and split across a group of computer nodes, while the second query reduces the data in each node so that when combining all the nodes it can provide the answer sought (Eini, 2010).

XML documents represent a whole data file, which contains markups, elements, and nodes (Lublinsky, Smith, & Yakubovich,, 2013; Myer, 2005):

  • XML markups are tags that helps describe the data start and end points as well as the data properties/attributes, which are encapsulated by < and a >
  • XML elements are data values, encapsulated by an opening <tag> and a closing </tag>
  • XML nodes are part of the hierarchical structure of a document that contains a data element and its tags

Unfortunately, the syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007)

Five questions must be asked before designing an XML data document (Font, 2010):

  1. Will this document be part of a solution?
  2. Will this document have design standards that must be followed?
  3. What part may change over time?
  4. To what extent is human readability or machine readability important?
  5. Will there be a massive amount of data? Does file size matter?

All XML data documents should be versioned, and key stakeholders should be involved in the XML data design process (Font, 2010).  XML is a machine and human readable data format (Smith, 2012). With a goal of using XML for MapReduce, we need to assume that we need to map and reduce huge files (Eini, 2010; Smith 2012). Unfortunately, XML doesn’t include sync markers in the data format and therefore MapReduce doesn’t support XML (Smith, 2012). However, Smith (2012) and Rohit (2013) used the XmlInputFormat class from mahout to work with XML input data into HBase. Smith (2012), stated that the Mahout’s code needs to know the exact sequence of XML start and end tags that will be searched for and Elements with attributes are hard for Mahout’s XML library to detect and parse.

Apache Spark started from a working group inside and outside of UC Berkley, in search of an open-sourced, multi-pass algorithm batch processing model of MapReduce (Zaharia et al., 2012). Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Apache Spark, on their website, boasts that they can run programs 100X faster than Hadoop’s MapReduce in Memory (Spark, n.d.). Spark outperforms Hadoop by 10x on iterative machine learning jobs (Gu & Li, 2013). Also, Spark runs 10x faster than Hadoop on disk memory (Spark, n.d.). Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized.

Data visualization

Big data can be defined as any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do, 2013; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  What is considered to be big data can change with respect to time.  What is considered as big data in 2002 is not considered big data in 2016 due to advancements made in technology over time (Fox & Do, 2013).  Then there is Data-in-motion, which can be defined as a part of data velocity, which deals with the speed of data coming in from multiple sources as well as the speed of data traveling between systems (Katal, Wazid, & Goudar, 2013). Essentially data-in-motion can encompass data streaming, data transfer, or real-time data. However, there are challenges and issues that have to be addressed to conducting real-time analysis on data streams (Katal et al., 2013; Tsinoremas et al., n.d.).

It is not enough to analyze the relevant data for data-driven decisions but also selecting relevant visualizations of that data to enable those data-driven decision (eInfochips, n.d.). There are many types of ways to visualize the data to highlight key facts through style and succinctly: tables and rankings, bar charts, line graphs, pie charts, stacked bar charts, tree maps, choropleth maps, cartograms, pinpoint maps, or proportional symbol maps (CHCF, 2014).  The above visualization plots, charts, maps and graphs could be part of an animated, static, and Interactive Visualizations and would it be a standalone image, dashboards, scorecards, or infographics (CHCF, 2014; eInfochips, n.d.).

Artificial Intelligence (AI)

Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). AI can provide tremendous value since it builds thousands of models and correlations automatically in one week, which use to take a few quantitative data scientist years to do (Dewey, 2013; Power, 2015).  Unfortunately, the rules created by AI out of 50K variables lack substantive human meaning, or the “Why” behind it, thus making it hard to interpret the results (Power, 2015).

“Machines can excel at frequent high-volume tasks. Humans can tackle novel situations.” said by Anthony Goldbloom. Thus, the fundamental question that decision makers need to ask, is how the decision is reduced to frequent high volume task and how much of it is reduced to novel situations (Goldbloom, 2016).  Therefore, if the ratio is skewed on the high volume tasks then AI could be a candidate to replace decision makers, if the ratio is evenly split then AI could augment and assist decision makers, and if the ratio is skewed on novel situations, then AI wouldn’t help decision makers.  They novel situations is equivalent to our tough challenges today (McAfee, 2013).  Finally, Meetoo (2016), warned that it doesn’t matter how intelligent or strategic a job could be, if there is enough data on that job to create accurate rules it can be automated as well; because machine learning can run millions of simulations against itself to generate huge volumes of data to learn from.

 

Resources:

Data Tools: XML & Hadoop

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). MapReduce has two queries, one that maps the input data into a final format and split across a group of computer nodes, while the second query reduces the data in each node so that when combining all the nodes it can provide the answer sought (Eini, 2010). In other words, data is partitioned, sorted and grouped to provide a key and value as an output (Rathbone, 2013). As more data gets added in real time, data in motion, MapReduce can do the recalculations cheaper than before, and the data scientist doesn’t have to touch the data (Eini, 2010; Roy, 2014). Roy (2014) had suggested an example of using Intensive Care Unit (ICU) sensor data, which comes into a database multiple times per second to help avoid false positive alarms that could lead to overwork hospital staffers.  However, Hadoop is best used for non-realtime tasks with a huge demand for processing power (Rathbone, 2013). The issue for Hadoop is to identify the correct instance that an actionable item is needed and acting on that item (Roy, 2014).

Does XML have any impact on MapReduce application design?

XML is a machine and human readable data format (Smith, 2012). With a goal of using XML for MapReduce, we need to assume that we need to map and reduce huge files (Eini, 2010; Smith 2012). Unfortunately, XML doesn’t include sync markers in the data format and therefore MapReduce doesn’t support XML (Smith, 2012). There are posts out there by coders use workarounds to allow for XML processing in Hadoop (Atom, 2010; Krishna, 2014; Rohit, 2013; Smith, 2012).  Smith (2012) and Rohit (2013) used the XmlInputFormat class from mahout to work with XML input data into HBase.  So, depending on the path the data scientist chooses will mean how much work is needed to be able to use MapReduce: code a new version of reading, mapping and reducing XML data from scratch; or use libraries from other code that is compatible with Hadoop.  Smith (2012), stated that the Mahout’s code needs to know the exact sequence of XML start and end tags that will be searched for and Elements with attributes are hard for Mahout’s XML library to detect and parse. Depending on the complexity of the XML document, Smith’s (2012) statement may mean the more complex use of XML input codes may be needed.  Therefore, a well designed XML document could make this process a bit easier, but the complexity of the data stored in it will make the task of creating code for using MapReduce on XML data harder.  Finally, Smith (2012) recommended a preprocessing step to convert XML data and treat it as a line of a record into other libraries native for MapReduce.

References

Data Tools: XML Design

Good XML Design Documentation for improved performance

Five questions must be asked before designing an XML data document (Font, 2010):

  1. Will this document be part of a solution?
  2. Will this document have design standards that must be followed?
  3. What part may change over time?
  4. To what extent is human readability or machine readability important?
  5. Will there be a massive amount of data? Does file size matter?

All XML data documents should be versioned, and key stakeholders should be involved in the XML data design process (Font, 2010).

A few rules (not a comprehensive list) on making a good XML design:

  1. Be consistent with your design and design for extensions by multiple people (Google, 2008; Font; 2010).
  2. Reuse existing XML formats (Google, 2008)
  3. Tag each unit of information, maintain a minimal amount of text that can be processed as a whole (Harold, 2003)
    1. An element has a start tag and an end tag only <menu></menu> but an attribute describes an element inside of a tag <menu-item portion-size =”500” portion-units=”g”></menu-item>; therefore an attribute provides some properties to the element (Harold, 2003; Ogbuji, 2004).
    2. The principle of core content: Know the difference of when to use an element versus an attribute: use elements when the information is an essential part of the material, and use an attribute if the information is peripheral or incidental to the main message. Essentially “Data goes in elements, metadata in attributes” (Ogbuji, 2004; Oracle, n.d.).
      1. Elements must be in a namespace, and attributes shouldn’t be in a namespace (Google, 2008)
    3. Avoid implicit structures, which occurs by the addition of white space (Harold, 2003)
      1. This can be seen easily with names, where white spaces are seen between the first name, middle name, and last name. Ogbuji (2004b) suggested to use well-established elements like: <firstname/>; <othername/>; <surname/>; <forename/>; <rolename/>; <namelink/>; <genname/>; and <addname/> to address the eccentricities of a person’s name from various cultures.
      2. Post office addresses pose this same issue, so Ogbuji (2004b) suggested these established elements: <street/>; <postcode/>; <pob/>; <city/>; <state/>; <country/>; <otheraddr/>; <phone/>; <fax/>; and <email/>/
    4. Use a standard and accepted element reference guide like DocBook Element Reference (Walsh & Muellner, 2006). Or something similar and stick with that convention.
      1. Use published standard abbreviations for constructing names (Google, 2008; Walsh & Muellner, 2006)
    5. Avoid using hyphens “-“ in your naming convention (Font, 2010)
    6. Avoid the use of boolean values (Google, 2008)
    7. Keep the document structure readable (Principle of readability), do not make it too troublesome to process or read (Harold, 2003; Ogbuji 2004). For example, use elements for readability and understandability by humans, and attributes for machine digest (Ogbuji, 2004).
    8. Comments should not be used to contain data, but rather to dos (Google, 2008)

Example of an XML Document (W3 Schools, n.d.)

<?xml version="1.0" encoding="UTF-8"?>
 
 <shiporder orderid="889923"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="shiporder.xsd">
   <orderperson>John Smith</orderperson>
   <shipto>
     <name>Ola Nordmann</name>
     <address>Langgt 23</address>
     <city>4000 Stavanger</city>
     <country>Norway</country>
   </shipto>
   <item>
     <title>Empire Burlesque</title>
     <note>Special Edition</note>
     <quantity>1</quantity>
     <price>10.90</price>
   </item>
   <item>
     <title>Hide your heart</title>
     <quantity>1</quantity>
     <price>9.90</price>
   </item>
 </shiporder>

Analysis of XML design document from the user’s perspective for improved performance

It should be best to have <shipto/> information to contain only the address, not just the two major datasets like name and address, which represents designing for extendibility as in Rule 1.  The tags <note/> and <price/> should be an attribute to the <title> per Rule 3a and 3b. Rule 4a. was not followed for <name>Ola Nordmann</name>.  Quantity is not an attribute of <item> thus should be a child element of <item> per Rule 4b. Tags like <name/>; <item/>; <quantity/>; and <price/> do not follow a naming convention as stated in Rule 5., but they could come from a naming convention that is internal to this company, so this one is hard to evaluate without much more information. Rules 6-9 were kept in this example.

References                                          

Data Tools: XML

What is XML and how is it used to represent data

XML, also known as the eXtendend Markup Language, it is a standardized way that allows objects or data items to be referred to and identified by type(s) in a flexible hierarchical approach (Brookshear & Brylow, 2014; Lublinsky, Smith, & Yakubovich, 2013; McNurlin, Sprague, & Bui, 2008; Sadalage & Fowler, 2012).  XML refers to and identifies objects by types, when it assigns tags to certain parts of the data, defining the data (McNurlin et al., 2008).  JSON provides similar functionality to XML, the XML schema, and query capabilities are better than JSON (Sadalage & Fowler, 2012). XML focuses more on semantics than appearance, which allows for searches that understand the contents of the data being considered. Therefore it is considered to be a standard for producing markup languages like HTML (Brookshear & Brylow, 2014).  Finally, XML uses object principles, which tell you what data is needed to perform the function and what output they can give you, just not the how they will do it (McNurlin et al., 2008).

XML documents contain descriptions of a service or function, how to request the service or function, the data it needs to perform the work of the service or function, and the results the service or function will deliver (McNurlin et al., 2008). Also, relational databases have taken on XML as a structuring mechanism by taking on the XML document as a column type to allow for the creation of XML querying languages (Sadalage & Fowler, 2012).  Therefore, XML essentially helps in defining the data, giving the data meaning for which computers can manipulate and work on, and therefore transforming data from a human readable format into a computer readable, and can help with currency conversion, credit card processing application, etc. (McNurlin et al., 2008).

XML can handle a high volume of data and can represent all varieties of data, structured, unstructured, and semi-structured data in an active or live (streaming) fashion, such as CAD data use to design buildings; represent multimedia, house music, product bar code scanning, photographs of damaged property held by insurance companies, etc. (Brookshear & Brylow, 2014; Lublinsky et al., 2013 McNurlin et al., 2008; Sadalage & Fowler, 2012). By definition, big data is any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do 2013).  Therefore XML can represent big data quite nicely.

Use of XML to represent data in various forms

XML documents represent a whole data file, which contains markups, elements, and nodes (Lublisnky et al., 2013; Myer, 2005):

  • XML markups are tags that helps describe the data start and end points as well as the data properties/attributes, which are encapsulated by < and a >
  • XML elements are data values, encapsulated by an opening <tag> and a closing </tag>
  • XML nodes are part of the hierarchical structure of a document that contains a data element and its tags

Data can be comprised of text and numbers, like a telephone number (123)-456-7890, which can be represented in XML as <phone country = “U.S.”> 1234567890</phone>.  The country adds hierarchy to the object, defining it further as a U.S. phone number (Myer, 2005).  Given its hierarchical nature, the root data element helps sort all of the data below similar to a hierarchical data tree (Lublisnky et al., 2013; Myer, 2005).

Since, elements can be described in tags, which help add context to the data turning it into information, and when adding hierarchical structure to these informational elements, we are describing the natural relationships which aid in transforming information into knowledge (Myer, 2005). This can help aid in analyzing big data sets.

Myer (2005), provided the following example of an XML syntax, which can also showcase the XML representation of data in a hierarchical structure:

<Actor type=”superstar”>

                <name> Harrison Ford</name>

                <gender> male</gender>

                <age>50<age>

<Actor>

Given this simple structure, it can easily be created by humans or by code, and can thus be produced in a document. This is great, however, to derive the value that Myer (2005) was talking about from the XML formatted data, it will need to be ingested for analysis into Hadoop (Agrawal, 2014).  Finally, XML can deal with numerical data such as integers, real, float, long, double, NaN, INF, -INF, Probabilities, percentages, string data, plain arrays of values (numerical and string arrays), sparse arrays, matrices, sparse matrices, etc. (Data Mining Group, n.d.).  Thus, addressing various types of data as aforementioned, at high volumes.

Resources

  • Agrawal, V. (2014). Processing XML data in BigInsights 3.0. Retrieved from https://developer.ibm.com/hadoop/2014/10/31/processing-xml-data-biginsights-3-0/
  • Brookshear, G., & Brylow D. (2014). Computer Science: An Overview (12th ed.). Pearson Learning Solutions. VitalBook file.
  • Data Mining Group (n.d.). PMML 4.1 – General structure. Retrieved from http://dmg.org/pmml/v4-1/GeneralStructure.html
  • Davenport, T. H., & Dyche, J. (2013). Big Data in Big Companies. International Institute for Analytics, (May), 1–31.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox. VitalBook file.
  • McNurlin, B., Sprague, R., & Bui, T. (2008). Information Systems Management (8th ed.). Pearson Learning Solutions. VitalBook file.
  • Myer, T. (2005). A really, really, really good introduction to xml. Retrieved from https://www.sitepoint.com/really-good-introduction-xml/
  • Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Pearson Learning Solutions. VitalBook file.