Adv Topics: The architecture of the Internet

Introduction

Kelly (2007) stated that there are 100 billion clicks per day and 55 trillion links. But, the internet is very pervasive to the human kind. In 2012, 2.27 billion people used the internet. However, globally 1.7 billion people are actively engaged with the internet (Li, 2010; Sakr, 2014). An actively engaged user of the internet would be using social media, where they can watch videos, share status updates, and curated content, with a goal of actively engaging other users (Li, 2010). Cloud-based services rely on the internet to provide services like distributed processing, large scale data storage, support distributed transactions, data manipulation, etc. (Brookshear & Brylow, 2014; Sakr, 2014). Thus, the internet plays a vital role in enabling big data analysis and social engagement, given its humble beginning for storing research projects between multiple research institutions in the 1960s (Brookshear & Brylow, 2014).

Internet Architecture

The internet has evolved into a socio-technical system. This evolution has come about in five distinct stages:

  • Web 1.0: Created by Tim Berners-Lee in the 1980s, where it was originally defined as a way of connecting static read-only information hosted across multiple computational components primarily for companies (Patel, 2013). The internet was a network of computational networks and where communication that is done throughout the internet is governed by the TCP/IP protocol (Brookshear & Brylow, 2014). The internet’s architecture relies on three components: Uniform Resource Identifier (URI), Hyper Text Transfer Protocol (HTTP), and HyperText Markup language (HTML) (Jacobs & Walsh, 2004).
    • A URI is a unique address that is common across all corners of the web and agreed upon convention by the internet community, which is made up of characters and numerical values that allow an end-user of the internet to locate and retrieve information (Brookshear & Brylow, 2014; Jacobs & Walsh, 2004; Patel 2013). This unique addressing convention allows the internet to store massive amounts of information without URI collision, which is when two URI hold the same value, which can confound search engines (Jacobs & Walsh, 2004). An example of a URI could be www.skyhernandez.com.
    • For a computer to locate this URI’s information which is hosted on a web server, a web browser like Goggle chrome, Microsoft edge, or Firefox uses the HTTP protocol to retrieve the information to be displayed by the browser (Brookshear & Brylow, 2014). The web browser would send an HTTP GET request for www.skyhernandez.com via the TCP/IP port 80, and as long as the browser is given access to the information stored in the URI, then the web server sends an HTTP POST or PUT to give the web browser the sought after information (Jacobs & Walsh, 2004). In a sense, HTTP protocols play an important role in information access management.
    • Once the HTTP POST is sent from the web server consisting the information stored in www.skyhernandez.com, the web browser must now convert the information and display it on the computer screen. HTML is a <tag> based notational code that helps a browser read the information sent by the web server and display it in a simple or rich data format (Brookshear & Brylow, 2014; Patel 2013). The <html /> tags are used by the HTTP protocol to identify the content type and encoding style for displaying the information of the web server onto the browser (Jacobs & Walsh, 2004).   The <head /> and <title /> tags are used as metadata about the document. Whereas the <body /> tag contains the information of the URI, and <a /> tags helps you link to other relevant URIs easily.
  • Web 2.0: Changed the state of the internet from a read-only state to a read/write state and had grown communities that hold a common interest (Patel, 2013). This version of the web led to more social interaction, giving people and content importance on the web, due to the introduction of social media tools through the introduction of web applications (Li, 2010; Patel, 2013; Sakr, 2014). Web applications can include event-driven and object-oriented programming that are designed to handle concurrent activities for multiple users and had a graphical user interface (Connolly & Begg, 2014; Sandén, 2011). Key technologies include:
    • Weblogs (Blogs), Video logs (vlogs), and audio logs (podcasts) are all content in various styles that are published chronologically in descending time order, which can be tagged with keywords for categorization and available when people need them (Li, 2010; Patel, 2013). These logs can be used for fact-finding when content is stored chronologically (Connolly& Begg, 2014).
    • Really Simple Syndication (RSS) is a web and data feed format that summarizes data via producing an open standard format, XML file (Patel, 2013; Services, 2015). This is regularly used for data collection (Services, 2015).
    • Wikis are editable and expandable by those who have access to the data, and information can be restored or rolled back (Patel, 2013). Wiki editors ensure data quality and encourage participation from the community in providing meaningful content to the community (Li, 2010).
  • Web 3.0: This is the state the web at 2017. Involves the semantic web that is driven by data integration through the uses of metadata (Patel, 2013). This version of the web supports a worldwide database with static HTML documents, dynamically rendered data, next standard HTML (HTML5), and links between documents with hopes of creating an interconnected and interrelated openly accessible world data such that tagged micro-content can be easily discoverable through search engines (Connolly & Begg, 2014; Patel, 2013). This new version of HTML, HTML5 can handle multimedia and graphical content and introduces new tags like <section />, <article />, <nav />, and <header />, which are great for semantic content (Connolly & Begg, 2014). Also, end-users are beginning to build dynamic web applications for others to interact with (Patel, 2013). Key technologies include:
    • Extensible Markup Language is a tag based metalanguage (Patel, 2013). These tags not limited to the tags defined by other people and can be created at the pace of the author rather than waiting for a standard body to approve a tag structure (UK Web Design Company, n.d.).
    • Resource Description Framework (RDF) is based on URI triples <subject, predicate, object>, which helps describes data properties and classes (Connolley & Begg, 2014; Patel, 2013). RDF is usually represented at the top of the data set with a @prefix (Connolly & Begg, 2014). For instance,

@prefix: <https://skyhernandez.com&gt; s: Author <https://skyhernandez.wordpress.com/about/&gt;. <https://skyhernandez.wordpress.com/about/&gt;. s:Name “Dr. Skylar Hernandez”. <https://skyhernandez.wordpress.com/about/&gt; s:e-mail “dr.sky.hernandez@gmail.com.”

  • Web 4.0: It is considered the symbiotic web, where data interactions occur between humans and smart devices, the internet of things (Atzori, 2010; Patel, 2013). These smart devices can be wired to the internet or connected via wireless sensors through enhanced communication protocols (Atzori, 2010). Thus, these smart devices would have read and write concurrency with humans, where the largest potential of web 4.0 has these smart devices analyze data online and begin to migrate the online world into the real world (Patel, 2013). Besides interacting with the internet and the real world, the internet of things smart devices would be able to interact with each other (Atzori, 2010). Sakr (2014) stated that this web ecosystem is built off of four key items:
    • Data devices where data is gathered from multiple sources that generate the data
    • Data collectors are devices or people that collect data
    • Data aggregation from the IoT, people, RFIDs, etc.
    • Data users and data buyers are people that derive value out of the data
  • Web 5.0: Previous iterations of the web do not perceive people’s emotion, but one day it could be able to understand a person’s emotional (Patel, 2013). Kelly (2007) predicted that in 5,000 days the internet would become one machine and all other devices would be a window into this machine. In 2007, Kelly stated that this one machine “the internet” has the processing capability of one human brain, but in 5,000 days it will have the processing capability of all the humanity.

Performance Bottlenecks & Root Causes

There is a trend in terms of “performance bottleneck” to access large-scale Web data as the Web technology evolves. A bottleneck is when the flow of information is slowed down or stopped in its entirety that it can cause a bad end-user experience (TechTarget, 2007, Thomas, 2012). Performance bottlenecks can cause an application to perform poorly towards expectations (Thomas, 2012). As the internet evolved, there are new performance bottlenecks that begin to appear:

  • Web 1.0: When two URI hold the same value, it can confound search engine, hence the move to IPv4 to IPv6 (Jacobs & Walsh, 2004). Transfer protocols rely on network devices: network interface card, firewall, cables, tight security, load balancers, routers, etc., which all affect the flow of information (bandwidth) (Jacobs & Walsh, 2004; Thomas, 2012). Finally, HTTP information is pulled from the web server, thus low capacity computers, broken links, tight security, poor configurations, which can result in HTTP errors (4xx, 5xx), lots of open connections, memory leaks, lengthy queues, extensive data scans, database deadlock, etc. (Thomas, 2012).
  • Web 2.0: Database demands on the web application, because there are more write and read transactions (Sakr, 2014). Plus, all the performance bottlenecks from web 1.0.
  • Web 3.0: Searching for information in the data is tough and time-consuming without a computer processing application (UK Web Design Company, n.d.). Data is tied to the logic and language like HTML without a readily made browser to simply explore the data and therefore may require HTML or other software to process the data (Brewton, Yuan, & Akowuah, 2012; UK Web Design Company, n.d.). Syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007).
  • Web 4.0: IoT creates performance bottlenecks but it also has its issues, if it is left on its own and it is not tied to anything else (Jaffe, 2014; Newman, 2016): (a) the devices cannot deal with the massive amounts of data generated and collected; and (b) the devices cannot learn from the data it generates and collects. Finally, there are also infrastructure potential performance bottlenecks for IoT (Atzori, 2010): (a) the huge number of internet oriented devices that will be taking up the last few IPv4 addresses; (b) things oriented and internet oriented devices could spend a time in sleep mode, which is not typical for current devices using the current IP networks; (c) IoT devices when connecting to the internet produce smaller packets of data at higher frequency than current devices; (d) each of the devices would have to use a common interface and standard protocols as other devices, which can easily flood the network and increase complexity of middleware software layer design; and (e) IoT are vastly heterogeneous objects, where each device with their own function and has its own way of communicating.
  • Web 5.0: Give the assumption of what this version of the web would become, a possible performance bottleneck would be a number of resources consumed to keep the web operating.

Overall, Kelly (2007) stated that there are 100 billion clicks per day and 55 trillion links and that there are 2 million emails per second, 1 million IM messages, etc. big data will impact the performance of the web. Big data will be primarily impacting the web server, because of the increasing size of information and potential lack of bandwidth, big data can slow down the throughput performance (Thomas, 2012).

High-level strategies to mitigate

The web should evolve with time to keep up with the demands and needs of society, and it is predicted change significantly. What the web evolves into is yet to be seen, but it will be quite unique. However, with multiple heterogeneous types of devices (IoT) trying to connect to the web, a standardized protocol and interface to the web should be adopted (Atzori, 2010). A move to IPv6 from IPv4 to accommodate the massive number of IoT devices that expected to generate data and connect to the web must happen to accommodate this opportunity to gain more data.

Given, Kelly (2007) centralized view of the web system, the information and data are stored distributively, and better algorithms are needed to connect relevant data and information more efficiently. Data storage is cheap, but at the rate of data creation by IoT and other sources, processing speeds through parallel and distributed processing must increase to take advantage of this explosion of big data into the web. This is the gap created by the fact that data collection is outpacing the speed of data processing and analysis. Given this gap, a data scientist should prioritize which subset of the data is valuable enough to analyze, to solve certain problems. This is a great workaround, however, it ignores the full data set at times, which was generated from a system. It’s not enough to analyze what is deemed to be valuable parts of the data, because another part of the data may reveal more insight (the whole is better than the sum of its parts argument).

NoSQL database types and extensible markup languages have enhanced how information and data are related to each other, but standardization and perhaps an automated 1:1 mapping of the same data being represented in different NoSQL databases may be needed to gain further insights faster. Web application developers should also be more resource conscious, trying to get more computational results for fewer resources.

Resources:

  • Atzori, L., Antonio Iera, A., & Morabito, G. (2010). The Internet of things: A survey. Computer Networks, 54(2). 787–2,805
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Sandén, B. I. (2011). Design of Multithreaded Software: The Entity-Life Modeling Approach. Wiley-Blackwell. VitalBook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.
  • Services, EMC E. (2015) Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. John Wiley & Sons P&T. VitalBook file.

Document store NoSQL databases

NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). NoSQL databases have benefits as they provide a data model for applications that require a little code, less debugging, run on clusters, handle large scale data and evolve with time (Sadalage & Fowler, 2012). Document store NoSQL databases, use a key/value pair that is the file/file itself, and it could be in JSON, BSON, or XML (Sadalage & Fowler, 2012; Services, 2015).  These document files are hierarchical trees (Sadalage & Fowler, 2012).

Parts of the documents could be updated in real-time this type of NoSQL database allows for easy creation and storage of dynamic data like website page views, unique views, or new metrics (Sadalage & Fowler, 2012).  To help speed up the search of a document store NoSQL database like content in multiple web pages, or store log file, indexes can be created (Services, 2012). These indexes could be stored as attributes, such as a “state,” “city,” “zip-code,” etc. attributes, which can have the same, different, or null values in the NoSQL database and it each of these is allowed (Sadalage & Fowler, 2012).

If you want to insert, update, or delete (also known as a transaction) data in a NoSQL database, it will either succeed or fail, it won’t have the ability as traditional databases to either commit or rollback (Sadalage & Fowler, 2012). Only two of the three features can exist according to CAP Theory (Consistency, Availability, and Partition Tolerance), and document store databases primarily focus on availability through replicating data in different nodes (Hurst, 2010; Sadalage & Fowler, 2012).  Some key players in the document store database realm are CouchDB, MongoDB, OrientDB, RavenDB, and Terrastore (Sadalage & Fowler, 2012).  This discussion will focus on both CouchDB and MongoDB; which are open-sourced code that allows having scalability features (CouchDB, n.d.; MongoDB, n.d.; Sadalage & Fowler, 2012).

CouchDB is an Apache code available for Windows, Linux, and Mac OS X and it is also:

  • AP database system (Hurst, 2010)
  • AP systems can achieve consistency if data can be replicated and verified (Hurst, 2010)
  • Globally distributed server cluster to allow for accessing data and implementing projects anywhere through a data replication protocol (CouchDB, n.d.)
  • Data can be stored on a single or clustered server, via locally on the company’s servers, virtual machines, Raspberry Pi servers, or on a cloud provider (CouchDB, n.d.)
  • Allows for offline end user experience (CouchDB, n.d.)
  • Can use MapReduce for deriving insights from the data (CouchDB, n.d.)
  • Uses HTTP protocol and JSON data (CouchDB, n.d.)
  • Only allowing for appending data helps create a crash-resistant data structure (CouchDB, n.d.)

MongoDB is code available for Windows, Linux, Mac OS X, Solaris, etc.:

  • CP database system (Hurst, 2010)
  • CP systems have issues keeping data available across all nodes through their replication system (Hurst, 2010; Sadalage & Fowler, 2012)
  • Used by companies like Expedia, Forbes, Bosch, AstraZeneca, MetLife, Facebook, Urban Outfitters, sprinklr, the guardian, Comcast, etc., such that 33% of the Fortune 100 are using it (MongoDB, n.d.)
  • Has an expressive query language and secondary indexes out of the box to help access and understand data stored within its database, which is easier to use and requires fewer lines of code (MongoDB, n.d.; Sadalage & Fowler, 2012)
  • Allows for a flexible data model that evolves with time as the data stored in it evolves (MongoDB, n.d.)
  • Allows for integration of silo, internet of things, mobile, catalog data to help provide real-time analytics (MongoDB, n.d.)

References

  • CouchDB (n.d.). CouchDB, relax. Apache. Retrieved from http://couchdb.apache.org/
  • Hurst, N. (2010). Visual guide to NoSQL systems. Retrieved from http://blog.nahurst.com/visual-guide-to-nosql-systems
  • MongoDB (n.d.). MongoDB, for giant ideas. Retrieved from https://www.mongodb.com/
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].

Compelling topics

Hadoop, XML and Spark

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). MapReduce has two queries, one that maps the input data into a final format and split across a group of computer nodes, while the second query reduces the data in each node so that when combining all the nodes it can provide the answer sought (Eini, 2010).

XML documents represent a whole data file, which contains markups, elements, and nodes (Lublinsky, Smith, & Yakubovich,, 2013; Myer, 2005):

  • XML markups are tags that helps describe the data start and end points as well as the data properties/attributes, which are encapsulated by < and a >
  • XML elements are data values, encapsulated by an opening <tag> and a closing </tag>
  • XML nodes are part of the hierarchical structure of a document that contains a data element and its tags

Unfortunately, the syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007)

Five questions must be asked before designing an XML data document (Font, 2010):

  1. Will this document be part of a solution?
  2. Will this document have design standards that must be followed?
  3. What part may change over time?
  4. To what extent is human readability or machine readability important?
  5. Will there be a massive amount of data? Does file size matter?

All XML data documents should be versioned, and key stakeholders should be involved in the XML data design process (Font, 2010).  XML is a machine and human readable data format (Smith, 2012). With a goal of using XML for MapReduce, we need to assume that we need to map and reduce huge files (Eini, 2010; Smith 2012). Unfortunately, XML doesn’t include sync markers in the data format and therefore MapReduce doesn’t support XML (Smith, 2012). However, Smith (2012) and Rohit (2013) used the XmlInputFormat class from mahout to work with XML input data into HBase. Smith (2012), stated that the Mahout’s code needs to know the exact sequence of XML start and end tags that will be searched for and Elements with attributes are hard for Mahout’s XML library to detect and parse.

Apache Spark started from a working group inside and outside of UC Berkley, in search of an open-sourced, multi-pass algorithm batch processing model of MapReduce (Zaharia et al., 2012). Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Apache Spark, on their website, boasts that they can run programs 100X faster than Hadoop’s MapReduce in Memory (Spark, n.d.). Spark outperforms Hadoop by 10x on iterative machine learning jobs (Gu & Li, 2013). Also, Spark runs 10x faster than Hadoop on disk memory (Spark, n.d.). Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized.

Data visualization

Big data can be defined as any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do, 2013; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  What is considered to be big data can change with respect to time.  What is considered as big data in 2002 is not considered big data in 2016 due to advancements made in technology over time (Fox & Do, 2013).  Then there is Data-in-motion, which can be defined as a part of data velocity, which deals with the speed of data coming in from multiple sources as well as the speed of data traveling between systems (Katal, Wazid, & Goudar, 2013). Essentially data-in-motion can encompass data streaming, data transfer, or real-time data. However, there are challenges and issues that have to be addressed to conducting real-time analysis on data streams (Katal et al., 2013; Tsinoremas et al., n.d.).

It is not enough to analyze the relevant data for data-driven decisions but also selecting relevant visualizations of that data to enable those data-driven decision (eInfochips, n.d.). There are many types of ways to visualize the data to highlight key facts through style and succinctly: tables and rankings, bar charts, line graphs, pie charts, stacked bar charts, tree maps, choropleth maps, cartograms, pinpoint maps, or proportional symbol maps (CHCF, 2014).  The above visualization plots, charts, maps and graphs could be part of an animated, static, and Interactive Visualizations and would it be a standalone image, dashboards, scorecards, or infographics (CHCF, 2014; eInfochips, n.d.).

Artificial Intelligence (AI)

Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). AI can provide tremendous value since it builds thousands of models and correlations automatically in one week, which use to take a few quantitative data scientist years to do (Dewey, 2013; Power, 2015).  Unfortunately, the rules created by AI out of 50K variables lack substantive human meaning, or the “Why” behind it, thus making it hard to interpret the results (Power, 2015).

“Machines can excel at frequent high-volume tasks. Humans can tackle novel situations.” said by Anthony Goldbloom. Thus, the fundamental question that decision makers need to ask, is how the decision is reduced to frequent high volume task and how much of it is reduced to novel situations (Goldbloom, 2016).  Therefore, if the ratio is skewed on the high volume tasks then AI could be a candidate to replace decision makers, if the ratio is evenly split then AI could augment and assist decision makers, and if the ratio is skewed on novel situations, then AI wouldn’t help decision makers.  They novel situations is equivalent to our tough challenges today (McAfee, 2013).  Finally, Meetoo (2016), warned that it doesn’t matter how intelligent or strategic a job could be, if there is enough data on that job to create accurate rules it can be automated as well; because machine learning can run millions of simulations against itself to generate huge volumes of data to learn from.

 

Resources:

Data Tools: Artificial Intelligence

Big data Analytics and Artificial Intelligence

Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). Though previously, AI wasn’t able to come into existence without the proper computational power that is provided today (Cringely, 2013).  AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with (Power, 2015).  The goal of AI is to use huge amounts of data to draw out a set of rules through machine learning that will effectively replace experts in a certain field (Cringely, 2013; Power, 2015). Cringely (2013) stated that in some situations big data can eliminate the need for theory and that AI can aid in analyzing big data where theory is either lacking or impossible to define.

AI can provide tremendous value since it builds thousands of models and correlations automatically in one week, which use to take a few quantitative data scientist years to do (Dewey, 2013; Power, 2015).  The thing that has slowed down the progression of AI in the past was the creation of human readable computer languages like XML or SQL, which is not intuitive for computers to read (Cringely, 2013).  Fortunately, AI can easily use structured data and now use unstructured data thanks to everyone who tags all these unstructured data either in comments or on the data point itself, speeding up the computational time (Cringely, 2013; Power, 2015).  Dewey (2013), hypothesized that not only will AI be able to analyze big data at speeds faster than any human can, but that the AI system can also begin to improve its search algorithms in phenomena called intelligence explosion.  Intelligence explosion is when an AI system begins to analyze itself to improve itself in an iterative process to a point where there is an exponential growth in improvement (Dewey, 2013).

Unfortunately, the rules created by AI out of 50K variables lack substantive human meaning, or the “Why” behind it, thus making it hard to interpret the results (Power, 2015).  It would take many scientists to analyze the same big data and analyze it all, to fully understand how the connections were made in the AI system, which is no longer feasible (Cringely, 2013).  It is as if data scientist is trying to read the mind of the AI system, and they currently cannot read a human’s mind. However, the results of AI are becoming accurate, with AI identifying cats in photographs in 72 hours of machine learning and after a cat is tagged in a few photographs (Cringely, 2013). AI could be applied to any field of study like finance, social science, science, engineering, etc. or even play against champions on the Jeopardy game show (Cyranoski, 2015; Cringely, 2013; Dewey, 2013; Power, 2015).

Example of artificial intelligence use in big data analysis: Genomics

The goal of AI use on genomic data is to help analyze physiological traits and lifestyle choices to provide a dedicated and personalized health plan to treat and eventually prevent disease (Cyranoski, 2015; Power, 2015).  This is done by feeding the AI systems with huge amounts of genomic data, which is considered big data by today’s standards (Cyranoski, 2015). Systems like IBM’s Watson (an AI system) could provide treatment options based on the results gained from analyzing thousands or even millions of genomic data (Power, 2015).  This is done by analyzing all this data and allowing machine learning techniques to devise algorithms based on the input data (Cringely, 2013; Cyranoski, 2015; Power, 2015).  As of 2015, there is about 100,000 individual genomic data in the system, and even with this huge amounts of data, it is still not enough to provide the personalized health plan that is currently being envisioned based on a person’s genomic data (Cyranoski, 2015).  Eventually, millions of individuals will need to be added into the AI system, and not just genomic data, but also proteomics, metabolomics, lipidomics, etc.

Resources:

Data Tools: XML & Hadoop

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). MapReduce has two queries, one that maps the input data into a final format and split across a group of computer nodes, while the second query reduces the data in each node so that when combining all the nodes it can provide the answer sought (Eini, 2010). In other words, data is partitioned, sorted and grouped to provide a key and value as an output (Rathbone, 2013). As more data gets added in real time, data in motion, MapReduce can do the recalculations cheaper than before, and the data scientist doesn’t have to touch the data (Eini, 2010; Roy, 2014). Roy (2014) had suggested an example of using Intensive Care Unit (ICU) sensor data, which comes into a database multiple times per second to help avoid false positive alarms that could lead to overwork hospital staffers.  However, Hadoop is best used for non-realtime tasks with a huge demand for processing power (Rathbone, 2013). The issue for Hadoop is to identify the correct instance that an actionable item is needed and acting on that item (Roy, 2014).

Does XML have any impact on MapReduce application design?

XML is a machine and human readable data format (Smith, 2012). With a goal of using XML for MapReduce, we need to assume that we need to map and reduce huge files (Eini, 2010; Smith 2012). Unfortunately, XML doesn’t include sync markers in the data format and therefore MapReduce doesn’t support XML (Smith, 2012). There are posts out there by coders use workarounds to allow for XML processing in Hadoop (Atom, 2010; Krishna, 2014; Rohit, 2013; Smith, 2012).  Smith (2012) and Rohit (2013) used the XmlInputFormat class from mahout to work with XML input data into HBase.  So, depending on the path the data scientist chooses will mean how much work is needed to be able to use MapReduce: code a new version of reading, mapping and reducing XML data from scratch; or use libraries from other code that is compatible with Hadoop.  Smith (2012), stated that the Mahout’s code needs to know the exact sequence of XML start and end tags that will be searched for and Elements with attributes are hard for Mahout’s XML library to detect and parse. Depending on the complexity of the XML document, Smith’s (2012) statement may mean the more complex use of XML input codes may be needed.  Therefore, a well designed XML document could make this process a bit easier, but the complexity of the data stored in it will make the task of creating code for using MapReduce on XML data harder.  Finally, Smith (2012) recommended a preprocessing step to convert XML data and treat it as a line of a record into other libraries native for MapReduce.

References

Data Tools: XML Design

Good XML Design Documentation for improved performance

Five questions must be asked before designing an XML data document (Font, 2010):

  1. Will this document be part of a solution?
  2. Will this document have design standards that must be followed?
  3. What part may change over time?
  4. To what extent is human readability or machine readability important?
  5. Will there be a massive amount of data? Does file size matter?

All XML data documents should be versioned, and key stakeholders should be involved in the XML data design process (Font, 2010).

A few rules (not a comprehensive list) on making a good XML design:

  1. Be consistent with your design and design for extensions by multiple people (Google, 2008; Font; 2010).
  2. Reuse existing XML formats (Google, 2008)
  3. Tag each unit of information, maintain a minimal amount of text that can be processed as a whole (Harold, 2003)
    1. An element has a start tag and an end tag only <menu></menu> but an attribute describes an element inside of a tag <menu-item portion-size =”500” portion-units=”g”></menu-item>; therefore an attribute provides some properties to the element (Harold, 2003; Ogbuji, 2004).
    2. The principle of core content: Know the difference of when to use an element versus an attribute: use elements when the information is an essential part of the material, and use an attribute if the information is peripheral or incidental to the main message. Essentially “Data goes in elements, metadata in attributes” (Ogbuji, 2004; Oracle, n.d.).
      1. Elements must be in a namespace, and attributes shouldn’t be in a namespace (Google, 2008)
    3. Avoid implicit structures, which occurs by the addition of white space (Harold, 2003)
      1. This can be seen easily with names, where white spaces are seen between the first name, middle name, and last name. Ogbuji (2004b) suggested to use well-established elements like: <firstname/>; <othername/>; <surname/>; <forename/>; <rolename/>; <namelink/>; <genname/>; and <addname/> to address the eccentricities of a person’s name from various cultures.
      2. Post office addresses pose this same issue, so Ogbuji (2004b) suggested these established elements: <street/>; <postcode/>; <pob/>; <city/>; <state/>; <country/>; <otheraddr/>; <phone/>; <fax/>; and <email/>/
    4. Use a standard and accepted element reference guide like DocBook Element Reference (Walsh & Muellner, 2006). Or something similar and stick with that convention.
      1. Use published standard abbreviations for constructing names (Google, 2008; Walsh & Muellner, 2006)
    5. Avoid using hyphens “-“ in your naming convention (Font, 2010)
    6. Avoid the use of boolean values (Google, 2008)
    7. Keep the document structure readable (Principle of readability), do not make it too troublesome to process or read (Harold, 2003; Ogbuji 2004). For example, use elements for readability and understandability by humans, and attributes for machine digest (Ogbuji, 2004).
    8. Comments should not be used to contain data, but rather to dos (Google, 2008)

Example of an XML Document (W3 Schools, n.d.)

<?xml version="1.0" encoding="UTF-8"?>
 
 <shiporder orderid="889923"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="shiporder.xsd">
   <orderperson>John Smith</orderperson>
   <shipto>
     <name>Ola Nordmann</name>
     <address>Langgt 23</address>
     <city>4000 Stavanger</city>
     <country>Norway</country>
   </shipto>
   <item>
     <title>Empire Burlesque</title>
     <note>Special Edition</note>
     <quantity>1</quantity>
     <price>10.90</price>
   </item>
   <item>
     <title>Hide your heart</title>
     <quantity>1</quantity>
     <price>9.90</price>
   </item>
 </shiporder>

Analysis of XML design document from the user’s perspective for improved performance

It should be best to have <shipto/> information to contain only the address, not just the two major datasets like name and address, which represents designing for extendibility as in Rule 1.  The tags <note/> and <price/> should be an attribute to the <title> per Rule 3a and 3b. Rule 4a. was not followed for <name>Ola Nordmann</name>.  Quantity is not an attribute of <item> thus should be a child element of <item> per Rule 4b. Tags like <name/>; <item/>; <quantity/>; and <price/> do not follow a naming convention as stated in Rule 5., but they could come from a naming convention that is internal to this company, so this one is hard to evaluate without much more information. Rules 6-9 were kept in this example.

References                                          

Data Tools: Use of XML

XML advantages

+ Writing your markup language and are not limited to the tags defined by other people (UK Web Design Company, n.d.)

+ Creating your tags at your pace rather than waiting for a standard body to approve of the tag structure (UK Web Design Company, n.d.)

+ Allows for a specific industry or person to design and create their set of tags that meet their unique problem, context, and needs (Brewton, Yuan, & Akowuah, 2012; UK Web Design Company, n.d.)

+ It is both human and machine-readable format (Hiroshi, 2007)

+ Used for data storage and processing both online and offline (Hiroshi, 2007)

+ Platform independent with forward and backward capability (Brewton et al., 2012; Hiroshi, 2007)

XML disadvantages

– Searching for information in the data is tough and time-consuming without a computer processing application (UK Web Design Company, n.d.)

– Data is tied to the logic and language similar to HTML without a readily made browser to simply explore the data and therefore may require HTML or other software to process the data (Brewton et al., 2012; UK Web Design Company, n.d.)

– Syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007)

– Limited to relational models and object-oriented graphs (Hiroshi, 2007)

– Tags are chosen by their creator. Thus there are no standard set of tags that should be used (Brewton et al., 2012)

XML use in Healthcare Industry

Thanks to the American National Standards Institute, the Health Level 7 (HL7) was created with standards for health care XML, which is now in use by 90% of all large hospitals (Brewton et al., 2012; Institute of Medicine, 2004). The Institute of Medicine (2004), stated that health care data could consist of: allergies immunizations, social histories, histories, vital signs, physical examination, physician’s and nurse’s notes, laboratory tests, diagnostic tests, radiology test, diagnoses, medications, procedures, clinical documentations, clinical measure for specific clinical conditions, patient instructions, dispositions, health maintenance schedules, etc.  More complex datasets like images, sounds, and other types of multimedia, are yet to be included (Brewton et al., 2012).  Also, terminologies within the data elements are not systemized nomenclature, and it does not support web-protocols for more advanced communications of health data (Institute of Medicine, 2004). HL7 V3 should resolve a lot of these issues, which should also account for a wide variety of health care scenarios (Brewton et al., 2012).

XML use in Astronomy

The Flexible Image Transport System (FITS), currently used by NASA/Goddard Space Flight Center, holds images, spectra, tables, and sky atlases data, which has been in use for 30 years (NASA, 2016; Pence et al. 2010). The newest version has a definition of time coordinates, support of long string keywords, multiple keywords, checksum keywords, image and table compression standards (NASA, 2016).  There was support for mandatory keywords previously (Pence et al. 2010).  Besides the differences in data entities and therefore tags needed to describe the data between the XML for healthcare and astronomy, the use of XML for a much longer period has allowed for a more robust solution that has evolved with technology.  It is also widely used as it is endorsed by the International Astronomical Union (NASA, 2016; Pence et al., 2010).  Based on the maturity of FITS, due to its creations in the late 1970s, and the fact that it is still in use, heavily endorsed, and is a standard still in use today, the healthcare industry could learn something from this system.  The only problem with FITS is that it removes some of the benefits of XML, which includes flexibility to create your tags due to the heavy standardization and standardization body.

Resources

Data Tools: XML

What is XML and how is it used to represent data

XML, also known as the eXtendend Markup Language, it is a standardized way that allows objects or data items to be referred to and identified by type(s) in a flexible hierarchical approach (Brookshear & Brylow, 2014; Lublinsky, Smith, & Yakubovich, 2013; McNurlin, Sprague, & Bui, 2008; Sadalage & Fowler, 2012).  XML refers to and identifies objects by types, when it assigns tags to certain parts of the data, defining the data (McNurlin et al., 2008).  JSON provides similar functionality to XML, the XML schema, and query capabilities are better than JSON (Sadalage & Fowler, 2012). XML focuses more on semantics than appearance, which allows for searches that understand the contents of the data being considered. Therefore it is considered to be a standard for producing markup languages like HTML (Brookshear & Brylow, 2014).  Finally, XML uses object principles, which tell you what data is needed to perform the function and what output they can give you, just not the how they will do it (McNurlin et al., 2008).

XML documents contain descriptions of a service or function, how to request the service or function, the data it needs to perform the work of the service or function, and the results the service or function will deliver (McNurlin et al., 2008). Also, relational databases have taken on XML as a structuring mechanism by taking on the XML document as a column type to allow for the creation of XML querying languages (Sadalage & Fowler, 2012).  Therefore, XML essentially helps in defining the data, giving the data meaning for which computers can manipulate and work on, and therefore transforming data from a human readable format into a computer readable, and can help with currency conversion, credit card processing application, etc. (McNurlin et al., 2008).

XML can handle a high volume of data and can represent all varieties of data, structured, unstructured, and semi-structured data in an active or live (streaming) fashion, such as CAD data use to design buildings; represent multimedia, house music, product bar code scanning, photographs of damaged property held by insurance companies, etc. (Brookshear & Brylow, 2014; Lublinsky et al., 2013 McNurlin et al., 2008; Sadalage & Fowler, 2012). By definition, big data is any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do 2013).  Therefore XML can represent big data quite nicely.

Use of XML to represent data in various forms

XML documents represent a whole data file, which contains markups, elements, and nodes (Lublisnky et al., 2013; Myer, 2005):

  • XML markups are tags that helps describe the data start and end points as well as the data properties/attributes, which are encapsulated by < and a >
  • XML elements are data values, encapsulated by an opening <tag> and a closing </tag>
  • XML nodes are part of the hierarchical structure of a document that contains a data element and its tags

Data can be comprised of text and numbers, like a telephone number (123)-456-7890, which can be represented in XML as <phone country = “U.S.”> 1234567890</phone>.  The country adds hierarchy to the object, defining it further as a U.S. phone number (Myer, 2005).  Given its hierarchical nature, the root data element helps sort all of the data below similar to a hierarchical data tree (Lublisnky et al., 2013; Myer, 2005).

Since, elements can be described in tags, which help add context to the data turning it into information, and when adding hierarchical structure to these informational elements, we are describing the natural relationships which aid in transforming information into knowledge (Myer, 2005). This can help aid in analyzing big data sets.

Myer (2005), provided the following example of an XML syntax, which can also showcase the XML representation of data in a hierarchical structure:

<Actor type=”superstar”>

                <name> Harrison Ford</name>

                <gender> male</gender>

                <age>50<age>

<Actor>

Given this simple structure, it can easily be created by humans or by code, and can thus be produced in a document. This is great, however, to derive the value that Myer (2005) was talking about from the XML formatted data, it will need to be ingested for analysis into Hadoop (Agrawal, 2014).  Finally, XML can deal with numerical data such as integers, real, float, long, double, NaN, INF, -INF, Probabilities, percentages, string data, plain arrays of values (numerical and string arrays), sparse arrays, matrices, sparse matrices, etc. (Data Mining Group, n.d.).  Thus, addressing various types of data as aforementioned, at high volumes.

Resources

  • Agrawal, V. (2014). Processing XML data in BigInsights 3.0. Retrieved from https://developer.ibm.com/hadoop/2014/10/31/processing-xml-data-biginsights-3-0/
  • Brookshear, G., & Brylow D. (2014). Computer Science: An Overview (12th ed.). Pearson Learning Solutions. VitalBook file.
  • Data Mining Group (n.d.). PMML 4.1 – General structure. Retrieved from http://dmg.org/pmml/v4-1/GeneralStructure.html
  • Davenport, T. H., & Dyche, J. (2013). Big Data in Big Companies. International Institute for Analytics, (May), 1–31.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox. VitalBook file.
  • McNurlin, B., Sprague, R., & Bui, T. (2008). Information Systems Management (8th ed.). Pearson Learning Solutions. VitalBook file.
  • Myer, T. (2005). A really, really, really good introduction to xml. Retrieved from https://www.sitepoint.com/really-good-introduction-xml/
  • Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Pearson Learning Solutions. VitalBook file.