Data Tools: Data-In-Motion

Definition of terms

Data in-motion: a part of data velocity, which deals with the speed of data coming in from multiple sources as well as the speed of data traveling between systems (Katal, Wazid, & Goudar, 2013). Essentially data-in-motion can encompass data streaming, data transfer, or real-time data. However, there are challenges and issues that have to be addressed to conducting real-time analysis on data streams (Katal et al., 2013; Tsinoremas et al., n.d.).

Data complexity: consists of the joining, cleaning, and transformation of data from multiple systems to find relationships that are highly correlated (Katal et al., 2013).  Complexity increases as the velocity of data coming in or transferred increases (Katal et al., 2013; Tsinoremas et al., n.d.).

Data-in-motion analytics performed in case study (Blount et al., 2010)

Artemis was designed, built and deployed in 2009 through a coalition of the University of Ontario Institute of Technology, SickKids, Department of Pediatrics, and University of Toronto, to help read in data from multiple sensors taken from neonatal intensive care units (NICU).  The goal is to have Artemis to read in data from multiple physiological instruments like an electrocardiogram (ECG), heart rate, blood oxygen saturation, respiratory states, etc. to find key patterns and relationships in the data streams (data-in-motion) to provide the best care for infants in NICU.  To make Artemis a success, the coalition had to analyze huge amounts of data from a large group of patients.  Artemis had to interface with multiple medical devices, should be scalable to add more medical devices, and store raw physiological data while at the same time de-identifying the data per U.S. and Canadian Health Privacy laws.  From these multiple medical devices new rules could be created by unsupervised machine learning techniques, and through supervising machine learning techniques with medical/clinical derived rules.  The Artemis system has to read in the data in real-time to sort, join, clean, and transform, to evaluate against certain rules and send out an alert or not to medical staff about one of the NICU patients, while at the same time de-identifying the data and storing it into a database for future analysis and tests.

In the test phase, 5 infants were enrolled and in the deployed state 19 infants were enrolled in the study. This study has to take into account, that the cables from all the sensors and the equipment use to collect all the streaming data must not get in the way of the medical/clinical staff when they need to help out the infant. In some cases, when the Artemis system was deployed, some of the sensors were not attached, and thus the Information Management Teams had to work with medical/clinical staff to help train the model on fewer data as well, if they do not have all the ideal sensors needed to send out alerts for certain situations.  Therefore, this system provides a way for medical/clinical staff to have constant data on NICU patients in real time from multiple sensors and allow the machine to alert them when certain markers and key performance indicators are met.

Importance of applying data analytics to data-in-motion

It can be easily seen that analyzing infant NICU data is important.  It is especially important to leverage analytics to the data stream of the key medical sensors needed per infant in the NICU.  What is not easily seen sometimes is how important all the data really is.  Since, in the real-life deployment showed that not all the medical sensors are being used to help provide the model with enough information to be of use to medical/clinical staff (Blount et al., 2010).

Also, the use of data streams in a university setting would allow for a different perspective that could be used in the NICU case study above.  At the University of Miami, data is triaged into a four-tiered system (Tsinoremas et al., n.d.):

  • High-speed storage – for data that is currently being processed, data-in-motion is at its highest (has 300TB of space and costs $2000/TB)
  • Mid-range speed storage – for data that is currently being looked at (costs $600-$700/TB)
  • Deep storage – long-term data storage, data that is looked at every so often, but not regularly, usually old data (costs $300/TB)
  • Archived – data to be stored offline, but it is perfect for data at rest

This tiered system above could be applied to Artemis, such that they could process which of the medical devices should be processed first when resources are limited.  Also, this could be applied different, such that there should be a window of which data is currently available, e.g. a 1-hour long record of NICU stats saved locally, with longer records still accessible, but not stored in vital processing spaces.  Data windows were discussed, but depending on the situation, data windows could be adjusted to provide the best care for the infants (Blount et al., 2010).

Also, the quality of the sensor data must be taken into account.  If more data is needed/preferred to make informed decisions on infant patients in the NICU (Blount et al., 2010), then there should be a focus in collecting, analyzing, high-quality data and the right types of data.  This would lead the designers of Artemis, medical, and clinician staff to think deeply about which data is relevant, and how much data is enough to make the decisions needed to tend to the infants (Katal et al., 2013).

Resources

  • Blount, M., Ebling, M. R., Eklund, J. M., James, A. G., McGregor, C., Percival, N., … & Sow, D. (2010). Real-time analysis for intensive care: development and deployment of the Artemis analytic system.IEEE Engineering in Medicine and Biology Magazine29(2), 110-118.
  • Katal, A., Wazid, M., & Goudar, R. H. (2013, August). Big data: issues, challenges, tools and good practices. InContemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
  • Tsinoremas, N. F., Zysman, J., Mader, C., Kirtma, B., & Blaire, J. (n.d.) Data in motion: A new paradigm in research data lifecycle management. Center for Computational Science: University of Miami.

Data Tools: XML & Hadoop

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). MapReduce has two queries, one that maps the input data into a final format and split across a group of computer nodes, while the second query reduces the data in each node so that when combining all the nodes it can provide the answer sought (Eini, 2010). In other words, data is partitioned, sorted and grouped to provide a key and value as an output (Rathbone, 2013). As more data gets added in real time, data in motion, MapReduce can do the recalculations cheaper than before, and the data scientist doesn’t have to touch the data (Eini, 2010; Roy, 2014). Roy (2014) had suggested an example of using Intensive Care Unit (ICU) sensor data, which comes into a database multiple times per second to help avoid false positive alarms that could lead to overwork hospital staffers.  However, Hadoop is best used for non-realtime tasks with a huge demand for processing power (Rathbone, 2013). The issue for Hadoop is to identify the correct instance that an actionable item is needed and acting on that item (Roy, 2014).

Does XML have any impact on MapReduce application design?

XML is a machine and human readable data format (Smith, 2012). With a goal of using XML for MapReduce, we need to assume that we need to map and reduce huge files (Eini, 2010; Smith 2012). Unfortunately, XML doesn’t include sync markers in the data format and therefore MapReduce doesn’t support XML (Smith, 2012). There are posts out there by coders use workarounds to allow for XML processing in Hadoop (Atom, 2010; Krishna, 2014; Rohit, 2013; Smith, 2012).  Smith (2012) and Rohit (2013) used the XmlInputFormat class from mahout to work with XML input data into HBase.  So, depending on the path the data scientist chooses will mean how much work is needed to be able to use MapReduce: code a new version of reading, mapping and reducing XML data from scratch; or use libraries from other code that is compatible with Hadoop.  Smith (2012), stated that the Mahout’s code needs to know the exact sequence of XML start and end tags that will be searched for and Elements with attributes are hard for Mahout’s XML library to detect and parse. Depending on the complexity of the XML document, Smith’s (2012) statement may mean the more complex use of XML input codes may be needed.  Therefore, a well designed XML document could make this process a bit easier, but the complexity of the data stored in it will make the task of creating code for using MapReduce on XML data harder.  Finally, Smith (2012) recommended a preprocessing step to convert XML data and treat it as a line of a record into other libraries native for MapReduce.

References

Data Tools: XML Design

Good XML Design Documentation for improved performance

Five questions must be asked before designing an XML data document (Font, 2010):

  1. Will this document be part of a solution?
  2. Will this document have design standards that must be followed?
  3. What part may change over time?
  4. To what extent is human readability or machine readability important?
  5. Will there be a massive amount of data? Does file size matter?

All XML data documents should be versioned, and key stakeholders should be involved in the XML data design process (Font, 2010).

A few rules (not a comprehensive list) on making a good XML design:

  1. Be consistent with your design and design for extensions by multiple people (Google, 2008; Font; 2010).
  2. Reuse existing XML formats (Google, 2008)
  3. Tag each unit of information, maintain a minimal amount of text that can be processed as a whole (Harold, 2003)
    1. An element has a start tag and an end tag only <menu></menu> but an attribute describes an element inside of a tag <menu-item portion-size =”500” portion-units=”g”></menu-item>; therefore an attribute provides some properties to the element (Harold, 2003; Ogbuji, 2004).
    2. The principle of core content: Know the difference of when to use an element versus an attribute: use elements when the information is an essential part of the material, and use an attribute if the information is peripheral or incidental to the main message. Essentially “Data goes in elements, metadata in attributes” (Ogbuji, 2004; Oracle, n.d.).
      1. Elements must be in a namespace, and attributes shouldn’t be in a namespace (Google, 2008)
    3. Avoid implicit structures, which occurs by the addition of white space (Harold, 2003)
      1. This can be seen easily with names, where white spaces are seen between the first name, middle name, and last name. Ogbuji (2004b) suggested to use well-established elements like: <firstname/>; <othername/>; <surname/>; <forename/>; <rolename/>; <namelink/>; <genname/>; and <addname/> to address the eccentricities of a person’s name from various cultures.
      2. Post office addresses pose this same issue, so Ogbuji (2004b) suggested these established elements: <street/>; <postcode/>; <pob/>; <city/>; <state/>; <country/>; <otheraddr/>; <phone/>; <fax/>; and <email/>/
    4. Use a standard and accepted element reference guide like DocBook Element Reference (Walsh & Muellner, 2006). Or something similar and stick with that convention.
      1. Use published standard abbreviations for constructing names (Google, 2008; Walsh & Muellner, 2006)
    5. Avoid using hyphens “-“ in your naming convention (Font, 2010)
    6. Avoid the use of boolean values (Google, 2008)
    7. Keep the document structure readable (Principle of readability), do not make it too troublesome to process or read (Harold, 2003; Ogbuji 2004). For example, use elements for readability and understandability by humans, and attributes for machine digest (Ogbuji, 2004).
    8. Comments should not be used to contain data, but rather to dos (Google, 2008)

Example of an XML Document (W3 Schools, n.d.)

<?xml version="1.0" encoding="UTF-8"?>
 
 <shiporder orderid="889923"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="shiporder.xsd">
   <orderperson>John Smith</orderperson>
   <shipto>
     <name>Ola Nordmann</name>
     <address>Langgt 23</address>
     <city>4000 Stavanger</city>
     <country>Norway</country>
   </shipto>
   <item>
     <title>Empire Burlesque</title>
     <note>Special Edition</note>
     <quantity>1</quantity>
     <price>10.90</price>
   </item>
   <item>
     <title>Hide your heart</title>
     <quantity>1</quantity>
     <price>9.90</price>
   </item>
 </shiporder>

Analysis of XML design document from the user’s perspective for improved performance

It should be best to have <shipto/> information to contain only the address, not just the two major datasets like name and address, which represents designing for extendibility as in Rule 1.  The tags <note/> and <price/> should be an attribute to the <title> per Rule 3a and 3b. Rule 4a. was not followed for <name>Ola Nordmann</name>.  Quantity is not an attribute of <item> thus should be a child element of <item> per Rule 4b. Tags like <name/>; <item/>; <quantity/>; and <price/> do not follow a naming convention as stated in Rule 5., but they could come from a naming convention that is internal to this company, so this one is hard to evaluate without much more information. Rules 6-9 were kept in this example.

References                                          

Data tools: Analysis of big data involving text mining

Definitions

Big data – any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do 2013, Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).

Text mining – a process that involves discovering implicit knowledge from unstructured textual data (Gera & Goel, 2015; Hashimi & Hafez, 2015; Nassirtoussi Aghabozorgi, Wah, & Ngo, 2015).

Case study: Basole, Seuss, and Rouse (2013). IT innovation adoption by enterprises: Knowledge discovery through text analytics.

The goal of this study was to use text mining techniques on 472 quality peer reviewed articles that spanned 30 years of knowledge (1977-2008).  The selection criteria for the articles were based on articles focused on the adoption of IT innovation; focused on the enterprise, organization, or firm; rigorous research methods; and publishable leading journals.  The reason to go through all this analysis is to prove the usefulness of text analytics for literature reviews.  In 2016, most literature reviews contain recent literature from the last five years, and in certain fields, it may not just be useful to focus on the last five years.  Extending the literature search beyond this 5-year period, requires a ton of attention and manual labor, which makes the already literature an even more time-consuming endeavor than before. So, the author’s question is to see if it is possible to use text mining to conduct a more thorough review of the body of knowledge that expands beyond just the typical five years on any subject matter.  They argue that the time it takes to conduct this tedious task could benefit from automation.  However, this should be thought of as a first pass through the literature review. Thinking of this regarding a first pass allows for the generation of new research questions and a generation of ideas, which drives more future analysis.In the end, the study was able to conclude that cost and complexity were two of the most frequent determinants of IT innovation adoption from the perspective of an IT department.  Other determinants for IT departments were the complexity, capability, and relative advantage of the innovation.  However, when going up one level of extraction to the enterprise/organizational level, the perceived benefits and usefulness were the main determinants of IT innovation.  Ease of use of the technology was a big deal for the organization.  When comparing, IT innovation with costs there was a negative correlation between the two, while IT innovation has a positive correlation to organization size and top management support.

How was big data analytics learned, taught, and used in the case study?

The research approach for this study was: (1) Document Identification and extraction, (2) document classification and coding, (3) document analysis and knowledge discovery (key terms, co-occurrence), and (4) research gap identification.

Analysis of the data consisted of classifying the data into four time periods (bins): 1988-1979; 1980-1989; 1990-1999; and 2000-2008 and use of a classification scheme based on existing taxonomies (case study, content analysis, field experiment, field study, frameworks and conceptual model, interview, laboratory experiment, literature analysis, mathematical model, qualitative research, secondary data, speculation/commentary, and survey).  Data was also classified by their functional discipline (Information systems and computer science, decision science, management and organization sciences, economics, and innovation) and finally by IT innovation (software, hardware, networking infrastructure, and the tool’s IT term catalog). This study used a tool called Northernlight (http://georgiatech.northernlight.com/).

The hopes of this study are to use the bag-of-words technique and word proximity to other words (or their equivalents) to help extract meaning from a large set of text-based documents.  Bag-of-words technique is known for counting and identifying key terms and phrases, which help uncover themes.  The simplest way of thinking of the bag-of-words technique is word frequencies in a document.

However, understanding the meaning behind the themes means studying the context in which the words are located in, and relating them amongst other themes, also called co-occurrence of terms.  The best way of doing this meaning extraction is to measure the strength/distance between the themes.  Finally, the researcher in this study can set minimums, maximums that can enhance the meaning extraction algorithm to garner insights into IT innovation, while reducing the overall noise in the final results. The researchers set the following rules for co-occurrences between themes:

  • There are approximately 40 words per sentence
  • There are approximately 150 words per paragraph

How could this implementation of big data have been improved upon?

Goldbloom (2016) stated that using big data techniques (machine learning) is best on big data that requires classifying and it breaks down when the task is too small and specialized, therefore prime for only human analysis.  This study only looked at 427 articles, is this considered big enough for analysis, or should the analysis go back through multiple years beyond just the 30 years (Basole et al., 2013).  What is considered big data in 2013 (the time of this study), may not be big data in 2023 (Fox & Do, 2013).

Mei & Zhai (2005), observed how terms and term frequencies evolved over time and graphed it by year, rather than binning the data into four different groups as in Basole et al. (2013).  This case study could have shown how cost and complexity in IT innovation changed over time.  Graphing the results similar to Mei & Zhai (2005) and Yoon and Song (2014) would also allow for an analysis of IT innovation themes and if each of these themes is in an Introduction, Growth, Majority, or Decline mode.

 Reference

  • Basole, R. C., Seuss, C. D., & Rouse, W. B. (2013). IT innovation adoption by enterprises: Knowledge discovery through text analytics. Decision Support Systems, 54, 1044-1054. Retrieved from http://www.sciencedirect.com.ctu.idm.oclc.org/science/article/pii/S0167923612002849
  • Davenport, T. H., & Dyche, J. (2013). Big Data in Big Companies. International Institute for Analytics, (May), 1–31.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Gera, M., & Goel, S. (2015). Data Mining-Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Goldbloom, A. (2016). The jobs we’ll lose to machines –and the ones we won’t. TED. Retrieved from http://www.ted.com/talks/anthony_goldbloom_the_jobs_we_ll_lose_to_machines_and_the_ones_we_won_t
  • Hashimi, H., & Hafez, A. (2015). Selection criteria for text mining approaches. Computers in Human Behavior, 51, 729–733. http://doi.org/10.1016/j.chb.2014.10.062
  • Mei, Q., & Zhai, C. (2005). Discovering evolutionary theme patterns from text: an exploration of temporal text mining. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 198–207. http://doi.org/10.1145/1081870.1081895
  • Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2015). Text-mining of news-headlines for FOREX market prediction: a multi-layer dimension reduction algorithm with semantics and sentiment. Expert Systems with Applications42(1), 306-324.
  • Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J., & Zients, J. (2014). Big Data: Seizing Opportunities. Executive Office of the President of USA, 1–79.
  • Yoon, B., & Song, B. (2014). A systematic approach of partner selection for open innovation. Industrial Management & Data Systems, 114(7), 1068.