Adv Topics: The Internet of Things and Web 4.0

The IoT is the explosion of device/sensor data, which is growing the amount of structured data exponentially with tremendous opportunities (Jaffe, 2014; Power, 2015). Both Atzori (2010) and Patel (2013) classified the Web 4.0 as the symbiotic web, where data interactions occur between humans and smart devices, the internet of things (IoT). These smart devices can be wired to the internet or connected via wireless sensors through enhanced communication protocols (Atzori, 2010). Thus, these smart devices would have read and write concurrency with humans, where the largest potential of web 4.0 has these smart devices analyze data online and begin to migrate the online world into the reality (Patel, 2013). Besides interacting with the internet and the real world, the internet of things smart devices would be able to interact with each other (Atzori, 2010). Sakr (2014) stated that this web ecosystem is built off of four key items:

  • Data devices where data is gathered from multiple sources that generate the data
  • Data collectors are devices or people that collect data
  • Data aggregation from the IoT, people, Radio Frequency Identification tags, etc.
  • Data users and data buyers are people that derive value out of the data

Some of the potential benefits of IoT are: assisted living, e-health, enhanced learning, government, retail, financial, automation, industrial manufacturing, logistics, business/process management, and intelligent transport (Sakr, 2014; Atzori, 2010). Atzori (2010) suggests that there are three different definitions or vision on the use of IoT, which is based on the device’s orientation:

  • Things oriented which are designed for status and traceability of objects via RFID or similar technology
  • Internet-oriented which are designed for light internet protocol where the device is addressable and reachable via the internet
  • Semantic-oriented where devices aid in creating reasoning over the data that is generated by these devices by exploiting models

Some of IoT can fall on one, two, or all three definitions or visions for IoT use.

Performance Bottlenecks for IoT

In 2016, IoT has two main issues, if it is left on its own and it is not tied to anything else (Jaffe, 2014; Newman, 2016):

  • The devices cannot deal with the massive amounts of data generated and collected
  • The devices cannot learn from the data it generates and receives

Thus, artificial intelligence (AI) should be able to store and mine all the data that is gathered from a wide range of sensors to give it meaning and value (Canton, 2016; Jaffe, 2014). AI would bring out the potential of IoT through quickly and naturally collect, analyzing, organizing, and feeding valuable data to key stakeholders, transforming the field into the Internet of Learning-Things (IoLT) from the standard IoT (Jaffe, 2014; Newman, 2016). However, this would mean a change in the infrastructure of the web to handle IoLT or IoT. Thus, Atzori (2010) listed some of the potential performance bottlenecks for IoT on a network level:

  • The vast number of internet oriented devices that will be taking up the last few IPv4 addresses, thus there is a need to move to IPv6 to support all the devices that will come online soon. This is just one version of the indexing problem.
  • Things oriented and internet oriented devices could spend a time in sleep mode, which is not typical for current devices using the existing IP networks.
  • IoT devices when connecting to the internet produce smaller packets of data at a higher frequency than current devices.
  • Each of the devices would have to use a common interface and standard protocols as other devices, which can quickly flood the network and increase the complexity of middleware software layer design.
  • IoT is vastly various objects, where each device with their function and has its way of communicating. There is a need to create a level of abstraction to homogenate data transfer and access of data through a standard process.

Proposed solutions would be to use NoSQL (Not only Structured Query Language) databases to help with collection, storage, and analysis of IoT data that is heterogeneous, lacking a common interface with standard protocols and can deal with data of various sizes. This can solve one aspect of the indexing problem of IoT. NoSQL databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015).

  • Document stores use a key/value pair that could store data in JSON, BSON, or XML
  • Graphical databases are use networks diagrams to show the relationship between items in a graphical format
  • Column-oriented databases are perfect for sparse datasets, where data is grouped together in columns rather than rows

Retail is currently using thing oriented RFID for inventory tracking and in-store foot traffic if installed on shopping carts to be used for understanding customer wants (Mitchell, n.d.). Thus, Mitchell (n.d.) suggested that the use of video cameras and mobile device Wi-Fi traffic could help identify if the customer wanted an item or a group of items by seeking hotspots of dwelling time, so that store managers can optimize the store layouts to increase flow and higher revenue. However, these retailers must be considering the added data sources and have the supporting infrastructure to avoid performance bottlenecks to get to reap the rewards of utilizing IoT to generate data-driven decisions.

Resources:

  • Atzori, L., Antonio Iera, A., & Morabito, G. (2010). The Internet of things: A survey. Computer Networks, 54(2). 787–2,805

Adv Topics: The architecture of the Internet

Introduction

Kelly (2007) stated that there are 100 billion clicks per day and 55 trillion links. But, the internet is very pervasive to the human kind. In 2012, 2.27 billion people used the internet. However, globally 1.7 billion people are actively engaged with the internet (Li, 2010; Sakr, 2014). An actively engaged user of the internet would be using social media, where they can watch videos, share status updates, and curated content, with a goal of actively engaging other users (Li, 2010). Cloud-based services rely on the internet to provide services like distributed processing, large scale data storage, support distributed transactions, data manipulation, etc. (Brookshear & Brylow, 2014; Sakr, 2014). Thus, the internet plays a vital role in enabling big data analysis and social engagement, given its humble beginning for storing research projects between multiple research institutions in the 1960s (Brookshear & Brylow, 2014).

Internet Architecture

The internet has evolved into a socio-technical system. This evolution has come about in five distinct stages:

  • Web 1.0: Created by Tim Berners-Lee in the 1980s, where it was originally defined as a way of connecting static read-only information hosted across multiple computational components primarily for companies (Patel, 2013). The internet was a network of computational networks and where communication that is done throughout the internet is governed by the TCP/IP protocol (Brookshear & Brylow, 2014). The internet’s architecture relies on three components: Uniform Resource Identifier (URI), Hyper Text Transfer Protocol (HTTP), and HyperText Markup language (HTML) (Jacobs & Walsh, 2004).
    • A URI is a unique address that is common across all corners of the web and agreed upon convention by the internet community, which is made up of characters and numerical values that allow an end-user of the internet to locate and retrieve information (Brookshear & Brylow, 2014; Jacobs & Walsh, 2004; Patel 2013). This unique addressing convention allows the internet to store massive amounts of information without URI collision, which is when two URI hold the same value, which can confound search engines (Jacobs & Walsh, 2004). An example of a URI could be www.skyhernandez.com.
    • For a computer to locate this URI’s information which is hosted on a web server, a web browser like Goggle chrome, Microsoft edge, or Firefox uses the HTTP protocol to retrieve the information to be displayed by the browser (Brookshear & Brylow, 2014). The web browser would send an HTTP GET request for www.skyhernandez.com via the TCP/IP port 80, and as long as the browser is given access to the information stored in the URI, then the web server sends an HTTP POST or PUT to give the web browser the sought after information (Jacobs & Walsh, 2004). In a sense, HTTP protocols play an important role in information access management.
    • Once the HTTP POST is sent from the web server consisting the information stored in www.skyhernandez.com, the web browser must now convert the information and display it on the computer screen. HTML is a <tag> based notational code that helps a browser read the information sent by the web server and display it in a simple or rich data format (Brookshear & Brylow, 2014; Patel 2013). The <html /> tags are used by the HTTP protocol to identify the content type and encoding style for displaying the information of the web server onto the browser (Jacobs & Walsh, 2004).   The <head /> and <title /> tags are used as metadata about the document. Whereas the <body /> tag contains the information of the URI, and <a /> tags helps you link to other relevant URIs easily.
  • Web 2.0: Changed the state of the internet from a read-only state to a read/write state and had grown communities that hold a common interest (Patel, 2013). This version of the web led to more social interaction, giving people and content importance on the web, due to the introduction of social media tools through the introduction of web applications (Li, 2010; Patel, 2013; Sakr, 2014). Web applications can include event-driven and object-oriented programming that are designed to handle concurrent activities for multiple users and had a graphical user interface (Connolly & Begg, 2014; Sandén, 2011). Key technologies include:
    • Weblogs (Blogs), Video logs (vlogs), and audio logs (podcasts) are all content in various styles that are published chronologically in descending time order, which can be tagged with keywords for categorization and available when people need them (Li, 2010; Patel, 2013). These logs can be used for fact-finding when content is stored chronologically (Connolly& Begg, 2014).
    • Really Simple Syndication (RSS) is a web and data feed format that summarizes data via producing an open standard format, XML file (Patel, 2013; Services, 2015). This is regularly used for data collection (Services, 2015).
    • Wikis are editable and expandable by those who have access to the data, and information can be restored or rolled back (Patel, 2013). Wiki editors ensure data quality and encourage participation from the community in providing meaningful content to the community (Li, 2010).
  • Web 3.0: This is the state the web at 2017. Involves the semantic web that is driven by data integration through the uses of metadata (Patel, 2013). This version of the web supports a worldwide database with static HTML documents, dynamically rendered data, next standard HTML (HTML5), and links between documents with hopes of creating an interconnected and interrelated openly accessible world data such that tagged micro-content can be easily discoverable through search engines (Connolly & Begg, 2014; Patel, 2013). This new version of HTML, HTML5 can handle multimedia and graphical content and introduces new tags like <section />, <article />, <nav />, and <header />, which are great for semantic content (Connolly & Begg, 2014). Also, end-users are beginning to build dynamic web applications for others to interact with (Patel, 2013). Key technologies include:
    • Extensible Markup Language is a tag based metalanguage (Patel, 2013). These tags not limited to the tags defined by other people and can be created at the pace of the author rather than waiting for a standard body to approve a tag structure (UK Web Design Company, n.d.).
    • Resource Description Framework (RDF) is based on URI triples <subject, predicate, object>, which helps describes data properties and classes (Connolley & Begg, 2014; Patel, 2013). RDF is usually represented at the top of the data set with a @prefix (Connolly & Begg, 2014). For instance,

@prefix: <https://skyhernandez.com&gt; s: Author <https://skyhernandez.wordpress.com/about/&gt;. <https://skyhernandez.wordpress.com/about/&gt;. s:Name “Dr. Skylar Hernandez”. <https://skyhernandez.wordpress.com/about/&gt; s:e-mail “dr.sky.hernandez@gmail.com.”

  • Web 4.0: It is considered the symbiotic web, where data interactions occur between humans and smart devices, the internet of things (Atzori, 2010; Patel, 2013). These smart devices can be wired to the internet or connected via wireless sensors through enhanced communication protocols (Atzori, 2010). Thus, these smart devices would have read and write concurrency with humans, where the largest potential of web 4.0 has these smart devices analyze data online and begin to migrate the online world into the real world (Patel, 2013). Besides interacting with the internet and the real world, the internet of things smart devices would be able to interact with each other (Atzori, 2010). Sakr (2014) stated that this web ecosystem is built off of four key items:
    • Data devices where data is gathered from multiple sources that generate the data
    • Data collectors are devices or people that collect data
    • Data aggregation from the IoT, people, RFIDs, etc.
    • Data users and data buyers are people that derive value out of the data
  • Web 5.0: Previous iterations of the web do not perceive people’s emotion, but one day it could be able to understand a person’s emotional (Patel, 2013). Kelly (2007) predicted that in 5,000 days the internet would become one machine and all other devices would be a window into this machine. In 2007, Kelly stated that this one machine “the internet” has the processing capability of one human brain, but in 5,000 days it will have the processing capability of all the humanity.

Performance Bottlenecks & Root Causes

There is a trend in terms of “performance bottleneck” to access large-scale Web data as the Web technology evolves. A bottleneck is when the flow of information is slowed down or stopped in its entirety that it can cause a bad end-user experience (TechTarget, 2007, Thomas, 2012). Performance bottlenecks can cause an application to perform poorly towards expectations (Thomas, 2012). As the internet evolved, there are new performance bottlenecks that begin to appear:

  • Web 1.0: When two URI hold the same value, it can confound search engine, hence the move to IPv4 to IPv6 (Jacobs & Walsh, 2004). Transfer protocols rely on network devices: network interface card, firewall, cables, tight security, load balancers, routers, etc., which all affect the flow of information (bandwidth) (Jacobs & Walsh, 2004; Thomas, 2012). Finally, HTTP information is pulled from the web server, thus low capacity computers, broken links, tight security, poor configurations, which can result in HTTP errors (4xx, 5xx), lots of open connections, memory leaks, lengthy queues, extensive data scans, database deadlock, etc. (Thomas, 2012).
  • Web 2.0: Database demands on the web application, because there are more write and read transactions (Sakr, 2014). Plus, all the performance bottlenecks from web 1.0.
  • Web 3.0: Searching for information in the data is tough and time-consuming without a computer processing application (UK Web Design Company, n.d.). Data is tied to the logic and language like HTML without a readily made browser to simply explore the data and therefore may require HTML or other software to process the data (Brewton, Yuan, & Akowuah, 2012; UK Web Design Company, n.d.). Syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007).
  • Web 4.0: IoT creates performance bottlenecks but it also has its issues, if it is left on its own and it is not tied to anything else (Jaffe, 2014; Newman, 2016): (a) the devices cannot deal with the massive amounts of data generated and collected; and (b) the devices cannot learn from the data it generates and collects. Finally, there are also infrastructure potential performance bottlenecks for IoT (Atzori, 2010): (a) the huge number of internet oriented devices that will be taking up the last few IPv4 addresses; (b) things oriented and internet oriented devices could spend a time in sleep mode, which is not typical for current devices using the current IP networks; (c) IoT devices when connecting to the internet produce smaller packets of data at higher frequency than current devices; (d) each of the devices would have to use a common interface and standard protocols as other devices, which can easily flood the network and increase complexity of middleware software layer design; and (e) IoT are vastly heterogeneous objects, where each device with their own function and has its own way of communicating.
  • Web 5.0: Give the assumption of what this version of the web would become, a possible performance bottleneck would be a number of resources consumed to keep the web operating.

Overall, Kelly (2007) stated that there are 100 billion clicks per day and 55 trillion links and that there are 2 million emails per second, 1 million IM messages, etc. big data will impact the performance of the web. Big data will be primarily impacting the web server, because of the increasing size of information and potential lack of bandwidth, big data can slow down the throughput performance (Thomas, 2012).

High-level strategies to mitigate

The web should evolve with time to keep up with the demands and needs of society, and it is predicted change significantly. What the web evolves into is yet to be seen, but it will be quite unique. However, with multiple heterogeneous types of devices (IoT) trying to connect to the web, a standardized protocol and interface to the web should be adopted (Atzori, 2010). A move to IPv6 from IPv4 to accommodate the massive number of IoT devices that expected to generate data and connect to the web must happen to accommodate this opportunity to gain more data.

Given, Kelly (2007) centralized view of the web system, the information and data are stored distributively, and better algorithms are needed to connect relevant data and information more efficiently. Data storage is cheap, but at the rate of data creation by IoT and other sources, processing speeds through parallel and distributed processing must increase to take advantage of this explosion of big data into the web. This is the gap created by the fact that data collection is outpacing the speed of data processing and analysis. Given this gap, a data scientist should prioritize which subset of the data is valuable enough to analyze, to solve certain problems. This is a great workaround, however, it ignores the full data set at times, which was generated from a system. It’s not enough to analyze what is deemed to be valuable parts of the data, because another part of the data may reveal more insight (the whole is better than the sum of its parts argument).

NoSQL database types and extensible markup languages have enhanced how information and data are related to each other, but standardization and perhaps an automated 1:1 mapping of the same data being represented in different NoSQL databases may be needed to gain further insights faster. Web application developers should also be more resource conscious, trying to get more computational results for fewer resources.

Resources:

  • Atzori, L., Antonio Iera, A., & Morabito, G. (2010). The Internet of things: A survey. Computer Networks, 54(2). 787–2,805
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Sandén, B. I. (2011). Design of Multithreaded Software: The Entity-Life Modeling Approach. Wiley-Blackwell. VitalBook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.
  • Services, EMC E. (2015) Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. John Wiley & Sons P&T. VitalBook file.

Compelling topics on analytics of big data

  • Big data is defined as high volume, high variety/complexity, and high velocity, which is known as the 3Vs (Services, 2015).
  • Depending on the goal and objectives of the problem, that should help define which theories and techniques of big data analytics to use. Fayyad, Piatetsky-Shapiro, and Smyth (1996) defined that data analytics can be divided into descriptive and predictive analytics. Vardarlier and Silaharoglu (2016) agreed with Fayyad et al. (1996) division but added prescriptive analytics. Thus, these three divisions of big data analytics are:
    • Descriptive analytics explains “What happened?”
    • Predictive analytics explains “What will happen?”
    • Prescriptive analytics explains “Why will it happen?”
  • The scientific method helps give a framework for the data analytics lifecycle (Dietrich, 2013; Services, 2015). According to Dietrich (2013), it is a cyclical life cycle that has iterative parts in each of its six steps: discovery; pre-processing data; model planning; model building; communicate results, and
  • Data-in-motion is the real-time streaming of data from a broad spectrum of technologies, which also encompasses the data transmission between systems (Katal, Wazid, & Goudar, 2013; Kishore & Sharma, 2016; Ovum, 2016; Ramachandran & Chang, 2016). Data that is stored on a database system or cloud system is considered as data-at-rest and data that is being processed and analyzed is considered as data-in-use (Ramachandran & Chang, 2016).  The analysis of real-time streaming data in a timely fashion is also known as stream reasoning and implementing solutions for stream reasoning revolve around high throughput systems and storage space with low latency (Della Valle et al., 2016).
  • Data brokers are tasked collecting data from people, building a particular type of profile on that person, and selling it to companies (Angwin, 2014; Beckett, 2014; Tsesis, 2014). The data brokers main mission is to collect data and drop down the barriers of geographic location, cognitive or cultural gaps, different professions, or parties that don’t trust each other (Long, Cunningham, & Braithwaite, 2013). The danger of collecting this data from people can raise the incidents of discrimination based on race or income directly or indirectly (Beckett, 2014).
  • Data auditing is assessing the quality and fit for the purpose of data via key metrics and properties of the data (Techopedia, n.d.). Data auditing processes and procedures are the business’ way of assessing and controlling their data quality (Eichhorn, 2014).
  • If following an agile development processes the key stakeholders should be involved in all the lifecycles. That is because the key stakeholders are known as business user, project sponsor, project manager, business intelligence analyst, database administers, data engineer, and data scientist (Services, 2015).
  • Lawyers define privacy as (Richard & King, 2014): invasions into protecting spaces, relationships or decisions, a collection of information, use of information, and disclosure of information.
  • Richard and King (2014), describe that a binary notion of data privacy does not Data is never completely private/confidential nor completely divulged, but data lies in-between these two extremes.  Privacy laws should focus on the flow of personal information, where an emphasis should be placed on a type of privacy called confidentiality, where data is agreed to flow to a certain individual or group of individuals (Richard & King, 2014).
  • Fraud is deception; fraud detection is needed because as fraud detection algorithms are improving, the rate of fraud is increasing (Minelli, Chambers, &, Dhiraj, 2013). Data mining has allowed for fraud detection via multi-attribute monitoring, where it tries to find hidden anomalies by identifying hidden patterns through the use of class description and class discrimination (Brookshear & Brylow, 2014; Minellli et al., 2013).
  • High-performance computing is where there is either a cluster or grid of servers or virtual machines that are connected by a network for a distributed storage and workflow (Bhokare et al., 2016; Connolly & Begg, 2014; Minelli et al., 2013).
  • Parallel computing environments draw on the distributed storage and workflow on the cluster and grid of servers or virtual machines for processing big data (Bhokare et al., 2016; Minelli et al., 2013).
  • NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). NoSQL databases have benefits as they provide a data model for applications that require a little code, less debugging, run on clusters, handle large scale data and evolve with time (Sadalage & Fowler, 2012).
    • Document store NoSQL databases, use a key/value pair that is the file/file itself, and it could be in JSON, BSON, or XML (Sadalage & Fowler, 2012; Services, 2015). These document files are hierarchical trees (Sadalage & Fowler, 2012). Some sample document databases consist of MongoDB and CouchDB.
    • Graph NoSQL databases are used drawing networks by showing the relationship between items in a graphical format that has been optimized for easy searching and editing (Services, 2015). Each item is considered a node and adding more nodes or relationships while traversing through them is made simpler through a graph database rather than a traditional database (Sadalage & Fowler, 2012). Some sample graph databases consist of Neo4j Pregel, etc. (Park et al., 2014).
    • Column-oriented databases are perfect for sparse datasets, ones with many null values and when columns do have data the related columns are grouped together (Services, 2015). Grouping demographic data like age, income, gender, marital status, sexual orientation, etc. are a great example for using this NoSQL database. Cassandra is an example of a column-oriented database.
  • Public cloud environments are where a supplier to a company provides a cluster or grid of servers through the internet like Spark AWS, EC2 (Connolly & Begg, 2014; Minelli et al. 2013).
  • A community cloud environment is a cloud that is shared exclusively by a set of companies that share the similar characteristics, compliance, security, jurisdiction, etc. (Connolly & Begg, 2014).
  • Private cloud environments have a similar infrastructure to a public cloud, but the infrastructure only holds the data one company exclusively, and its services are shared across the different business units of that one company (Connolly & Begg, 2014; Minelli et al., 2013).
  • Hybrid clouds are two or more cloud structures that have either a private, community or public aspect to them (Connolly & Begg, 2014).
  • Cloud computing allows for the company to purchase the services it needs, without having to purchase the infrastructure to support the services it might think it will need. This allows for hyper-scaling computing in a distributed environment, also known as hyper-scale cloud computing, where the volume and demand of data explode exponentially yet still be accommodated in public, community, private, or hybrid cloud in a cost efficiently (Mainstay, 2016; Minelli et al., 2013).
  • Building block system of big data analytics involves a few steps Burkle et al. (2001):
    • What is the purpose that the new data will and should serve
      • How many functions should it support
      • Marking which parts of that new data is needed for each function
    • Identify the tool needed to support the purpose of that new data
    • Create a top level architecture plan view
    • Building based on the plan but leaving room to pivot when needed
      • Modifications occur to allow for the final vision to be achieved given the conditions at the time of building the architecture.
      • Other modifications come under a closer inspection of certain components in the architecture

 

References

  • Angwin, J. (2014). Privacy tools: Opting out from data brokers. Pro Publica. Retrieved from https://www.propublica.org/article/privacy-tools-opting-out-from-data-brokers
  • Beckett, L. (2014). Everything we know about what data brokers know about you. Pro Publica. Retrieved from https://www.propublica.org/article/everything-we-know-about-what-data-brokers-know-about-you
  • Bhokare, P., Bhagwat, P., Bhise, P., Lalwani, V., & Mahajan, M. R. (2016). Private Cloud using GlusterFS and Docker.International Journal of Engineering Science5016.
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, (12th). Pearson Learning Solutions. VitalBook file.
  • Burkle, T., Hain, T., Hossain, H., Dudeck, J., & Domann, E. (2001). Bioinformatics in medical practice: what is necessary for a hospital?. Studies in health technology and informatics, (2), 951-955.
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th). Pearson Learning Solutions. [Bookshelf Online].
  • Della Valle, E., Dell’Aglio, D., & Margara, A. (2016). Tutorial: Taming velocity and variety simultaneous big data and stream reasoning. Retrieved from https://pdfs.semanticscholar.org/1fdf/4d05ebb51193088afc7b63cf002f01325a90.pdf
  • Dietrich, D. (2013). The genesis of EMC’s data analytics lifecycle. Retrieved from https://infocus.emc.com/david_dietrich/the-genesis-of-emcs-data-analytics-lifecycle/
  • Eichhorn, G. (2014). Why exactly is data auditing important? Retrieved from http://www.realisedatasystems.com/why-exactly-is-data-auditing-important/
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37. Retrieved from: http://www.aaai.org/ojs/index.php/aimagazine/article/download/1230/1131/
  • Katal, A., Wazid, M., & Goudar, R. H. (2013, August). Big data: issues, challenges, tools and good practices. InContemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
  • Kishore, N. & Sharma, S. (2016). Secure data migration from enterprise to cloud storage – analytical survey. BIJIT-BVICAM’s Internal Journal of Information Technology. Retrieved from http://bvicam.ac.in/bijit/downloads/pdf/issue15/09.pdf
  • Long, J. C., Cunningham, F. C., & Braithwaite, J. (2013). Bridges, brokers and boundary spanners in collaborative networks: a systematic review.BMC health services research13(1), 158.
  • (2016). An economic study of the hyper-scale data center. Mainstay, LLC, Castle Rock, CO, the USA, Retrieved from http://cloudpages.ericsson.com/ transforming-the-economics-of-data-center
  • Minelli, M., Chambers, M., &, Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. [Bookshelf Online].
  • Ovum (2016). 2017 Trends to watch: Big Data. Retrieved from http://info.ovum.com/uploads/files/2017_Trends_to_Watch_Big_Data.pdf
  • Park, Y., Shankar, M., Park, B. H., & Ghosh, J. (2014, March). Graph databases for large-scale healthcare systems: A framework for efficient data management and data services. In Data Engineering Workshops (ICDEW), 2014 IEEE 30th International Conference on (pp. 12-19). IEEE.
  • Ramachandran, M. & Chang, V. (2016). Toward validating cloud service providers using business process modeling and simulation. Retrieved from http://eprints.soton.ac.uk/390478/1/cloud_security_bpmn1%20paper%20_accepted.pdf
  • Richards, N. M., & King, J. H. (2014). Big Data Ethics. Wake Forest Law Review, 49, 393–432.
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, (1st). [Bookshelf Online].
  • Technopedia (n.d.). Data audit. Retrieved from https://www.techopedia.com/definition/28032/data-audit
  • Tsesis, A. (2014). The right to erasure: Privacy, data brokers, and the indefinite retention of data.Wake Forest L. Rev.49, 433.
  • Vardarlier, P., & Silahtaroglu, G. (2016). Gossip management at universities using big data warehouse model integrated with a decision support system. International Journal of Research in Business and Social Science, 5(1), 1–14. Doi: http://doi.org/10.1108/ 17506200710779521

Modeling and analyzing big data in health care

Let’s consider using the building blocks system for healthcare systems, on a healthcare problem that wants to monitor patient vital signs similar to Chen et al. (2010).

  • The purpose that the new data will serve: Most hospitals measure the following vitals for triaging patients: blood pressure and flow, core temperature, ECG, carbon dioxide concentration (Chen et al. 2010).
    1. Functions should it serve: gathering, storing, preprocessing, and processing the data. Chen et al. (2010) suggested that they should also perform a consistency check, aggregating and integrate the data.
    2. Which parts of the data are needed to serve these functions: all
  • Tools needed: distributed database system, wireless network, parallel processing, graphical user interface for healthcare providers to understand the data, servers, subject matter experts to create upper limits and lower limits, classification algorithms that used machine learning
  • Top level plan: The data will be collected from the vital sign sensors, streaming at various time intervals into a central hub that sends the data in packets over a wireless network into a server room. The server can divide the data into various distributed systems accordingly. A parallel processing program will be able to access the data per patient per window of time to conduct the needed functions and classifications to be able to provide triage warnings if the vitals hit any of the predetermined key performance indicators that require intervention by the subject matter experts.  If a key performance indicator is sparked, send data to the healthcare provider’s device via a graphical user interface.
  • Pivoting is bound to happen; the following can happen:
    1. Graphical user interface is not healthcare provider friendly
    2. Some of the sensors need to be able to throw a warning if they are going bad
    3. Subject matter experts may need to readjust the classification algorithm for better triaging

Thus, the above problem as discussed by Chen et al. (2010), could be broken apart to its building block components as addressed in Burkle et al. (2011).  These components help to create a system to analyze this set of big health care data through analytics, via distributed systems and parallel processing as addressed by Services (2015) and Mirtaheri et al. (2008).

Draw on a large body of data to form a prediction or variable comparisons within the premise of big data.

Fayyad, Piatetsky-Shapiro, and Smyth (1996) defined that data analytics can be divided into descriptive and predictive analytics. Vardarlier and Silaharoglu (2016) agreed with Fayyad et al. (1996) division but added prescriptive analytics.  Depending on the goal of diagnosing illnesses with the use of big data analytics should depend on the theory/division one should choose.  Raghupathi & Raghupathi (2014), stated some common examples of big data in the healthcare field to be: personal medical records, radiology images, clinical trial data, 3D imaging, human genomic data, population genomic data, biometric sensor reading, x-ray films, scripts, and traditional paper files.  Thus, the use of big data analytics to understand the 23 pairs of chromosomes that are the building blocks for people. Healthcare professionals are using the big data generated from our genomic code to help predict which illnesses a person could get (Services, 2013). Thus, using predictive analytics tools and algorithms like decision trees would be of some use.  Another use of predictive analytics and machine learning can be applied to diagnosing an eye disease like diabetic retinopathy from an image by using classification algorithms (Goldbloom, 2016).

Examine the unique domain of health informatics and explain how big data analytics contributes to the detection of fraud and the diagnosis of illness.

A process mining framework for the detection of healthcare fraud and abuse case study (Yang & Hwang, 2006): Fraud exists in processing health insurance claims because there are more opportunities to commit fraud because there are more channels of communication: service providers, insurance agencies, and patients. Any one of these three people can commit fraud, and the highest chance of fraud occurs where service providers can do unnecessary procedures putting patients at risk. Thus this case study provided the framework on how to conduct automated fraud detection. The study collected data from 2543 gynecology patients from 2001-2002 from a hospital, where they filtered out noisy data, identified activities based on medical expertise, identified fraud in about 906.

Summarize one case study in detail related to big data analytics as it relates to organizational processes and topical research.

The use of Spark about the healthcare field case study by Pita et al. (2015): Data quality in healthcare data is poor and in particular that of the Brazilian Public Health System.  Spark was used to help in data processing to improve quality through deterministic and probabilistic record linking within multiple databases.  Record linking is a technique that uses common attributes across multiple databases and identifies a 1-to-1 match.  Spark workflows were created to help do record linking by (1) analyzing all data in each database and common attributes with high probabilities of linkage; (2) pre-processing data where data is transformed, anonymization, and cleaned to a single format so that all the attributes can be compared to each other for a 1-to-1 match; (3) record linking based on deterministic and probabilistic algorithms; and (4) statistical analysis to evaluate the accuracy. Over 397M comparisons were made in 12 hours.  They concluded that accuracy depends on the size of the data, where the bigger the data, the more accuracy in record linking.

References

  • Burkle, T., Hain, T., Hossain, H., Dudeck, J., & Domann, E. (2001). Bioinformatics in medical practice: What is necessary for a hospital?. Studies in health technology and informatics, (2), 951-955.
  • Chen, B., Varkey, J. P., Pompili, D., Li, J. K., & Marsic, I. (2010). Patient vital signs monitoring using wireless body area networks. In Bioengineering Conference, Proceedings of the 2010 IEEE 36th Annual Northeast (pp. 1-2). IEEE.
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. Retrieved from: http://www.aaai.org/ojs/index.php/aimagazine/article/download/1230/1131/
  • Goldbloom, A. (2016). The jobs we’ll lose to machines – and the ones we won’t. TED Talks. Retrieved from https://www.youtube.com/watch?v=gWmRkYsLzB4
  • Mirtaheri, S. L., Khaneghah, E. M., Sharifi, M., & Azgomi, M. A. (2008). The influence of efficient message passing mechanisms on high performance distributed scientific computing. In Parallel and Distributed Processing with Applications, 2008. ISPA’08. International Symposium on (pp. 663-668). IEEE.
  • Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., & Rasella, D. (2015). A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data. In EDBT/ICDT Workshops (pp. 17-26).
  • Raghupathi, W. Raghupathi, V. (2014). Big Data Analytics in healthcare: promise and potential. Heath Information Science and Systems. 2(3). Retrieved from http://hissjournal.biomedcentral.com/articles/10.1186/2047-2501-2-3
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].
  • Vardarlier, P., & Silahtaroglu, G. (2016). Gossip management at universities using big data warehouse model integrated with a decision support system. International Journal of Research in Business and Social Science, 5(1), 1–14. Doi: http://doi.org/10.1108/ 17506200710779521
  • Yang, W. S., & Hwang, S. Y. (2006). A process-mining framework for the detection of healthcare fraud and abuse.Expert Systems with Applications31(1), 56-68.