Compelling topics on analytics of big data

  • Big data is defined as high volume, high variety/complexity, and high velocity, which is known as the 3Vs (Services, 2015).
  • Depending on the goal and objectives of the problem, that should help define which theories and techniques of big data analytics to use. Fayyad, Piatetsky-Shapiro, and Smyth (1996) defined that data analytics can be divided into descriptive and predictive analytics. Vardarlier and Silaharoglu (2016) agreed with Fayyad et al. (1996) division but added prescriptive analytics. Thus, these three divisions of big data analytics are:
    • Descriptive analytics explains “What happened?”
    • Predictive analytics explains “What will happen?”
    • Prescriptive analytics explains “Why will it happen?”
  • The scientific method helps give a framework for the data analytics lifecycle (Dietrich, 2013; Services, 2015). According to Dietrich (2013), it is a cyclical life cycle that has iterative parts in each of its six steps: discovery; pre-processing data; model planning; model building; communicate results, and
  • Data-in-motion is the real-time streaming of data from a broad spectrum of technologies, which also encompasses the data transmission between systems (Katal, Wazid, & Goudar, 2013; Kishore & Sharma, 2016; Ovum, 2016; Ramachandran & Chang, 2016). Data that is stored on a database system or cloud system is considered as data-at-rest and data that is being processed and analyzed is considered as data-in-use (Ramachandran & Chang, 2016).  The analysis of real-time streaming data in a timely fashion is also known as stream reasoning and implementing solutions for stream reasoning revolve around high throughput systems and storage space with low latency (Della Valle et al., 2016).
  • Data brokers are tasked collecting data from people, building a particular type of profile on that person, and selling it to companies (Angwin, 2014; Beckett, 2014; Tsesis, 2014). The data brokers main mission is to collect data and drop down the barriers of geographic location, cognitive or cultural gaps, different professions, or parties that don’t trust each other (Long, Cunningham, & Braithwaite, 2013). The danger of collecting this data from people can raise the incidents of discrimination based on race or income directly or indirectly (Beckett, 2014).
  • Data auditing is assessing the quality and fit for the purpose of data via key metrics and properties of the data (Techopedia, n.d.). Data auditing processes and procedures are the business’ way of assessing and controlling their data quality (Eichhorn, 2014).
  • If following an agile development processes the key stakeholders should be involved in all the lifecycles. That is because the key stakeholders are known as business user, project sponsor, project manager, business intelligence analyst, database administers, data engineer, and data scientist (Services, 2015).
  • Lawyers define privacy as (Richard & King, 2014): invasions into protecting spaces, relationships or decisions, a collection of information, use of information, and disclosure of information.
  • Richard and King (2014), describe that a binary notion of data privacy does not Data is never completely private/confidential nor completely divulged, but data lies in-between these two extremes.  Privacy laws should focus on the flow of personal information, where an emphasis should be placed on a type of privacy called confidentiality, where data is agreed to flow to a certain individual or group of individuals (Richard & King, 2014).
  • Fraud is deception; fraud detection is needed because as fraud detection algorithms are improving, the rate of fraud is increasing (Minelli, Chambers, &, Dhiraj, 2013). Data mining has allowed for fraud detection via multi-attribute monitoring, where it tries to find hidden anomalies by identifying hidden patterns through the use of class description and class discrimination (Brookshear & Brylow, 2014; Minellli et al., 2013).
  • High-performance computing is where there is either a cluster or grid of servers or virtual machines that are connected by a network for a distributed storage and workflow (Bhokare et al., 2016; Connolly & Begg, 2014; Minelli et al., 2013).
  • Parallel computing environments draw on the distributed storage and workflow on the cluster and grid of servers or virtual machines for processing big data (Bhokare et al., 2016; Minelli et al., 2013).
  • NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). NoSQL databases have benefits as they provide a data model for applications that require a little code, less debugging, run on clusters, handle large scale data and evolve with time (Sadalage & Fowler, 2012).
    • Document store NoSQL databases, use a key/value pair that is the file/file itself, and it could be in JSON, BSON, or XML (Sadalage & Fowler, 2012; Services, 2015). These document files are hierarchical trees (Sadalage & Fowler, 2012). Some sample document databases consist of MongoDB and CouchDB.
    • Graph NoSQL databases are used drawing networks by showing the relationship between items in a graphical format that has been optimized for easy searching and editing (Services, 2015). Each item is considered a node and adding more nodes or relationships while traversing through them is made simpler through a graph database rather than a traditional database (Sadalage & Fowler, 2012). Some sample graph databases consist of Neo4j Pregel, etc. (Park et al., 2014).
    • Column-oriented databases are perfect for sparse datasets, ones with many null values and when columns do have data the related columns are grouped together (Services, 2015). Grouping demographic data like age, income, gender, marital status, sexual orientation, etc. are a great example for using this NoSQL database. Cassandra is an example of a column-oriented database.
  • Public cloud environments are where a supplier to a company provides a cluster or grid of servers through the internet like Spark AWS, EC2 (Connolly & Begg, 2014; Minelli et al. 2013).
  • A community cloud environment is a cloud that is shared exclusively by a set of companies that share the similar characteristics, compliance, security, jurisdiction, etc. (Connolly & Begg, 2014).
  • Private cloud environments have a similar infrastructure to a public cloud, but the infrastructure only holds the data one company exclusively, and its services are shared across the different business units of that one company (Connolly & Begg, 2014; Minelli et al., 2013).
  • Hybrid clouds are two or more cloud structures that have either a private, community or public aspect to them (Connolly & Begg, 2014).
  • Cloud computing allows for the company to purchase the services it needs, without having to purchase the infrastructure to support the services it might think it will need. This allows for hyper-scaling computing in a distributed environment, also known as hyper-scale cloud computing, where the volume and demand of data explode exponentially yet still be accommodated in public, community, private, or hybrid cloud in a cost efficiently (Mainstay, 2016; Minelli et al., 2013).
  • Building block system of big data analytics involves a few steps Burkle et al. (2001):
    • What is the purpose that the new data will and should serve
      • How many functions should it support
      • Marking which parts of that new data is needed for each function
    • Identify the tool needed to support the purpose of that new data
    • Create a top level architecture plan view
    • Building based on the plan but leaving room to pivot when needed
      • Modifications occur to allow for the final vision to be achieved given the conditions at the time of building the architecture.
      • Other modifications come under a closer inspection of certain components in the architecture

 

References

  • Angwin, J. (2014). Privacy tools: Opting out from data brokers. Pro Publica. Retrieved from https://www.propublica.org/article/privacy-tools-opting-out-from-data-brokers
  • Beckett, L. (2014). Everything we know about what data brokers know about you. Pro Publica. Retrieved from https://www.propublica.org/article/everything-we-know-about-what-data-brokers-know-about-you
  • Bhokare, P., Bhagwat, P., Bhise, P., Lalwani, V., & Mahajan, M. R. (2016). Private Cloud using GlusterFS and Docker.International Journal of Engineering Science5016.
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, (12th). Pearson Learning Solutions. VitalBook file.
  • Burkle, T., Hain, T., Hossain, H., Dudeck, J., & Domann, E. (2001). Bioinformatics in medical practice: what is necessary for a hospital?. Studies in health technology and informatics, (2), 951-955.
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th). Pearson Learning Solutions. [Bookshelf Online].
  • Della Valle, E., Dell’Aglio, D., & Margara, A. (2016). Tutorial: Taming velocity and variety simultaneous big data and stream reasoning. Retrieved from https://pdfs.semanticscholar.org/1fdf/4d05ebb51193088afc7b63cf002f01325a90.pdf
  • Dietrich, D. (2013). The genesis of EMC’s data analytics lifecycle. Retrieved from https://infocus.emc.com/david_dietrich/the-genesis-of-emcs-data-analytics-lifecycle/
  • Eichhorn, G. (2014). Why exactly is data auditing important? Retrieved from http://www.realisedatasystems.com/why-exactly-is-data-auditing-important/
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37. Retrieved from: http://www.aaai.org/ojs/index.php/aimagazine/article/download/1230/1131/
  • Katal, A., Wazid, M., & Goudar, R. H. (2013, August). Big data: issues, challenges, tools and good practices. InContemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
  • Kishore, N. & Sharma, S. (2016). Secure data migration from enterprise to cloud storage – analytical survey. BIJIT-BVICAM’s Internal Journal of Information Technology. Retrieved from http://bvicam.ac.in/bijit/downloads/pdf/issue15/09.pdf
  • Long, J. C., Cunningham, F. C., & Braithwaite, J. (2013). Bridges, brokers and boundary spanners in collaborative networks: a systematic review.BMC health services research13(1), 158.
  • (2016). An economic study of the hyper-scale data center. Mainstay, LLC, Castle Rock, CO, the USA, Retrieved from http://cloudpages.ericsson.com/ transforming-the-economics-of-data-center
  • Minelli, M., Chambers, M., &, Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. [Bookshelf Online].
  • Ovum (2016). 2017 Trends to watch: Big Data. Retrieved from http://info.ovum.com/uploads/files/2017_Trends_to_Watch_Big_Data.pdf
  • Park, Y., Shankar, M., Park, B. H., & Ghosh, J. (2014, March). Graph databases for large-scale healthcare systems: A framework for efficient data management and data services. In Data Engineering Workshops (ICDEW), 2014 IEEE 30th International Conference on (pp. 12-19). IEEE.
  • Ramachandran, M. & Chang, V. (2016). Toward validating cloud service providers using business process modeling and simulation. Retrieved from http://eprints.soton.ac.uk/390478/1/cloud_security_bpmn1%20paper%20_accepted.pdf
  • Richards, N. M., & King, J. H. (2014). Big Data Ethics. Wake Forest Law Review, 49, 393–432.
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, (1st). [Bookshelf Online].
  • Technopedia (n.d.). Data audit. Retrieved from https://www.techopedia.com/definition/28032/data-audit
  • Tsesis, A. (2014). The right to erasure: Privacy, data brokers, and the indefinite retention of data.Wake Forest L. Rev.49, 433.
  • Vardarlier, P., & Silahtaroglu, G. (2016). Gossip management at universities using big data warehouse model integrated with a decision support system. International Journal of Research in Business and Social Science, 5(1), 1–14. Doi: http://doi.org/10.1108/ 17506200710779521

Modeling and analyzing big data in health care

Let’s consider using the building blocks system for healthcare systems, on a healthcare problem that wants to monitor patient vital signs similar to Chen et al. (2010).

  • The purpose that the new data will serve: Most hospitals measure the following vitals for triaging patients: blood pressure and flow, core temperature, ECG, carbon dioxide concentration (Chen et al. 2010).
    1. Functions should it serve: gathering, storing, preprocessing, and processing the data. Chen et al. (2010) suggested that they should also perform a consistency check, aggregating and integrate the data.
    2. Which parts of the data are needed to serve these functions: all
  • Tools needed: distributed database system, wireless network, parallel processing, graphical user interface for healthcare providers to understand the data, servers, subject matter experts to create upper limits and lower limits, classification algorithms that used machine learning
  • Top level plan: The data will be collected from the vital sign sensors, streaming at various time intervals into a central hub that sends the data in packets over a wireless network into a server room. The server can divide the data into various distributed systems accordingly. A parallel processing program will be able to access the data per patient per window of time to conduct the needed functions and classifications to be able to provide triage warnings if the vitals hit any of the predetermined key performance indicators that require intervention by the subject matter experts.  If a key performance indicator is sparked, send data to the healthcare provider’s device via a graphical user interface.
  • Pivoting is bound to happen; the following can happen:
    1. Graphical user interface is not healthcare provider friendly
    2. Some of the sensors need to be able to throw a warning if they are going bad
    3. Subject matter experts may need to readjust the classification algorithm for better triaging

Thus, the above problem as discussed by Chen et al. (2010), could be broken apart to its building block components as addressed in Burkle et al. (2011).  These components help to create a system to analyze this set of big health care data through analytics, via distributed systems and parallel processing as addressed by Services (2015) and Mirtaheri et al. (2008).

Draw on a large body of data to form a prediction or variable comparisons within the premise of big data.

Fayyad, Piatetsky-Shapiro, and Smyth (1996) defined that data analytics can be divided into descriptive and predictive analytics. Vardarlier and Silaharoglu (2016) agreed with Fayyad et al. (1996) division but added prescriptive analytics.  Depending on the goal of diagnosing illnesses with the use of big data analytics should depend on the theory/division one should choose.  Raghupathi & Raghupathi (2014), stated some common examples of big data in the healthcare field to be: personal medical records, radiology images, clinical trial data, 3D imaging, human genomic data, population genomic data, biometric sensor reading, x-ray films, scripts, and traditional paper files.  Thus, the use of big data analytics to understand the 23 pairs of chromosomes that are the building blocks for people. Healthcare professionals are using the big data generated from our genomic code to help predict which illnesses a person could get (Services, 2013). Thus, using predictive analytics tools and algorithms like decision trees would be of some use.  Another use of predictive analytics and machine learning can be applied to diagnosing an eye disease like diabetic retinopathy from an image by using classification algorithms (Goldbloom, 2016).

Examine the unique domain of health informatics and explain how big data analytics contributes to the detection of fraud and the diagnosis of illness.

A process mining framework for the detection of healthcare fraud and abuse case study (Yang & Hwang, 2006): Fraud exists in processing health insurance claims because there are more opportunities to commit fraud because there are more channels of communication: service providers, insurance agencies, and patients. Any one of these three people can commit fraud, and the highest chance of fraud occurs where service providers can do unnecessary procedures putting patients at risk. Thus this case study provided the framework on how to conduct automated fraud detection. The study collected data from 2543 gynecology patients from 2001-2002 from a hospital, where they filtered out noisy data, identified activities based on medical expertise, identified fraud in about 906.

Summarize one case study in detail related to big data analytics as it relates to organizational processes and topical research.

The use of Spark about the healthcare field case study by Pita et al. (2015): Data quality in healthcare data is poor and in particular that of the Brazilian Public Health System.  Spark was used to help in data processing to improve quality through deterministic and probabilistic record linking within multiple databases.  Record linking is a technique that uses common attributes across multiple databases and identifies a 1-to-1 match.  Spark workflows were created to help do record linking by (1) analyzing all data in each database and common attributes with high probabilities of linkage; (2) pre-processing data where data is transformed, anonymization, and cleaned to a single format so that all the attributes can be compared to each other for a 1-to-1 match; (3) record linking based on deterministic and probabilistic algorithms; and (4) statistical analysis to evaluate the accuracy. Over 397M comparisons were made in 12 hours.  They concluded that accuracy depends on the size of the data, where the bigger the data, the more accuracy in record linking.

References

  • Burkle, T., Hain, T., Hossain, H., Dudeck, J., & Domann, E. (2001). Bioinformatics in medical practice: What is necessary for a hospital?. Studies in health technology and informatics, (2), 951-955.
  • Chen, B., Varkey, J. P., Pompili, D., Li, J. K., & Marsic, I. (2010). Patient vital signs monitoring using wireless body area networks. In Bioengineering Conference, Proceedings of the 2010 IEEE 36th Annual Northeast (pp. 1-2). IEEE.
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. Retrieved from: http://www.aaai.org/ojs/index.php/aimagazine/article/download/1230/1131/
  • Goldbloom, A. (2016). The jobs we’ll lose to machines – and the ones we won’t. TED Talks. Retrieved from https://www.youtube.com/watch?v=gWmRkYsLzB4
  • Mirtaheri, S. L., Khaneghah, E. M., Sharifi, M., & Azgomi, M. A. (2008). The influence of efficient message passing mechanisms on high performance distributed scientific computing. In Parallel and Distributed Processing with Applications, 2008. ISPA’08. International Symposium on (pp. 663-668). IEEE.
  • Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., & Rasella, D. (2015). A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data. In EDBT/ICDT Workshops (pp. 17-26).
  • Raghupathi, W. Raghupathi, V. (2014). Big Data Analytics in healthcare: promise and potential. Heath Information Science and Systems. 2(3). Retrieved from http://hissjournal.biomedcentral.com/articles/10.1186/2047-2501-2-3
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].
  • Vardarlier, P., & Silahtaroglu, G. (2016). Gossip management at universities using big data warehouse model integrated with a decision support system. International Journal of Research in Business and Social Science, 5(1), 1–14. Doi: http://doi.org/10.1108/ 17506200710779521
  • Yang, W. S., & Hwang, S. Y. (2006). A process-mining framework for the detection of healthcare fraud and abuse.Expert Systems with Applications31(1), 56-68.

Cloud computing and big data

High-performance computing is where there is either a cluster and grid of servers or virtual machines that are connected by a network for a distributed storage and workflow (Bhokare et al., 2016; Connolly & Begg, 2014; Minelli, Chamber, & Dhiraj, 2013). Parallel computing environments draw on the distributed storage and workflow on the cluster and grid of servers or virtual machines for processing big data (Bhokare et al., 2016; Minelli et al., 2013). NoSQL databases have benefits as they provide a data model for applications that require a little code, less debugging, run on clusters, handle large scale data, stored across distributed systems, use parallel processing, and evolve with time (Sadalage & Fowler, 2012).  Cloud technology is the integration of data storage across a distributed set of servers or virtual machines through either traditional relational database systems or NoSQL database systems while allowing for data preprocessing and processing through parallel processing (Bhokare et al., 2016; Connolly & Begg, 2014; Minelli et al., 2013; Sadalage & Folwer, 2012).

Clouds can come in different flavors depending on how much the organization and supplier want to manage: Infrastructure as a Service, Platform as a Service, and Software as a Service (Connolly & Begg, 2014).  Thus, this makes the enterprise IT act as a broker across the various cloud options.  Also, analyzing exactly how and where data are stored to ensure it complies with various national and international data rules and regulations while preserving data privacy exist with the type of cloud use: public, community, private and hybrid clouds (Minelli et al. 2013; Conolloy & Begg, 2014).

Public cloud environments are where a supplier to a company provides a cluster or grid of servers through the internet like Spark AWS, EC2 (Connolly & Begg, 2014; Minelli et al. 2013).  Cloud computing can be thought of as a set of building blocks.  The company can grow or shrink a number of servers and services when needed dynamically, which allows the company to request the right amount of services for their data collection, storage, preprocessing, and processing needs (Bhokare et al., 2016; Minelli et al., 2013; Sadalage & Fowler, 2012).  This allows for the company to purchase the services it needs, without having to purchase the infrastructure to support the services it might think it will need. This allows for hyper-scaling computing in a distributed environment, also known as hyper-scale cloud computing, where the volume and demand of data explode exponentially yet still be accommodated in public, community, private, or hybrid cloud in a cost efficiently (Mainstay, 2016; Minelli et al., 2013).

Data storage and sharing are a key component of using enterprise public clouds (Sumana & Biswal, 2016).  However, it should be noted that the data is stored in the public cloud is stored on the same servers as probably the company’s competitors, so data security is an issue. Sumana and Biswal (2016) proposed that a key aggregate cryptosystem to be used, where the enterprise holds the master key for all its enterprise files, whereas going a deep layer users can have other data encrypted to send within the enterprise, without needing to know the enterprise file key. This proposed solution for data security in a public cloud allows for end-user registration, end-user revocation, file generation and deletion, and file access and traceability.

A community cloud environment is a cloud that is shared exclusively by a set of companies that share the similar characteristics, compliance, security, jurisdiction, etc. (Connolly & Begg, 2014). Thus, the infrastructure of all of these servers and grids meet industry standards and best practices, with the shared cost of the infrastructure is maintained by the community.

Private cloud environments have a similar infrastructure to a public cloud, but the infrastructure only holds the data one company exclusively, and its services are shared across the different business units of that one company (Connolly & Begg, 2014; Minelli et al., 2013). An organization may have all the components already to build a cloud through various on-premise computing resources and thus tend to build a cloud system using open source code on their internal infrastructure; this is called an on-premise private cloud (Bhokare et al., 2016). The benefit of the private cloud is full control of your data, and the cost of the servers are spread across all the business units, but the infrastructure costs (initial, upgrades, and maintenance costs) are in the company.

Hybrid clouds are two or more cloud structures that have either a private, community or public aspect to them (Connolly & Begg, 2014).  This allows for some data to be retained in the house if need be, and reducing the size of capital expenditure for the internal cloud infrastructure, while other data is stored externally where the cost of the infrastructure is not directly felt by the organization.

References

  • Bhokare, P., Bhagwat, P., Bhise, P., Lalwani, V., & Mahajan, M. R. (2016). Private Cloud using GlusterFS and Docker.International Journal of Engineering Science5016.
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th). Pearson Learning Solutions. [Bookshelf Online].
  • (2016). An economic study of the hyper-scale data center. Mainstay, LLC, Castle Rock, CO, the USA, Retrieved from http://cloudpages.ericsson.com/ transforming-the-economics-of-data-center
  • Minelli, M., Chambers, M., &, Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. [Bookshelf Online].
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, [Bookshelf Online].
  • Sumana, P., & Biswal, B. K. (2016). Secure Privacy Protected Data Sharing Between Groups in Public Cloud.International Journal of Engineering Science3285.

Data Tools: Artificial Intelligence

Big data Analytics and Artificial Intelligence

Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). Though previously, AI wasn’t able to come into existence without the proper computational power that is provided today (Cringely, 2013).  AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with (Power, 2015).  The goal of AI is to use huge amounts of data to draw out a set of rules through machine learning that will effectively replace experts in a certain field (Cringely, 2013; Power, 2015). Cringely (2013) stated that in some situations big data can eliminate the need for theory and that AI can aid in analyzing big data where theory is either lacking or impossible to define.

AI can provide tremendous value since it builds thousands of models and correlations automatically in one week, which use to take a few quantitative data scientist years to do (Dewey, 2013; Power, 2015).  The thing that has slowed down the progression of AI in the past was the creation of human readable computer languages like XML or SQL, which is not intuitive for computers to read (Cringely, 2013).  Fortunately, AI can easily use structured data and now use unstructured data thanks to everyone who tags all these unstructured data either in comments or on the data point itself, speeding up the computational time (Cringely, 2013; Power, 2015).  Dewey (2013), hypothesized that not only will AI be able to analyze big data at speeds faster than any human can, but that the AI system can also begin to improve its search algorithms in phenomena called intelligence explosion.  Intelligence explosion is when an AI system begins to analyze itself to improve itself in an iterative process to a point where there is an exponential growth in improvement (Dewey, 2013).

Unfortunately, the rules created by AI out of 50K variables lack substantive human meaning, or the “Why” behind it, thus making it hard to interpret the results (Power, 2015).  It would take many scientists to analyze the same big data and analyze it all, to fully understand how the connections were made in the AI system, which is no longer feasible (Cringely, 2013).  It is as if data scientist is trying to read the mind of the AI system, and they currently cannot read a human’s mind. However, the results of AI are becoming accurate, with AI identifying cats in photographs in 72 hours of machine learning and after a cat is tagged in a few photographs (Cringely, 2013). AI could be applied to any field of study like finance, social science, science, engineering, etc. or even play against champions on the Jeopardy game show (Cyranoski, 2015; Cringely, 2013; Dewey, 2013; Power, 2015).

Example of artificial intelligence use in big data analysis: Genomics

The goal of AI use on genomic data is to help analyze physiological traits and lifestyle choices to provide a dedicated and personalized health plan to treat and eventually prevent disease (Cyranoski, 2015; Power, 2015).  This is done by feeding the AI systems with huge amounts of genomic data, which is considered big data by today’s standards (Cyranoski, 2015). Systems like IBM’s Watson (an AI system) could provide treatment options based on the results gained from analyzing thousands or even millions of genomic data (Power, 2015).  This is done by analyzing all this data and allowing machine learning techniques to devise algorithms based on the input data (Cringely, 2013; Cyranoski, 2015; Power, 2015).  As of 2015, there is about 100,000 individual genomic data in the system, and even with this huge amounts of data, it is still not enough to provide the personalized health plan that is currently being envisioned based on a person’s genomic data (Cyranoski, 2015).  Eventually, millions of individuals will need to be added into the AI system, and not just genomic data, but also proteomics, metabolomics, lipidomics, etc.

Resources:

Data tools: Analysis of big data involving text mining

Definitions

Big data – any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do 2013, Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).

Text mining – a process that involves discovering implicit knowledge from unstructured textual data (Gera & Goel, 2015; Hashimi & Hafez, 2015; Nassirtoussi Aghabozorgi, Wah, & Ngo, 2015).

Case study: Basole, Seuss, and Rouse (2013). IT innovation adoption by enterprises: Knowledge discovery through text analytics.

The goal of this study was to use text mining techniques on 472 quality peer reviewed articles that spanned 30 years of knowledge (1977-2008).  The selection criteria for the articles were based on articles focused on the adoption of IT innovation; focused on the enterprise, organization, or firm; rigorous research methods; and publishable leading journals.  The reason to go through all this analysis is to prove the usefulness of text analytics for literature reviews.  In 2016, most literature reviews contain recent literature from the last five years, and in certain fields, it may not just be useful to focus on the last five years.  Extending the literature search beyond this 5-year period, requires a ton of attention and manual labor, which makes the already literature an even more time-consuming endeavor than before. So, the author’s question is to see if it is possible to use text mining to conduct a more thorough review of the body of knowledge that expands beyond just the typical five years on any subject matter.  They argue that the time it takes to conduct this tedious task could benefit from automation.  However, this should be thought of as a first pass through the literature review. Thinking of this regarding a first pass allows for the generation of new research questions and a generation of ideas, which drives more future analysis.In the end, the study was able to conclude that cost and complexity were two of the most frequent determinants of IT innovation adoption from the perspective of an IT department.  Other determinants for IT departments were the complexity, capability, and relative advantage of the innovation.  However, when going up one level of extraction to the enterprise/organizational level, the perceived benefits and usefulness were the main determinants of IT innovation.  Ease of use of the technology was a big deal for the organization.  When comparing, IT innovation with costs there was a negative correlation between the two, while IT innovation has a positive correlation to organization size and top management support.

How was big data analytics learned, taught, and used in the case study?

The research approach for this study was: (1) Document Identification and extraction, (2) document classification and coding, (3) document analysis and knowledge discovery (key terms, co-occurrence), and (4) research gap identification.

Analysis of the data consisted of classifying the data into four time periods (bins): 1988-1979; 1980-1989; 1990-1999; and 2000-2008 and use of a classification scheme based on existing taxonomies (case study, content analysis, field experiment, field study, frameworks and conceptual model, interview, laboratory experiment, literature analysis, mathematical model, qualitative research, secondary data, speculation/commentary, and survey).  Data was also classified by their functional discipline (Information systems and computer science, decision science, management and organization sciences, economics, and innovation) and finally by IT innovation (software, hardware, networking infrastructure, and the tool’s IT term catalog). This study used a tool called Northernlight (http://georgiatech.northernlight.com/).

The hopes of this study are to use the bag-of-words technique and word proximity to other words (or their equivalents) to help extract meaning from a large set of text-based documents.  Bag-of-words technique is known for counting and identifying key terms and phrases, which help uncover themes.  The simplest way of thinking of the bag-of-words technique is word frequencies in a document.

However, understanding the meaning behind the themes means studying the context in which the words are located in, and relating them amongst other themes, also called co-occurrence of terms.  The best way of doing this meaning extraction is to measure the strength/distance between the themes.  Finally, the researcher in this study can set minimums, maximums that can enhance the meaning extraction algorithm to garner insights into IT innovation, while reducing the overall noise in the final results. The researchers set the following rules for co-occurrences between themes:

  • There are approximately 40 words per sentence
  • There are approximately 150 words per paragraph

How could this implementation of big data have been improved upon?

Goldbloom (2016) stated that using big data techniques (machine learning) is best on big data that requires classifying and it breaks down when the task is too small and specialized, therefore prime for only human analysis.  This study only looked at 427 articles, is this considered big enough for analysis, or should the analysis go back through multiple years beyond just the 30 years (Basole et al., 2013).  What is considered big data in 2013 (the time of this study), may not be big data in 2023 (Fox & Do, 2013).

Mei & Zhai (2005), observed how terms and term frequencies evolved over time and graphed it by year, rather than binning the data into four different groups as in Basole et al. (2013).  This case study could have shown how cost and complexity in IT innovation changed over time.  Graphing the results similar to Mei & Zhai (2005) and Yoon and Song (2014) would also allow for an analysis of IT innovation themes and if each of these themes is in an Introduction, Growth, Majority, or Decline mode.

 Reference

  • Basole, R. C., Seuss, C. D., & Rouse, W. B. (2013). IT innovation adoption by enterprises: Knowledge discovery through text analytics. Decision Support Systems, 54, 1044-1054. Retrieved from http://www.sciencedirect.com.ctu.idm.oclc.org/science/article/pii/S0167923612002849
  • Davenport, T. H., & Dyche, J. (2013). Big Data in Big Companies. International Institute for Analytics, (May), 1–31.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Gera, M., & Goel, S. (2015). Data Mining-Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Goldbloom, A. (2016). The jobs we’ll lose to machines –and the ones we won’t. TED. Retrieved from http://www.ted.com/talks/anthony_goldbloom_the_jobs_we_ll_lose_to_machines_and_the_ones_we_won_t
  • Hashimi, H., & Hafez, A. (2015). Selection criteria for text mining approaches. Computers in Human Behavior, 51, 729–733. http://doi.org/10.1016/j.chb.2014.10.062
  • Mei, Q., & Zhai, C. (2005). Discovering evolutionary theme patterns from text: an exploration of temporal text mining. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 198–207. http://doi.org/10.1145/1081870.1081895
  • Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2015). Text-mining of news-headlines for FOREX market prediction: a multi-layer dimension reduction algorithm with semantics and sentiment. Expert Systems with Applications42(1), 306-324.
  • Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J., & Zients, J. (2014). Big Data: Seizing Opportunities. Executive Office of the President of USA, 1–79.
  • Yoon, B., & Song, B. (2014). A systematic approach of partner selection for open innovation. Industrial Management & Data Systems, 114(7), 1068.

Data Tools: WEKA

WEKA

The Java based, open sourced, and platform independent Waikato Environment for Knowledge Analysis (WEKA) tool, for data preprocessing, predictive data analytics, and facilitation interpretations and evaluation (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Miranda, n.d.; Xia & Gong, 2014).  It was originally developed for analyzing agricultural data and has evolved to house a comprehensive collection of data preprocessing and modeling techniques (Patel & Donga 2015).  It is a java based machine learning algorithm for data mining tasks as well as text mining that could be used for predictive modeling, housing pre-processing, classification, regression, clustering, association rules, and visualization (WEKA, n.d). Also, WEKA contains classification, clustering, association rules, regression, and visualization capabilities, in particular, the C4.5 decision tree predictive data analytics algorithm (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011). Here WEKA is an open source data and text mining software tool, thus it is free to use. Therefore there are no costs associated with this software solution.

WEKA can be applied to big data (WEKA, n.d.) and SQL Databases (Patel & Donga, 2015). Subsequently, WEKA has been used in many research studies that are involved in big data analytics (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011; Parkavi & Sasikumar, 2016; Xia & Gong, 2014). For instance, Barak and Modarres (2015) used WEKA for decision tree analysis on predicting stock risks and returns.

The fact that it has been using in this many research studies is that the reliability and validity of the software are high and well established.  Even in a study comparing WEKA with 12 other data analytics tools, is one of two apps studied that have a classification, regression, and clustering algorithms (Gera & Goel, 2015).

A disadvantage of using this tool is its lack of supporting multi-relational data mining, but if one can link all the multi-relational data into one table, it can do its job (Patel & Donga, 2015). The comprehensiveness of analysis algorithms for both data and text mining and pre-processing is its advantage. Another disadvantage of WEKA is that it cannot handle raw data directly, meaning the data had to be preprocessed before it is entered into the software package and analyzed (Hoonlor, 2011). WEKA cannot even import excel files, data in Excel have to be converted into CSV format to be usable within the system (Miranda, n.d.)

References:

  • Dogan, N., & Tanrikulu, Z. (2013). A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Information Technology and Management, 14(2), 105-124. doi:http://dx.doi.org/10.1007/s10799-012-0135-7
  • Gera, M., & Goel, S. (2015). Data Mining -Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Hoonlor, A. (2011). Sequential patterns and temporal patterns for text mining. UMI Dissertation Publishing.
  • Kumar, D., & Fet, D. (2011). Performance Analysis of Various Data Mining Algorithms: A Review. International Journal of Computer Applications, 32(6), 9–16.
  • Miranda, S. (n.d.). An Introduction to Social Analytics : Concepts and Methods.
  • Parkavi, S. & Sasikumar, S. (2016). Prediction of Commodities Market by Using Data Mining Technique. i-Manager’s Journal on Computer Science.
  • Patel, K., & Donga, J. (2015). Practical Approaches: A Survey on Data Mining Practical Tools. Foundations, 2(9).
  • WEKA (n.d.) WEKA 3: Data Mining Software in Java. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/
  • Xia, B. S., & Gong, P. (2014). Review of business intelligence through data analysis. Benchmarking, 21(2), 300–311. http://doi.org/http://dx.doi.org/10.1108/BIJ-08-2012-0051

Data Tools: Hadoop and how to install it

What is Hadoop

Hadoop’s Distributed File System (HFDS) is where big data is broken up into smaller blocks (IBM, n.d.), which can be aggregated like a set of Legos throughout a distributed database system. Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). Blocks are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on the data needs).

HFDS can support many different data types, even those that are unknown or yet to be classified and it can store a bunch of data.  Thus, Hadoop’s technology to manage big data allows for parallel processing, which can allow for parallel searching, metadata management, parallel analysis (with MapReduce), the establishment of workflow system analysis, etc. (Gary et al., 2005, Hortonworks, 2013, & IBM, n.d.).

Given the massive amounts of data in Big Data that needs to get processed, manipulated, and calculated upon, parallel processing and programming are there to use the benefits of distributed systems to get the job done (Minelli et al., 2013).  Hadoop, which is Java based allows for manipulation and calculations to be done by calling on MapReduce, which pulls on the data which is distributed on its servers, to map key items/objects, and reduces the data to the query at hand (Hortonworks, 2013 & Sathupadi, 2010).

Parallel processing allows making quick work on a big data set, because rather than having one processor doing all the work, Hadoop splits up the task amongst many processors. This is the largest benefit of Hadoop, which allows for parallel processing.  Another advantage of parallel processing is when one processor/node goes out; another node can pick up from where that task last saved safe object task (which can slow down the calculation but by just a bit).  Hadoop knows that this happens all the time with their nodes, so the processor/node create backups of their data as part of their fail safe (IBM, n.d).  This is done so that another processor/node can continue its work on the copied data, which enhances data availability, which in the end gets the task you need to be done now.

Minelli et al. (2013) stated that traditional relational database systems could depend on hardware architecture.  However, Hadoop’s service is part of cloud (as Platform as a Service = PaaS).  For PaaS, we manage the applications, and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001).  The next section discusses how to install Hadoop and how to set up Eclipse to access map/reduce servers.

Installation steps

  • Go to the Hadoop Main Page < http://hadoop.apache.org/ > and scroll down to the getting started section, and click “Download Hadoop from the release page.” (Birajdar, 2015)
  • In the Apache Hadoop Releases < http://hadoop.apache.org/releases.html > Select the link for the “source” code for Hadoop 2.7.3, and then select the first mirror: “http://apache.mirrors.ionfish.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3-src.tar.gz” (Birajdar, 2015)
  • Open the Hadoop-2.7.3 tarball file with a compression file reader like WinRAR archiver < http://www.rarlab.com/download.htm >. Then drag the file into the Local Disk (C:). (Birajdar, 2015)
  • Once the file has been completely transferred to the Local Disk drive, close the tarball file, and open up the hadoop-2.7.3-src folder. (Birajdar, 2015)
  • Download Hadoop 0.18.0 tarball file < https://archive.apache.org/dist/hadoop/core/hadoop-0.18.0/ > and place the copy the “Hadoop-vm-appliance-0-18-0” folder into the Java “jdk1.8.0_101” folder. (Birajdar, 2015; Gnsaheb, 2013)
  • Download Hadoop VM file < http://ydn.zenfs.com/site/hadoop/hadoop-vm-appliance-0-18-0_v1.zip >, unzip it and place it inside the Hadoop src file. (Birajdar, 2015)
  • Open up VMware Workstation 12, and open a virtual machine “Hadoop-appliance-0.18.0.vmx” and select play virtual machine. (Birajdar, 2015)
  • Login: Hadoop-user and password: Hadoop. (Birajdar, 2015; Gnsaheb, 2013)
  • Once in the virtual machine, type “./start-hadoop” and hit enter. (Birajdar, 2015; Gnsaheb, 2013)
    1. To test MapReduce on the VM: bin/Hadoop jar Hadoop-0.18.0-examples.jar pi 10 100000000
      1. You should get a “job finished in X seconds.”
      2. You should get an “estimated value of PI is Y.”
  • To bind MapReduce plugin to eclipse (Gnsaheb, 2013)
    1. Go into the JDK folder, under Hadoop-0.18.0 > contrib> eclipse-plugin > “Hadoop-0.18.0-eclipse-plugin” and place it into the eclipse neon 1 plugin folder “eclipse\plugins”
    2. Open eclipse, then open perspective button> other> map/reduce.
    3. In Eclipse, click on Windows> Show View > other > MapReduce Tools > Map/Reduce location
    4. Adding a server. On the Map/Reduce Location window, click on the elephant
      1. Location name: your choice
      2. Map/Reduce master host: IP address achieved after you log in via the VM
  • Map/Reduce Master Port: 9001
  1. DFS Master Port: 9000
  2. Username: Hadoop-user
  1. Go to the advance parameter tab > mapred.system.dir > edit to /Hadoop/mapped/system

Issues experienced in the installation processes (Discussion of any challenges and explain how it was investigated and solved)

Not one source has the entire solution Birajdar, 2015; Gnsaheb, 2013; Korolev, 2008).  It took a combination of all three sources, to get the same output that each of them has described.  Once the solution was determined to be correct, and the correct versions of the files were located, they were expressed in the instruction set above.  Whenever a person runs into a problem with computer science, google.com is their friend.  The links above will become outdated with time, and methods will change.  Each person’s computer system is different than those from my personal computer system, which is reflected in this instruction manual.  This instruction manual should help others google the right terms and in the right order to get Hadoop installed correctly onto their system.  This process takes about 3-5 hours to install correctly, with the long time it takes to download and install the right files, and with the time to set up everything correctly.

Resources

Adv Quant: Statistical Features of R

Ward and Barker (2013) traced back definition of Volume, Velocity, and Variety from Gartner.  Now, a predominately widely accepted definition for big data is any set of data that has high velocity, volume, and variety (Davenport & Dyche, 2013; Fox & Do 2013, Kaur & Rani, 2015. Mao, Xu, Wu, Li, Li, & Lu, 2015; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014; Richards & King, 2014; Sagiroglu & Sinanc, 2013; Zikopoulous and Eaton, 2012). Davenport et al. (2012), stated that IT companies define big data as “more insightful data analysis”, but if used properly companies can gain a competitive edge.  Data scientists from companies like Google, Facebook, and LinkedIn, use R for their finance and data analytics (Revolution Analytics, n.d.). According to Minelli, Chambers and Dhiraj (2013) R has 2 million end-users and is used in industries like health, finance, etc.

Why is R so popular and have that many users?  It could be that R is a free open-source software that works on multiple platforms (Unix, Windows, Mac), and has an extensive statistical library to help conduct basic statistical data analysis, to multivariate analysis, scaling up to big data analytics (Hothorn, 2016; Leisch & Gruen, 2016; Schumacker, 2014 & 2016; Templ, 2016; Theussl & Borchers, 2016; Wild, 2015).  Given the open-sourced nature of the R software, many libraries are being built and shared with the greater community, and the Comprehensive R Archive Network (CRAN), has a ton of these programs as part of R Packages (Schumacker, 2014).  Other advantages of R, is the customizable statistical analysis, control over the analytical processes, extensive documentation, and references (Schumacker, 2016).  R Packages allow for everyday data analytics, visually aesthetic data visualizations, faster results than legacy statistical software that the end-user can control, drawing upon the talents of leading data scientists (Revolution Analytics, n.d.).  R programming features include dealing with a whole suite of data types, (scalars, vectors, matrices, arrays, and data frames), as well as impetrating and exporting data into multiple other commercially available statistical/data software (SPSS, SAS, Excel, etc.) (Schumacker, 2014 & 2016).  All the features of R related to big data analytics, statistical, and programming features are listed in Table 1 (below).  Given all the R Packages listed below and the importing and exporting features to other big data statistical software illustrates how useful R is for analyzing big datasets of various types (Schumacker, 2014, 2016).

Finally, R is the most dominant analytics tool for Big Data Analytics (Minelli et al., 2013).  Big data analytics is at the border of computing science, data mining, and statistics, it is natural to see multiple R Packages and libraries listed within CRAN that are freely available to use.  Within the field of big data analytics, some (but not all) of common sets of techniques that have R Packages are machine learning, cluster analysis, finite mixture models, and natural language processing. Given the extensive libraries through R Packages and extensive documentation, R is well suited for Big Data.

Table 1: Big Data Analytics, Statistical, and Programmable features of R

R Programming Features (Schumacker, 2014) Input, Process, Output, R Packages
Variables in R (Schumacker, 2014) number, character, logical
Data Types in R (Schumacker, 2014) scalars, arrays, vectors, matrices, list, data frames
Flow control: Loops (Schumacker, 2014) Loops (for, if, while, else, …)

Boolean Operators (and, not, or)

Visualizations (Schumacker, 2014) pie charts, bar charts, histogram, stem-and-leaf plots, scatter plots, box-whiskers plot, surface plots, contour plots, geographic maps, colors, plus others from the many R Packages
Statistical Analysis (Schumacker, 2014) Central tendency, dispersion, correlation test, linear Regression, multiple regression, logistic regression, log-linear regression, analysis of variance, probability, confidence intervals, plus others from the many R Packages
Distributions: population, sampling, and statistical (Schumacker, 2014) Binomial, Uniform, Exponential, Normal, Hypothesis testing, chi-square, z-test, t-test, f-test, plus others from the many R Packages
Multivariate Statistical Analysis (Schumacker, 2016) MANOVA, MANCOVA, factor analysis, principle components analysis, structural equation modeling, multidimensional scaling, discriminant analysis, canonical correlation, multiple group multivariate statistical analysis, plus others from the many R Packages
Big Data Analytics: Cluster Analysis (Leisch & Gruen, 2016)

 

hierarchical clustering, partitioning clustering, model-based clustering, K-means clustering, fuzzy clustering, cluster-wise regression, principal component analysis, self-organizing maps, density based clustering
Big Data Analytics: Machine Learning

(Hothorn, 2016; Templ, 2016)

neural networks, recursive partitioning, random forests, regularized and shrinkage methods, boosting, support vector machines, association rules, fuzzy rules based systems, model selection and validation, tree methods, expectation-maximization, nearest neighbor
Big Data Analytics: Natural Language Processing (Wild, 2015)

 

Frameworks, lexical databases, keyword extraction, string manipulation, stemming, semantic, pragmatics
Big Data Analytics: Optimization and Mathematical Programing (Theussl & Borchers, 2016)

 

optimization infrastructure packages, general purpose continuous solvers, least-squares problems, semidefinite and convex solvers, global and stochastic optimization, mathematical programming solvers

 

References

  • Davenport, T. H., Barth, P., & Bean, R. (2012). How big data is different. MIT Sloan Management Review, 54(1), 43.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Hothorn, T. (2016). CRAN task view: Machine learning & statistical learning. Retrieved from https://cran.r-project.org/web/views/MachineLearning.html
  • Kaur, K., & Rani, R. (2015). Managing Data in Healthcare Information Systems: Many Models, One Solution. Big Data Management, 52–59.
  • Leisch, F. & Gruen, B. (2016). CRAN task view: Cluster analysis & finite mixture models. Retrieved from https://cran.r-project.org/web/views/Cluster.html
  • Mao, R., Xu, H., Wu, W., Li, J., Li, Y., & Lu, M. (2015). Overcoming the Challenge of Variety: Big Data Abstraction, the Next Evolution of Data Management for AAL Communication Systems. Ambient Assisted Living Communications, 42–47.
  • Minelli, M., Chambers M., & Dhiraj A. (2013) Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.
  • Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J., & Zients, J. (2014). Big Data: Seizing Opportunities. Executive Office of the President of USA, 1–79.
  • Revolution Analytics (n.d.). What is R? Retrieved from http://www.revolutionanalytics.com/what-r
  • Richards, N. M., & King, J. H. (2014). Big Data Ethics. Wake Forest Law Review, 49, 393–432.
  • Sagiroglu, S., & Sinanc, D. (2013). Big Data : A Review. Collaboration Technologies and Systems (CTS), 42–47.
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.
  • Schumacker, R. E. (2016) Using R with multivariate statistics. California, SAGE Publications, Inc.
  • Templ, M. (2016). CRAN task view: Official statistics & survey methodology. Retrieved from https://cran.r-project.org/web/views/OfficialStatistics.html
  • Theussl, S. & Borchers, H. W. (2016). CRAN task view: Optimization and mathematical programming. Retrieved from https://cran.r-project.org/web/views/Optimization.html
  • Ward, J. S., & Barker, A. (2013). Undefined by data: a survey of big data definitions. arXiv preprint arXiv:1309.5821.
  • Wild, F. (2015). CRAN task view: Natural language processing. Retrieved from https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
  • Zikopoulos, P., &Eaton, C. (2012). Understanding Big Data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media.

Big Data Analytics: Compelling Topics

Big Data and Hadoop:

According to Gray et al. (2005), traditional data management relies on arrays and tables in order to analyze objects, which can range from financial data, galaxies, proteins, events, spectra data, 2D weather, etc., but when it comes to N-dimensional arrays there is an “impedance mismatch” between the data and the database.    Big data, can be N-dimensional, which can also vary across time, i.e. text data (Gray et al., 2005). Big data, by its name, is voluminous. Thus, given the massive amounts of data in Big Data that needs to get processed, manipulated, and calculated upon, parallel processing and programming are there to use the benefits of distributed systems to get the job done (Minelli, Chambers, & Dhiraj, 2013).  Parallel processing allows making quick work on a big data set, because rather than having one processor doing all the work, you split up the task amongst many processors.

Hadoop’s Distributed File System (HFDS), breaks up big data into smaller blocks (IBM, n.d.), which can be aggregated like a set of Legos throughout a distributed database system. Data blocks are distributed across multiple servers. Hadoop is Java-based and pulls on the data that is stored on their distributed servers, to map key items/objects, and reduces the data to the query at hand (MapReduce function). Hadoop is built to deal with big data stored in the cloud.

Cloud Computing:

Clouds come in three different privacy flavors: Public (all customers and companies share the all same resources), Private (only one group of clients or company can use a particular cloud resources), and Hybrid (some aspects of the cloud are public while others are private depending on the data sensitivity.  Cloud technology encompasses Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).  These types of cloud differ in what the company managers on what is managed by the cloud provider (Lau, 2011).  Cloud differs from the conventional data centers where the company managed it all: application, data, O/S, virtualization, servers, storage, and networking.  Cloud is replacing the conventional data center because infrastructure costs are high.  For a company to be spending that much money on a conventional data center that will get outdated in 18 months (Moore’s law of technology), it’s just a constant sink in money.  Thus, outsourcing the data center infrastructure is the first step of company’s movement into the cloud.

Key Components to Success:

You need to have the buy-in of the leaders and employees when it comes to using big data analytics for predictive, prescriptive or descriptive purposes.  When it came to buy-in, Lt. Palmer had to nurture top-down support as well as buy-in from the bottom-up (ranks).  It was much harder to get buy-in from more experienced detectives, who feel that the introduction of tools like analytics, is a way to tell them to give up their long-standing practices and even replace them.  So, Lt. Palmer had sold Blue PALMS as “What’s worked best for us is proving [the value of Blue PALMS] one case at a time, and stressing that it’s a tool, that it’s a compliment to their skills and experience, not a substitute”.  Lt. Palmer got buy-in from a senior and well-respected officer, by helping him solve a case.  The senior officer had a suspect in mind, and after feeding in the data, the tool was able to predict 20 people that could have done it in an order of most likely.  The suspect was on the top five, and when apprehended, the suspect confessed.  Doing, this case by case has built the trust amongst veteran officers and thus eventually got their buy in.

Applications of Big Data Analytics:

A result of Big Data Analytics is online profiling.  Online profiling is using a person’s online identity to collect information about them, their behaviors, their interactions, their tastes, etc. to drive a targeted advertising (McNurlin et al., 2008).  Profiling has its roots in third party cookies and profiling has now evolved to include 40 different variables that are collected from the consumer (Pophal, 2014).  Online profiling allows for marketers to send personalized and “perfect” advertisements to the consumer, instantly.

Moving from online profiling to studying social media, He, Zha, and Li (2013) stated their theory, that with higher positive customer engagement, customers can become brand advocates, which increases their brand loyalty and push referrals to their friends, and approximately 1/3 people followed a friend’s referral if done through social media. This insight came through analyzing the social media data from Pizza Hut, Dominos and Papa Johns, as they aim to control more of the market share to increase their revenue.  But, is this aiding in protecting people’s privacy when we analyze their social media content when they interact with a company?

HIPAA described how we should conduct de-identification of 18 identifiers/variables that would help protect people from ethical issues that could arise from big data.   HIPAA legislation is not standardized for all big data applications/cases; it is good practice. However, HIPAA legislation is mostly concerned with the health care industry, listing those 18 identifiers that have to be de-identified: Names, Geographic data, Dates, Telephone Numbers, VIN, Fax, Device ID and serial numbers, emails addresses, URLs, SSN, IP address, Medical Record Numbers, Biometric ID (fingerprints, iris scans, voice prints, etc), full face photos, health plan beneficiary numbers, account numbers, any other unique ID number (characteristic, codes, etc), and certifications/license numbers (HHS, n.d.).  We must be aware that HIPAA compliance is more a feature of the data collector and data owner than the cloud provider.

HIPAA arose from the human genome project 25 years ago, where they were trying to sequence its first 3B base pair of the human genome over a 13 year period (Green, Watson, & Collins, 2015).  This 3B base pair is about 100 GB uncompressed and by 2011, 13 quadrillion bases were sequenced (O’Driscoll et al., 2013). Studying genomic data comes with a whole host of ethical issues.  Some of those were addressed by the HIPPA legislation while other issues are left unresolved today.

One of the ethical issues that arose were mentioned in McEwen et al. (2013), for people who have submitted their genomic data 25 years ago can that data be used today in other studies? What about if it was used to help the participants of 25 years ago to take preventative measures for adverse health conditions?  However, ethical issues extend beyond privacy and compliance.  McEwen et al. (2013) warn that data has been collected for 25 years, and what if data from 20 years ago provides data that a participant can suffer an adverse health condition that could be preventable.  What is the duty of the researchers today to that participant?

Resources:

Big Data Analytics: Future Predictions?

Big data analytics and stifling future innovation?

One way to make a prediction about the future is to understand the current challenges faced in certain parts of a particular field.  In the case of big data analytics, machine learning analyzes data from the past to make a prediction or understanding of the future (Ahlemeyer-Stubbe & Coleman, 2014).  Ahlemeyer-Stubbe and Coleman (2014), argued that learning from the past can hinder innovation.  Although Basole, Seuss, and Rouse (2013), studied past popular IT journal articles to see how the field of IT is evolving, and in Yang, Klose, Lippy,  Barcelon-Yang, and Zhang, (2014) they conclude that analyzing current patent information can lead to discovering trends, and help provide companies actionable items to guide and build future business strategies around a patent trend.  The danger of stifling innovation per Ahlemeyer-Stubbe and Coleman (2014), comes from when we consider a situation of only relying on past data and experiences and not allowing for experiencing or trying anything new.  An example is like trying to optimize a horse-drawn carriage; then the automobile will never have been invented (Ahlemeyer-Stubbe & Coleman, 2014).   This example is a very bad analogy.  We should not focus on only collecting data on one item, but its tangential items as well.  We should focus on collecting a wide range of data from different fields and different sources, to allow for new patterns to form, connections to be made, which can promote innovation (Basole et al. 2013).

Future of Health Analytics:

Another way to analyze the future is to dream big or from a movie (Carter, Farmer, and Siegel, 2014). What if we could analyze our blood daily to aid in tracking our overall health, besides the daily blood sugar levels data that most diabetics are accustom to?  The information generated from here can aid in generating a healthier lifestyle.  Currently, doctors aid patients in their care with their diet and monitor their overall health.  When you are going home, this care disappears.  But, constant monitoring may help outpatient care and daily living.  Alerts could be sent to your doctor or to other family members if certain biomarkers get to a critical threshold.  This could aid in better care, allowing people’s social network to help them keep accountable in making healthy life and lifestyle choices, and possibly lessen the time between symptom detection to emergency care in severe cases (Carter, Farmer, and Siegel, 2014).

Generating revenue from analyzing consumers:

Soon, it is not enough to conduct item affinity analysis (market basket analysis).  Item affinity (market basket analysis) uses rules-based analytics to understand what items frequently co-occur during transactions (Snowplow Analytics, 2016). Item affinity is similar to the Amazon.com current method to drive more sales through getting their customers to consume more.  However, what if we started to look at what a consumer intends to buy (Minelli, Chambers, and Dhiraj, 2013)? Analyzing data from consumer product awareness, brand awareness, opinion (sentiment analysis), consideration, preferences, and purchases from a consumer’s multiple social media platforms account in real time can allow marketers to create the perfect advertisement (Minelli et al., 2013).  Establishing the perfect advertisement will allow companies to gain a bigger market share, or to lure customers to their product and/or services from their competitors.  According to Minelli et al. (2013) predicted that companies in the future should be moving towards:

  • Data that can be refreshed every second
  • Data validation exists in real time and alerts sent if something is wrong before data is published in aiding data driven decisions
  • Executives will receive daily data briefs from their internal processes and from their competitors to allow them to make data-driven decisions to increase revenue
  • Questions that were raised in staff meetings or other organizational meetings can be answered in minutes to hours, not weeks
  • A cultural change in companies where data is easily available and the phrase “let me show you the facts” can be easily heard amongst colleagues

Big data analytics can affect many other areas as well, and there is a whole new world opening up to this.  More and more companies and government agencies are hiring data scientists, because they don’t just see the current value that these scientists bring, but they see their potential value 10-15 years from now.  Thus, the field is expected to change as more and more talent is being recruited into the field of big data analytics.

References:

Ahlemeyer-Stubbe, A., & Coleman, S.  (2014). A Practical Guide to Data Mining for Business and Industry. Wiley-Blackwell. VitalBook file.

Basole, R. C., Seuss, D. C., & Rouse, W. B. (2013). IT innovation adoption by enterpirses: knowledge discovery through text analyztics. Decision Support Systems V(54). 1044-1054.

Carter, K.  B., Farmer, D., Siegel, C. (2014). Actionable Intelligence: A Guide to Delivering Business Results with Big Data Fast!. John Wiley & Sons P&T. VitalBook file.

Minelli, M., Chambers, M., Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.

Snowplow Analytics (2016). Market basket analysis: identifying products and content that go well together. Retrieved from http://snowplowanalytics.com/analytics/recipes/catalog-analytics/market-basket-analysis-identifying-products-that-sell-well-together.html

Yang, Y. Y., Klose, T., Lippy, J., Barcelon-Yang, C. S. & Zhang, L. (2014). Leveraging text analytics in patent analysis to empower business decisions – a competitive differentiation of kinase assay technology platforms by I2E text mining software. World Patent Information V(39). 24-34.