Compelling topics on analytics of big data

  • Big data is defined as high volume, high variety/complexity, and high velocity, which is known as the 3Vs (Services, 2015).
  • Depending on the goal and objectives of the problem, that should help define which theories and techniques of big data analytics to use. Fayyad, Piatetsky-Shapiro, and Smyth (1996) defined that data analytics can be divided into descriptive and predictive analytics. Vardarlier and Silaharoglu (2016) agreed with Fayyad et al. (1996) division but added prescriptive analytics. Thus, these three divisions of big data analytics are:
    • Descriptive analytics explains “What happened?”
    • Predictive analytics explains “What will happen?”
    • Prescriptive analytics explains “Why will it happen?”
  • The scientific method helps give a framework for the data analytics lifecycle (Dietrich, 2013; Services, 2015). According to Dietrich (2013), it is a cyclical life cycle that has iterative parts in each of its six steps: discovery; pre-processing data; model planning; model building; communicate results, and
  • Data-in-motion is the real-time streaming of data from a broad spectrum of technologies, which also encompasses the data transmission between systems (Katal, Wazid, & Goudar, 2013; Kishore & Sharma, 2016; Ovum, 2016; Ramachandran & Chang, 2016). Data that is stored on a database system or cloud system is considered as data-at-rest and data that is being processed and analyzed is considered as data-in-use (Ramachandran & Chang, 2016).  The analysis of real-time streaming data in a timely fashion is also known as stream reasoning and implementing solutions for stream reasoning revolve around high throughput systems and storage space with low latency (Della Valle et al., 2016).
  • Data brokers are tasked collecting data from people, building a particular type of profile on that person, and selling it to companies (Angwin, 2014; Beckett, 2014; Tsesis, 2014). The data brokers main mission is to collect data and drop down the barriers of geographic location, cognitive or cultural gaps, different professions, or parties that don’t trust each other (Long, Cunningham, & Braithwaite, 2013). The danger of collecting this data from people can raise the incidents of discrimination based on race or income directly or indirectly (Beckett, 2014).
  • Data auditing is assessing the quality and fit for the purpose of data via key metrics and properties of the data (Techopedia, n.d.). Data auditing processes and procedures are the business’ way of assessing and controlling their data quality (Eichhorn, 2014).
  • If following an agile development processes the key stakeholders should be involved in all the lifecycles. That is because the key stakeholders are known as business user, project sponsor, project manager, business intelligence analyst, database administers, data engineer, and data scientist (Services, 2015).
  • Lawyers define privacy as (Richard & King, 2014): invasions into protecting spaces, relationships or decisions, a collection of information, use of information, and disclosure of information.
  • Richard and King (2014), describe that a binary notion of data privacy does not Data is never completely private/confidential nor completely divulged, but data lies in-between these two extremes.  Privacy laws should focus on the flow of personal information, where an emphasis should be placed on a type of privacy called confidentiality, where data is agreed to flow to a certain individual or group of individuals (Richard & King, 2014).
  • Fraud is deception; fraud detection is needed because as fraud detection algorithms are improving, the rate of fraud is increasing (Minelli, Chambers, &, Dhiraj, 2013). Data mining has allowed for fraud detection via multi-attribute monitoring, where it tries to find hidden anomalies by identifying hidden patterns through the use of class description and class discrimination (Brookshear & Brylow, 2014; Minellli et al., 2013).
  • High-performance computing is where there is either a cluster or grid of servers or virtual machines that are connected by a network for a distributed storage and workflow (Bhokare et al., 2016; Connolly & Begg, 2014; Minelli et al., 2013).
  • Parallel computing environments draw on the distributed storage and workflow on the cluster and grid of servers or virtual machines for processing big data (Bhokare et al., 2016; Minelli et al., 2013).
  • NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). NoSQL databases have benefits as they provide a data model for applications that require a little code, less debugging, run on clusters, handle large scale data and evolve with time (Sadalage & Fowler, 2012).
    • Document store NoSQL databases, use a key/value pair that is the file/file itself, and it could be in JSON, BSON, or XML (Sadalage & Fowler, 2012; Services, 2015). These document files are hierarchical trees (Sadalage & Fowler, 2012). Some sample document databases consist of MongoDB and CouchDB.
    • Graph NoSQL databases are used drawing networks by showing the relationship between items in a graphical format that has been optimized for easy searching and editing (Services, 2015). Each item is considered a node and adding more nodes or relationships while traversing through them is made simpler through a graph database rather than a traditional database (Sadalage & Fowler, 2012). Some sample graph databases consist of Neo4j Pregel, etc. (Park et al., 2014).
    • Column-oriented databases are perfect for sparse datasets, ones with many null values and when columns do have data the related columns are grouped together (Services, 2015). Grouping demographic data like age, income, gender, marital status, sexual orientation, etc. are a great example for using this NoSQL database. Cassandra is an example of a column-oriented database.
  • Public cloud environments are where a supplier to a company provides a cluster or grid of servers through the internet like Spark AWS, EC2 (Connolly & Begg, 2014; Minelli et al. 2013).
  • A community cloud environment is a cloud that is shared exclusively by a set of companies that share the similar characteristics, compliance, security, jurisdiction, etc. (Connolly & Begg, 2014).
  • Private cloud environments have a similar infrastructure to a public cloud, but the infrastructure only holds the data one company exclusively, and its services are shared across the different business units of that one company (Connolly & Begg, 2014; Minelli et al., 2013).
  • Hybrid clouds are two or more cloud structures that have either a private, community or public aspect to them (Connolly & Begg, 2014).
  • Cloud computing allows for the company to purchase the services it needs, without having to purchase the infrastructure to support the services it might think it will need. This allows for hyper-scaling computing in a distributed environment, also known as hyper-scale cloud computing, where the volume and demand of data explode exponentially yet still be accommodated in public, community, private, or hybrid cloud in a cost efficiently (Mainstay, 2016; Minelli et al., 2013).
  • Building block system of big data analytics involves a few steps Burkle et al. (2001):
    • What is the purpose that the new data will and should serve
      • How many functions should it support
      • Marking which parts of that new data is needed for each function
    • Identify the tool needed to support the purpose of that new data
    • Create a top level architecture plan view
    • Building based on the plan but leaving room to pivot when needed
      • Modifications occur to allow for the final vision to be achieved given the conditions at the time of building the architecture.
      • Other modifications come under a closer inspection of certain components in the architecture

 

References

  • Angwin, J. (2014). Privacy tools: Opting out from data brokers. Pro Publica. Retrieved from https://www.propublica.org/article/privacy-tools-opting-out-from-data-brokers
  • Beckett, L. (2014). Everything we know about what data brokers know about you. Pro Publica. Retrieved from https://www.propublica.org/article/everything-we-know-about-what-data-brokers-know-about-you
  • Bhokare, P., Bhagwat, P., Bhise, P., Lalwani, V., & Mahajan, M. R. (2016). Private Cloud using GlusterFS and Docker.International Journal of Engineering Science5016.
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, (12th). Pearson Learning Solutions. VitalBook file.
  • Burkle, T., Hain, T., Hossain, H., Dudeck, J., & Domann, E. (2001). Bioinformatics in medical practice: what is necessary for a hospital?. Studies in health technology and informatics, (2), 951-955.
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th). Pearson Learning Solutions. [Bookshelf Online].
  • Della Valle, E., Dell’Aglio, D., & Margara, A. (2016). Tutorial: Taming velocity and variety simultaneous big data and stream reasoning. Retrieved from https://pdfs.semanticscholar.org/1fdf/4d05ebb51193088afc7b63cf002f01325a90.pdf
  • Dietrich, D. (2013). The genesis of EMC’s data analytics lifecycle. Retrieved from https://infocus.emc.com/david_dietrich/the-genesis-of-emcs-data-analytics-lifecycle/
  • Eichhorn, G. (2014). Why exactly is data auditing important? Retrieved from http://www.realisedatasystems.com/why-exactly-is-data-auditing-important/
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37. Retrieved from: http://www.aaai.org/ojs/index.php/aimagazine/article/download/1230/1131/
  • Katal, A., Wazid, M., & Goudar, R. H. (2013, August). Big data: issues, challenges, tools and good practices. InContemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
  • Kishore, N. & Sharma, S. (2016). Secure data migration from enterprise to cloud storage – analytical survey. BIJIT-BVICAM’s Internal Journal of Information Technology. Retrieved from http://bvicam.ac.in/bijit/downloads/pdf/issue15/09.pdf
  • Long, J. C., Cunningham, F. C., & Braithwaite, J. (2013). Bridges, brokers and boundary spanners in collaborative networks: a systematic review.BMC health services research13(1), 158.
  • (2016). An economic study of the hyper-scale data center. Mainstay, LLC, Castle Rock, CO, the USA, Retrieved from http://cloudpages.ericsson.com/ transforming-the-economics-of-data-center
  • Minelli, M., Chambers, M., &, Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. [Bookshelf Online].
  • Ovum (2016). 2017 Trends to watch: Big Data. Retrieved from http://info.ovum.com/uploads/files/2017_Trends_to_Watch_Big_Data.pdf
  • Park, Y., Shankar, M., Park, B. H., & Ghosh, J. (2014, March). Graph databases for large-scale healthcare systems: A framework for efficient data management and data services. In Data Engineering Workshops (ICDEW), 2014 IEEE 30th International Conference on (pp. 12-19). IEEE.
  • Ramachandran, M. & Chang, V. (2016). Toward validating cloud service providers using business process modeling and simulation. Retrieved from http://eprints.soton.ac.uk/390478/1/cloud_security_bpmn1%20paper%20_accepted.pdf
  • Richards, N. M., & King, J. H. (2014). Big Data Ethics. Wake Forest Law Review, 49, 393–432.
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, (1st). [Bookshelf Online].
  • Technopedia (n.d.). Data audit. Retrieved from https://www.techopedia.com/definition/28032/data-audit
  • Tsesis, A. (2014). The right to erasure: Privacy, data brokers, and the indefinite retention of data.Wake Forest L. Rev.49, 433.
  • Vardarlier, P., & Silahtaroglu, G. (2016). Gossip management at universities using big data warehouse model integrated with a decision support system. International Journal of Research in Business and Social Science, 5(1), 1–14. Doi: http://doi.org/10.1108/ 17506200710779521

Health care as a Service (HaaS) – cloud solution

Health cloud-healthcare as a service (HaaS) Case Study (John & Shenoy, 2014):

The goal of this study is to provide the framework to build Health Cloud, a healthcare system that helps solve some of the issues currently dealt with in the Healthcare data analytics field. Especially, when paper images and data are limiting to only that healthcare provider’s facility until it is faxed, scanned, or mailed. The Health Cloud will be able to: store and index medical data, image processing, report generating, charting, trend analysis, and be secured with identification and access control.  The image processing capabilities of Health Cloud enable for better medical condition diagnosis of a patient.  The image processing structure was built using C++ Code for processing, to request data and to report out is done in Binary JSON (BSON) or text formats. Finally, the system built allows for the image to be framed, visualized, panned, zoomed, and annotated.

Issues related to health care data on the cloud (John & Shenoy, 2014):

  1. The number of MIR data has doubled in a decade, and CT data has increased by 50%, increasing the number of images primary providers are requesting on their patient to improve and create informed patient care. Thus, there is a need for hyper-scale cloud features.
  2. Health Insurance Portability and Accountability Act (HIPAA) requires data to be stored for six years after a patient has been discharged therefore increasing the volume of data. Consequently, there is another need for hyper-scale cloud features.
  3. Healthcare data should be able to be sharing medical data from anywhere and at any time per the Health Information Technology for Economic and Clinical Heath Act (HITECH) and American Recover and Reinvestment Act (ARRA), which aim to reduce duplication of data and improve data quality and access. HIPAA has created security regulations on data backup, recovery, and access. Hence, there is a need to have a community cloud provider familiar with HIPAA and other Regulations.
  4. Each hospital system is developed in silos or purchased from different suppliers. Thus, if data is shared, it may not be in the format that is easily received by the other Thus a common architecture and data model must be developed.  This can be resolved under a community cloud.
  5. Creation of seamless access to the data stored in the cloud among various mobile platforms. Thus, a cloud provided option such as a Software as a Service may best fit this requirement.
  6. Healthcare workflows are better managed in cloud-based solutions versus paper-based
  7. Cloud capabilities can be used for processing data, depending on what is purchased from which supplier.

Pros and Cons of healthcare data on the public or private cloud:

On-site private clouds can have limited storage space, and the data may not be in a format that is easily transferable to other on-site private clouds (Bhokare et al., 2016). Upgrades, maintenance, and infrastructure costs fall 100% of the health care providers.  Although these clouds are expensive, they offer the most control of their data and more control over-specialization of reports.

Public clouds distribute the cost of the upgrades, maintenance, and infrastructure to all others requesting the servers (Connolly & Begg, 2014). However, the servers may not be specialized 100% to all regulatory and legal specifications, or the servers could have additional regulatory and legal specification not advantageous to the healthcare cloud system.  Also, data stored on public clouds are shared with other companies, which can leave healthcare providers feeling vulnerable with their data’s security within the public cloud (Sumana & Biswal, 2016).

The solution should be a private or public community cloud.  A community cloud environment is a cloud that is shared exclusively by a set of companies that share the similar characteristics, compliance, security, jurisdiction, etc. (Connolly & Begg, 2014). Thus, the infrastructure of all of these servers and grids meet industry standards and best practices, with the shared cost of the infrastructure is maintained by the community. Certain community services would be optimized for HIPAA, HITECH, ARRA, etc. with little overhead to the individual IT teams that make up the overall community (John & Shenoy, 2014).

Reference

  • Bhokare, P., Bhagwat, P., Bhise, P., Lalwani, V., & Mahajan, M. R. (2016). Private Cloud using GlusterFS and Docker.International Journal of Engineering Science5016.
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th). Pearson Learning Solutions. [Bookshelf Online].
  • John, N., & Shenoy, S. (2014). Health cloud-Healthcare as a service (HaaS). InAdvances in Computing, Communications and Informatics (ICACCI, 2014 International Conference on (pp. 1963-1966). IEEE.
  • Sumana, P., & Biswal, B. K. (2016). Secure Privacy Protected Data Sharing Between Groups in Public Cloud.International Journal of Engineering Science3285.

Cloud computing and big data

High-performance computing is where there is either a cluster and grid of servers or virtual machines that are connected by a network for a distributed storage and workflow (Bhokare et al., 2016; Connolly & Begg, 2014; Minelli, Chamber, & Dhiraj, 2013). Parallel computing environments draw on the distributed storage and workflow on the cluster and grid of servers or virtual machines for processing big data (Bhokare et al., 2016; Minelli et al., 2013). NoSQL databases have benefits as they provide a data model for applications that require a little code, less debugging, run on clusters, handle large scale data, stored across distributed systems, use parallel processing, and evolve with time (Sadalage & Fowler, 2012).  Cloud technology is the integration of data storage across a distributed set of servers or virtual machines through either traditional relational database systems or NoSQL database systems while allowing for data preprocessing and processing through parallel processing (Bhokare et al., 2016; Connolly & Begg, 2014; Minelli et al., 2013; Sadalage & Folwer, 2012).

Clouds can come in different flavors depending on how much the organization and supplier want to manage: Infrastructure as a Service, Platform as a Service, and Software as a Service (Connolly & Begg, 2014).  Thus, this makes the enterprise IT act as a broker across the various cloud options.  Also, analyzing exactly how and where data are stored to ensure it complies with various national and international data rules and regulations while preserving data privacy exist with the type of cloud use: public, community, private and hybrid clouds (Minelli et al. 2013; Conolloy & Begg, 2014).

Public cloud environments are where a supplier to a company provides a cluster or grid of servers through the internet like Spark AWS, EC2 (Connolly & Begg, 2014; Minelli et al. 2013).  Cloud computing can be thought of as a set of building blocks.  The company can grow or shrink a number of servers and services when needed dynamically, which allows the company to request the right amount of services for their data collection, storage, preprocessing, and processing needs (Bhokare et al., 2016; Minelli et al., 2013; Sadalage & Fowler, 2012).  This allows for the company to purchase the services it needs, without having to purchase the infrastructure to support the services it might think it will need. This allows for hyper-scaling computing in a distributed environment, also known as hyper-scale cloud computing, where the volume and demand of data explode exponentially yet still be accommodated in public, community, private, or hybrid cloud in a cost efficiently (Mainstay, 2016; Minelli et al., 2013).

Data storage and sharing are a key component of using enterprise public clouds (Sumana & Biswal, 2016).  However, it should be noted that the data is stored in the public cloud is stored on the same servers as probably the company’s competitors, so data security is an issue. Sumana and Biswal (2016) proposed that a key aggregate cryptosystem to be used, where the enterprise holds the master key for all its enterprise files, whereas going a deep layer users can have other data encrypted to send within the enterprise, without needing to know the enterprise file key. This proposed solution for data security in a public cloud allows for end-user registration, end-user revocation, file generation and deletion, and file access and traceability.

A community cloud environment is a cloud that is shared exclusively by a set of companies that share the similar characteristics, compliance, security, jurisdiction, etc. (Connolly & Begg, 2014). Thus, the infrastructure of all of these servers and grids meet industry standards and best practices, with the shared cost of the infrastructure is maintained by the community.

Private cloud environments have a similar infrastructure to a public cloud, but the infrastructure only holds the data one company exclusively, and its services are shared across the different business units of that one company (Connolly & Begg, 2014; Minelli et al., 2013). An organization may have all the components already to build a cloud through various on-premise computing resources and thus tend to build a cloud system using open source code on their internal infrastructure; this is called an on-premise private cloud (Bhokare et al., 2016). The benefit of the private cloud is full control of your data, and the cost of the servers are spread across all the business units, but the infrastructure costs (initial, upgrades, and maintenance costs) are in the company.

Hybrid clouds are two or more cloud structures that have either a private, community or public aspect to them (Connolly & Begg, 2014).  This allows for some data to be retained in the house if need be, and reducing the size of capital expenditure for the internal cloud infrastructure, while other data is stored externally where the cost of the infrastructure is not directly felt by the organization.

References

  • Bhokare, P., Bhagwat, P., Bhise, P., Lalwani, V., & Mahajan, M. R. (2016). Private Cloud using GlusterFS and Docker.International Journal of Engineering Science5016.
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th). Pearson Learning Solutions. [Bookshelf Online].
  • (2016). An economic study of the hyper-scale data center. Mainstay, LLC, Castle Rock, CO, the USA, Retrieved from http://cloudpages.ericsson.com/ transforming-the-economics-of-data-center
  • Minelli, M., Chambers, M., &, Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. [Bookshelf Online].
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, [Bookshelf Online].
  • Sumana, P., & Biswal, B. K. (2016). Secure Privacy Protected Data Sharing Between Groups in Public Cloud.International Journal of Engineering Science3285.