Adv DB: Data Services in the Cloud Service Platform

Rimal et al (2009), states that Cloud Computing Systems are a disruptive service that has gained momentum. What makes it disruptive is that it has similar properties of prior technology, while adding new features and capabilities (like big data processing) in a very cost-effective way.   It has become part of the XaaS (where X can be infrastructure, hardware, software, etc.) as a Service.  According to Connolly & Begg (2014), Data as a Service (DaaS) and Database as a Service (DBaaS) are considered as cloud-based solutions. DaaS doesn’t use SQL interfaces, but it does enable corporations to access data to analyze value streams that they own or those they can easily expand into. DBaaS must be continuously monitored and improved on, because they usually serve multiple organizations, with the added benefit of providing charge-back functions per organization (Connolly & Begg, 2014) or a pay-for-use model (Rimal et al, 2009).  However, one must pick a solution that best serves their business’/organization’s needs.

Benefits of the service

Connolly & Begg (2014), stated that there are benefits to cloud computing such as Cost-reduction due to lower CapEx, ability to scale up or down based on data demands, needs for higher security making data stored here more secure than in-house solutions, 24/7 reliability can be provided, faster development time because time is not take away from building the DBMS from scratch, finally sensitivity/load testing can be made readily available and cheaply because of lower CapEX.  Rimal et al (2009), stated that the benefits came from lower costs, improved flexibility and agility, and scalability.  You can set up systems as quickly as you wish, under a pay-as-you-go model and is great for disaster recovery efforts (as long as the disaster affected your systems, not theirs).   Other benefits could be seen from an article looking at the Data-as-a-Service in the health field: a low cost to implementation and maintainability of databases, defragmentation of data, exchange of patient data across the heady provider organization, and provide a mode for standardization of data types, forms, and frequency to capture data.  From a health-care perspective, it could lead to supporting research, strategic planning of medical decisions, improve data quality, reduce cost, reduce resource scarcity issues from an IT perspective, and finally provide better patient care (AbuKhousa et al, 2012).

Problems that can be removed because of the service

Unfortunately, there are two sides to a coin.  Given that on a cloud service there exists network dependency, such that if the supplier has a power outage, the consumer will not have access to their data.  Other network dependencies can occur like peak service hours, where service tends to be degraded compared to if the company used the supplier during its off-peak hours.  Quite a few organizations use Amazon EC2 (Rimal et al, 2009) as their Cloud DBMS, which if that system is hacked or security is breached the problem is bigger than if it were carried out to only one company. There are system dependencies, like in the case of disaster recovery (AbuKhousa et al, 2012), organizations are as strong as their weakest link when it comes to a disaster, if the point of failure in the service, and there are no other mitigation plans, that organization may have a hard time recuperating their losses.  Also, placing data into these services, you lose control over the data (lose control over availability, reliability, maintainability, integrity, confidentiality, intervenability, and isolation) (Connolly & Begg, 2014).  Rimal et al (2009) stated clear examples of outages that existed in Services like Microsoft (down 22 hours in 2008), or Google Gmail (2.5 hours in 2009), etc.  All of these lack of control points is perhaps one of the main reasons why certain government agencies have had a hard time adopting a commercialized cloud service provider, however they are attempting to create internal clouds.

Architectural view of the system

The overall cloud architecture is the layered system that serves as a single point of contact and uses software applications over the web, using an infrastructure, which draws on resources from necessary hardware to complete a task (Rimal et al, 2009).  Adopted from Rimal et al (2009) the figure below is what these authors describe as the layered cloud architecture.

Software-as-a-Service (SaaS)

Platform-as-a-Service (PaaS)

Developers implementing cloud applications

Infrastructure-as-a-Service (IaaS)

[(Virtualizations, Storage Networks) as-a-Service]

Hardware-as-a-Service (HaaS)

Figure 1. A layered cloud architecture.

Cloud DBMS can be private (an internal provider where the data is held within the organization), public (an external provider manages resources dynamically across their system and through multiple organizations supplying them data), and hybrid (consists of multiple internal and external providers) as defined in Rimal et al (2009).

A problem with DBaaS is the fact that databases between multiple organizations are stored by the same service provider.  Where data can be what separates one organization from its competitors, they must consider the following architecture: Separate Servers, Shared Server but different database processes, Shared databases but separate databases, Shared databases but separate schema, or Shared databases but shared schema (Connolly & Begg, 2014).

Let’s take two competitors: Walmart & Target.  Two supergiant organizations that have trillions of dollars and inventory flowing in and out of their systems monthly.  Let’s also assume three database tables with automated primary keys and foreign keys that connect the data together: Product, Purchasing, and Inventory.  Another set of assumptions: (1) their data can be stored by the same DBaaS provider, (2) their DBaaS provider is Amazon.

While Target and Walmart may use the same supplier for their DBaaS, they can select one of those five architectural solutions to avoid their valuable data to be seen.  If Walmart and Target purchased separate servers, their data can be safe.  They could also go this route if they want to store their huge data and/or have many users of their data. Now if we narrow down our assumptions to the children’s toy section (due to breaking up the datasets to manageable chunks), both Target and Walmart can store their data on a shared server but on separate database processes, they would not have any shared resources like memory or disk, just the virtual environment.  If Walmart and Target went on a shared database server but separate databases, would allow for better resource management between each organization.  If Walmart and Target decided to use the same database (which is unlikely) but hold separate schema, data must have strong access management systems in place.  This may be elected between CVS and Walgreen’s Pharmacy databases, where patient data can be vital, but not impossible to switch from one schema to another, however, this interaction is highly unlikely.   The final structure, highly unlikely for most corporations is sharing databases and schemas.  This final architectural structure is best used for hospitals sharing patient data (AbuKhousa et al, 2012), but, HIPPA must be observed still under this final architecture mode.  Though this is the desired state for some hospitals, it may take years to get to a full system.  Security is a big issue here and will take a ton of developmental hours, but in the long run, it is the cheapest solution available (Connolly & Begg, 2014).

Requirements on services that must be met

Looking at Amazon’s cloud database offering, it should be easy to set up, easy to operate, and easy to scale.  The system should enhance availability and reliability to its live databases compared to in house solutions. There should be software in the databases to back up the database, for recovery at any time. Security patches and maintenance of the underlying hardware and software of the cloud DBMS should be reduced significantly since that is not the burden that should be placed onto the organization. The goal of the Cloud DBMS should be to remove development costs away from managing the DBMS to focus on applications and the data to be stored in the system.  They should also provide infrastructure and hardware as a service to reduce overhead costs in managing these systems.  Amazon’s Relational Database Service can use MySQL, Oracle, SQL Server, or PostgreSQL databases.  Amazon’s DynamoDB is a NoSQL database service.  Amazon’s Redshift costs less than $1K to store a TB of data per year (Varia & Mathew, 2013).

Issues to be resolved during implementation

Rimal et al in 2010, stated some interesting things to consider during a before implementing a Cloud DBMS.  What are the Service-Level Agreements (SLAs) of the supplier Cloud DBMS?  The Cloud DBMS may be up and running 24×7, but if they experience a power outage, what is their SLAs to the organization, as not to impact the organization at all?  What is their backup/replication scheme?  What is their discovery (assists in reusability) schema? What is their load balancing (trying to avoid bottlenecks), especially since most suppliers cater to more than one organization? What does their resource management plan look like? Most cloud DBMS have several copies of the data spread across several servers, so how sure is the vender to ensure no data loss?  What types of security are provided? What is their encryption and decryption strength for the data held within its servers?  How private will the organization’s data be, if hosted on the same server or same database but separate schema?   What are their authorization and Authentication safeguards?  Looking at Varia & Mathew (2013) explain all the cloud DBMS services that Amazon provides, these questions are definitely things that should be addressed for each of their solutions.  Thus, when analyzing a supplier for a cloud DBMS, having technical requirements that meet the Business Goals & Objects (BG&Os) is great to help guide the organization pick the right supplier and solution, given issues that need to be resolved.  Other issues identified came from Chaudhuri (2012): data privacy through access control, auditing, and statistical privacy; allow for data exploration to enable deeper analytics; data enrichment with web 2.0 and social media; query optimization; scalable data platforms; manageability; and performance isolation for multiple organizations occupying the same server.

Data migration strategy & Security Enforcement

When migrating data between organizations into a Cloud DBMS, the taxonomy of the data must be preserved.  Along with taxonomy, one must consider that no data is lost in the transfer, that data is still available to the end-user, before, during and after the migration, and that the transfer is done in a cost-efficient way (Rimal et al 2010).  Furthermore, data migration should be done seamlessly and efficiently as if one were to move between suppliers of services, such that a supplier doesn’t get too entangled into the organization that it is the only solution the organization can see itself as using.  Finally, what type of data do you want to migrate over, mission-critical data may be too precious on certain types of Cloud DBMS, but may be great for a virtualized disaster recovery system?  What type of data to migrate over depends on the needs of a cloud system, to begin with, and what services does the organization want to pay-as-they-go now and in the near future. The type of data to migrate may also depend on the security features provided by the supplier.

Organizational information is a vital resource to any organization, and access to it and maintaining it proprietary is key. If not enforced data like employee social security numbers can be compromised, or credit card numbers of past consumers.  Rimal et al (2009) compared the security considerations in the current cloud systems at the time:

  • Google: uses 128 Bit or higher server authentication
  • GigaSpaces: SSH tunneling
  • Microsoft Azure: Token services
  • OpenNebula: Firewall and virtual private network tunnel

In 2010, Rimal et al, further expanded the security considerations by the suggestion that organizations should look into: authentication and authorization protocols (access management), privacy and federated data, encryption and decryption schemes, etc.

References

  • AbuKhousa, E., Mohamed, N., & Al-Jaroodi, J. (2012). e-Health cloud: opportunities and challenges. Future Internet, 4(3), 621-645.
  • Chaudhuri, S. (2012, May). What next?: a half-dozen data management research goals for big data and the cloud. In Proceedings of the 31st symposium on Principles of Database Systems (pp. 1-4). ACM.
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf version]. Retrieved from http://online.vitalsource.com/books/9781323135761/epubcfi/6/210
  • Rimal, B. P., Choi, E., & Lumb, I. (2009, August). A taxonomy and survey of cloud computing systems. In INC, IMS and IDC, 2009. NCM’09. Fifth International Joint Conference on (pp. 44-51). Ieee.
  • Rimal, B. P., Choi, E., & Lumb, I. (2010). A taxonomy, survey, and issues of cloud computing ecosystems. In Cloud Computing (pp. 21-46). Springer London.
  • Varia, J., & Mathew, S. (2013). Overview of amazon web services. Jan-2014.

Adv Topics: Security Issues with Cloud Technology

Big data requires huge amounts of resources to analyze it for data driven decisions, thus there has been a gravitation towards cloud computing to work in this era of big data (Sakr, 2014). Cloud technology is different than personal systems that place different demands on cyber security, where personal systems could have single authority systems and cloud computing systems, have no individual owners, have multiple users, groups rights, and shared responsibility (Brookshear & Brylow, 2014; Prakash & Darbari, 2012). Cloud security can be just as good or better than personal systems because cloud providers could have the economies of scales that can support a budget to have an information security team that many organizations may not be able to afford (Connolly & Begg, 2014). Cloud security can be designed to be independently modular, which is great for heterogenous distributed systems (Prakash & Darbari, 2012).

For cloud computing eavesdropping, masquerading, message tampering, replaying the message, and denial of services are security issues that should be addressed (Prakash & Darbari, 2012). Sakr (2014) stated that exploitation of co-tenancy, a secure architecture for the cloud, accountability for outsourced data, confidentiality of data and computation, privacy, verifying outsourced computation, verifying capability, cloud forensics, misuse detection, and resource accounting and economic attacks are big issues for cloud security. This post will discuss the exploitation of co-tendency and confidentiality of data and computation.

Exploitation of Co-Tenancy: An issue with cloud security is within one of its properties, that it is a shared environment (Prakash & Darbari, 2012; Sakr, 2014). Given that it is a shared environment, people with malicious intent could pretend to be someone they are not to gain access, in other words masquerading (Prakash & Darbari, 2012). Once inside, these people with malicious intent tend to gather information about the cloud system and the data contained within it (Sakr, 2014). Another way these services could be used by malicious people is to use the computational resources of the cloud to carry out denial of service attacks on other people.   Prakash and Darbari (2012) stated that two-factor authentications were used on personal devices and for shared distributed systems, there has been proposed a use of a three-factor authentication. The first two factors are the use passwords and smart cards. The last one could be either biometrics or digital certificates. Digital certificates can be used automatically to reduce end-user fatigue on using multiple authentications (Connolly & Begg, 2014). The third level of authentication helps to create a trusted system. Subsequently, a three-factor authentication could primarily mitigate masquerading. Sakr (2014), proposed using a tool that hides the IP addresses the infrastructure components that make up the cloud, to prevent the cloud for being used if the entry is granted to a malicious person.

Confidentiality of data and computation: If data in the cloud is accessed malicious people can gain information, and change the content of that information. Data stored on the distributed systems are sensitive to the owners of the data, like health care data which is heavily regulated for privacy (Sakr, 2014). Prakash and Darbari (2012) suggested the use of public key cryptography, software agents, XML binding technology, public key infrastructure, and role-based access control are used to deal with eavesdropping and message tampering. This essentially hides the data in such a way that it is hard to read without key items that are stored elsewhere in the cloud system. Sakr (2014) suggested homomorphic encryption may be needed, but warns that the use of encryption techniques increases the cost and time of performance. Finally, Lublinsky, Smith, and Yakubovich (2013), stated that encrypting the network to protect data-in-motion is needed.

Overall, a combination of data encryption, hiding IP addresses of computational components, and three-factor authentication may mitigate some of the cloud computing security concerns, like eavesdropping, masquerading, message tampering, and denial of services. However, using these techniques will increase the time it takes to process big data. Thus a cost-benefit analysis must be conducted to compare and contrast these methods while balancing data risk profiles and current risk models.

Resources:

  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, (12th ed.). Pearson Learning Solutions. VitalBook file.
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. Pearson Learning Solutions. VitalBook file.
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox. VitalBook file.
  • Prakash, V., & Darbari, M. (2012). A review on security issues in distributed systems. International Journal of Scientific & Engineering Research, 3(9), 300–304.
  • Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.

Adv Topics: Graph dataset partitioning in Cloud Computing

When dealing with cloud environments, the network management systems are key to processing big data, because some computational components within the network can have bandwidth unevenness (Conolly & Begg, 2014; Sakr, 2014). Chen et al. (2016) stated that often the cloud infrastructure is designed as interconnected computational resources in a tree structure, where they are first group in pods, which are then connected to other pods This unevenness can be defined as differing network performance between computational components within and across different racks from each other (Lublinsky, Smith, & Yakubovich, 2013). This unevenness grows as the cloud gets upgrade and the computational components grow to become increasingly heterogeneous with time (Chen et al., 2016). Thus, network management systems, which is part of resource pooling in cloud environments, helps with data load balancing, route connections, and diagnose problems between the “computational node ó link ó computational node ó link ó computational node …;” where each node is connected to multiple databases through multiple links (Conolly & Begg, 2014).

Network bandwidth unevenness becomes an issue when trying to process graph and network data, because (a) the complexity of the data cannot be stored in typical relational database systems; (b) the data is complex to process; and (c) scalability issues (Chen et al., 2016). Sakr (2014) stated there are two ways to scale data: vertical scaling where researchers are finding the few computational components that can handle huge data loads or horizontal scaling where researchers are spreading the data across multiple computational components. Chen et al. (2016) and Sakr (2014) recommended these two large-scale graph dataset partitioning methods for cloud computing systems:

  • Machine Graph: Models the graphical data by treating each computational node as a graphical vertex, such that the interconnectivity between each computational node would represent the connection of that graphical data element to others graphical data elements. A benefit of using this design is that the computational nodes representing this graphical data doesn’t need the knowledge of the topology of the network and can be built by utilizing the network bandwidth between computational nodes.
  • Partition Sketch: A graphical data set is partitioned out to represent multiple tree-structured data, and each computational node address a single level of the graphical tree data. From the perspective of the partitioned tree; the root node holds the input graph, non-leaf nodes hold partition iteration data, and leaf nodes hold the partitioned graph data points. This is built iteratively, with the goal of reducing the number of cross leaf-nodes between partitions. Design wise, each partition sketch is locally optimized, addresses monotonicity (cross leaf-nodes between partitions and non-leaf node depth), and addresses the proximity of parent nodes by defining common parent data between leaf nodes and non-leaf nodes. For instance, data that has a low common parent data should be stored together in high bandwidth computational nodes, whereas leaf-node data should be stored in low bandwidth computational nodes. However, the limitation of such a design is that the partition size must be carefully selected to fit monotonicity, but at the same time not have too many nodes as to run into input and output errors when trying to retrieve the data. Thus, more processing nodes should be used to decrease the number of cross leaf-nodes with low bandwidth, but not too many to cause memory issues.

Resources

  • Chen, R. Weng, X., He, B., Choi, B. and Yang, M. (2016). Network performance aware graph partitioning for large graph processing systems in the cloud. Retrieved from http://www.comp.nus.edu.sg/~hebs/pub/rishannetwork_crc14.pdf
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. Pearson Learning Solutions. VitalBook file.
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox, VitalBook file.
  • Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.

Cloud computing and big data

High-performance computing is where there is either a cluster and grid of servers or virtual machines that are connected by a network for a distributed storage and workflow (Bhokare et al., 2016; Connolly & Begg, 2014; Minelli, Chamber, & Dhiraj, 2013). Parallel computing environments draw on the distributed storage and workflow on the cluster and grid of servers or virtual machines for processing big data (Bhokare et al., 2016; Minelli et al., 2013). NoSQL databases have benefits as they provide a data model for applications that require a little code, less debugging, run on clusters, handle large scale data, stored across distributed systems, use parallel processing, and evolve with time (Sadalage & Fowler, 2012).  Cloud technology is the integration of data storage across a distributed set of servers or virtual machines through either traditional relational database systems or NoSQL database systems while allowing for data preprocessing and processing through parallel processing (Bhokare et al., 2016; Connolly & Begg, 2014; Minelli et al., 2013; Sadalage & Folwer, 2012).

Clouds can come in different flavors depending on how much the organization and supplier want to manage: Infrastructure as a Service, Platform as a Service, and Software as a Service (Connolly & Begg, 2014).  Thus, this makes the enterprise IT act as a broker across the various cloud options.  Also, analyzing exactly how and where data are stored to ensure it complies with various national and international data rules and regulations while preserving data privacy exist with the type of cloud use: public, community, private and hybrid clouds (Minelli et al. 2013; Conolloy & Begg, 2014).

Public cloud environments are where a supplier to a company provides a cluster or grid of servers through the internet like Spark AWS, EC2 (Connolly & Begg, 2014; Minelli et al. 2013).  Cloud computing can be thought of as a set of building blocks.  The company can grow or shrink a number of servers and services when needed dynamically, which allows the company to request the right amount of services for their data collection, storage, preprocessing, and processing needs (Bhokare et al., 2016; Minelli et al., 2013; Sadalage & Fowler, 2012).  This allows for the company to purchase the services it needs, without having to purchase the infrastructure to support the services it might think it will need. This allows for hyper-scaling computing in a distributed environment, also known as hyper-scale cloud computing, where the volume and demand of data explode exponentially yet still be accommodated in public, community, private, or hybrid cloud in a cost efficiently (Mainstay, 2016; Minelli et al., 2013).

Data storage and sharing are a key component of using enterprise public clouds (Sumana & Biswal, 2016).  However, it should be noted that the data is stored in the public cloud is stored on the same servers as probably the company’s competitors, so data security is an issue. Sumana and Biswal (2016) proposed that a key aggregate cryptosystem to be used, where the enterprise holds the master key for all its enterprise files, whereas going a deep layer users can have other data encrypted to send within the enterprise, without needing to know the enterprise file key. This proposed solution for data security in a public cloud allows for end-user registration, end-user revocation, file generation and deletion, and file access and traceability.

A community cloud environment is a cloud that is shared exclusively by a set of companies that share the similar characteristics, compliance, security, jurisdiction, etc. (Connolly & Begg, 2014). Thus, the infrastructure of all of these servers and grids meet industry standards and best practices, with the shared cost of the infrastructure is maintained by the community.

Private cloud environments have a similar infrastructure to a public cloud, but the infrastructure only holds the data one company exclusively, and its services are shared across the different business units of that one company (Connolly & Begg, 2014; Minelli et al., 2013). An organization may have all the components already to build a cloud through various on-premise computing resources and thus tend to build a cloud system using open source code on their internal infrastructure; this is called an on-premise private cloud (Bhokare et al., 2016). The benefit of the private cloud is full control of your data, and the cost of the servers are spread across all the business units, but the infrastructure costs (initial, upgrades, and maintenance costs) are in the company.

Hybrid clouds are two or more cloud structures that have either a private, community or public aspect to them (Connolly & Begg, 2014).  This allows for some data to be retained in the house if need be, and reducing the size of capital expenditure for the internal cloud infrastructure, while other data is stored externally where the cost of the infrastructure is not directly felt by the organization.

References

  • Bhokare, P., Bhagwat, P., Bhise, P., Lalwani, V., & Mahajan, M. R. (2016). Private Cloud using GlusterFS and Docker.International Journal of Engineering Science5016.
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th). Pearson Learning Solutions. [Bookshelf Online].
  • (2016). An economic study of the hyper-scale data center. Mainstay, LLC, Castle Rock, CO, the USA, Retrieved from http://cloudpages.ericsson.com/ transforming-the-economics-of-data-center
  • Minelli, M., Chambers, M., &, Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. [Bookshelf Online].
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, [Bookshelf Online].
  • Sumana, P., & Biswal, B. K. (2016). Secure Privacy Protected Data Sharing Between Groups in Public Cloud.International Journal of Engineering Science3285.

Big Data Analytics: Compelling Topics

Big Data and Hadoop:

According to Gray et al. (2005), traditional data management relies on arrays and tables in order to analyze objects, which can range from financial data, galaxies, proteins, events, spectra data, 2D weather, etc., but when it comes to N-dimensional arrays there is an “impedance mismatch” between the data and the database.    Big data, can be N-dimensional, which can also vary across time, i.e. text data (Gray et al., 2005). Big data, by its name, is voluminous. Thus, given the massive amounts of data in Big Data that needs to get processed, manipulated, and calculated upon, parallel processing and programming are there to use the benefits of distributed systems to get the job done (Minelli, Chambers, & Dhiraj, 2013).  Parallel processing allows making quick work on a big data set, because rather than having one processor doing all the work, you split up the task amongst many processors.

Hadoop’s Distributed File System (HFDS), breaks up big data into smaller blocks (IBM, n.d.), which can be aggregated like a set of Legos throughout a distributed database system. Data blocks are distributed across multiple servers. Hadoop is Java-based and pulls on the data that is stored on their distributed servers, to map key items/objects, and reduces the data to the query at hand (MapReduce function). Hadoop is built to deal with big data stored in the cloud.

Cloud Computing:

Clouds come in three different privacy flavors: Public (all customers and companies share the all same resources), Private (only one group of clients or company can use a particular cloud resources), and Hybrid (some aspects of the cloud are public while others are private depending on the data sensitivity.  Cloud technology encompasses Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).  These types of cloud differ in what the company managers on what is managed by the cloud provider (Lau, 2011).  Cloud differs from the conventional data centers where the company managed it all: application, data, O/S, virtualization, servers, storage, and networking.  Cloud is replacing the conventional data center because infrastructure costs are high.  For a company to be spending that much money on a conventional data center that will get outdated in 18 months (Moore’s law of technology), it’s just a constant sink in money.  Thus, outsourcing the data center infrastructure is the first step of company’s movement into the cloud.

Key Components to Success:

You need to have the buy-in of the leaders and employees when it comes to using big data analytics for predictive, prescriptive or descriptive purposes.  When it came to buy-in, Lt. Palmer had to nurture top-down support as well as buy-in from the bottom-up (ranks).  It was much harder to get buy-in from more experienced detectives, who feel that the introduction of tools like analytics, is a way to tell them to give up their long-standing practices and even replace them.  So, Lt. Palmer had sold Blue PALMS as “What’s worked best for us is proving [the value of Blue PALMS] one case at a time, and stressing that it’s a tool, that it’s a compliment to their skills and experience, not a substitute”.  Lt. Palmer got buy-in from a senior and well-respected officer, by helping him solve a case.  The senior officer had a suspect in mind, and after feeding in the data, the tool was able to predict 20 people that could have done it in an order of most likely.  The suspect was on the top five, and when apprehended, the suspect confessed.  Doing, this case by case has built the trust amongst veteran officers and thus eventually got their buy in.

Applications of Big Data Analytics:

A result of Big Data Analytics is online profiling.  Online profiling is using a person’s online identity to collect information about them, their behaviors, their interactions, their tastes, etc. to drive a targeted advertising (McNurlin et al., 2008).  Profiling has its roots in third party cookies and profiling has now evolved to include 40 different variables that are collected from the consumer (Pophal, 2014).  Online profiling allows for marketers to send personalized and “perfect” advertisements to the consumer, instantly.

Moving from online profiling to studying social media, He, Zha, and Li (2013) stated their theory, that with higher positive customer engagement, customers can become brand advocates, which increases their brand loyalty and push referrals to their friends, and approximately 1/3 people followed a friend’s referral if done through social media. This insight came through analyzing the social media data from Pizza Hut, Dominos and Papa Johns, as they aim to control more of the market share to increase their revenue.  But, is this aiding in protecting people’s privacy when we analyze their social media content when they interact with a company?

HIPAA described how we should conduct de-identification of 18 identifiers/variables that would help protect people from ethical issues that could arise from big data.   HIPAA legislation is not standardized for all big data applications/cases; it is good practice. However, HIPAA legislation is mostly concerned with the health care industry, listing those 18 identifiers that have to be de-identified: Names, Geographic data, Dates, Telephone Numbers, VIN, Fax, Device ID and serial numbers, emails addresses, URLs, SSN, IP address, Medical Record Numbers, Biometric ID (fingerprints, iris scans, voice prints, etc), full face photos, health plan beneficiary numbers, account numbers, any other unique ID number (characteristic, codes, etc), and certifications/license numbers (HHS, n.d.).  We must be aware that HIPAA compliance is more a feature of the data collector and data owner than the cloud provider.

HIPAA arose from the human genome project 25 years ago, where they were trying to sequence its first 3B base pair of the human genome over a 13 year period (Green, Watson, & Collins, 2015).  This 3B base pair is about 100 GB uncompressed and by 2011, 13 quadrillion bases were sequenced (O’Driscoll et al., 2013). Studying genomic data comes with a whole host of ethical issues.  Some of those were addressed by the HIPPA legislation while other issues are left unresolved today.

One of the ethical issues that arose were mentioned in McEwen et al. (2013), for people who have submitted their genomic data 25 years ago can that data be used today in other studies? What about if it was used to help the participants of 25 years ago to take preventative measures for adverse health conditions?  However, ethical issues extend beyond privacy and compliance.  McEwen et al. (2013) warn that data has been collected for 25 years, and what if data from 20 years ago provides data that a participant can suffer an adverse health condition that could be preventable.  What is the duty of the researchers today to that participant?

Resources:

Big Data Analytics: Cloud Computing

Clouds come in three different privacy flavors: Public (all customers and companies share the all same resources), Private (only one group of clients or company can use a particular cloud resources), and Hybrid (some aspects of the cloud are public while others are private depending on the data sensitivity.

Cloud technology encompasses Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).  These types of cloud differ in what the company managers with respect to what is managed by the cloud provider.  For IaaS the company manages the applications, data, runtime, and middleware, whereas the provider administers the O/S, virtualization, servers, storage, and networking.  For PaaS the company manages the applications, and data, whereas the vendor, administers the runtime, middleware, O/S, virtualization, servers, storage, and networking.  Finally SaaS the provider manages it all: application, data, O/S, virtualization, servers, storage, and networking (Lau, 2011).  This differs from the conventional data centers where the company managed it all: application, data, O/S, virtualization, servers, storage, and networking.

Examples of IaaS are Amazon Web Services, Rack Space, and VMware vCloud.  Examples of PaaS are Google App Engine, Windows Azure Platform, and force.com. Examples of SaaS are Gmail, Office 365, and Google Docs (Lau, 2011).

There are benefits of cloud is this pay-as-you-go business model.  One, the company can pay for as much (SaaS) or as little (IaaS) of the service that they need and how much in space they require. Two, the company can go on an On-Demand model, which businesses can scale up and down as they need (Dikaiakos, Katsaros, Mehra, Pallis, & Vakali, 2009).  For example, if a company would like a development environment for 3 weeks, they can build it up in the cloud for that time period and spend money for using the service for 3 weeks rather than buying a new set of infrastructure and setting up all the libraries.  This can help speed up the development speed in a ton of applications moving forward when you elect the cloud versus buying a new infrastructure.  These models are like renting a car.  Renting a car for what you need, but you are paying for what you use (Lau, 2011).

Replacing Conventional Data Center?

Infrastructure costs are really high.  For a company to be spending that much money on something that will get outdated in 18 months (Moore’s law of technology), it’s just a constant sink in money.  Outsourcing, infrastructure is the first step of company’s movement into the cloud.  However, companies need to understand the different privacy flavors well, because if data is stored in a public cloud, it will be hard to destroy the hardware, because you will destroy not only your data, but other people’s and company’s data.  Private clouds are best for government agencies which may need or require physical destruction of the hardware.  Government agencies may even use hybrid structures, keeping private data in the private clouds and the public stuff in a public cloud.  Companies that contract with the government could migrate to hybrid clouds in the future, and businesses without contracts with the government could go onto a public cloud.  There may always be a need to store the data on a private server, like patents, of KFC’s 7 herbs and spices recipe, but for the majority of the data, personally the cloud may be a grand place to store and work off of.

Note: Companies that do venture into moving into a cloud platform and storing data, they should focus on migrating data and data dictionaries slowly and with uniformity.  Data variables should have the same naming convention, one definition, a list of who is responsible for the data, meta-data, etc.  This would be a great chance for companies, while in migration to a new infrastructure to clean up their data.

Resources:

 

Big Data Analytics: Privacy & HIPAA

Since its inception 25 years ago, the human genome project has been sequenced many 3B base pair of the human genomes (Green, Watson, & Collins, 2015).  This project has given rise of a new program, the Ethical, Legal and Social Implication (ELSI) project.  ELSI got 5% of the National Institute of Health Budget, to study ethical implications of this data, opening up a new field of study (Green et al., 2015 & O’Driscoll, Daugelaite, & Sleator, 2013).  Data sharing must occur, to leverage the benefits of the genome projects and others like it.  Poldrak and Gorgolewski (2014) stated that the goals of sharing data help out with the advancement of the field in a few ways: maximizing the contribution of research subjects, enabling responses to new questions, enabling the generation of new questions, enhance research results reproducibility (especially when the data and software used are combined), test bed for new big data analysis methods, improving research practices (development of a standard of ethics), reducing the cost of doing the science (what is feasible for one scientist to do), and protecting valuable scientific resources (via indirectly creating a redundant backup for disaster recovery).  Allowing for data sharing of genomic data can present ethical challenges, yet allow for multiple countries and disciplines to come together and analyze data sets to come up with new insights (Green et al., 2015).

Richards and King (2014), state that concerning privacy, we must think of it regarding the flow of personal information.  Privacy cannot be thought of as a binary, as data is private and public, but within a spectrum.  Richards and Kings (2014) argue that the data as exchanged between two people has a certain level of expectation of privacy and that data can remain confidential, but there is never a case were data is in absolute private or public.  Not everyone in the world would know or care about every single data point, nor will any data point be kept permanently secret if it is uttered out loud from the source.  Thus, Richards and Kings (2014) stated that transparency can help prevent abuse of the data flow.  That is why McEwen, Boyer, and Sun (2013) discussed that there could exist options for open-consent (your data can be used for any other future research project), broad-consent (describe various ways the data could be used, but it is not universal), or an opt-out-consent (where participants can say what their data shouldn’t be used for).

Attempts are being made through the enactment of Genetic Information Nondiscrimination Act (GINA) to protect identifying data for fears that it can be used to discriminate against a person with a certain type of genomic indicator (McEwen et al., 2013).  Internal Review Boards and Common Rules, with the Office of Human Research Protection (OHRP), have guidance on information flow that is de-identified.  De-identified information can be shared and is valid under current Health Insurance Portability and Accounting Act of 1996 (HIPAA) rules (McEwen et al, 2013).  However, fear of loss of data flow control comes from increase advances in technological decryption and de-anonymisation techniques (O’Driscoll et al., 2013 and McEwen et al., 2013).

Data must be seen and recognized as a person’s identity, which can be defined as the “ability of individuals to define who they are” (Richards & Kings, 2014). Thus, the assertion made in O’Driscoll et al. (2013) about how the ability to protect medical data, with respects to bid data and changing concept, definitional and legal landscape of privacy is valid.  Thanks to HIPAA, cloud computing, is currently on a watch list. Cloud computing can provide a lot of opportunity for cost savings. However, Amazon cloud computing is not HIPAA compliant, hybrid clouds could become HIPAA, and commercial cloud options like GenomeQuest and DNANexus are HIPAA compliant (O’Driscoll et al., 2013).

However, ethical issues extend beyond privacy and compliance.  McEwen et al. (2013) warn that data has been collected for 25 years, and what if data from 20 years ago provides data that a participant can suffer an adverse health condition that could be preventable.  What is the duty of the researchers today to that participant?  How far back in years should that go through?

Other ethical issues to consider: When it comes to data sharing, how should the researchers who collected the data, but didn’t analyze it should be positively incentivized?  One way is to make them co-author of any publication revolving their data, but then that makes it incompatible with standards of authorships (Poldrack & Gorgolewski, 2013).

 

Resources:

  • Green, E. D., Watson, J. D., & Collins, F. S. (2015). Twenty-five years of big biology. Nature, 526.
  • McEwen, J. E., Boyer, J. T., & Sun, K. Y. (2013). Evolving approaches to the ethical management of genomic data. Trends in Genetics, 29(6), 375-382.
  • Poldrack, R. A., & Gorgolewski, K. J. (2014). Making big data open: data sharing in neuroimaging. Nature Neuroscience, 17(11), 1510-1517
  • O’Driscoll, A., Daugelaite, J., & Sleator, R. D. (2013). ‘Big data,’ Hadoop and cloud computing in genomics. Journal of biomedical informatics, 46(5), 774-781.
  • Richards, N. M., & King, J. H. (2014). Big data ethics. Wake Forest L. Rev., 49, 393.