Adv Topics: Security Issues associated with Big Data

The scientific method helps give a framework for the data analytics lifecycle (Dietrich, 2013). Per Khan et al. (2014), the entire data lifecycle consists of the following eight stages:

  • Raw big data
  • Collection, cleaning, and integration of big data
  • Filtering and classification of data usually by some filtering criteria
  • Data analysis which includes tool selection, techniques, technology, and visualization
  • Storing data with consideration of CAP theory
  • Sharing and publishing data, while understanding ethical and legal requirements
  • Security and governance
  • Retrieval, reuse, and discovery to help in making data-driven decisions

Prajapati (2013), stated the entire data lifecycle consists of the following five steps:

  • Identifying the problem
  • Designing data requirements
  • Pre-processing data
  • Data analysis
  • Data visualizing

It should be noted that Prajapati includes steps that first ask what, when, who, where, why, and how with regards to trying to solve a problem. It doesn’t just dive into getting data. Combining both Prajapati (2013) and Kahn et al. (2014) data lifecycles, provides a better data lifecycle. However, there are 2 items to point out from the above lifecycle: (a) the security phase is an abstract phase because security considerations are involved in stages (b) storing data, sharing and publishing data, and retrieving, reusing and discovery phase.

Over time the threat landscape has gotten worse and thus big data security is a major issue. Khan et al. (2014) describe four aspects of data security: (a) privacy, (b) integrity, (c) availability, and (d) confidentiality. Minelli, Chambers, and Dhiraj (2013) stated that when it comes to data security a challenge to it is understanding who owns and has authority over the data and the data’s attributes, whether it is the generator of that data, the organization collecting, processing, and analyzing the data. Carter, Farmer, and Siegel (2014) stated that access to data is important, because if competitors and substitutes to the service or product have access to the same data then what advantage does that provide the company. Richard and King (2014), describe that a binary notion of data privacy does not exist.  Data is never completely private/confidential nor completely divulged, but data lies in-between these two extremes.  Privacy laws should focus on the flow of personal information, where an emphasis should be placed on a type of privacy called confidentiality, where data is agreed to flow to a certain individual or group of individuals (Richard & King, 2014).

Carter et al. (2014) focused on data access where access management leads to data availabilities to certain individuals. Whereas, Minelli et al. (2013) focused on data ownership. However, Richard and King (2014) was able to tie those two concepts into data privacy. Thus, each of these data security aspects is interrelated to each other and data ownership, availability, and privacy impacts all stages of the lifecycle. The root causes of the security issues in big data are using dated techniques that are best practices but don’t lead to zero-day vulnerability action plans, with a focus on prevention, focus on perimeter access, and a focus on signatures (RSA, 2013). Specifically, certain attacks like denial of service attacks are a threat and root cause to data availability issues (Khan, 2014). Also, RSA (2013) stated that from a sample of 257 security officials felt that the major challenges to security were the lack of staffing, large false positive amounts which creates too much noise, lack of security analysis skills, etc. Subsequently, data privacy issues arise from balancing compensation risks, maintaining privacy, and maintaining ownership of the data, similar to a cost-benefit analysis problem (Khan et al., 2014).

One way to solve security concerns when dealing with big data access, privacy, and ownership is to place a single entry point gateway between the data warehouse and the end-users (The Carology, 2013). The single entry point gateway is essentially middleware, which help ensures data privacy and confidentiality by acting on behalf of an individual (Minelli et al., 2013). Therefore, this gateway should aid in threat detection, assist in recognizing too many requests to the data which can cause a denial of service attacks, provides an audit trail and doesn’t require to change the data warehouse (The Carology, 2013). Thus, the use of middleware can address data access, privacy, and ownership issues. RSA (2013) proposed a solution to use data analytics to solve security issues by automating detection and responses, which will be covered in detail in another post.

Resources:

  • Carter, K. B., Farmer, D., and Siegel, C. (2014). Actionable Intelligence: A Guide to Delivering Business Results with Big Data Fast! John Wiley & Sons P&T. VitalBook file.
  • Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z. Ali, W. K. M., Alam, M., Shiraz, M., & Gani., A. (2014). Big data: Survey, technologies, opportunities, and challenges. The Scientific World Journal, 2014. Retrieved from http://www.hindawi.com/journals/tswj/2014/712826/
  • Minelli, M., Chambers, M., & Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.

Big Data Analytics: Cloud Computing

Clouds come in three different privacy flavors: Public (all customers and companies share the all same resources), Private (only one group of clients or company can use a particular cloud resources), and Hybrid (some aspects of the cloud are public while others are private depending on the data sensitivity.

Cloud technology encompasses Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).  These types of cloud differ in what the company managers with respect to what is managed by the cloud provider.  For IaaS the company manages the applications, data, runtime, and middleware, whereas the provider administers the O/S, virtualization, servers, storage, and networking.  For PaaS the company manages the applications, and data, whereas the vendor, administers the runtime, middleware, O/S, virtualization, servers, storage, and networking.  Finally SaaS the provider manages it all: application, data, O/S, virtualization, servers, storage, and networking (Lau, 2011).  This differs from the conventional data centers where the company managed it all: application, data, O/S, virtualization, servers, storage, and networking.

Examples of IaaS are Amazon Web Services, Rack Space, and VMware vCloud.  Examples of PaaS are Google App Engine, Windows Azure Platform, and force.com. Examples of SaaS are Gmail, Office 365, and Google Docs (Lau, 2011).

There are benefits of cloud is this pay-as-you-go business model.  One, the company can pay for as much (SaaS) or as little (IaaS) of the service that they need and how much in space they require. Two, the company can go on an On-Demand model, which businesses can scale up and down as they need (Dikaiakos, Katsaros, Mehra, Pallis, & Vakali, 2009).  For example, if a company would like a development environment for 3 weeks, they can build it up in the cloud for that time period and spend money for using the service for 3 weeks rather than buying a new set of infrastructure and setting up all the libraries.  This can help speed up the development speed in a ton of applications moving forward when you elect the cloud versus buying a new infrastructure.  These models are like renting a car.  Renting a car for what you need, but you are paying for what you use (Lau, 2011).

Replacing Conventional Data Center?

Infrastructure costs are really high.  For a company to be spending that much money on something that will get outdated in 18 months (Moore’s law of technology), it’s just a constant sink in money.  Outsourcing, infrastructure is the first step of company’s movement into the cloud.  However, companies need to understand the different privacy flavors well, because if data is stored in a public cloud, it will be hard to destroy the hardware, because you will destroy not only your data, but other people’s and company’s data.  Private clouds are best for government agencies which may need or require physical destruction of the hardware.  Government agencies may even use hybrid structures, keeping private data in the private clouds and the public stuff in a public cloud.  Companies that contract with the government could migrate to hybrid clouds in the future, and businesses without contracts with the government could go onto a public cloud.  There may always be a need to store the data on a private server, like patents, of KFC’s 7 herbs and spices recipe, but for the majority of the data, personally the cloud may be a grand place to store and work off of.

Note: Companies that do venture into moving into a cloud platform and storing data, they should focus on migrating data and data dictionaries slowly and with uniformity.  Data variables should have the same naming convention, one definition, a list of who is responsible for the data, meta-data, etc.  This would be a great chance for companies, while in migration to a new infrastructure to clean up their data.

Resources: