Adv Topics: Big data addressing Security Issues

Cybersecurity attacks are limited by their physical path, the network connectivity and reachability limits, and the attack structure, which is by exploiting a vulnerability that enables an attack (Xie et al., 2010). Previously, automated systems and tools were implemented to deal with moderately skilled cyber-attackers, plus white hat hackers are used to identify security vulnerabilities, but it is not enough to keep up with today’s threats (Peterson, 2012). Preventative measure only deals with newly discoverable items, not the ones that have yet to be discoverable (Fink, Sharifi, Carbonell, 2011). These two methods are preventative measures, with the goal of protecting the big data and cyberinfrastructure used to store and process big data from malicious intent. Setting up these preventative measures are no longer good enough to protect big data and its infrastructure. Thus there has been a migration towards using real-time analysis on monitored data (Glick, 2013). Real-time analysis is concerned with “What is really happening?” (Xie et al., 2010).

If algorithms used to process big data can be pointed towards cyber security, Security Information and Event Management (SIEM), it can add another solution towards identifying cyber security threat (Peterson, 2012). All that big data cyber security analysis will do is make security teams faster to react if they have the right context to the analysis, but it won’t make the security teams act in a more proactive way (Glick, 2013). SIEM has gone above and beyond current cyber security prevention measures, usually by collecting the log data in real time that is generated and processing the log data in real time using algorithms like correlation, pattern recognition, behavioral analysis, and anomaly analysis (Glick, 2013; Peterson, 2012). Glick (2013), reported that data from a variety of sources help build a cyber security risk and threat profile in real-time that can be taken to cyber security teams to react to each threat in real time, but it works on small data sets.

SIEM couldn’t handle the vast amounts of big data and therefore analyzing the next cyber threats came from using tools like Splunk to identify anomalies amongst the data (Glick, 2013). SIEM was proposed for use in the Olympics games, but Splunk was being used for investment banking purposes (Glick, 2013; Peterson, 2012). FireEye is another big data analytics security tool that was used for identifying network threats (Glick, 2013).

  • Xie et al. (2010), proposed the use of Bayesian networks for cyber security analysis. This solution considers that modeling cyber security profiles are difficult to construct and uncertain, plus they built the tool for near real-time systems. That is because Bayesian models try to model cause-and-effect relationships. Using deterministic security models are unrealistic and do not capture the full breadth of a cyber attack and cannot capture all the scenarios for real-time analysis. If the Bayesian models are built to reflect reality, then it could be used for near real-time analysis. In real-time cyber security analysis, analysts must consider an attacker’s choices are unknown or if they will be successful in their targets and goals. Building a modular graphical attack model can help calculate uncertainties, which can be done by decomposing the problem into finite small parts, where realistic data can be used to pre-populate all the parameters. These modular graphical attack models should consider the physical paths in the explicit and abstract form. Thus, the near real-time Bayesian network considers the three important uncertainties introduced in a real-time attack (italicized). Using this method is robust as determined by a holistic sensitivity analysis.
  • Fink et al. (2011), proposed a mashup of crowdsourcing, machine learning, and natural language processing to dealing both vulnerabilities and careless end user actions, for automated threat detection. In their study, they focused on scam websites and cross-site request forgeries. For scam website identification, the concept of using crowdsourced end users to flag certain websites as a scam is key to this process. The goal is that when a new end user approaches the scam website, a popup appears stating “This website is a scam! Do not provide personal information.” The authors’ solution ties data from heterogeneously common web scam blacklist databases. This solution has high precisions (98%), and high recall (98.1%) on their test of 837 manually labeled sites that was cross-validated using a ten-fold cross -validation analysis between the blacklisted database. The current system’s limitation does not address new threats and different sets of threats.

These studies and articles illustrate that the benefit of using big data analytics for cybersecurity analysis provides the following benefits (Fink et al., 2011; Glick, 2013; IBM Software, 2013; Peterson, 2012; Xie et al., 2010):

(a) moving away from preventative cybersecurity and moving towards real-time analysis to become reactive faster to a current threat;

(b) creating security models that more accurately reflect the reality and uncertainty that exists between the physical paths, successful attacks, and unpredictability of humans for near real-time analysis;

(c) provide a robust identification technique; and

(d) reduction of identifying false positives, which eat up the security team’s time.

Thus, helping security teams to solve difficult issues in real-time. However, this is a new and evolving field that is applying big data analytics. Thus it is expected that many tools will be developed, and the most successful tool would be able to provide real-time cybersecurity data analysis with the huge set of algorithms each aimed at studying different types of attacks. It is even possible for one day to see artificial intelligence to become the next new phase of providing real-time cyber security analysis and resolutions.


Adv Topics: Security Issues with Cloud Technology

Big data requires huge amounts of resources to analyze it for data driven decisions, thus there has been a gravitation towards cloud computing to work in this era of big data (Sakr, 2014). Cloud technology is different than personal systems that place different demands on cyber security, where personal systems could have single authority systems and cloud computing systems, have no individual owners, have multiple users, groups rights, and shared responsibility (Brookshear & Brylow, 2014; Prakash & Darbari, 2012). Cloud security can be just as good or better than personal systems because cloud providers could have the economies of scales that can support a budget to have an information security team that many organizations may not be able to afford (Connolly & Begg, 2014). Cloud security can be designed to be independently modular, which is great for heterogenous distributed systems (Prakash & Darbari, 2012).

For cloud computing eavesdropping, masquerading, message tampering, replaying the message, and denial of services are security issues that should be addressed (Prakash & Darbari, 2012). Sakr (2014) stated that exploitation of co-tenancy, a secure architecture for the cloud, accountability for outsourced data, confidentiality of data and computation, privacy, verifying outsourced computation, verifying capability, cloud forensics, misuse detection, and resource accounting and economic attacks are big issues for cloud security. This post will discuss the exploitation of co-tendency and confidentiality of data and computation.

Exploitation of Co-Tenancy: An issue with cloud security is within one of its properties, that it is a shared environment (Prakash & Darbari, 2012; Sakr, 2014). Given that it is a shared environment, people with malicious intent could pretend to be someone they are not to gain access, in other words masquerading (Prakash & Darbari, 2012). Once inside, these people with malicious intent tend to gather information about the cloud system and the data contained within it (Sakr, 2014). Another way these services could be used by malicious people is to use the computational resources of the cloud to carry out denial of service attacks on other people.   Prakash and Darbari (2012) stated that two-factor authentications were used on personal devices and for shared distributed systems, there has been proposed a use of a three-factor authentication. The first two factors are the use passwords and smart cards. The last one could be either biometrics or digital certificates. Digital certificates can be used automatically to reduce end-user fatigue on using multiple authentications (Connolly & Begg, 2014). The third level of authentication helps to create a trusted system. Subsequently, a three-factor authentication could primarily mitigate masquerading. Sakr (2014), proposed using a tool that hides the IP addresses the infrastructure components that make up the cloud, to prevent the cloud for being used if the entry is granted to a malicious person.

Confidentiality of data and computation: If data in the cloud is accessed malicious people can gain information, and change the content of that information. Data stored on the distributed systems are sensitive to the owners of the data, like health care data which is heavily regulated for privacy (Sakr, 2014). Prakash and Darbari (2012) suggested the use of public key cryptography, software agents, XML binding technology, public key infrastructure, and role-based access control are used to deal with eavesdropping and message tampering. This essentially hides the data in such a way that it is hard to read without key items that are stored elsewhere in the cloud system. Sakr (2014) suggested homomorphic encryption may be needed, but warns that the use of encryption techniques increases the cost and time of performance. Finally, Lublinsky, Smith, and Yakubovich (2013), stated that encrypting the network to protect data-in-motion is needed.

Overall, a combination of data encryption, hiding IP addresses of computational components, and three-factor authentication may mitigate some of the cloud computing security concerns, like eavesdropping, masquerading, message tampering, and denial of services. However, using these techniques will increase the time it takes to process big data. Thus a cost-benefit analysis must be conducted to compare and contrast these methods while balancing data risk profiles and current risk models.


  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, (12th ed.). Pearson Learning Solutions. VitalBook file.
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. Pearson Learning Solutions. VitalBook file.
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox. VitalBook file.
  • Prakash, V., & Darbari, M. (2012). A review on security issues in distributed systems. International Journal of Scientific & Engineering Research, 3(9), 300–304.
  • Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.

Adv Topics: IP size distributions and detecting network traffic anomalies

In 2011, internet advertising has generated over $31B in the US (Sakr, 2014). Much of this revenue is generated by created contextual advertising, which is when online advertisers, search engine optimizers, and sponsored search providers try to engage a user experience and revenue with displaying relevant and context based ads online (Chakrabarti, Agarwal, & Josifovski, 2008). Thus, understanding click rates of online advertisers, search engine optimizers, and sponsored search providers can provide online revenue to any business’ products or services (Regelson & Fain, 2006). When a consumer clicks on the ad to decide whether to purchase the product or service, a small amount of money is withdrawn from the online advertising budget from the company (Regelson & Fain, 2006; Sakr, 2014).

This business model is subjected to cyber-attacks, such that a competitor can create an automated piece of code to click on the advertising without making a purchase, which in the end depletes the online advertising budget (Sakr, 2014). This automated piece of code usually comes from an IP size distribution, which is a group of IPs set to target one ad and pretending to be an actual consumer, which sounds like a DoS attack – Denial of Service attack (Park & Lee, 2001; Sakr, 2014). However, DoS attack is to use IP size distributions to block services from a website, and the best way to prevent this situation is to trace back the source of the IP size distribution and block it (Park & Lee, 2001). This is slightly different though; it is not denying the company’s service or products, its depleting their online advertisement budget, which will reduce one company’s online market share.

Sakr (2014) says that IP size distributions are defined by two dimensions (a) application and (b) time; which change throughout time due to business cycles, flash crowds, etc. IP size distributions are generated three ways: (a) legitimate users, (b) publisher’s friends that could include sponsored providers with some fraudulent clicks, and (c) bot-master with botnets (Sakr, 2014; Soldo & Metwally, 2012). The goal is now to identify the bot-master with botnets and the fraudulent clicks. Thus, companies need to be able to detect network traffic anomalies based on the IP size distribution:

  • Sakr (2014) and Soldo and Metwally (2012) suggested using anomaly detection algorithms, which relies on the current IP size distribution and analyzes the data to search for patterns that are characteristic of these attacks. These methods of detection are robust because it uses these characters of fraudulent clicks, which has low complexity and can be written to run MapReduce in parallel processing. This method can assign a distinct cookie ID for analysis when a click is generated. This technique uses a regression model and compares IP rates to a Poisson distribution, as well as using an explanatory diversity feature which counts the distinct cookies and measures an entropy of that distribution; setting this as the true IP sizes. The use of this information to generate explanatory diversity models, which can then also be analyzed using quantile regression, linear regression, percentage regression, and principal component analysis. Then each of these analyses has their root mean square error computed, relative error, and bucket error to allow inter-comparability between the results of each of these models to the true value. This inter-comparison allows for detection of anomalous activities because each method measures different properties within the same data. Once the IP addresses have been identified as fraudulent, they are then flagged.
  • Regelson and Fain (2006) suggests using historical data if it is available to create reliable prior IP size distribution to compare it to current IP size distributions. Though the authors suggested using this for studying click through rates, which is clicking on the ad to purchase, this could also be used for this scenario. This method of using historical data can sometimes work when there is a wealth of historical information, but in cases that there are little to none historical information a creative aggregation technique could work. This technique uses a cluster of less frequent and similar items as well as completely novel items to develop that historical context needed to build the historical IP size distribution. This technique uses a logistic regression analysis. This method could reduce error by 50% when there was no historical data to compare to.

With further analysis of the first method, the strengths of this method are:

  • that there is no need to obtain personally identifiable information
  • no need to authenticate end user clicks
  • fully automated statistical aggregation method that can scale linearly using MapReduce
  • creating a legitimate looking IP size distributions is really difficult

while the limitations of this method are:

  • It requires many actual click data to create these models
  • Colluding with other companies to provide their click data can help create a large amount of click data needed, but usually, that data is proprietary.

That is why the second method was mentioned from Regelson and Fain (2006) because they address the limitations of the Sakr (2014) and Soldo and Metwally (2012) method.


  • Chakrabarti, D., Agarwal, D., & Josifovski, V. (2008). Contextual advertising by combining relevance with click feedback. In Proceedings of the 17th international conference on World Wide Web (pp. 417-426). ACM.
  • Park, K., & Lee, H. (2001). On the effectiveness of probabilistic packet marking for IP traceback under denial of service attack. In INFOCOM 2001. Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE (Vol. 1, pp. 338-347). IEEE.
  • Regelson, M., & Fain, D. (2006). Predicting click-through rate using keyword clusters. In Proceedings of the Second Workshop on Sponsored Search Auctions (Vol. 9623).
  • Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.
  • Soldo, F., & Metwally, A. (2012). Traffic anomaly detection based on the IP size distribution. In INFOCOM, 2012 Proceedings IEEE (pp. 2005-2013). IEEE.