Adv Topics: IP size distributions and detecting network traffic anomalies

In 2011, internet advertising has generated over $31B in the US (Sakr, 2014). Much of this revenue is generated by created contextual advertising, which is when online advertisers, search engine optimizers, and sponsored search providers try to engage a user experience and revenue with displaying relevant and context based ads online (Chakrabarti, Agarwal, & Josifovski, 2008). Thus, understanding click rates of online advertisers, search engine optimizers, and sponsored search providers can provide online revenue to any business’ products or services (Regelson & Fain, 2006). When a consumer clicks on the ad to decide whether to purchase the product or service, a small amount of money is withdrawn from the online advertising budget from the company (Regelson & Fain, 2006; Sakr, 2014).

This business model is subjected to cyber-attacks, such that a competitor can create an automated piece of code to click on the advertising without making a purchase, which in the end depletes the online advertising budget (Sakr, 2014). This automated piece of code usually comes from an IP size distribution, which is a group of IPs set to target one ad and pretending to be an actual consumer, which sounds like a DoS attack – Denial of Service attack (Park & Lee, 2001; Sakr, 2014). However, DoS attack is to use IP size distributions to block services from a website, and the best way to prevent this situation is to trace back the source of the IP size distribution and block it (Park & Lee, 2001). This is slightly different though; it is not denying the company’s service or products, its depleting their online advertisement budget, which will reduce one company’s online market share.

Sakr (2014) says that IP size distributions are defined by two dimensions (a) application and (b) time; which change throughout time due to business cycles, flash crowds, etc. IP size distributions are generated three ways: (a) legitimate users, (b) publisher’s friends that could include sponsored providers with some fraudulent clicks, and (c) bot-master with botnets (Sakr, 2014; Soldo & Metwally, 2012). The goal is now to identify the bot-master with botnets and the fraudulent clicks. Thus, companies need to be able to detect network traffic anomalies based on the IP size distribution:

  • Sakr (2014) and Soldo and Metwally (2012) suggested using anomaly detection algorithms, which relies on the current IP size distribution and analyzes the data to search for patterns that are characteristic of these attacks. These methods of detection are robust because it uses these characters of fraudulent clicks, which has low complexity and can be written to run MapReduce in parallel processing. This method can assign a distinct cookie ID for analysis when a click is generated. This technique uses a regression model and compares IP rates to a Poisson distribution, as well as using an explanatory diversity feature which counts the distinct cookies and measures an entropy of that distribution; setting this as the true IP sizes. The use of this information to generate explanatory diversity models, which can then also be analyzed using quantile regression, linear regression, percentage regression, and principal component analysis. Then each of these analyses has their root mean square error computed, relative error, and bucket error to allow inter-comparability between the results of each of these models to the true value. This inter-comparison allows for detection of anomalous activities because each method measures different properties within the same data. Once the IP addresses have been identified as fraudulent, they are then flagged.
  • Regelson and Fain (2006) suggests using historical data if it is available to create reliable prior IP size distribution to compare it to current IP size distributions. Though the authors suggested using this for studying click through rates, which is clicking on the ad to purchase, this could also be used for this scenario. This method of using historical data can sometimes work when there is a wealth of historical information, but in cases that there are little to none historical information a creative aggregation technique could work. This technique uses a cluster of less frequent and similar items as well as completely novel items to develop that historical context needed to build the historical IP size distribution. This technique uses a logistic regression analysis. This method could reduce error by 50% when there was no historical data to compare to.

With further analysis of the first method, the strengths of this method are:

  • that there is no need to obtain personally identifiable information
  • no need to authenticate end user clicks
  • fully automated statistical aggregation method that can scale linearly using MapReduce
  • creating a legitimate looking IP size distributions is really difficult

while the limitations of this method are:

  • It requires many actual click data to create these models
  • Colluding with other companies to provide their click data can help create a large amount of click data needed, but usually, that data is proprietary.

That is why the second method was mentioned from Regelson and Fain (2006) because they address the limitations of the Sakr (2014) and Soldo and Metwally (2012) method.

Resources:

  • Chakrabarti, D., Agarwal, D., & Josifovski, V. (2008). Contextual advertising by combining relevance with click feedback. In Proceedings of the 17th international conference on World Wide Web (pp. 417-426). ACM.
  • Park, K., & Lee, H. (2001). On the effectiveness of probabilistic packet marking for IP traceback under denial of service attack. In INFOCOM 2001. Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE (Vol. 1, pp. 338-347). IEEE.
  • Regelson, M., & Fain, D. (2006). Predicting click-through rate using keyword clusters. In Proceedings of the Second Workshop on Sponsored Search Auctions (Vol. 9623).
  • Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.
  • Soldo, F., & Metwally, A. (2012). Traffic anomaly detection based on the IP size distribution. In INFOCOM, 2012 Proceedings IEEE (pp. 2005-2013). IEEE.

Adv Topics: Stochastic Modeling and three-pool Cloud Architecture

There is a need to have an easy and effective performance analysis method suited to the large Infrastructure as a Service (IaaS) cloud computing environment. There are typical; tree approaches can be used to conduct performance analysis to any type of target system:

  • Experiment-based performance analysis
  • Discrete event-simulation-based performance analysis
  • Stochastic model-based performance analysis

The experimental-based performance analysis can be cost and time prohibitive as data and processing scales increase on IaaS computing environments, and discrete event simulations take too much time to compute (Sakr, 2014). A scalable multi-level stochastic model-based performance analysis has been proposed by Ghosh, Longo, Naik and Trivedi (2012) for IaaS cloud computing. Stochastic analysis and models are a form of predicting how probable an outcome will occur using a form of chaotic deterministic models that help in dealing with analyzing one or more outcomes that are cloaked in uncertainty (Anadale, 2016; Investopedia, n.d.). Properties of stochastic models consist of (Anadale, 2016):

  1. All aspects of uncertainty should be representing (variables), by having a solution space that contains all possible outcomes
  2. Each of the possible variables would have probability distribution attached to it
  3. The distribution of variables is run through thousands of simulations to identify the probability of a preferred key outcome

These steps are essential in trying to gain information on a variety of outcomes with varying variables, to make data-driven decisions because stochastic modeling runs thousands of simulations to understand the eccentricities of each variable per outcome (Investopedia, n.d.). Ghosh et al. (2012) selection of a stochastic model for performance analysis is because of its low-cost analysis tool that covers a large solution space. Sakr (2014), shared this same solution for big data IaaS cloud environments. The solution consists of a three-pool cloud architecture.

Three-pool cloud architecture

In IaaS cloud scenario, a request for resources initiates a request for one or more virtual machines to be deployed to access a physical machine (Ghosh et al., 2012). The architecture (Figure 1) assumes that physical machines are grouped in pools hot, warm, and cool (Sakr, 2014). An architectural technique that allows for cost savings because it minimizes power and cooling costs (Ghosh et al., 2012). A Resource Provisioning Decision Engine (RPDE), helps an incoming queue asking for computational resources (Ghosh et al., 2012; Sakr, 2014). The RPDE searches for physical machines in the hot pool first to build a virtual machine to access its resources. However, if all the hot pool resources are currently used, it will search for the physical machines in the warm pool, then the cool pool, and if none are available the queue gets rejected. This three-pool architecture can be scalable and practical to do performance analysis on IaaS cloud computing environment

The hot pool consists of physical machines that are constantly on, and virtual machines are deployed upon request (Ghosh et al., 2012; Sakr, 2014).   Whereas a warm pool has the physical machines in power saving mode, and cold pools have physical machines that are turned off. For both warm and cold pools setting up a virtual machine is delayed compared to a hot pool, since the physical machines need to be powered up or awaken before a virtual machine is deployed (Ghosh et al., 2012). The optimal number of physical machines in each pool is predetermined by the information technology architects.

ip3V4.png

Figure 1: Request provisioning steps in a three-pool cloud architecture adapted from Ghosh et al. (2012).

Stochastic modeling comes into play here when trying to calculate the job rejection probability, which is when there are no physical machines in either pool, with input variables such as job arrival rate, mean searching delay to find a physical machine, probabilities that a physical machine can accept the job, and maximum number of jobs, sizes of the job (Ghosh et al., 2012; Sakr, 2014). Each of these variables is submodels used to help define the job rejection probability. In other words, each of these variables has their defined statistical distributions to them, thus when running thousands of simulations, a given job’s rejection probability can be given.

Limitations

Unfortunately, the jobs can be rejected for two reasons: (a) buffer is full or (b) insufficient physical machine resources (Ghosh et al., 2012; Sakr, 2014). Thus, the use of stochastic modeling allows for predicting a job rejection rate and thus could reinitiate the queue if the probability is high for a more optimal time.

Also, both Ghosh et al. (2012) and Sakr (2014) stated that there is a limitation based on the assumption that all the physical machines in IaaS cloud platforms are homogeneous and that virtual machines are homogeneous. Adding more submodels and variables to the main stochastic model should alleviate the situation. However, this will build a bigger model that would be slower than the original example.

Potential applications that can be developed

Ghosh et al. (2012) suggested that this model can be used to calculate the job rejection probability and mean response delay from this architecture, by analyzing different variables than what was previously discussed like changing job arrival rates, mean job service times, number of physical machines per pool, number of virtual machines that can be supported on a particular physical machine.

Another possible application of the three-pool architecture could be used for data recovery efforts. Data and computational algorithms that are mission critical to the success of an organization can be prioritized to be stored in a hot pool, whereas data that are not time sensitive can be stored in warm and cold pools. Storing key data and algorithms in a hot pool means that if the main architecture goes down, the backups stored in the cloud can be recovered relatively quickly. These solutions could be implemented in a private, hybrid, or public cloud.   This solution is a cloud version of the different types of sites used for disaster recovery (Segue Technologies, 2013).

Conclusions

Stochastic modeling is a great way to use a chaotic version of deterministic modeling to predict the future when the variables could be defined by a probability distribution. Using differing pools of physical machines in an IaaS cloud architecture for demanding computational resources can result in a loss of service. Stochastic modeling could be used here to describe the probability that a certain queue request will be rejected, which can allow people to make data driven the decision on when it would be an optimal time to request resources from the IaaS cloud architecture.

Reference

Adv Topics: Big Data Visualization

Volume visualization is used to understand large amounts of data, in other words, big data, where it can be processed on a server or in the cloud and rendered onto a hand-held device to allow for the end user to interact with the data (Johnson, 2011). Tanahashi, Chen, Marchesin and Ma (2010), define a framework for creating an entirely web-based visualization interface (or web application), which is leveraging the cloud computing environment. The benefit of using this type of interface is that there is no need to download or install the software, but that the software can be accessed through any mobile device with a connection to the internet. A scientist can then use visualization and multimedia functionally as a tool to enhance their thinking and understanding of current problems, from understanding the 3-dimentional structure of DNA or the 3-dimentional structure of a hurricane (Minelli, Chambers, & Dhiraj, 2013; Johnson, 2011).

Db3.jpg

Figure 1: Hurricane Joaquin’s 3-dimensional rendering of its rain structure from the NASA-enhanced infrared satellite image and GPM data. Adapted from NASA (September 29, 2015). This image shows snow particles in the storm’s anvil, but also shows that significant amounts of heat being released by the storm’s core, which is driving the circulation of the storm and providing the storm energy required for further intensification.

McNurlin, Sprague, and Bui (2008), stated that the ideal web-based visualization tool would have simplified operations, allows for reusable templates, rapid deployment, multilingual support, and allows for control over the creation, update, access, customization, and destruction. Tanahashi et al. (2010) had proposed a web-based visualization framework to have a:

(a)   preprocessing phase = data is collected, indexed, stored

(b)   interface = end-user connecting to data in the databases and makes the request for processing and modifying the data

(c)   processing phase = a set of images, video, 3-dimensional renderings are returned per request

(d)   modification phase = end-user can request further modifications

(e)   reprocessing phase = a set of images, video, 3-dimensional renderings are returned per request, which goes into an iterative loop between parts d and e until a final product is rendered to the end user

It was designed for all people to use, and go by the philosophy that “Knowledge should be openly assessable to the broader community.” (Tanahashi et al., 2010, Sakr, 2014). Performance bottlenecks of the above Tanahashi et al. (2010) framework include difficulty with dealing with different data formats, different rendering algorithms, transferring cloud-based data rendering onto the web interface, and organization of big data for efficient retrieval. With the goal of any visualization is to be providing the right user the right information in their preferred or a suggested rendering these bottlenecks must be addressed (McNurlin et al., 2008). Thus, the algorithms can be indexed to allow for classifying the algorithms ‘properties of aesthetics or analytical significance, which can be searched for by an end-user with a search bar (Tanahashi et al., 2010).

Subsequently, it is proposed that using and indexing metadata can resolve the issues of data organization (Tanahashi et al., 2010). Data transfer issues could be mitigated by minimizing the amount of data-in-motion via a MapReduce paradigm (Tanahashi et al., 2010, Sakr, 2014). In the MapReduce paradigm, the mapper’s process and render the data and reducers create the final composition of the data (Tanahashi et al., 2010).

Resources

  • Johnson, C. (2011) Visualizing large data sets. TEDx Salt Lake City. Retrieved from https://www.youtube.com/watch?v=5UxC9Le1eOY
  • McNurlin, B., Sprague, R., & Bui, T. (2008) Information Systems Management, (8th ed.). Pearson Learning Solution. VitalBook file.
  • Minelli, M., Chambers, M., & Dhiraj, M. (2013) Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.
  • NASA (2015). October 02, 2015 – Update #1 – A 3-D Look at Hurricane Joaquin from NASA’s GPM Satellite.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.
  • Tanahashi, Y., Chen, C., Marchesin, S., & Ma, K. (2010). An interface design for future cloud-based visualization services. Proceedings of 2010 IEEE Second International Conference on Cloud Computing Technology and Service, 609–613. doi: 10.1109/CloudCom.2010.46

Adv Topics: CAP Theory and NoSQL Databases

Brewer (2000) and Gilbert and Lynch (2012) concluded that for a distributed shared-data system you could only have at most two of the three properties: consistency, availability, partition-tolerance (CAP theory). Gilbert and Lynch (2012) describes these three as akin to the safety of the data, live data, and reliability of the data. Thus, systems that are giving up

  • consistency creates a system that needs expirations, conflict resolution, and optimistic locking (Brewer, 2000). A lack of consistency means that there is a chance that the data or processes may not return the right response to a request (Gilbert & Lynch, 2012).
  • availability creates a system that needs pessimistic locking and making some partitions unavailable (Brewer, 2000). A lack of availability means that there is a chance that a request may not get a response (Gilbert & Lynch, 2012).
  • Partition-tolerance creates a system that needs a 2-phase commit and cache validation profiles (Brewer, 2000). A lack of partition-tolerance means that there is a chance that messages between servers, tasks, threads, can be lost forever and never are committed (Gilbert & Lynch, 2012).

Therefore, in a NoSQL distributed database systems (DDBS), it means that partition-tolerance should exist, and therefore administrators should then select between consistency and availability (Gilbert & Lynch, 2012; Sakr, 2014). However, if the administrators focus on availability they can try to achieve weak consistency, or if the administrators focus on consistency, they are planning on having a strong consistency system. An availability focus is having access to the data even during downtimes (Sakr, 2014). However, providing high levels of availability can cost money. Per the web application Uptime.is:

Availability Level Monthly downtime Yearly downtime
99.9% 43m 49.7s 8h 45m 75.0s
99.99% 4m 23.0s 52m 35.7s
99.999% 26.3s 5m 15.6s
99.9999% 2.6s 31.6s

To achieve high levels of availability means having a set of fail-safe systems to build for fault tolerance.

From the previous paragraph, there is both strong and weak consistency. Strong consistency ensures that all copies of the data are updated in real-time, whereas weak consistency means that eventually all the copies of the data will be updated (Connolly and Begg, 2014; Sakr, 2014). Thus, there is a resource cost to have stronger consistency over weaker consistency due to how fast the data needs to be updated (Gilbert & Lynch, 2012). Consequently, this is where the savings come from when handling for overhead in a NoSQL DDBS.

Finally, the table below illustrates some of the NoSQL databases that are either an AP or CP system (Hurst, 2010).

Availability & Partition Tolerance

NoSQL systems

Consistency & Partition Tolerance

NoSQL systems

Dynamo, Voldemort, Tokyo Cabinet, KAI, Riak, CouchDB, SimpleDB, Cassandra Big Table, MongoDB, Terrastore, Hypertable, Hbase, Scalaris, Berkley DB, MemcacheDB, Redis

 Resources

  • Brewer, E. (2000). Towards robust distributed systems. Proceedings of 19th Annual ACM Symposium Principles of Distributed Computing (PODC00). 7–10.
  • Connolly, T., & Begg, B. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Gilbert, S., and Lynch N. A. (2012). Perspectives on the CAP Theorem. Computer 45(2), 30–36. doi: 10.1109/MC.2011.389

 

Adv Topics: Distributed Programing

Distributed programming can be divided into the following two models:

  • Shared memory distributed programming: Is where serialized programs run on multiple threads, where all the threads have access to the underlying data that is stored in shared memory (Sakr, 2014). Each thread should be synchronized as to ensure that read and write functions aren’t being done on the same segment of the shared data at the same time. Sandén (2011) and Sakr, (2014) stated that this could be achieved via semaphores (signals other threads that data is being written/posted and other threads should wait to use the data until a condition is met), locks (data can be locked or unlocked from reading and writing), and barriers (threads cannot run on this next step until everything preceding it is completed). A famous example of this style of parallel programming is the use of MapReduce on data stored in the Hadoop Distributed File System (HDFS) (Lublinsky, Smith, & Yakubovich, 2013; Sakr, 2014). The HDFS is where the data is stored, and the mapper and reducers functions can access the data stored in the HDFS.
  • Message passing distributed programming: Is where data is stored in one location, and a master thread helps spread chunks of the data onto sub-tasks and threads to process the overall data in parallel (Sakr, 2014).       There are explicitly direct send and receive messages that have synchronized communications (Lublinsky et al., 2013; Sakr, 2014).   At the end of the runs, data is the merged together by the master thread (Sakr, 2014). A famous example of this style of parallel programming is Message Passing Interface (MPI), such that many weather models like the Weather Research and Forecasting (WRF) model benefits use this form of distributed programming (Sakr, 2014; WRF, n.d.). The initial weather conditions are stored in one location and are chucked into small pieces and spread across the threads, which are then eventually joined in the end to produce one cohesive forecast.

However, there are six challenges to distributed programming model: Heterogeneity, Scalability, Communications, Synchronization, Fault-tolerance, and Scheduling (Sakr, 2014). Each of these six challenges is interrelated. Thus, an increase in complexity in one of these challenges can increase the level of complexity of one or more of the other ones. Therefore, both the shared memory and message passing distributed programming are insufficient when processing the large-scale data in cloud computing environment. This post will focus on two of these six:

  • Scalability issues exist when an increase in the number of users, the amount of data, and request for resources and the distributed processing system can still be effective (Sakr, 2014). Using Hadoop and HDFS in the cloud allows for a mitigation of the scalability issues by providing a free open-source way of managing such an explosion of data and demand on resources. But, the storage costs on the cloud will also increase, even though it is usually 10% of the cost than normal information technology infrastructure (Minelli, Chambers, & Dhiraj, 2013). As the scale of resources increase, it can also increase a number of resources needed for a deal with communication and synchronization (Sakr, 2014).
  • Synchronization is a critical challenge that must be addressed because multiple threads should be able to share data without corrupting the data or cause inconsistencies (Sandén, 2011; Sakr, 2014). Lublinsky et al. (2013), stated that MapReduce requires proper synchronization between the mapper and reducer functions to work. Improper synchronization can lead to issues in fault tolerance. Thus, efficient synchronization between reading and write operations are vital and are within the control of the programmers (Sakr, 2014). The challenge comes when scalability issues are introduced and applying synchronization methods without degrading performances, causing deadlocks where two tasks want access to the same data, load balancing issues, or wasteful use of computational resources (Lublinsky et al., 2013; Sandén, 2011; Sakr, 2014).

Resources

  • Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
  • Minelli, M., Chambers, M., & Dhiraj, M. (2013) Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.
  • Sandén, B. I. (2011) Design of Multithreaded Software: The Entity-Life Modeling Approach. Wiley-Blackwell. VitalBook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.