In 2011, internet advertising has generated over $31B in the US (Sakr, 2014). Much of this revenue is generated by created contextual advertising, which is when online advertisers, search engine optimizers, and sponsored search providers try to engage a user experience and revenue with displaying relevant and context based ads online (Chakrabarti, Agarwal, & Josifovski, 2008). Thus, understanding click rates of online advertisers, search engine optimizers, and sponsored search providers can provide online revenue to any business’ products or services (Regelson & Fain, 2006). When a consumer clicks on the ad to decide whether to purchase the product or service, a small amount of money is withdrawn from the online advertising budget from the company (Regelson & Fain, 2006; Sakr, 2014).
This business model is subjected to cyber-attacks, such that a competitor can create an automated piece of code to click on the advertising without making a purchase, which in the end depletes the online advertising budget (Sakr, 2014). This automated piece of code usually comes from an IP size distribution, which is a group of IPs set to target one ad and pretending to be an actual consumer, which sounds like a DoS attack – Denial of Service attack (Park & Lee, 2001; Sakr, 2014). However, DoS attack is to use IP size distributions to block services from a website, and the best way to prevent this situation is to trace back the source of the IP size distribution and block it (Park & Lee, 2001). This is slightly different though; it is not denying the company’s service or products, its depleting their online advertisement budget, which will reduce one company’s online market share.
Sakr (2014) says that IP size distributions are defined by two dimensions (a) application and (b) time; which change throughout time due to business cycles, flash crowds, etc. IP size distributions are generated three ways: (a) legitimate users, (b) publisher’s friends that could include sponsored providers with some fraudulent clicks, and (c) bot-master with botnets (Sakr, 2014; Soldo & Metwally, 2012). The goal is now to identify the bot-master with botnets and the fraudulent clicks. Thus, companies need to be able to detect network traffic anomalies based on the IP size distribution:
- Sakr (2014) and Soldo and Metwally (2012) suggested using anomaly detection algorithms, which relies on the current IP size distribution and analyzes the data to search for patterns that are characteristic of these attacks. These methods of detection are robust because it uses these characters of fraudulent clicks, which has low complexity and can be written to run MapReduce in parallel processing. This method can assign a distinct cookie ID for analysis when a click is generated. This technique uses a regression model and compares IP rates to a Poisson distribution, as well as using an explanatory diversity feature which counts the distinct cookies and measures an entropy of that distribution; setting this as the true IP sizes. The use of this information to generate explanatory diversity models, which can then also be analyzed using quantile regression, linear regression, percentage regression, and principal component analysis. Then each of these analyses has their root mean square error computed, relative error, and bucket error to allow inter-comparability between the results of each of these models to the true value. This inter-comparison allows for detection of anomalous activities because each method measures different properties within the same data. Once the IP addresses have been identified as fraudulent, they are then flagged.
- Regelson and Fain (2006) suggests using historical data if it is available to create reliable prior IP size distribution to compare it to current IP size distributions. Though the authors suggested using this for studying click through rates, which is clicking on the ad to purchase, this could also be used for this scenario. This method of using historical data can sometimes work when there is a wealth of historical information, but in cases that there are little to none historical information a creative aggregation technique could work. This technique uses a cluster of less frequent and similar items as well as completely novel items to develop that historical context needed to build the historical IP size distribution. This technique uses a logistic regression analysis. This method could reduce error by 50% when there was no historical data to compare to.
With further analysis of the first method, the strengths of this method are:
- that there is no need to obtain personally identifiable information
- no need to authenticate end user clicks
- fully automated statistical aggregation method that can scale linearly using MapReduce
- creating a legitimate looking IP size distributions is really difficult
while the limitations of this method are:
- It requires many actual click data to create these models
- Colluding with other companies to provide their click data can help create a large amount of click data needed, but usually, that data is proprietary.
That is why the second method was mentioned from Regelson and Fain (2006) because they address the limitations of the Sakr (2014) and Soldo and Metwally (2012) method.
- Chakrabarti, D., Agarwal, D., & Josifovski, V. (2008). Contextual advertising by combining relevance with click feedback. In Proceedings of the 17th international conference on World Wide Web (pp. 417-426). ACM.
- Park, K., & Lee, H. (2001). On the effectiveness of probabilistic packet marking for IP traceback under denial of service attack. In INFOCOM 2001. Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE (Vol. 1, pp. 338-347). IEEE.
- Regelson, M., & Fain, D. (2006). Predicting click-through rate using keyword clusters. In Proceedings of the Second Workshop on Sponsored Search Auctions (Vol. 9623).
- Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.
- Soldo, F., & Metwally, A. (2012). Traffic anomaly detection based on the IP size distribution. In INFOCOM, 2012 Proceedings IEEE (pp. 2005-2013). IEEE.