├── image └── wechat.jpg ├── papers ├── 10.1.1.704.3668.pdf └── Alexandr_Tuzhilin_Report.pdf ├── notes ├── fighting-online-click-fraud-using-bluff-ads.md ├── lane-gifts-google-report.md └── novel-approach-based-on-ensemble-learning-fraud-detection-mobile-advertising.md └── README.md /image/wechat.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IPL/fraud-detection-papers/HEAD/image/wechat.jpg -------------------------------------------------------------------------------- /papers/10.1.1.704.3668.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IPL/fraud-detection-papers/HEAD/papers/10.1.1.704.3668.pdf -------------------------------------------------------------------------------- /papers/Alexandr_Tuzhilin_Report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IPL/fraud-detection-papers/HEAD/papers/Alexandr_Tuzhilin_Report.pdf -------------------------------------------------------------------------------- /notes/fighting-online-click-fraud-using-bluff-ads.md: -------------------------------------------------------------------------------- 1 | ## [Fighting Online Click-Fraud Using Bluff Ads](https://arxiv.org/pdf/1002.2353.pdf) 2 | 3 | 很短的一篇paper, 只有4页, 但是引用次数很多, 70多次, 主要还是因为思路的创造性. 4 | 5 | Bluff Ad是专门设计可以被机器或者没有经过足够训练的作弊人力检测和点击的广告. 一种可能是广告的展示文本不相关, 另一种形式是定向到不相关的人群. 无论哪种形式, 正常的用户一般都不会去点击. 6 | 7 | 然后作者列举了4种作弊的方式, 并且介绍了bluff ad是怎么工作的. 4种作弊方式分别是Profiling the customer, Publisher fraud, Attacks on advertisers, 以及Attacks on publisher. Bluff/real比例高代表作弊可能高. 这里没有任何公式, 都是定性的文字描述. 8 | 9 | 作者做了很简单的一组实验, 分成一个正常的广告, 和三组bluff ads, 在Google Adword上投放, 以实际数据表明相对正常广告来说, bluff ad的点击率确实很低. 但是这个实验太过简单, 并且数据量很小, 也没有排除其他原因的干扰, 比如其中有两组广告根本没有获得曝光机会, 也就无从对比点击率. 10 | 11 | 总的来说, 是一篇很有创新的paper, 但是内容有限. 而且根据对作者的检索, 也没有发现后续有更深入的研究成果. -------------------------------------------------------------------------------- /notes/lane-gifts-google-report.md: -------------------------------------------------------------------------------- 1 | ## [The Lane’s Gifts v. Google Report](/papers/Alexandr_Tuzhilin_Report.pdf) 2 | 3 | 这是由Alexander Tuzhilin主导的对Google的无效点击检测系统的独立评估, 反映了当时(2006年)的Google反作弊系统的状态, 以及发展进程. 几个有意思的地方: 4 | 1. 人员组成. Click Quality团队的工程部分有10多个人, Spam Operations有20多个人, 还可以获得很多其他团队的支持. 5 | 2. Google主要使用Anomaly-based和Rule-based方法检测无效点击, 也会少量使用Classifier-based方法. 6 | 3. Google使用Prevention和Detection达到目标. Detection有下列几个维度: Online filtering vs. Offline monitoring and analysis, Automated vs. Manual detection, Proactive vs. Reactive detection, Where were invalid clicks made. 7 | 4. 检测阶段主要有: Pre-filtering, Online Filtering, Post-filtering. 8 | 5. Google分析在线未检测出而离线检测出的无效点击, 开发新的filter或修正已有的filter. 9 | 6. Click Quality团队使用非直接的证据证明系统确实在工作: 新添加的filter仅识别少量新的无效点击, 以及离线分析方法相对仅识别少量新的无效点击. 10 | 7. 大多数的filter都非常简单. 系统工作良好的原因可能是: 多个filter组合, 部分filter比较复杂, 大多数攻击非常简单, 无效点击的长尾特性(这里有主观猜测). 11 | 8. Google的filter中主要缺少: Deployment of Data Mining Methods, Using the Conversion Data in Filters, Developing More Advanced Types of Filters. 12 | 9. 除了少数例外, 所有的filter和对应的阈值都是工程师引入和确定的, 目的是为了保护广告主, 未受到finance部门的影响. 13 | 10. Google的无效点击检测的发展历程: 14 | 1. The Early Days (February 2002 – Summer 2003): AdWords投入使用, 当时仅有三条filter, 只能过滤初级的无效点击. 15 | 2. The Formation Stage (Summer 2003 – Fall 2005): AdSense投入使用, Click Quality组建, 增加了三条filter, 除此之外, 开始整体架构开发. 无效点击问题在2005年年底"under control". 16 | 3. The Consolidation Stage (Fall 2005 – present): 调优现有方法, 开发下一代filter, 为更复杂的攻击做好准备. 17 | 18 | 报告反映的是2006年的状况, 相信现在Google的无效点击检测系统已经非常完善, 不过这篇报告还是有一定的参考价值. -------------------------------------------------------------------------------- /notes/novel-approach-based-on-ensemble-learning-fraud-detection-mobile-advertising.md: -------------------------------------------------------------------------------- 1 | ## [A Novel Approach Based on Ensemble Learning for Fraud Detection in Mobile Advertising](http://research.larc.smu.edu.sg/fdma2012/doc/SecondWinner-TeamMasdar-Paper.pdf) 2 | 3 | 1. 特征抽取: 借鉴了Google AdSense反作弊的一些特征, 同时也参考了传统反作弊系统的一些研究成果, 因为移动广告反作弊还处于新兴阶段. 4 | 1. 时间特征: 作弊方经常使用多种技巧隐藏自己的活动, 例如生成非常稀疏的点击序列, 改变IP地址, 从不同国家的计算机发出点击等等; 还有一些是坚持使用传统方法, 即只在给定时间间隔内产生最大点击次数. 反作弊系统需要识别出这两种方式. 作者选择1 min, 5 min, 1 hours, 3hours and 6 hours的时间间隔, 来统计他们的点击次数, 并综合所有的时间间隔的数据, 统计平均值, 最大值, 方差以及偏度作为特征. 5 | 2. IP特征: 构造了最大同IP点击数, IP总数, 平均每IP点击数, IP点击的熵, IP点击的方差等特征. 6 | 3. 对Agent, Country, Campaign ID等其他属性也做了类似的工作来抽取特征. 7 | 4. 最终一共生成了41个特征, 完整列表在http://www.dnagroup.org/PDF/FDMA12_TeamMasdar_AppendixA.pdf. 8 | 2. 特征选择: 作者使用了Principal Component Analysis (PCA), Common Spatial Patterns (CSP), 以及wrapper subset evaluation做特征选择. 经过对比测试, 第三种方法相对前两种要好. (但是实际上特征选择没有真正发挥作用, 后面有讲到.) 9 | 3. 方法使用: 作者尝试了决策树, 回归树, 神经网络, 还有SVM. 每个方法都使用了不同的learning algorithm. 经过分析, 发现决策树算法是最好的. (决策树的结果也最容易解释, 正好适用于反作弊系统) 10 | 4. 样本数据是高度倾斜的, 其中包括2.336%的作弊者, 2.596%的不确定和95.068%的未作弊数据. 为了应对数据倾斜, 使用了Resampling和SMOTE. 具体结果在后面有讲到. 11 | 5. 作者使用了Bagging, Metacost, random subspace, 以及logiboost来进行集成学习. 12 | 6. 作者最终使用了6个模型取平均, 包括Bagging with J48, Bagging with Reptree, Bagging with Random Forest, Metacost with J48, Logiboost with J48, Random subspace with J48. 13 | 7. 结果分析 14 | 1. 特征选择: 特征选择目的是避免over fitting, 但是经过对比, 发现经过特征选择的结果要比不做特征选择稍低一些, 这个现象在各种算法中都类似. 一个可能的解释是, 决策树算法本身就包含pruning步骤, 这可以算是一种特征选择, 因此额外增加特征选择环节的收益很有限, 甚至还会有降低. 15 | 2. 采样: 因为数据高度倾斜, 所以作者尝试了Resampling and SMOTE. 两种方法在训练集表现都很好, 但是在验证集表现都很差. 最终不做采样要比做了采样的结果好20%. 16 | 3. 2-class/3-class classifications: 作者对比了仅包含两个类别(Observation转化为OK vs Observation转化为Fraud), 以及保持三种类别的标注. 三种结果做对比, Observation转化为OK的效果是最好的. 17 | 4. 多种算法做组合的结果比单一算法要好, 最终选择了六种算法的组合. 18 | 8. 结论: 90%的作弊者在各个时间间隔的点击量都比较小, 并且方差和斜度也非常低, 这种方式虽然让他们隐藏在合法用户中, 但是这种系统性的点击也可以作为一个标识. 对于Agent属性, 作者观察到在作弊媒体上的方差非常大, 表明很多作弊媒体使用大量的agent来模拟正常用户. 19 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Fraud Detection Papers 2 | 3 | 在这里汇总工作中阅读过的广告反作弊相关论文、参考资料。除少数几篇论文外,其它所有资料都是以链接的形式提供。广告反作弊在各个公司都是一个很神秘的模块,公开的资料比较少。希望这里的整理可以给相关技术人员提供便利。另外,也会收集一些其它领域,比如金融、电商等行业的反作弊论文,希望它山之石可以攻玉,给广告反作弊提供不一样的思路。 4 | 5 | 如果对广告反作弊领域感兴趣,欢迎和我讨论,我的联系方式如下: 6 | 7 | * email: xinyu.wang1@gmail.com 8 | * 知乎: [王新宇的知乎](https://www.zhihu.com/people/master) 9 | * 微信: ![](/image/wechat.jpg) 10 | 11 | ## Advertising 12 | 13 | ### Report 14 | * [The Lane’s Gifts v. Google Report](/papers/Alexandr_Tuzhilin_Report.pdf) by Alexander Tuzhilin. 2006. [Note](/notes/lane-gifts-google-report.md) 15 | * [Click Fraud Detection: Adversarial Pattern Recognition over 5 Years at Microsoft](http://www.appliedaisystems.com/papers/ClickQualitySystems54_LNCSFormat_clean.pdf) by Brendan Kitts et al. Real World Data Mining Applications 2015. 16 | 17 | ### Bluff Ads 18 | * [Fighting Online Click-Fraud Using Bluff Ads](https://arxiv.org/pdf/1002.2353.pdf) by Hamed Haddadi. ACM Computer Communication Review 2010. [Note](/notes/fighting-online-click-fraud-using-bluff-ads.md) 19 | * [Measuring and Fingerprinting Click-Spam in Ad Networks](http://www.cs.utexas.edu/users/yzhang/papers/clickspam-sigc12.pdf) by Vacha Dave et al. ACM SIGCOMM Conference on Data Communication 2012. 20 | 21 | ### Mobile Apps 22 | * [DECAF: Detecting and Characterizing Ad Fraud in Mobile Apps](/papers/10.1.1.704.3668.pdf) by Bin Liu et al. Proc. 11th USENIX Conf. Netw. Syst. Des. Implementation 2014. 23 | * [MAdFraud: Investigating Ad Fraud in Android Applications](https://pdfs.semanticscholar.org/f4b7/7a7f3c868b48a44c3480843aff22dc67df70.pdf) by Jonathan Crussell et al. Proc. 12th International Conference on Mobile Systems Applications and Services (MobiSys'14) 2014. 24 | 25 | ### Duplicate Detection 26 | * [Detecting Click Fraud in Pay-Per-Click Streams of Online Advertising Networks](https://www.researchgate.net/profile/Linfeng_Zhang4/publication/4365107_Detecting_Click_Fraud_in_Pay-Per-Click_Streams_of_Online_Advertising_Networks/links/56f12ac908ae9a58a829435f.pdf) by Linfeng Zhang et al. ICDCS 2008. 27 | 28 | ### Association Rule 29 | * [Using Association Rules for Fraud Detection in Web Advertising Networks](https://p2p.cs.ucsb.edu/research/tech_reports/reports/2005-13.pdf) by Ahmed Metwally et al. VLDB 2005. 30 | 31 | ### Competition 32 | * [Detecting Click Fraud in Online Advertising: A Data Mining Approach](http://www.jmlr.org/papers/volume15/oentaryo14a/oentaryo14a.pdf) by Richard Oentaryo et al. JMLR 2014. 33 | * [Feature Engineering for Click Fraud Detection](http://research.larc.smu.edu.sg/fdma2012/doc/FirstWinner-Starrystarrynight-Paper.pdf) by Clifton Phua et al. International Workshop on Fraud Detection in Mobile Advertising (FDMA) 2012. 34 | * [A Novel Approach Based on Ensemble Learning for Fraud Detection in Mobile Advertising](http://research.larc.smu.edu.sg/fdma2012/doc/SecondWinner-TeamMasdar-Paper.pdf) by Kasun S. Perera et al. International Workshop on Fraud Detection in Mobile Advertising (FDMA) 2012. [Note](/notes/novel-approach-based-on-ensemble-learning-fraud-detection-mobile-advertising.md) 35 | * [Hybrid Models for Click Fraud Detection in Mobile Advertising](http://research.larc.smu.edu.sg/fdma2012/doc/ThirdWinner-DB2-Paper.pdf) by Chen Wei et al. International Workshop on Fraud Detection in Mobile Advertising (FDMA) 2012. 36 | * [Random Forests for the Detection of Click Fraud in Online Mobile Advertising](http://research.larc.smu.edu.sg/fdma2012/doc/FirstRunnerUp-Tea-Paper.pdf) by Daniel Berrar et al. International Workshop on Fraud Detection in Mobile Advertising (FDMA) 2012. 37 | * [Hierarchical Committee Machines for Fraud Detection in Mobile Advertising](http://research.larc.smu.edu.sg/fdma2012/doc/SecondRunnerUp-Kites-Paper.pdf) by S. Shivashankar et al. International Workshop on Fraud Detection in Mobile Advertising (FDMA) 2012. 38 | 39 | ### Dateset 40 | * [FDMA 2012 Competition Dataset](https://docs.google.com/file/d/0B77LA4oEl-AQTGRHSVNMczJhVTg) by BuzzCity Pte. Ltd. FDMA 2012. 41 | 42 | ### White Paper 43 | * [2017广告反欺诈白皮书](https://3gimg.qq.com/mig_op/beacon/download/baipishu.pdf) by 腾讯灯塔, 秒针, AdMaster. 2017. 44 | * [The State of Mobile Fraud Q1 2018](https://hub.appsflyer.com/hubfs/State%20of%20Mobile%20Fraud%20Q1%202018%20AppsFlyer.pdf) by Appsflyer. 2018. 45 | * [2020中国移动广告反欺诈白皮书](https://s.tencent.com/subjects/downloads/pdf/20201229/2020%E4%B8%AD%E5%9B%BD%E7%A7%BB%E5%8A%A8%E5%B9%BF%E5%91%8A%E5%8F%8D%E6%AC%BA%E8%AF%88%E7%99%BD%E7%9A%AE%E4%B9%A6.pdf) by 腾讯安全天御, 腾讯防火墙, InMobi. 2020. 46 | * [阿里妈妈流量反作弊算法实践](https://mp.weixin.qq.com/s/WUDTT-0dTD4g3W5t1p48Zg) by 阿里妈妈风控团队. 2021. 47 | 48 | ## Other Areas 49 | 50 | ### Anomaly Detection 51 | 52 | #### Survey 53 | * [Anomaly Detection: A Survey](https://www.cs.umn.edu/sites/cs.umn.edu/files/tech_reports/07-017.pdf) by Varun Chandola et al. ACM Computing Surveys, Vol. 41, No. 3, 15, 01.07.2009. 54 | 55 | #### Open Source Toolkit 56 | * [Scikit-learn Novelty and Outlier Detection](http://scikit-learn.org/stable/modules/outlier_detection.html) 57 | * [Python Outlier Detection (PyOD)](http://pyod.readthedocs.io) 58 | * [ELKI: Environment for Developing KDD-Applications Supported by Index-Structures](https://elki-project.github.io) 59 | 60 | ### Report 61 | * [Facebook Immune System](https://css.csail.mit.edu/6.858/2012/readings/facebook-immune.pdf) by Tao Stein et al. Proceedings of the 4th Workshop on Social Network Systems, SNS, 2011. 62 | 63 | ### Credit Card Transaction Fraud 64 | * [Learned lessons in credit card fraud detection from a practitioner perspective](http://www.ulb.ac.be/di/map/adalpozz/pdf/FraudDetectionPaper_8.pdf) by A Dal Pozzolo et al. Expert Systems with Applications, 41(10):4915–4928, 2014. 65 | * [APATE: A Novel Approach for Automated Credit Card Transaction Fraud Detection using Network-Based Extensions](https://lirias.kuleuven.be/bitstream/123456789/496406/1/APATE.pdf) by Veronique Van Vlasselaer et al. Decision Support Systems, 2015. 66 | 67 | ### RNN 68 | * [Detecting Fraudulent Behavior Using Recurrent Neural Networks](http://lab.iisec.ac.jp/~tanaka_lab/images/pdf/kennkyukai/kennkyukai-2016-10.pdf) by Yoshihiro Ando et al. Computer Security Symposium 2016. 69 | * [Session-Based Fraud Detection in Online E-Commerce Transactions Using Recurrent Neural Networks](http://iiis.tsinghua.edu.cn/~weixu/files/SWang_ECMLPKDD_2017.pdf) by Shuhao Wang et al. PKDD 2017. [Slides](http://iiis.tsinghua.edu.cn/~weixu/files/SWang_ECMLPKDD_2017_Slides.pdf) 70 | 71 | ### White Paper 72 | * [2017电子商务生态安全白皮书](http://hchdownload.oss-cn-hangzhou.aliyuncs.com/%E4%BC%9A%E8%AE%AE%E6%96%87%E6%A1%A3/2017%E7%94%B5%E5%AD%90%E5%95%86%E5%8A%A1%E7%94%9F%E6%80%81%E5%AE%89%E5%85%A8%E7%99%BD%E7%9A%AE%E4%B9%A6.pdf) by 电子商务生态安全联盟. 2017. 73 | 74 | ### 经验汇总 75 | * [收集汇总不同行业不同公司,网络上公开的风控或安全的架构、方案、算法](https://github.com/csearch/risky-company-project) 76 | * [收录风控领域相关算法Paper](https://github.com/csearch/risky-algorithm-research) 77 | --------------------------------------------------------------------------------