├── .gitignore ├── Amazon ├── AmazonBeauty_m1 │ └── README.md ├── AmazonBooks_m1 │ └── README.md ├── AmazonCDs_m1 │ └── README.md ├── AmazonElectronics_m1 │ └── README.md ├── AmazonElectronics_x1 │ ├── README.md │ └── convert_amazonelectronics_x1.py ├── AmazonMovies_m1 │ └── README.md └── README.md ├── Avazu ├── Avazu_x1 │ ├── README.md │ ├── convert_avazu_x1.py │ └── download_avazu_x1.py ├── Avazu_x2 │ └── README.md ├── Avazu_x4 │ ├── README.md │ └── convert_avazu_x4.py └── README.md ├── CiteULike └── CiteUlikeA_m1 │ └── README.md ├── Criteo ├── Criteo_x1 │ ├── README.md │ ├── convert_criteo_x1.py │ └── download_criteo_x1.py ├── Criteo_x2 │ └── README.md ├── Criteo_x4 │ ├── README.md │ └── convert_criteo_x4.py └── README.md ├── Frappe ├── Frappe_x1 │ ├── README.md │ └── convert_frappe_x1.py └── README.md ├── Gowalla └── Gowalla_m1 │ └── README.md ├── KKBox ├── KKBox_x1 │ └── README.md └── README.md ├── KuaiShou ├── KuaiVideo_x1 │ ├── README.md │ └── convert_kuaivideo_x1.py └── README.md ├── MIND ├── MIND_large_x1 │ ├── README.md │ └── convert_MIND_large_x1.py └── MIND_small_x1 │ ├── README.md │ └── convert_MIND_small_x1.py ├── MicroVideo └── MicroVideo1.7M_x1 │ ├── README.md │ └── convert_microvideo1.7m_x1.py ├── MovieLens ├── Movielens1M_m1 │ └── README.md ├── MovielensLatest_x1 │ ├── README.md │ └── convert_movielenslatest_x1.py └── README.md ├── README.md ├── Taobao └── TaobaoAd_x1 │ ├── README.md │ └── convert_taobaoad_x1.py ├── Yelp └── Yelp18_m1 │ └── README.md ├── iFlytek └── iFlyteckAds_x1 │ └── convert_iFlyteckAds_x1.py ├── iPinYou └── iPinYou_x1 │ └── README.md └── tracking.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | local_settings.py 56 | db.sqlite3 57 | 58 | # Flask stuff: 59 | instance/ 60 | .webassets-cache 61 | 62 | # Scrapy stuff: 63 | .scrapy 64 | 65 | # Sphinx documentation 66 | docs/_build/ 67 | 68 | # PyBuilder 69 | target/ 70 | 71 | # Jupyter Notebook 72 | .ipynb_checkpoints 73 | 74 | # pyenv 75 | .python-version 76 | 77 | # celery beat schedule file 78 | celerybeat-schedule 79 | 80 | # SageMath parsed files 81 | *.sage.py 82 | 83 | # Environments 84 | .env 85 | .venv 86 | env/ 87 | venv/ 88 | ENV/ 89 | env.bak/ 90 | venv.bak/ 91 | 92 | # Spyder project settings 93 | .spyderproject 94 | .spyproject 95 | 96 | # Rope project settings 97 | .ropeproject 98 | 99 | # mkdocs documentation 100 | /site 101 | 102 | # mypy 103 | .mypy_cache/ 104 | .ipynb_checkpoints 105 | .DS_Store 106 | _build 107 | -------------------------------------------------------------------------------- /Amazon/AmazonBeauty_m1/README.md: -------------------------------------------------------------------------------- 1 | # AmazonBeauty_m1 2 | 3 | + **Data format:** 4 | user_id item1 item2 ... 5 | 6 | + **Source:** https://cseweb.ucsd.edu/~jmcauley/datasets.html 7 | + **Download:** https://huggingface.co/datasets/reczoo/AmazonBeauty_m1/tree/main 8 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 9 | 10 | + **Used by papers:** 11 | - Jianing Sun, Wei Guo, Dengcheng Zhang, Yingxue Zhang, Florence Regol, Yaochen Hu, Huifeng Guo, Ruiming Tang, Han Yuan, Xiuqiang He, Mark Coates. [A Framework for Recommending Accurate and Diverse Items Using Bayesian Graph Convolutional Neural Networks](https://hyclex.github.io/papers/paper_sun2019BGCN.pdf). In KDD 2020. 12 | - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filterin](https://arxiv.org/abs/2109.12613). In CIKM 2021. 13 | 14 | + **Check the md5sum for data integrity:** 15 | ```bash 16 | $ md5sum *.txt 17 | 66fb687136d55b51742905ece189da31 test.txt 18 | 53cc9d39bc79f13c9bd3e75bd5121d1d train.txt 19 | ``` 20 | -------------------------------------------------------------------------------- /Amazon/AmazonBooks_m1/README.md: -------------------------------------------------------------------------------- 1 | # AmazonBooks_m1 2 | 3 | + **Dataset description:** 4 | 5 | The data statistics are summarized as follows: 6 | 7 | | Dataset ID | #Users | #Items | #Interactions | #Train | #Test | Density | 8 | |:--------------:|:------:|:------:|:-------------:|:---------:|:-------:|:-------:| 9 | | AmazonBooks_m1 | 52,643 | 91,599 | 2,984,108 | 2,380,730 | 603,378 | 0.00062 | 10 | 11 | 12 | + **Data format:** 13 | user_id item1 item2 ... 14 | 15 | + **Source:** https://cseweb.ucsd.edu/~jmcauley/datasets.html 16 | + **Download:** https://huggingface.co/datasets/reczoo/AmazonBooks_m1/tree/main 17 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 18 | 19 | + **Used by papers:** 20 | - Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, Meng Wang. [LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation](https://arxiv.org/abs/2002.02126). In SIGIR 2020. 21 | - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filtering](https://arxiv.org/abs/2109.12613). In CIKM 2021. 22 | - Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. [UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation](https://arxiv.org/abs/2110.15114). In CIKM 2021. 23 | 24 | + **Check the md5sum for data integrity:** 25 | ```bash 26 | $ md5sum *.txt 27 | 5b1125ef3bf4118a7988f1fd8ce52ef9 item_list.txt 28 | 30f8ccfea18d25007ba9fb9aba4e174d test.txt 29 | c916ecac04ca72300a016228258b41ed train.txt 30 | 132f8a5d6d35d5fdde1e0396488be235 user_list.txt 31 | ``` 32 | -------------------------------------------------------------------------------- /Amazon/AmazonCDs_m1/README.md: -------------------------------------------------------------------------------- 1 | # AmazonCDs_m1 2 | 3 | + **Data format:** 4 | user_id item1 item2 ... 5 | 6 | + **Source:** https://cseweb.ucsd.edu/~jmcauley/datasets.html 7 | + **Download:** https://huggingface.co/datasets/reczoo/AmazonCDs_m1/tree/main 8 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 9 | 10 | + **Used by papers:** 11 | - Jianing Sun, Wei Guo, Dengcheng Zhang, Yingxue Zhang, Florence Regol, Yaochen Hu, Huifeng Guo, Ruiming Tang, Han Yuan, Xiuqiang He, Mark Coates. [A Framework for Recommending Accurate and Diverse Items Using Bayesian Graph Convolutional Neural Networks](https://hyclex.github.io/papers/paper_sun2019BGCN.pdf). In KDD 2020. 12 | - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filtering](https://arxiv.org/abs/2109.12613). In CIKM 2021. 13 | - Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. [UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation](https://arxiv.org/abs/2110.15114). In CIKM 2021. 14 | 15 | + **Check the md5sum for data integrity:** 16 | ```bash 17 | $ md5sum *.txt 18 | d29acb66d0fb74bc3bc0791cbbce5cf2 test.txt 19 | 2df6a35cac4373cf3eef95f75568da0a train.txt 20 | ``` 21 | -------------------------------------------------------------------------------- /Amazon/AmazonElectronics_m1/README.md: -------------------------------------------------------------------------------- 1 | # AmazonElectronics_m1 2 | 3 | + **Data format:** 4 | 5 | Each user corresponds to a list of interacted items: [[item1, item2], [item3, item4, item5], ...] 6 | 7 | + **Source:** https://cseweb.ucsd.edu/~jmcauley/datasets.html 8 | + **Download:** https://huggingface.co/datasets/reczoo/AmazonElectronics_m1/tree/main 9 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 10 | 11 | + **Used by papers:** 12 | - Wenhui Yu, Zheng Qin. [Sampler Design for Implicit Feedback Data by Noisy-label Robust Learning](https://arxiv.org/abs/2007.07204). In SIGIR 2020. 13 | - Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. [UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation](https://arxiv.org/abs/2110.15114). In CIKM 2021. 14 | 15 | + **Check the md5sum for data integrity:** 16 | ```bash 17 | $ md5sum *.json 18 | 7a0fa5d0da5dc5d5008da02b554ef688 test_data.json 19 | ca71f3f5b9ada393ffd5490eba84c7db train_data.json 20 | 7f2db9b5b0de91c7d757ed6ed6095a5a validation_data.json 21 | ``` 22 | -------------------------------------------------------------------------------- /Amazon/AmazonElectronics_x1/README.md: -------------------------------------------------------------------------------- 1 | # AmazonElectronics_x1 2 | 3 | + **Dataset description:** 4 | 5 | The [Amazon dataset](http://jmcauley.ucsd.edu/data/amazon) contains product reviews and metadata from Amazon, which is a widely-used benchmark dataset. We use the preprocessed subset named Amazon-Electronics from the [DIN](https://arxiv.org/abs/1706.06978) work. It contains 192,403 users, 63,001 goods, 801 categories and 1,689,188 samples. User behaviors in this dataset are rich, with more than 5 reviews for each user and goods. Features include goods_id, cate_id, user reviewed goods_id_list and cate_id_list. Following DIN, the task is to predict the probability of reviewing the (k+1)-th goods by making use of the first k reviewed goods. The last item of each behavior sequence is reserved for testing. 6 | 7 | The dataset statistics are summarized as follows. 8 | 9 | | Dataset Split | Total | #Train | #Validation | #Test | 10 | | :--------: | :-----: |:-----: | :----------: | :----: | 11 | | AmazonElectronics_x1 | 2,993,570 | 2,608,764 | | 384,806 | 12 | 13 | + **Data format:** 14 | label, user_id, item_id, cate_id, item_history, cate_history 15 | 16 | + **Source:** https://cseweb.ucsd.edu/~jmcauley/datasets.html 17 | + **Download:** https://huggingface.co/datasets/reczoo/AmazonElectronics_x1/tree/main 18 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 19 | 20 | + **Used by papers:** 21 | - Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, Kun Gai. [Deep Interest Network for Click-Through Rate Prediction](https://arxiv.org/abs/1706.06978). In KDD 2018. 22 | - Jieming Zhu, Guohao Cai, Junjie Huang, Zhenhua Dong, Ruiming Tang, Weinan Zhang. [ReLoop2: Building Self-Adaptive Recommendation Models via Responsive Error Compensation Loop](https://arxiv.org/abs/2306.08808). In KDD 2023. 23 | 24 | + **Check the md5sum for data integrity:** 25 | ```bash 26 | $ md5sum *.csv 27 | 57a20e82fe736dd495f2eaf0669bf6d0 test.csv 28 | e9bf80b92985e463db18fdc753d347b5 train.csv 29 | ``` 30 | -------------------------------------------------------------------------------- /Amazon/AmazonElectronics_x1/convert_amazonelectronics_x1.py: -------------------------------------------------------------------------------- 1 | """ Convert AmazonElectronics dataset used by the DIN paper from pickle file to csv file 2 | Run the following cat command to get `dataset.pkl` 3 | cat aa ab ac > dataset.pkl 4 | after downloading from https://github.com/zhougr1993/DeepInterestNetwork/tree/master/din 5 | """ 6 | 7 | import pickle 8 | import pandas as pd 9 | import hashlib 10 | 11 | 12 | with open('dataset.pkl', 'rb') as f: 13 | train_set = pickle.load(f, encoding='bytes') 14 | test_set = pickle.load(f, encoding='bytes') 15 | cate_list = pickle.load(f, encoding='bytes') 16 | user_count, item_count, cate_count = pickle.load(f, encoding='bytes') 17 | 18 | train_data = [] 19 | for sample in train_set: 20 | user_id = sample[0] 21 | item_id = sample[2] 22 | item_history = "^".join([str(i) for i in sample[1]]) 23 | label = sample[3] 24 | cate_id = cate_list[item_id] 25 | cate_history = "^".join([str(i) for i in cate_list[sample[1]]]) 26 | train_data.append([label, user_id, item_id, cate_id, item_history, cate_history]) 27 | train_df = pd.DataFrame(train_data, columns=['label', 'user_id', 'item_id', 'cate_id', 'item_history', 'cate_history']) 28 | train_df.to_csv("train.csv", index=False) 29 | 30 | test_data = [] 31 | for sample in test_set: 32 | user_id = sample[0] 33 | item_pair = sample[2] 34 | item_history = "^".join([str(i) for i in sample[1]]) 35 | cate_history = "^".join([str(i) for i in cate_list[sample[1]]]) 36 | test_data.append([1, user_id, item_pair[0], cate_list[item_pair[0]], item_history, cate_history]) 37 | test_data.append([0, user_id, item_pair[1], cate_list[item_pair[1]], item_history, cate_history]) 38 | test_df = pd.DataFrame(test_data, columns=['label', 'user_id', 'item_id', 'cate_id', 'item_history', 'cate_history']) 39 | test_df.to_csv("test.csv", index=False) 40 | 41 | # Check md5sum for correctness 42 | assert("e9bf80b92985e463db18fdc753d347b5" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest()) 43 | assert("57a20e82fe736dd495f2eaf0669bf6d0" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest()) 44 | 45 | print("Reproducing data succeeded!") 46 | -------------------------------------------------------------------------------- /Amazon/AmazonMovies_m1/README.md: -------------------------------------------------------------------------------- 1 | # AmazonMovies_m1 2 | 3 | + **Data format:** 4 | user_id item1 item2 ... 5 | 6 | + **Source:** https://cseweb.ucsd.edu/~jmcauley/datasets.html 7 | + **Download:** https://huggingface.co/datasets/reczoo/AmazonMovies_m1/tree/main 8 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 9 | 10 | + **Used by papers:** 11 | - Jianing Sun, Wei Guo, Dengcheng Zhang, Yingxue Zhang, Florence Regol, Yaochen Hu, Huifeng Guo, Ruiming Tang, Han Yuan, Xiuqiang He, Mark Coates. [A Framework for Recommending Accurate and Diverse Items Using Bayesian Graph Convolutional Neural Networks](https://hyclex.github.io/papers/paper_sun2019BGCN.pdf). In KDD 2020. 12 | - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filterin](https://arxiv.org/abs/2109.12613). In CIKM 2021. 13 | 14 | + **Check the md5sum for data integrity:** 15 | ```bash 16 | $ md5sum *.txt 17 | c02e5f6579aa51950aa875c462a0204b test.txt 18 | 3e9d30eacd30330a9feaa0fdb17760ba train.txt 19 | ``` 20 | -------------------------------------------------------------------------------- /Amazon/README.md: -------------------------------------------------------------------------------- 1 | # Amazon 2 | 3 | + [AmazonBeauty_m1](./AmazonBeauty_m1/README.md) 4 | + [AmazonBooks_m1](./AmazonBooks_m1/README.md) 5 | + [AmazonCDs_m1](./AmazonCDs_m1/README.md) 6 | + [AmazonElectronics_x1](./AmazonElectronics_x1/README.md) 7 | -------------------------------------------------------------------------------- /Avazu/Avazu_x1/README.md: -------------------------------------------------------------------------------- 1 | # Avazu_x1 2 | 3 | + **Dataset description:** 4 | 5 | This dataset contains about 10 days of labeled click-through data on mobile advertisements. It has 22 feature fields including user features and advertisement attributes. As with the [AFN](https://ojs.aaai.org/index.php/AAAI/article/view/5768) work, the data are randomly split into 7:1:2 as the training set, validation set, and test set, respectively. 6 | 7 | The dataset statistics are summarized as follows: 8 | 9 | | Dataset | Total | #Train | #Validation | #Test | 10 | | :--------: | :-----: |:-----: | :----------: | :----: | 11 | | Avazu_x1 | 40,428,967 | 28,300,276 | 4,042,897 | 8,085,794 | 12 | 13 | + **Source:** https://www.kaggle.com/c/avazu-ctr-prediction/data 14 | + **Download:** https://huggingface.co/datasets/reczoo/Avazu_x1/tree/main 15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 16 | 17 | + **Used by papers:** 18 | - Weiyu Cheng, Yanyan Shen, Linpeng Huang. [Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions](https://ojs.aaai.org/index.php/AAAI/article/view/5768). In AAAI 2020. 19 | - Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, Zhenhua Dong. [FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction](https://arxiv.org/abs/2304.00902). In AAAI 2023. 20 | - Jieming Zhu, Qinglin Jia, Guohao Cai, Quanyu Dai, Jingjie Li, Zhenhua Dong, Ruiming Tang, Rui Zhang. [FINAL: Factorized Interaction Layer for CTR Prediction](https://dl.acm.org/doi/10.1145/3539618.3591988). In SIGIR 2023. 21 | 22 | + **Check the md5sum for data integrity:** 23 | ```bash 24 | $ md5sum train.csv valid.csv test.csv 25 | f1114a07aea9e996842c71648e0f6395 train.csv 26 | d9568f246357d156c4b8030fadb8b623 valid.csv 27 | 9e2fe9c48705c9315ae7a0953eb57acf test.csv 28 | ``` 29 | -------------------------------------------------------------------------------- /Avazu/Avazu_x1/convert_avazu_x1.py: -------------------------------------------------------------------------------- 1 | """ Convert libsvm data from AFN paper to csv format """ 2 | import pandas as pd 3 | from pathlib import Path 4 | import gc 5 | import hashlib 6 | 7 | headers = ["label", "feat_1", "feat_2", "feat_3", "feat_4", "feat_5", "feat_6", "feat_7", "feat_8", "feat_9", "feat_10", 8 | "feat_11", "feat_12", "feat_13", "feat_14", "feat_15", "feat_16", "feat_17", "feat_18", "feat_19", "feat_20", "feat_21", "feat_22"] 9 | 10 | data_files = ["train.libsvm", "valid.libsvm", "test.libsvm"] 11 | for f in data_files: 12 | df = pd.read_csv(f, sep=" ", names=headers) 13 | for col in headers[1:]: 14 | df[col] = df[col].apply(lambda x: x.split(':')[0]) 15 | df.to_csv(Path(f).stem + ".csv", index=False) 16 | del df 17 | gc.collect() 18 | 19 | 20 | # Check md5sum for correctness 21 | assert("f1114a07aea9e996842c71648e0f6395" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest()) 22 | assert("d9568f246357d156c4b8030fadb8b623" == hashlib.md5(open('valid.csv', 'r').read().encode('utf-8')).hexdigest()) 23 | assert("9e2fe9c48705c9315ae7a0953eb57acf" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest()) 24 | 25 | print("Reproducing data succeeded!") 26 | 27 | -------------------------------------------------------------------------------- /Avazu/Avazu_x1/download_avazu_x1.py: -------------------------------------------------------------------------------- 1 | # This file is modified from https://github.com/WeiyuCheng/AFN-AAAI-20/blob/master/src/download_criteo_and_avazu.py 2 | # to download the preprocessed data split Avazu_x1 3 | 4 | import os 5 | import zipfile 6 | import urllib.request 7 | from tqdm import tqdm 8 | 9 | 10 | class DownloadProgressBar(tqdm): 11 | def update_to(self, b=1, bsize=1, tsize=None): 12 | if tsize is not None: 13 | self.total = tsize 14 | self.update(b * bsize - self.n) 15 | 16 | def download(url, output_path): 17 | with DownloadProgressBar(unit='B', unit_scale=True, 18 | miniters=1, desc=url.split('/')[-1]) as t: 19 | urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to) 20 | 21 | if __name__ == "__main__": 22 | print("Begin to download avazu data, the total size is 683MB...") 23 | download('https://worksheets.codalab.org/rest/bundles/0xf5ab597052744680b1a55986557472c7/contents/blob/', './avazu.zip') 24 | print("Unzipping avazu dataset...") 25 | with zipfile.ZipFile('./avazu.zip', 'r') as zip_ref: 26 | zip_ref.extractall('./avazu/') 27 | print("Done.") 28 | -------------------------------------------------------------------------------- /Avazu/Avazu_x2/README.md: -------------------------------------------------------------------------------- 1 | # Avazu_x2 2 | 3 | + **Dataset description:** 4 | 5 | This dataset contains about 10 days of labeled click-through data on mobile advertisements. It has 22 feature fields including user features and advertisement attributes. Following the same setting in the [AutoGroup](https://dl.acm.org/doi/abs/10.1145/3397271.3401082) work, we randomly split 80% of the data for training and validation, and the remaining 20% for testing, respectively. For all categorical fields, we filter infrequent features by setting the threshold min_category_count=20 and replace them with a default ```` token. 6 | 7 | The dataset statistics are summarized as follows: 8 | 9 | | Dataset | Total | #Train | #Validation | #Test | 10 | | :--------: | :-----: |:-----: | :----------: | :----: | 11 | | Avazu_x2 | 40,428,967 | 32,343,173 | | 8,085,794 | 12 | 13 | + **Source:** https://www.kaggle.com/c/avazu-ctr-prediction/data 14 | + **Download:** https://huggingface.co/datasets/reczoo/Avazu_x2/tree/main 15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 16 | 17 | + **Used by papers:** 18 | - Bin Liu, Niannan Xue, Huifeng Guo, Ruiming Tang, Stefanos Zafeiriou, Xiuqiang He, Zhenguo Li. [AutoGroup: Automatic Feature Grouping for Modelling Explicit High-Order Feature Interactions in CTR Prediction](https://dl.acm.org/doi/abs/10.1145/3397271.3401082). In SIGIR 2020. 19 | 20 | + **Check the md5sum for data integrity:** 21 | ```bash 22 | $ md5sum train.csv test.csv 23 | c41d786896e2ebe68e08a022199f0ce8 train.csv 24 | e641ea94c72cdc99b49656d3404f536e test.csv 25 | ``` 26 | -------------------------------------------------------------------------------- /Avazu/Avazu_x4/README.md: -------------------------------------------------------------------------------- 1 | # Avazu_x4 2 | 3 | + **Dataset description:** 4 | 5 | This dataset contains about 10 days of labeled click-through data on mobile advertisements. It has 22 feature fields including user features and advertisement attributes. Following the same setting with the [AutoInt](https://arxiv.org/abs/1810.11921) work, we split the data randomly into 8:1:1 as the training set, validation set, and test set, respectively. 6 | 7 | The dataset statistics are summarized as follows: 8 | 9 | | Dataset | Total | #Train | #Validation | #Test | 10 | | :--------: | :-----: |:-----: | :----------: | :----: | 11 | | Avazu_x4 | 40,428,967 | 32,343,172 | 4,042,897 | 4,042,898 | 12 | 13 | 14 | - Avazu_x4_001 15 | 16 | In this setting, we preprocess the data split by removing the ``id`` field that is useless for CTR prediction. In addition, we transform the timestamp field into three fields: hour, weekday, and is_weekend. For all categorical fields, we filter infrequent features by setting the threshold min_category_count=2 (performs well) and replace them with a default ```` token. Note that we do not follow the exact preprocessing steps in AutoInt, because the authors neither remove the useless ``id`` field nor specially preprocess the timestamp field. We fix **embedding_dim=16** following the existing [AutoInt work](https://arxiv.org/abs/1810.11921). 17 | 18 | - Avazu_x4_002 19 | 20 | In this setting, we preprocess the data split by removing the ``id`` field that is useless for CTR prediction. In addition, we transform the timestamp field into three fields: hour, weekday, and is_weekend. For all categorical fields, we filter infrequent features by setting the threshold min_category_count=1 and replace them with a default ```` token. Note that we found that min_category_count=1 performs the best, which is surprising. We fix **embedding_dim=40** following the existing [FGCNN work](https://arxiv.org/abs/1904.04447). 21 | 22 | 23 | + **Source:** https://www.kaggle.com/c/avazu-ctr-prediction/data 24 | + **Download:** https://huggingface.co/datasets/reczoo/Avazu_x4/tree/main 25 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 26 | 27 | + **Used by papers:** 28 | - Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, Jian Tang. [AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/abs/1810.11921). In CIKM 2019. 29 | - Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, Xiuqiang He. [BARS-CTR: Open Benchmarking for Click-Through Rate Prediction](https://arxiv.org/abs/2009.05794). In CIKM 2021. 30 | 31 | + **Check the md5sum for data integrity:** 32 | ```bash 33 | $ md5sum train.csv valid.csv test.csv 34 | de3a27264cdabf66adf09df82328ccaa train.csv 35 | 33232931d84d6452d3f956e936cab2c9 valid.csv 36 | 3ebb774a9ca74d05919b84a3d402986d test.csv 37 | ``` 38 | -------------------------------------------------------------------------------- /Avazu/Avazu_x4/convert_avazu_x4.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import hashlib 4 | from sklearn.model_selection import StratifiedKFold 5 | 6 | """ 7 | NOTICE: We found that even though we fix the random seed, the resulting data split can be different 8 | due to the potential StratifiedKFold API change in different scikit-learn versions. For 9 | reproduciblity, `sklearn==0.19.1` is required. We use the python environement by installing 10 | `Anaconda3-5.2.0-Linux-x86_64.sh`. 11 | """ 12 | 13 | RANDOM_SEED = 2018 # Fix seed for reproduction 14 | ddf = pd.read_csv('train/train.csv', encoding='utf-8', dtype=object) 15 | X = ddf.values 16 | y = ddf['click'].map(lambda x: float(x)).values 17 | print(str(len(X)) + ' lines in total') 18 | 19 | folds = StratifiedKFold(n_splits=10, shuffle=True, 20 | random_state=RANDOM_SEED).split(X, y) 21 | 22 | fold_indexes = [] 23 | for train_id, valid_id in folds: 24 | fold_indexes.append(valid_id) 25 | test_index = fold_indexes[0] 26 | valid_index = fold_indexes[1] 27 | train_index = np.concatenate(fold_indexes[2:]) 28 | 29 | test_df = ddf.loc[test_index, :] 30 | test_df.to_csv('test.csv', index=False, encoding='utf-8') 31 | valid_df = ddf.loc[valid_index, :] 32 | valid_df.to_csv('valid.csv', index=False, encoding='utf-8') 33 | ddf.loc[train_index, :].to_csv('train.csv', index=False, encoding='utf-8') 34 | 35 | print('Train lines:', len(train_index)) 36 | print('Validation lines:', len(valid_index)) 37 | print('Test lines:', len(test_index)) 38 | print('Postive ratio:', np.sum(y) / len(y)) 39 | 40 | # Check md5sum for correctness 41 | assert("de3a27264cdabf66adf09df82328ccaa" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest()) 42 | assert("33232931d84d6452d3f956e936cab2c9" == hashlib.md5(open('valid.csv', 'r').read().encode('utf-8')).hexdigest()) 43 | assert("3ebb774a9ca74d05919b84a3d402986d" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest()) 44 | 45 | print("Reproducing data succeeded!") 46 | -------------------------------------------------------------------------------- /Avazu/README.md: -------------------------------------------------------------------------------- 1 | # Avazu 2 | 3 | + [Avazu_x1](./Avazu_x1/README.md) 4 | + [Avazu_x2](./Avazu_x2/README.md) 5 | + [Avazu_x4](./Avazu_x4/README.md) 6 | 7 | It is a [Kaggle challenge dataset](https://www.kaggle.com/c/avazu-ctr-prediction/data) for Avazu CTR prediction. [Avazu](http://avazuinc.com/home) is one of the leading mobile advertising platforms globally. The Kaggle competition targets at predicting whether a mobile ad will be clicked and has provided 11 days worth of Avazu data to build and test prediction models. It consists of 10 days of labeled click-through data for training and 1 day of ads data for testing (yet without labels). Note that only the first 10 days of labeled data are used for benchmarking. 8 | 9 | Data fields consist of: 10 | + id: ad identifier (``Note: This column is more like unique sample id, where each row has a distinct value, and thus should be dropped.``) 11 | + click: 0/1 for non-click/click 12 | + hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC. (``Note: It is a common practice to bucketize the timestamp into hour, day, is_weekend, and so on.``) 13 | + C1: anonymized categorical variable 14 | + banner_pos 15 | + site_id 16 | + site_domain 17 | + site_category 18 | + app_id 19 | + app_domain 20 | + app_category 21 | + device_id 22 | + device_ip 23 | + device_model 24 | + device_type 25 | + device_conn_type 26 | + C14-C21: anonymized categorical variables 27 | -------------------------------------------------------------------------------- /CiteULike/CiteUlikeA_m1/README.md: -------------------------------------------------------------------------------- 1 | # CiteUlikeA_m1 2 | 3 | + **Data format:** 4 | user_id item1 item2 ... 5 | 6 | + **Source:** http://www.citeulike.org 7 | + **Download:** https://huggingface.co/datasets/reczoo/CiteUlikeA_m1/tree/main 8 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 9 | 10 | + **Used by papers:** 11 | - Shuyi Ji, Yifan Feng, Rongrong Ji, Xibin Zhao, Wanwan Tang, Yue Gao. [Dual Channel Hypergraph Collaborative Filtering](https://dl.acm.org/doi/10.1145/3394486.3403253). In KDD 2020. 12 | - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filterin](https://arxiv.org/abs/2109.12613). In CIKM 2021. 13 | 14 | + **Check the md5sum for data integrity:** 15 | ```bash 16 | $ md5sum *.txt 17 | c9d2de139ac69d480264b6221a567324 test.txt 18 | f037c7ac8f9d8142bb5fd137ff61ad0c train.txt 19 | ``` 20 | -------------------------------------------------------------------------------- /Criteo/Criteo_x1/README.md: -------------------------------------------------------------------------------- 1 | # Criteo_x1 2 | 3 | + **Dataset description:** 4 | 5 | The Criteo dataset is a widely-used benchmark dataset for CTR prediction, which contains about one week of click-through data for display advertising. It has 13 numerical feature fields and 26 categorical feature fields. Following the [AFN](https://ojs.aaai.org/index.php/AAAI/article/view/5768) work, we randomly split the data into 7:2:1\* as the training set, validation set, and test set, respectively. 6 | 7 | The dataset statistics are summarized as follows: 8 | 9 | | Dataset Split | Total | #Train | #Validation | #Test | 10 | | :--------: | :-----: |:-----: | :----------: | :----: | 11 | | Criteo_x1 | 45,840,617 | 33,003,326 | 8,250,124 | 4,587,167 | 12 | 13 | + **Source:** https://www.kaggle.com/c/criteo-display-ad-challenge/data 14 | + **Download:** https://huggingface.co/datasets/reczoo/Criteo_x1/tree/main 15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 16 | 17 | + **Used by papers:** 18 | - Weiyu Cheng, Yanyan Shen, Linpeng Huang. [Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions](https://ojs.aaai.org/index.php/AAAI/article/view/5768). In AAAI 2020. 19 | - Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, Zhenhua Dong. [FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction](https://arxiv.org/abs/2304.00902). In AAAI 2023. 20 | - Jieming Zhu, Qinglin Jia, Guohao Cai, Quanyu Dai, Jingjie Li, Zhenhua Dong, Ruiming Tang, Rui Zhang. [FINAL: Factorized Interaction Layer for CTR Prediction](https://dl.acm.org/doi/10.1145/3539618.3591988). In SIGIR 2023. 21 | 22 | + **Check the md5sum for data integrity:** 23 | ```bash 24 | $ md5sum train.csv valid.csv test.csv 25 | 30b89c1c7213013b92df52ec44f52dc5 train.csv 26 | f73c71fb3c4f66b6ebdfa032646bea72 valid.csv 27 | 2c48b26e84c04a69b948082edae46f8c test.csv 28 | ``` 29 | -------------------------------------------------------------------------------- /Criteo/Criteo_x1/convert_criteo_x1.py: -------------------------------------------------------------------------------- 1 | """ Convert libsvm data from AFN paper to csv format """ 2 | import pandas as pd 3 | from pathlib import Path 4 | import gc 5 | import hashlib 6 | 7 | headers = ["label", "I1", "I2", "I3", "I4", "I5", "I6", "I7", "I8", "I9", "I10", 8 | "I11", "I12", "I13", "C1", "C2", "C3", "C4", "C5", "C6", "C7", "C8", "C9", "C10", 9 | "C11", "C12", "C13", "C14", "C15", "C16", "C17", "C18", "C19", "C20", "C21", "C22", 10 | "C23", "C24", "C25", "C26"] 11 | 12 | data_files = ["train.libsvm", "valid.libsvm", "test.libsvm"] 13 | for f in data_files: 14 | df = pd.read_csv(f, sep=" ", names=headers) 15 | for col in headers[1:]: 16 | if col.startswith("I"): 17 | df[col] = df[col].apply(lambda x: x.split(':')[-1]) 18 | elif col.startswith("C"): 19 | df[col] = df[col].apply(lambda x: x.split(':')[0]) 20 | df.to_csv(Path(f).stem + ".csv", index=False) 21 | del df 22 | gc.collect() 23 | 24 | # Check md5sum for correctness 25 | assert("30b89c1c7213013b92df52ec44f52dc5" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest()) 26 | assert("f73c71fb3c4f66b6ebdfa032646bea72" == hashlib.md5(open('valid.csv', 'r').read().encode('utf-8')).hexdigest()) 27 | assert("2c48b26e84c04a69b948082edae46f8c" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest()) 28 | 29 | print("Reproducing data succeeded!") 30 | 31 | -------------------------------------------------------------------------------- /Criteo/Criteo_x1/download_criteo_x1.py: -------------------------------------------------------------------------------- 1 | # This file is modified from https://github.com/WeiyuCheng/AFN-AAAI-20/blob/master/src/download_criteo_and_avazu.py 2 | # to download the preprocessed data split Criteo_x1 3 | 4 | import os 5 | import zipfile 6 | import urllib.request 7 | from tqdm import tqdm 8 | 9 | 10 | class DownloadProgressBar(tqdm): 11 | def update_to(self, b=1, bsize=1, tsize=None): 12 | if tsize is not None: 13 | self.total = tsize 14 | self.update(b * bsize - self.n) 15 | 16 | def download(url, output_path): 17 | with DownloadProgressBar(unit='B', unit_scale=True, 18 | miniters=1, desc=url.split('/')[-1]) as t: 19 | urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to) 20 | 21 | if __name__ == "__main__": 22 | print("Begin to download criteo data, the total size is 3GB...") 23 | download('https://worksheets.codalab.org/rest/bundles/0x8dca5e7bac42470aa445f9a205d177c6/contents/blob/', './criteo.zip') 24 | print("Unzipping criteo dataset...") 25 | with zipfile.ZipFile('./criteo.zip', 'r') as zip_ref: 26 | zip_ref.extractall('./criteo/') 27 | print("Done.") 28 | 29 | -------------------------------------------------------------------------------- /Criteo/Criteo_x2/README.md: -------------------------------------------------------------------------------- 1 | # Criteo_x2 2 | 3 | + **Dataset description:** 4 | 5 | This dataset employs the [Criteo 1TB Click Logs](https://ailab.criteo.com/criteo-1tb-click-logs-dataset/) for display advertising, which contains one month of click-through data with billions of data samples. Following the same setting with the [AutoGroup](https://dl.acm.org/doi/abs/10.1145/3397271.3401082) work, we select "data 6-12" as the training set while using "day-13" for testing. To reduce label imbalance, we perform negative sub-sampling to keep the positive ratio roughly at 50%. It has 13 numerical feature fields and 26 categorical feature fields. In this setting, 13 numerical fields are converted into categorical values through bucketizing, while categorical features appearing less than 20 times are set as a default ```` feature. 6 | 7 | The dataset statistics are summarized as follows: 8 | 9 | | Dataset Split | Total | #Train | #Validation | #Test | 10 | | :--------: | :-----: |:-----: | :----------: | :----: | 11 | | Criteo_x2 | 99,616,043 | 86,883,012 | | 12,733,031 | 12 | 13 | + **Source:** https://ailab.criteo.com/criteo-1tb-click-logs-dataset 14 | + **Download:** https://huggingface.co/datasets/reczoo/Criteo_x2/tree/main 15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 16 | 17 | + **Used by papers:** 18 | - Bin Liu, Niannan Xue, Huifeng Guo, Ruiming Tang, Stefanos Zafeiriou, Xiuqiang He, Zhenguo Li. [AutoGroup: Automatic Feature Grouping for Modelling Explicit High-Order Feature Interactions in CTR Prediction](https://dl.acm.org/doi/abs/10.1145/3397271.3401082). In SIGIR 2020. 19 | 20 | + **Check the md5sum for data integrity:** 21 | ```bash 22 | $ md5sum train.csv test.csv 23 | d4d08405e95836ee049455cae0f8b0d6 train.csv 24 | 32c14fbc7bfe02e72b501793e8db660b test.csv 25 | ``` 26 | -------------------------------------------------------------------------------- /Criteo/Criteo_x4/README.md: -------------------------------------------------------------------------------- 1 | # Criteo_x4 2 | 3 | + **Dataset description:** 4 | 5 | The Criteo dataset is a widely-used benchmark dataset for CTR prediction, which contains about one week of click-through data for display advertising. It has 13 numerical feature fields and 26 categorical feature fields. Following the setting with the [AutoInt work](https://arxiv.org/abs/1810.11921), we randomly split the data into 8:1:1 as the training set, validation set, and test set, respectively. 6 | 7 | The dataset statistics are summarized as follows: 8 | 9 | | Dataset Split | Total | #Train | #Validation | #Test | 10 | | :--------: | :-----: |:-----: | :----------: | :----: | 11 | | Criteo_x4 | 45,840,617 | 36,672,493 | 4,584,062 | 4,584,062 | 12 | 13 | 14 | - Criteo_x4_001 15 | 16 | In this setting, we follow the winner's solution of the Criteo challenge to discretize each integer value x to ⌊log2(x)⌋, if x > 2; and x = 1 otherwise. For all categorical fields, we replace infrequent features with a default ```` token by setting the threshold min_category_count=10. Note that we do not follow the exact preprocessing steps in AutoInt, because this preprocessing performs much better. We fix **embedding_dim=16** as with AutoInt. 17 | 18 | - Criteo_x4_002 19 | 20 | In this setting, we follow the winner's solution of the Criteo challenge to discretize each integer value x to ⌊log2(x)⌋, if x > 2; and x = 1 otherwise. For all categorical fields, we replace infrequent features with a default ```` token by setting the threshold min_category_count=2. We fix **embedding_dim=40** in this setting. 21 | 22 | 23 | + **Source:** https://www.kaggle.com/c/criteo-display-ad-challenge/data 24 | + **Download:** https://huggingface.co/datasets/reczoo/Criteo_x4/tree/main 25 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 26 | 27 | + **Used by papers:** 28 | - Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, Jian Tang. [AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/abs/1810.11921). In CIKM 2019. 29 | - Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, Xiuqiang He. [BARS-CTR: Open Benchmarking for Click-Through Rate Prediction](https://arxiv.org/abs/2009.05794). In CIKM 2021. 30 | 31 | + **Check the md5sum for data integrity:** 32 | ```bash 33 | $ md5sum train.csv valid.csv test.csv 34 | 4a53bb7cbc0e4ee25f9d6a73ed824b1a train.csv 35 | fba5428b22895016e790e2dec623cb56 valid.csv 36 | cfc37da0d75c4d2d8778e76997df2976 test.csv 37 | ``` 38 | -------------------------------------------------------------------------------- /Criteo/Criteo_x4/convert_criteo_x4.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import hashlib 4 | from sklearn.model_selection import StratifiedKFold 5 | 6 | """ 7 | NOTICE: We found that even though we fix the random seed, the resulting data split can be different 8 | due to the potential StratifiedKFold API change in different scikit-learn versions. For 9 | reproduciblity, `sklearn==0.19.1` is required. We use the python environement by installing 10 | `Anaconda3-5.2.0-Linux-x86_64.sh`. 11 | """ 12 | 13 | RANDOM_SEED = 2018 # Fix seed for reproduction 14 | cols = ['Label'] 15 | for i in range(1, 14): 16 | cols.append('I' + str(i)) 17 | for i in range(1, 27): 18 | cols.append('C' + str(i)) 19 | 20 | ddf = pd.read_csv('dac/train.txt', sep='\t', header=None, names=cols, encoding='utf-8', dtype=object) 21 | X = ddf.values 22 | y = ddf['Label'].map(lambda x: float(x)).values 23 | print(str(len(X)) + ' lines in total') 24 | 25 | folds = StratifiedKFold(n_splits=10, shuffle=True, 26 | random_state=RANDOM_SEED).split(X, y) 27 | 28 | fold_indexes = [] 29 | for train_id, valid_id in folds: 30 | fold_indexes.append(valid_id) 31 | test_index = fold_indexes[0] 32 | valid_index = fold_indexes[1] 33 | train_index = np.concatenate(fold_indexes[2:]) 34 | 35 | test_df = ddf.loc[test_index, :] 36 | test_df.to_csv('test.csv', index=False, encoding='utf-8') 37 | valid_df = ddf.loc[valid_index, :] 38 | valid_df.to_csv('valid.csv', index=False, encoding='utf-8') 39 | ddf.loc[train_index, :].to_csv('train.csv', index=False, encoding='utf-8') 40 | 41 | print('Train lines:', len(train_index)) 42 | print('Validation lines:', len(valid_index)) 43 | print('Test lines:', len(test_index)) 44 | print('Postive ratio:', np.sum(y) / len(y)) 45 | 46 | # Check md5sum for correctness 47 | assert("4a53bb7cbc0e4ee25f9d6a73ed824b1a" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest()) 48 | assert("fba5428b22895016e790e2dec623cb56" == hashlib.md5(open('valid.csv', 'r').read().encode('utf-8')).hexdigest()) 49 | assert("cfc37da0d75c4d2d8778e76997df2976" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest()) 50 | 51 | print("Reproducing data succeeded!") 52 | 53 | -------------------------------------------------------------------------------- /Criteo/README.md: -------------------------------------------------------------------------------- 1 | # Criteo 2 | 3 | + [Criteo_x1](./Criteo_x1) 4 | + [Criteo_x2](./Criteo_x2) 5 | + [Criteo_x4](./Criteo_x4) 6 | 7 | The dataset is from a [Kaggle challenge for Criteo display advertising](https://www.kaggle.com/c/criteo-display-ad-challenge/data). Criteo is a personalized retargeting company that works with Internet retailers to serve personalized online display advertisements to consumers. The goal of this Kaggle challenge is to predict click-through rates on display ads. It offers a week's worth of data from Criteo's traffic. In the labeled training set over a period of 7 days, each row corresponds to a display ad served by Criteo. The samples are chronologically ordered. Positive and negatives samples have both been subsampled at different rates in order to reduce the dataset size. There are 13 count features and 26 categorical features. The semantic of these features is undisclosed. Some feature have missing values. Note that only the labeled part (i.e., `train.txt`) of the data is used for benchmarking. 8 | 9 | Data fields consist of: 10 | + Label: Target variable that indicates if an ad was clicked (1) or not (0). 11 | + I1-I13: A total of 13 columns of integer features (mostly count features). 12 | + C1-C26: A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes. 13 | -------------------------------------------------------------------------------- /Frappe/Frappe_x1/README.md: -------------------------------------------------------------------------------- 1 | # Frappe_x1 2 | 3 | + **Dataset description:** 4 | 5 | The Frappe dataset contains a context-aware app usage log, which comprises 96203 entries by 957 users for 4082 apps used in various contexts. It has 10 feature fields including user_id, item_id, daytime, weekday, isweekend, homework, cost, weather, country, city. The target value indicates whether the user has used the app under the context. Following the [AFN](https://ojs.aaai.org/index.php/AAAI/article/view/5768) work, we randomly split the data into 7:2:1 as the training set, validation set, and test set, respectively. 6 | 7 | The dataset statistics are summarized as follows: 8 | 9 | | Dataset Split | Total | #Train | #Validation | #Test | 10 | | :--------: | :-----: |:-----: | :----------: | :----: | 11 | | Frappe_x1 | 288,609 | 202,027 | 57,722 | 28,860 | 12 | 13 | + **Source:** https://www.baltrunas.info/context-aware 14 | + **Download:** https://huggingface.co/datasets/reczoo/Frappe_x1/tree/main 15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 16 | 17 | + **Used by papers:** 18 | - Weiyu Cheng, Yanyan Shen, Linpeng Huang. [Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions](https://ojs.aaai.org/index.php/AAAI/article/view/5768). In AAAI 2020. 19 | - Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, Zhenhua Dong. [FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction](https://arxiv.org/abs/2304.00902). In AAAI 2023. 20 | - Jieming Zhu, Qinglin Jia, Guohao Cai, Quanyu Dai, Jingjie Li, Zhenhua Dong, Ruiming Tang, Rui Zhang. [FINAL: Factorized Interaction Layer for CTR Prediction](https://dl.acm.org/doi/10.1145/3539618.3591988). In SIGIR 2023. 21 | 22 | + **Check the md5sum for data integrity:** 23 | ```bash 24 | $ md5sum train.csv valid.csv test.csv 25 | ba7306e6c4fc19dd2cd84f2f0596d158 train.csv 26 | 88d51bf2173505436d3a8f78f2a59da8 valid.csv 27 | 3470f6d32713dc5f7715f198ca7c612a test.csv 28 | ``` 29 | -------------------------------------------------------------------------------- /Frappe/Frappe_x1/convert_frappe_x1.py: -------------------------------------------------------------------------------- 1 | # Convert libsvm data from AFN [AAAI'2020] to csv format 2 | 3 | import pandas as pd 4 | from pathlib import Path 5 | import gc 6 | 7 | headers = ["label", "user", "item", "daytime", "weekday", "isweekend", "homework", "cost", "weather", "country", "city"] 8 | 9 | data_files = ["train.libsvm", "valid.libsvm", "test.libsvm"] 10 | for f in data_files: 11 | df = pd.read_csv(f, sep=" ", names=headers) 12 | for col in headers[1:]: 13 | df[col] = df[col].apply(lambda x: x.split(':')[0]) 14 | df.to_csv(Path(f).stem + ".csv", index=False) 15 | del df 16 | gc.collect() -------------------------------------------------------------------------------- /Frappe/README.md: -------------------------------------------------------------------------------- 1 | # Frappe 2 | 3 | + [Frappe_x1](./Frappe_x1) 4 | 5 | The frappe dataset contains a context-aware app usage log. It consists of 96203 entries by 957 users for 4082 apps used in various contexts. 6 | 7 | Data fields consist of: 8 | + user: anonymized user id 9 | + item: anonymized app id 10 | + daytime 11 | + weekday 12 | + isweekend 13 | + homework 14 | + cost 15 | + weather 16 | + country 17 | + city 18 | + cnt: how many times the app has been used by the user 19 | 20 | Any scientific publications that use this dataset should cite the following paper: 21 | 22 | + Linas Baltrunas, Karen Church, Alexandros Karatzoglou, Nuria Oliver. [Frappe: Understanding the Usage and Perception of Mobile App Recommendations In-The-Wild](https://arxiv.org/abs/1505.03014), Arxiv 1505.03014, 2015. 23 | -------------------------------------------------------------------------------- /Gowalla/Gowalla_m1/README.md: -------------------------------------------------------------------------------- 1 | # Gowalla_m1 2 | 3 | + **Dataset description:** 4 | 5 | The dataset statistics are summarized as follows: 6 | 7 | | Dataset ID | #Users | #Items | #Interactions | #Train | #Test | Density | 8 | |:--------------:|:------:|:------:|:-------------:|:---------:|:-------:|:-------:| 9 | | Gowalla_m1 | 29,858 | 40,981 | 1,027,370 | 810,128 | 217,242 | 0.00084 | 10 | 11 | + **Source:** https://snap.stanford.edu/data/loc-gowalla.html 12 | + **Download:** https://huggingface.co/datasets/reczoo/Gowalla_m1/tree/main 13 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 14 | 15 | + **Used by papers:** 16 | - Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, Meng Wang. [LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation](https://arxiv.org/abs/2002.02126). In SIGIR 2020. 17 | - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filtering](https://arxiv.org/abs/2109.12613). In CIKM 2021. 18 | - Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. [UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation](https://arxiv.org/abs/2110.15114). In CIKM 2021. 19 | 20 | + **Check the md5sum for data integrity:** 21 | ```bash 22 | $ md5sum *.txt 23 | 13b1c0d75b07b8cea9413f40042f476f item_list.txt 24 | c04e2c4bcd2389f53ed8281816166149 test.txt 25 | 5eec1eb2edb8dd648377d348b8e136cf train.txt 26 | f83ec6f2cd974ba6470e8808830cc144 user_list.txt 27 | ``` 28 | -------------------------------------------------------------------------------- /KKBox/KKBox_x1/README.md: -------------------------------------------------------------------------------- 1 | # KKBox_x1 2 | 3 | + **Dataset description:** 4 | 5 | KKBox is a challenge dataset for music recommendation at WSDM 2018. The data consist of user-song pairs in a given time period, with a total of 19 user features (e.g., city, gender) and song features (e.g., language, genre, artist). We randomly split the data into 8:1:1 as the training set, validation set, and test set, respectively. In this setting, for all categorical fields, we replace infrequent features with a default ```` token by setting the threshold min_category_count=10. 6 | 7 | The dataset statistics are summarized as follows: 8 | 9 | | Dataset Split | Total | #Train | #Validation | #Test | 10 | | :--------: | :-----: |:-----: | :----------: | :----: | 11 | | KKBox_x1 | 7,377,418 | 5,901,932 | 737,743 | 737,743 | 12 | 13 | + **Source:** https://www.kaggle.com/c/kkbox-music-recommendation-challenge 14 | + **Download:** https://huggingface.co/datasets/reczoo/KKBox_x1/tree/main 15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 16 | 17 | + **Used by papers:** 18 | - Jieming Zhu, Quanyu Dai, Liangcai Su, Rong Ma, Jinyang Liu, Guohao Cai, Xi Xiao, Rui Zhang. [BARS: Towards Open Benchmarking for Recommender Systems](https://arxiv.org/abs/2205.09626). In SIGIR 2022. 19 | 20 | + **Check the md5sum for data integrity:** 21 | ```bash 22 | $ md5sum train.csv valid.csv test.csv 23 | 195b1ae8fc2d9267d7c8656c07ea1304 train.csv 24 | 398e97ac139611a09bd61a58e4240a3e valid.csv 25 | 8c5f7add05a6f5258b6b3bcc00ba640b test.csv 26 | ``` 27 | -------------------------------------------------------------------------------- /KKBox/README.md: -------------------------------------------------------------------------------- 1 | # KKBox 2 | 3 | + [KKBox_x1](./KKBox_x1) 4 | 5 | It is a [WSDM challenge dataset for KKBox's music recommendation](https://www.kaggle.com/c/kkbox-music-recommendation-challenge) in 2018. The dataset is from [KKBox](https://www.kkbox.com), Asia's leading music streaming service, which holds the world's most comprehensive Asia-Pop music library with over 30 million tracks. 6 | 7 | The task is to predict the chances of a user listening to a song repetitively after the first observable listening event within a time window was triggered. If there are recurring listening event(s) triggered within a month after the user's very first observable listening event, its target is marked 1, and 0 otherwise in the training set. KKBox provides a training data set consists of information of the first observable listening event for each unique user-song pair within a specific time duration. Metadata of each unique user and song pair is also provided. The train and the test data are selected from users listening history in a given time period, and are split based on time. Note that only the labeled train set of the dataset is used for benchmarking. 8 | 9 | Data fields consist of: 10 | + target: this is the target variable. target=1 means there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, target=0 otherwise . 11 | + msno: user id 12 | + song_id: song id 13 | + source_system_tab: the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps functions. For example, tab my library contains functions to manipulate the local storage, and tab search contains functions relating to search. 14 | + source_screen_name: name of the layout a user sees. 15 | + source_type: an entry point a user first plays music on mobile apps. An entry point could be album, online-playlist, song, etc. 16 | 17 | Song features: 18 | + song_length: in ms 19 | + genre_ids: genre category. Some songs have multiple genres and they are separated by | 20 | + artist_name 21 | + composer 22 | + lyricist 23 | + language 24 | + song name: the name of the song. 25 | + isrc: International Standard Recording Code 26 | 27 | User features: 28 | + city 29 | + bd: age. Note: this column has outlier values, please use your judgement. 30 | + gender 31 | + registered_via: registration method 32 | + registration_init_time: format %Y%m%d 33 | + expiration_date: format %Y%m%d 34 | -------------------------------------------------------------------------------- /KuaiShou/KuaiVideo_x1/README.md: -------------------------------------------------------------------------------- 1 | # KuaiVideo_x1 2 | 3 | + **Dataset description:** 4 | 5 | The raw dataset is released by the Kuaishou Competition in the China MM 2018 conference, which aims to predict users' click probabilities for new micro-videos. In this dataset, there are multiple types of interactions between users and micro-videos, such as "click", "not click", "like", and "follow". Particularly, "not click" means the user did not click the micro-video after previewing its thumbnail. Note that the timestamp associated with each behaviour has been processed such that the absolute time is unknown, but the sequential order can be obtained according to the timestamp. For each micro-video, we can access its 2,048-d visual embedding of its thumbnail. In total, 10,000 users and their 3,239,534 interacted micro-videos are randomly selected. We follow the train-test data splitting from the [ALPINE](https://github.com/liyongqi67/ALPINE) work. In this setting, we filter infrequent categorical features with the threshold min_category_count=10. We further set the maximal length of user behavior sequence to 100. 6 | 7 | Note that the 3239534 item ids in behavior data are not continous (0 ~ 3242314), thus `item_visual_emb_dim64.h5` has 3242315 rows, each of which corresponds to an item id and its visual embedding. 8 | 9 | The dataset statistics are summarized as follows: 10 | 11 | | Dataset Split | Total | #Train | #Validation | #Test | 12 | | :--------: | :-----: |:-----: | :----------: | :----: | 13 | | KuaiVideo_x1 | 13,661,383 | 10,931,092 | | 2,730,291 | 14 | 15 | + **Source:** https://www.kuaishou.com/activity/uimc 16 | + **Download:** https://huggingface.co/datasets/reczoo/KuaiVideo_x1/tree/main 17 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 18 | 19 | + **Used by papers:** 20 | - Yongqi Li, Meng Liu, Jianhua Yin, Chaoran Cui, Xinshun-Xu, and Liqiang Nie. [Routing Micro-videos via A Temporal Graph-guided Recommendation System](https://liyongqi67.github.io/papers/MM2019_Routing_Micro_videos_via_A_Temporal_Graph_guided_Recommendation_System.pdf). In MM 2020. 21 | - Jieming Zhu, Guohao Cai, Junjie Huang, Zhenhua Dong, Ruiming Tang, Weinan Zhang. [ReLoop2: Building Self-Adaptive Recommendation Models via Responsive Error Compensation Loop](https://arxiv.org/abs/2306.08808). In KDD 2023. 22 | 23 | + **Check the md5sum for data integrity:** 24 | ```bash 25 | $ md5sum train.csv test.csv 26 | 16f13734411532cc313caf2180bfcd56 train.csv 27 | ba26c01caaf6c65c272af11aa451fc7a test.csv 28 | ``` 29 | -------------------------------------------------------------------------------- /KuaiShou/KuaiVideo_x1/convert_kuaivideo_x1.py: -------------------------------------------------------------------------------- 1 | """ Convert the raw `dataset.pkl` from pickle to csv format, which is 2 | obtained from the following paper: Li et al., Routing Micro-videos 3 | via A Temporal Graph-guided Recommendation System, MM 2019. 4 | See https://github.com/liyongqi67/ALPINE 5 | """ 6 | 7 | import pickle 8 | import numpy as np 9 | import h5py 10 | import pandas as pd 11 | import hashlib 12 | 13 | 14 | data_path = "./" 15 | MAX_SEQ_LEN = 100 # chunk the max length of behavior sequence to 100 16 | 17 | with open(data_path + "dataset.pkl", "rb") as f: 18 | train = pickle.load(f) 19 | test = pickle.load(f) 20 | pos_seq = pickle.load(f) 21 | neg_seq = pickle.load(f) 22 | pos_edge = pickle.load(f) 23 | neg_edge = pickle.load(f) 24 | 25 | for part in ["train", "test"]: 26 | sample_list = [] 27 | for sample in eval(part): 28 | user_id = sample[0][0] 29 | item_id = sample[0][1] 30 | is_click = sample[0][2] 31 | is_like = sample[0][3] 32 | is_follow = sample[0][4] 33 | timestamp = sample[0][5] 34 | pos_len = sample[1] 35 | neg_len = sample[2] 36 | pos_items = "^".join(map(str, pos_seq[user_id][0:min(pos_len, MAX_SEQ_LEN)])) 37 | neg_items = "^".join(map(str, neg_seq[user_id][0:min(neg_len, MAX_SEQ_LEN)])) 38 | sample_list.append([timestamp, user_id, item_id, is_click, is_like, is_follow, pos_items, neg_items]) 39 | data = pd.DataFrame(sample_list, columns=["timestamp", "user_id", "item_id", "is_click", "is_like", "is_follow", "pos_items", "neg_items"]) 40 | data.sort_values(by="timestamp", inplace=True) 41 | data.to_csv(f"{part}" + ".csv", index=False) 42 | 43 | user_emb = np.load(data_path + "user_like.npy") 44 | image_emb = np.load(data_path + "visual64_select.npy") 45 | 46 | with h5py.File("item_visual_emb_dim64.h5", 'w') as hf: 47 | hf.create_dataset("key", data=list(range(len(image_emb)))) 48 | hf.create_dataset("value", data=image_emb) 49 | 50 | with h5py.File("user_visual_emb_dim64.h5", 'w') as hf: 51 | hf.create_dataset("key", data=list(range(len(user_emb)))) 52 | hf.create_dataset("value", data=user_emb) 53 | 54 | # Check md5sum for correctness 55 | assert("16f13734411532cc313caf2180bfcd56" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest()) 56 | assert("ba26c01caaf6c65c272af11aa451fc7a" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest()) 57 | 58 | print("Reproducing data succeeded!") 59 | -------------------------------------------------------------------------------- /KuaiShou/README.md: -------------------------------------------------------------------------------- 1 | # KuaiShou 2 | 3 | + [KuaiVideo_x1](./KuaiVideo_x1) 4 | -------------------------------------------------------------------------------- /MIND/MIND_large_x1/README.md: -------------------------------------------------------------------------------- 1 | # MIND_large_x1 2 | 3 | + **Dataset description:** 4 | 5 | MIND is a large-scale Microsoft news dataset for news recommendation. It was collected from anonymized behavior logs of Microsoft News website. MIND totally contains about 160k English news articles and more than 15 million impression logs generated by 1 million users. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression. 6 | 7 | The dataset statistics are summarized as follows: 8 | 9 | | Dataset Split | Total | #Train | #Validation | #Test | 10 | | :--------: | :-----: |:-----: | :----------: | :----: | 11 | | MIND_large_x1 | | | | | 12 | 13 | + **Source:** https://msnews.github.io/index.html 14 | + **Download:** https://huggingface.co/datasets/reczoo/MIND_large_x1/tree/main 15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 16 | 17 | + **Used by papers:** 18 | - Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, Ming Zhou. [MIND: A Large-scale Dataset for News Recommendation](https://aclanthology.org/2020.acl-main.331). In ACL 2020. 19 | - Jian Li, Jieming Zhu, Qiwei Bi, Guohao Cai, Lifeng Shang, Zhenhua Dong, Xin Jiang, Qun Liu. [MINER: Multi-Interest Matching Network for News Recommendation](https://aclanthology.org/2022.findings-acl.29.pdf). In ACL 2022. 20 | - Qijiong Liu, Jieming Zhu, Quanyu Dai, Xiaoming Wu. [Boosting Deep CTR Prediction with a Plug-and-Play Pre-trainer for News Recommendation](https://aclanthology.org/2022.coling-1.249.pdf). In COLING 2022. 21 | 22 | + **Check the md5sum for data integrity:** 23 | ```bash 24 | $ md5sum train.csv valid.csv test.csv news_corpus.tsv 25 | 955b80b959fb15076a0568d82da6bf05 train.csv 26 | 4942111ca7ba975b5f5dae8e2c54f1f0 valid.csv 27 | cbd5e69d573dc471d9f9ae91f2b5690f test.csv 28 | 9007e6b9127ff71bf146b7cfc1dc842d news_corpus.tsv 29 | ``` 30 | -------------------------------------------------------------------------------- /MIND/MIND_large_x1/convert_MIND_large_x1.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import h5py 4 | import os 5 | import hashlib 6 | 7 | MAX_SEQ_LEN = 50 8 | train_path = "./MINDlarge_train/" 9 | dev_path = "./MINDlarge_dev/" 10 | test_path = "./MINDlarge_test/" 11 | 12 | print("Preprocess news profile...") 13 | train_wiki_file = os.path.join(train_path, "entity_embedding.vec") 14 | dev_wiki_file = os.path.join(dev_path, "entity_embedding.vec") 15 | test_wiki_file = os.path.join(test_path, "entity_embedding.vec") 16 | entity_dict = dict() 17 | with open(train_wiki_file, "r") as fin: 18 | for line in fin: 19 | l = line.strip().split("\t") 20 | entity_dict[l[0]] = [float(v) for v in l[1:]] 21 | with open(dev_wiki_file, "r") as fin: 22 | for line in fin: 23 | l = line.strip().split("\t") 24 | entity_dict[l[0]] = [float(v) for v in l[1:]] 25 | with open(test_wiki_file, "r") as fin: 26 | for line in fin: 27 | l = line.strip().split("\t") 28 | entity_dict[l[0]] = [float(v) for v in l[1:]] 29 | 30 | train_news_file = os.path.join(train_path, "news.tsv") 31 | train_news = pd.read_csv(train_news_file, sep="\t", header=None, 32 | names=["news_id", "cat", "sub_cat", "title", "abstract", "url", 33 | "title_entities", "abstract_entities"]) 34 | dev_news_file = os.path.join(dev_path, "news.tsv") 35 | dev_news = pd.read_csv(dev_news_file, sep="\t", header=None, 36 | names=["news_id", "cat", "sub_cat", "title", "abstract", "url", 37 | "title_entities", "abstract_entities"]) 38 | test_news_file = os.path.join(test_path, "news.tsv") 39 | test_news = pd.read_csv(test_news_file, sep="\t", header=None, 40 | names=["news_id", "cat", "sub_cat", "title", "abstract", "url", 41 | "title_entities", "abstract_entities"]) 42 | news = pd.concat([train_news, dev_news, test_news], axis=0) 43 | news = news.drop_duplicates(subset=['news_id']).reset_index(drop=True) 44 | news = news[["news_id", "cat", "sub_cat", "title", "abstract", "title_entities", "abstract_entities"]] 45 | news["title_entities"] = news["title_entities"].fillna("[]") 46 | news["abstract_entities"] = news["abstract_entities"].fillna("[]") 47 | news["title_entities"] = news["title_entities"] \ 48 | .map(lambda x: "^".join([v["WikidataId"] for v in eval(x) if v["WikidataId"] in entity_dict])) 49 | news["abstract_entities"] = news["abstract_entities"] \ 50 | .map(lambda x: "^".join([v["WikidataId"] for v in eval(x) if v["WikidataId"] in entity_dict])) 51 | news.to_csv("news_corpus.tsv", sep="\t", index=False) 52 | print(news.head()) 53 | 54 | entity_set = set(list(news["title_entities"].values) + list(news["abstract_entities"].values)) 55 | entity_keys = [] 56 | entity_values = [] 57 | for k, v in entity_dict.items(): 58 | if k in entity_set: 59 | entity_keys.append(k) 60 | entity_values.append(v) 61 | with h5py.File("entity_emb_dim100.h5", 'w') as hf: 62 | hf.create_dataset("key", data=np.array(entity_keys, dtype=h5py.special_dtype(vlen=str))) 63 | hf.create_dataset("value", data=np.array(entity_values)) 64 | 65 | news2cat = dict(zip(news["news_id"], news["cat"])) 66 | news2subcat = dict(zip(news["news_id"], news["sub_cat"])) 67 | news2title_entities = dict(zip(news["news_id"], news["title_entities"])) 68 | news2abstract_entities = dict(zip(news["news_id"], news["abstract_entities"])) 69 | used_feat = [ 70 | "imp_id", 71 | "click", 72 | "hour", 73 | "user_id", 74 | "news_id", 75 | "cat", 76 | "sub_cat", 77 | "title_entities", 78 | "abstract_entities", 79 | "news_his", 80 | "cat_his", 81 | "subcat_his" 82 | ] 83 | 84 | def join_data(in_path, out_path): 85 | df = pd.read_csv(in_path, sep="\t", header=None, 86 | names=["imp_id", "user_id", "timestamp", "news_his", "impression_list"]) 87 | df["news_his"] = df["news_his"].fillna("").map(lambda x: \ 88 | "^".join([v for v in x.split() if v in news2cat][-MAX_SEQ_LEN:])) 89 | df = df.drop('impression_list', axis=1).join( \ 90 | df['impression_list'].str.split(' ', expand=True).stack(). \ 91 | reset_index(level=1, drop=True).rename('impression')) 92 | df["hour"] = df["timestamp"].map(lambda t: t.split(" ")[1].split(":")[0] + t.split(" ")[-1]) 93 | try: 94 | df[["news_id", "click"]] = df["impression"].str.split("-", expand=True) 95 | except: 96 | df["news_id"] = df["impression"] 97 | df["click"] = [-1] * len(df["impression"]) 98 | df = pd.merge(df, news, how="left", on="news_id") 99 | df["cat_his"] = df["news_his"].map(lambda x: "^".join([news2cat.get(i, "") for i in x.split("^")])) 100 | df["subcat_his"] = df["news_his"].map(lambda x: "^".join([news2subcat.get(i, "") for i in x.split("^")])) 101 | df[used_feat].to_csv(out_path, index=False) 102 | 103 | print("Preprocess train data...") 104 | join_data(os.path.join(train_path, "behaviors.tsv"), "train.csv") 105 | print("Preprocess dev data...") 106 | join_data(os.path.join(dev_path, "behaviors.tsv"), "valid.csv") 107 | print("Preprocess test data...") 108 | join_data(os.path.join(test_path, "behaviors.tsv"), "test.csv") 109 | 110 | # Check md5sum for correctness 111 | assert("9007e6b9127ff71bf146b7cfc1dc842d" == hashlib.md5(open('news_corpus.tsv', 'r').read().encode('utf-8')).hexdigest()) 112 | assert("955b80b959fb15076a0568d82da6bf05" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest()) 113 | assert("4942111ca7ba975b5f5dae8e2c54f1f0" == hashlib.md5(open('valid.csv', 'r').read().encode('utf-8')).hexdigest()) 114 | assert("cbd5e69d573dc471d9f9ae91f2b5690f" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest()) 115 | print("Reproducing data succeeded!") 116 | -------------------------------------------------------------------------------- /MIND/MIND_small_x1/README.md: -------------------------------------------------------------------------------- 1 | # MIND_small_x1 2 | 3 | + **Dataset description:** 4 | 5 | MIND is a large-scale Microsoft news dataset for news recommendation. It was collected from anonymized behavior logs of Microsoft News website. MIND totally contains about 160k English news articles and more than 15 million impression logs generated by 1 million users. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression. The MIND-small version of the dataset is made by randomly sampling 50,000 users and their behavior logs from the MIND dataset. 6 | 7 | The dataset statistics are summarized as follows: 8 | 9 | | Dataset Split | Total | #Train | #Validation | #Test | 10 | | :--------: | :-----: |:-----: | :----------: | :----: | 11 | | MIND_small_x1 | 8,584,442 | 5,843,444 | 2,740,998 | | 12 | 13 | + **Source:** https://msnews.github.io/index.html 14 | + **Download:** https://huggingface.co/datasets/reczoo/MIND_small_x1/tree/main 15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 16 | 17 | + **Used by papers:** 18 | - Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, Ming Zhou. [MIND: A Large-scale Dataset for News Recommendation](https://aclanthology.org/2020.acl-main.331). In ACL 2020. 19 | - Jian Li, Jieming Zhu, Qiwei Bi, Guohao Cai, Lifeng Shang, Zhenhua Dong, Xin Jiang, Qun Liu. [MINER: Multi-Interest Matching Network for News Recommendation](https://aclanthology.org/2022.findings-acl.29.pdf). In ACL 2022. 20 | - Qijiong Liu, Jieming Zhu, Quanyu Dai, Xiaoming Wu. [Boosting Deep CTR Prediction with a Plug-and-Play Pre-trainer for News Recommendation](https://aclanthology.org/2022.coling-1.249.pdf). In COLING 2022. 21 | 22 | + **Check the md5sum for data integrity:** 23 | ```bash 24 | $ md5sum train.csv valid.csv news_corpus.tsv 25 | 51ac2a4514754078ad05b1028a4c7b9a train.csv 26 | 691961eb780f97b68606e4decebf2296 valid.csv 27 | 51e0b3ae69deab32c7c3f6590f0dab72 news_corpus.tsv 28 | ``` 29 | -------------------------------------------------------------------------------- /MIND/MIND_small_x1/convert_MIND_small_x1.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import h5py 4 | import os 5 | import hashlib 6 | 7 | MAX_SEQ_LEN = 50 8 | train_path = "./MINDsmall_train/" 9 | dev_path = "./MINDsmall_dev/" 10 | 11 | print("Preprocess news profile...") 12 | train_wiki_file = os.path.join(train_path, "entity_embedding.vec") 13 | dev_wiki_file = os.path.join(dev_path, "entity_embedding.vec") 14 | entity_dict = dict() 15 | with open(train_wiki_file, "r") as fin: 16 | for line in fin: 17 | l = line.strip().split("\t") 18 | entity_dict[l[0]] = [float(v) for v in l[1:]] 19 | with open(dev_wiki_file, "r") as fin: 20 | for line in fin: 21 | l = line.strip().split("\t") 22 | entity_dict[l[0]] = [float(v) for v in l[1:]] 23 | 24 | train_news_file = os.path.join(train_path, "news.tsv") 25 | train_news = pd.read_csv(train_news_file, sep="\t", header=None, 26 | names=["news_id", "cat", "sub_cat", "title", "abstract", "url", 27 | "title_entities", "abstract_entities"]) 28 | dev_news_file = os.path.join(dev_path, "news.tsv") 29 | dev_news = pd.read_csv(dev_news_file, sep="\t", header=None, 30 | names=["news_id", "cat", "sub_cat", "title", "abstract", "url", 31 | "title_entities", "abstract_entities"]) 32 | news = pd.concat([train_news, dev_news], axis=0) 33 | news = news.drop_duplicates(subset=['news_id']).reset_index(drop=True) 34 | news = news[["news_id", "cat", "sub_cat", "title_entities", "abstract_entities", "title", "abstract"]] 35 | news["title_entities"] = news["title_entities"].fillna("[]") 36 | news["abstract_entities"] = news["abstract_entities"].fillna("[]") 37 | news["title_entities"] = news["title_entities"] \ 38 | .map(lambda x: "^".join([v["WikidataId"] for v in eval(x) if v["WikidataId"] in entity_dict])) 39 | news["abstract_entities"] = news["abstract_entities"] \ 40 | .map(lambda x: "^".join([v["WikidataId"] for v in eval(x) if v["WikidataId"] in entity_dict])) 41 | news.to_csv("news_corpus.tsv", sep="\t", index=False) 42 | print(news.head()) 43 | 44 | entity_set = set(list(news["title_entities"].values) + list(news["abstract_entities"].values)) 45 | entity_keys = [] 46 | entity_values = [] 47 | for k, v in entity_dict.items(): 48 | if k in entity_set: 49 | entity_keys.append(k) 50 | entity_values.append(v) 51 | with h5py.File("entity_emb_dim100.h5", 'w') as hf: 52 | hf.create_dataset("key", data=np.array(entity_keys, dtype=h5py.special_dtype(vlen=str))) 53 | hf.create_dataset("value", data=np.array(entity_values)) 54 | 55 | news2cat = dict(zip(news["news_id"], news["cat"])) 56 | news2subcat = dict(zip(news["news_id"], news["sub_cat"])) 57 | news2title_entities = dict(zip(news["news_id"], news["title_entities"])) 58 | news2abstract_entities = dict(zip(news["news_id"], news["abstract_entities"])) 59 | used_feat = [ 60 | "imp_id", 61 | "click", 62 | "hour", 63 | "user_id", 64 | "news_id", 65 | "cat", 66 | "sub_cat", 67 | "title_entities", 68 | "abstract_entities", 69 | "news_his", 70 | "cat_his", 71 | "subcat_his" 72 | ] 73 | 74 | def join_data(in_path, out_path): 75 | df = pd.read_csv(in_path, sep="\t", header=None, 76 | names=["imp_id", "user_id", "timestamp", "news_his", "impression_list"]) 77 | df["news_his"] = df["news_his"].fillna("").map(lambda x: \ 78 | "^".join([v for v in x.split() if v in news2cat][-MAX_SEQ_LEN:])) 79 | df = df.drop('impression_list', axis=1).join( \ 80 | df['impression_list'].str.split(' ', expand=True).stack(). \ 81 | reset_index(level=1, drop=True).rename('impression')) 82 | df["hour"] = df["timestamp"].map(lambda t: t.split(" ")[1].split(":")[0] + t.split(" ")[-1]) 83 | df[["news_id", "click"]] = df["impression"].str.split("-", expand=True) 84 | df = pd.merge(df, news, how="left", on="news_id") 85 | df["cat_his"] = df["news_his"].map(lambda x: "^".join([news2cat.get(i, "") for i in x.split("^")])) 86 | df["subcat_his"] = df["news_his"].map(lambda x: "^".join([news2subcat.get(i, "") for i in x.split("^")])) 87 | df[used_feat].to_csv(out_path, index=False) 88 | 89 | print("Preprocess train data...") 90 | join_data(os.path.join(train_path, "behaviors.tsv"), "train.csv") 91 | print("Preprocess dev data...") 92 | join_data(os.path.join(dev_path, "behaviors.tsv"), "valid.csv") 93 | 94 | # Check md5sum for correctness 95 | assert("fe0ec15c20424535b5e5471a7f32d61e" == hashlib.md5(open('news_corpus.tsv', 'r').read().encode('utf-8')).hexdigest()) 96 | assert("18b16481ad421986de3f80ce3295f5ed" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest()) 97 | assert("1223a2a14fa65e880bc8158836da3894" == hashlib.md5(open('valid.csv', 'r').read().encode('utf-8')).hexdigest()) 98 | print("Reproducing data succeeded!") 99 | -------------------------------------------------------------------------------- /MicroVideo/MicroVideo1.7M_x1/README.md: -------------------------------------------------------------------------------- 1 | # MicroVideo1.7M_x1 2 | 3 | + **Dataset description:** 4 | 5 | This is a micro-video dataset provided by the [THACIL work](https://dl.acm.org/doi/10.1145/3240508.3240617), which contains 12,737,617 interactions that 10,986 users have made on 1,704,880 micro-videos. The features include user id, item id, category, and the extracted image embedding vectors of cover images of micro-videos. Note that the dataset has been split such that the items in the test set are all new micro-videos, which have no overlap with the items in the training set. This helps validate the generability of multimodal embedding vectors for new micro-videos. In this setting, we set the maximal length of user behavior sequence to 100. 6 | 7 | The dataset statistics are summarized as follows: 8 | 9 | | Dataset Split | Total | #Train | #Validation | #Test | 10 | | :--------: | :-----: |:-----: | :----------: | :----: | 11 | | MicroVideo1.7M_x1 | 12,737,617 | 8,970,309 | | 3,767,308 | 12 | 13 | + **Source:** https://github.com/Ocxs/THACIL 14 | + **Download:** https://huggingface.co/datasets/reczoo/MicroVideo1.7M_x1/tree/main 15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 16 | 17 | + **Used by papers:** 18 | - Xusong Chen, Dong Liu, Zheng-Jun Zha, Wengang Zhou, Zhiwei Xiong, Yan Li. [Temporal Hierarchical Attention at Category- and Item-Level for Micro-Video Click-Through Prediction](https://dl.acm.org/doi/10.1145/3240508.3240617). In MM 2018. 19 | - Jieming Zhu, Guohao Cai, Junjie Huang, Zhenhua Dong, Ruiming Tang, Weinan Zhang. [ReLoop2: Building Self-Adaptive Recommendation Models via Responsive Error Compensation Loop](https://arxiv.org/abs/2306.08808). In KDD 2023. 20 | 21 | + **Check the md5sum for data integrity:** 22 | ```bash 23 | $ md5sum train.csv test.csv 24 | 936e6612714c887e76226a60829b4e0a train.csv 25 | 9417a18304fb62411ac27c26c5e0de56 test.csv 26 | ``` 27 | -------------------------------------------------------------------------------- /MicroVideo/MicroVideo1.7M_x1/convert_microvideo1.7m_x1.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import h5py 3 | import sys 4 | from collections import defaultdict 5 | import numpy as np 6 | import hashlib 7 | from sklearn.decomposition import PCA 8 | 9 | 10 | 11 | sequence_maxlen = 128 12 | 13 | train = pd.read_csv("train_data.csv", dtype=object) 14 | print("train.shape", train.shape) 15 | # print(train.columns) 16 | train = train.sort_values(by=["user_id", "timestamp"]).reset_index(drop=True) 17 | item_ID = sorted(list(train["item_id"].unique())) 18 | user_ID = sorted(list(train["user_id"].unique())) 19 | print("Number of users: ", len(user_ID)) 20 | print("Number of items: ", len(item_ID)) 21 | 22 | clicked_items_queue = defaultdict(list) 23 | clicked_categories_queue = defaultdict(list) 24 | clicked_items_list = [] 25 | clicked_categories_list = [] 26 | click_time = "" 27 | for idx, row in train.iterrows(): 28 | if idx % 10000 == 0: 29 | print("Processing {} lines".format(idx)) 30 | click_time = row['timestamp'] 31 | user_id = row["user_id"] 32 | item_id = row["item_id"] 33 | cate_id = row["cate_id"] 34 | click = row['is_click'] 35 | click_history = clicked_items_queue[user_id] 36 | if len(click_history) > sequence_maxlen: 37 | click_history = click_history[-sequence_maxlen:] 38 | clicked_items_queue[user_id] = click_history 39 | clicked_items_list.append("^".join(click_history)) 40 | category_history = clicked_categories_queue[user_id] 41 | if len(category_history) > sequence_maxlen: 42 | category_history = category_history[-sequence_maxlen:] 43 | clicked_categories_queue[user_id] = category_history 44 | clicked_categories_list.append("^".join(category_history)) 45 | if click == "1": 46 | clicked_items_queue[user_id].append(item_id) 47 | clicked_categories_queue[user_id].append(cate_id) 48 | 49 | train["clicked_items"] = clicked_items_list 50 | train["clicked_categories"] = clicked_categories_list 51 | train.to_csv("train.csv", index=False) 52 | 53 | test = pd.read_csv("test_data.csv", dtype=object) 54 | print("test.shape", test.shape) 55 | test = test.sort_values(by=["user_id", "timestamp"]).reset_index(drop=True) 56 | test["item_id"] = test["item_id"].map(lambda x: str(len(item_ID) + int(x))) # re-map item ids of test 57 | test_item_ID = sorted(list(test["item_id"].unique())) 58 | test_user_ID = sorted(list(train["user_id"].unique())) 59 | print("Number of users: ", len(test_user_ID)) 60 | print("Number of items: ", len(test_item_ID)) 61 | test["clicked_items"] = test["user_id"].map(lambda x: "^".join(clicked_items_queue[x][-sequence_maxlen:])) 62 | test["clicked_categories"] = test["user_id"].map(lambda x: "^".join(clicked_categories_queue[x][-sequence_maxlen:])) 63 | test.to_csv("test.csv", index=False) 64 | 65 | # Embedding dimension reduction via PCA 66 | train_emb = np.load("train_cover_image_feature.npy") 67 | test_emb = np.load("test_cover_image_feature.npy") 68 | item_emb = np.vstack([train_emb, test_emb]) 69 | pca = PCA(n_components=64) 70 | item_emb = pca.fit_transform(item_emb) 71 | print("item_emb.shape", item_emb.shape) 72 | 73 | with h5py.File("item_image_emb_dim64.h5", 'w') as hf: 74 | hf.create_dataset("key", data=list(range(len(item_emb)))) 75 | hf.create_dataset("value", data=item_emb) 76 | 77 | # Check md5sum for correctness 78 | assert("936e6612714c887e76226a60829b4e0a" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest()) 79 | assert("9417a18304fb62411ac27c26c5e0de56" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest()) 80 | 81 | print("Reproducing data succeeded!") 82 | -------------------------------------------------------------------------------- /MovieLens/Movielens1M_m1/README.md: -------------------------------------------------------------------------------- 1 | # Movielens1M_m1 2 | 3 | + **Dataset description:** 4 | 5 | The MovieLens-1M dataset contain 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users. We follow the LCF work to split and preprocess the data into training, validation, and test sets, respectively. 6 | 7 | + **Data format:** 8 | 9 | Each user corresponds to a list of interacted items: [[item1, item2], [item3, item4, item5], ...] 10 | 11 | + **Source:** https://grouplens.org/datasets/movielens/1m/ 12 | + **Download:** https://huggingface.co/datasets/reczoo/Movielens1M_m1/tree/main 13 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 14 | 15 | + **Used by papers:** 16 | - Wenhui Yu, Zheng Qin. [Graph Convolutional Network for Recommendation with Low-pass Collaborative Filters](https://arxiv.org/abs/2006.15516). In ICML 2020. 17 | - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filtering](https://arxiv.org/abs/2109.12613). In CIKM 2021. 18 | - Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. [UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation](https://arxiv.org/abs/2110.15114). In CIKM 2021. 19 | 20 | + **Check the md5sum for data integrity:** 21 | ```bash 22 | $ md5sum *.json 23 | cdd3ad819512cb87dad2f098c8437df2 test_data.json 24 | 4229bc5369f943918103daf7fd92e920 train_data.json 25 | 60be3b377d39806f80a43e37c94449f6 validation_data.json 26 | ``` 27 | -------------------------------------------------------------------------------- /MovieLens/MovielensLatest_x1/README.md: -------------------------------------------------------------------------------- 1 | # MovielensLatest_x1 2 | 3 | + **Dataset description:** 4 | 5 | The MovieLens dataset consists of users' tagging records on movies. The task is formulated as personalized tag recommendation with each tagging record (user_id, item_id, tag_id) as an data instance. The target value denotes whether the user has assigned a particular tag to the movie. Following the [AFN](https://ojs.aaai.org/index.php/AAAI/article/view/5768) work, we randomly split the data into 7:2:1 as the training set, validation set, and test set, respectively. 6 | 7 | The dataset statistics are summarized as follows: 8 | 9 | | Dataset Split | Total | #Train | #Validation | #Test | 10 | | :--------: | :-----: |:-----: | :----------: | :----: | 11 | | MovielensLatest_x1 | 2,006,859 | 1,404,801 | 401,373 | 200,686 | 12 | 13 | + **Source:** https://grouplens.org/datasets/movielens 14 | + **Download:** https://huggingface.co/datasets/reczoo/MovielensLatest_x1/tree/main 15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 16 | 17 | + **Used by papers:** 18 | - Weiyu Cheng, Yanyan Shen, Linpeng Huang. [Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions](https://ojs.aaai.org/index.php/AAAI/article/view/5768). In AAAI 2020. 19 | - Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, Zhenhua Dong. [FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction](https://arxiv.org/abs/2304.00902). In AAAI 2023. 20 | - Jieming Zhu, Qinglin Jia, Guohao Cai, Quanyu Dai, Jingjie Li, Zhenhua Dong, Ruiming Tang, Rui Zhang. [FINAL: Factorized Interaction Layer for CTR Prediction](https://dl.acm.org/doi/10.1145/3539618.3591988). In SIGIR 2023. 21 | 22 | + **Check the md5sum for data integrity:** 23 | ```bash 24 | $ md5sum train.csv valid.csv test.csv 25 | efc8bceeaa0e895d566470fc99f3f271 train.csv 26 | e1930223a5026e910ed5a48687de8af1 valid.csv 27 | 54e8c6baff2e059fe067fb9b69e692d0 test.csv 28 | ``` 29 | -------------------------------------------------------------------------------- /MovieLens/MovielensLatest_x1/convert_movielenslatest_x1.py: -------------------------------------------------------------------------------- 1 | # Convert libsvm data from AFN [AAAI'2020] to csv format 2 | 3 | import pandas as pd 4 | from pathlib import Path 5 | import gc 6 | 7 | headers = ["label", "user_id", "item_id", "tag_id"] 8 | 9 | data_files = ["train.libsvm", "valid.libsvm", "test.libsvm"] 10 | for f in data_files: 11 | df = pd.read_csv(f, sep=" ", names=headers) 12 | for col in headers[1:]: 13 | df[col] = df[col].apply(lambda x: x.split(':')[0]) 14 | df.to_csv(Path(f).stem + ".csv", index=False) 15 | del df 16 | gc.collect() -------------------------------------------------------------------------------- /MovieLens/README.md: -------------------------------------------------------------------------------- 1 | # MovieLens 2 | 3 | + [MovielensLatest_x1](./MovielensLatest_x1) 4 | + [Movielens1M_m1](./Movielens1M_m1) 5 | 6 | The MovieLens datasets are collected by GroupLens Research from the MovieLens web site (https://movielens.org) where movie rating data are made available. The datasets have been widely used in various research on recommender systems. 7 | 8 | MovieLens datasets https://grouplens.org/datasets/movielens 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RecZoo Datasets 2 | 3 | + [CTR Prediction](#ctr-prediction) 4 | + [Matching](#matching) 5 | + [Reranking](#reranking) 6 | + [Multimodal](#multimodal) 7 | + [Multitask](#multitask) 8 | + [Multidomain](#multidomain) 9 | 10 | 11 | ## CTR Prediction 12 | 13 | | Dataset | Dataset ID | Domain | Use Cases | Download | Leaderboard | 14 | |:-----------|:--------------------|:------------------------|:-------------------- |:---------------------:|:---------------------:| 15 | | [Criteo](https://github.com/reczoo/Datasets/tree/main/Criteo) | [Criteo_x1](https://github.com/reczoo/Datasets/tree/main/Criteo/Criteo_x1) | Ads | Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/Criteo_x1/resolve/main/Criteo_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/criteo_x1.html) | 16 | | | [Criteo_x2](https://github.com/reczoo/Datasets/tree/main/Criteo/Criteo_x2) | Ads | Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/Criteo_x2/resolve/main/Criteo_x2.zip?download=true) | 17 | | | [Criteo_x4](https://github.com/reczoo/Datasets/tree/main/Criteo/Criteo_x4) | Ads |Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/Criteo_x4/resolve/main/Criteo_x4.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/criteo_x4.html) | 18 | | [Avazu](https://github.com/reczoo/Datasets/tree/main/Avazu) | [Avazu_x1](https://github.com/reczoo/Datasets/tree/main/Avazu/Avazu_x1) | Ads |Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/Avazu_x1/resolve/main/Avazu_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/avazu_x1.html) | 19 | | | [Avazu_x2](https://github.com/reczoo/Datasets/tree/main/Avazu/Avazu_x2) | Ads |Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/Avazu_x2/resolve/main/Avazu_x2.zip?download=true) | 20 | | | [Avazu_x4](https://github.com/reczoo/Datasets/tree/main/Avazu/Avazu_x4) | Ads |Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/Avazu_x4/resolve/main/Avazu_x4.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/avazu_x4.html) | 21 | | [KKBox](https://github.com/reczoo/Datasets/tree/main/KKBox) | [KKBox_x1](https://github.com/reczoo/Datasets/tree/main/KKBox/KKBox_x1) | Music | Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/KKBox_x1/resolve/main/KKBox_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/kkbox_x1.html) | 22 | | [Frappe](https://github.com/reczoo/Datasets/tree/main/Frappe) | [Frappe_x1](https://github.com/reczoo/Datasets/tree/main/Frappe/Frappe_x1) | Apps | Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/Frappe_x1/resolve/main/Frappe_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/frappe_x1.html) | 23 | | [MovieLens](https://github.com/reczoo/Datasets/tree/main/MovieLens) | [MovielensLatest_x1](https://github.com/reczoo/Datasets/tree/main/MovieLens/MovielensLatest_x1) | Movies | Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/MovielensLatest_x1/resolve/main/MovielensLatest_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/movielenslatest_x1.html) | 24 | | [Taobao](https://github.com/reczoo/Datasets/tree/main/Taobao) | [TaobaoAd_x1](https://github.com/reczoo/Datasets/tree/main/Taobao/TaobaoAd_x1) | Ads | Sequential | [:link:](https://huggingface.co/datasets/reczoo/TaobaoAd_x1/resolve/main/TaobaoAd_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/taobaoad_x1.html) | 25 | | [Amazon](https://github.com/reczoo/Datasets/tree/main/Amazon) | [AmazonElectronics_x1](https://github.com/reczoo/Datasets/tree/main/Amazon/AmazonElectronics_x1) | Electronics | Sequential | [:link:](https://huggingface.co/datasets/reczoo/AmazonElectronics_x1/resolve/main/AmazonElectronics_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/amazonelectronics_x1.html) | 26 | | [iPinYou](https://github.com/reczoo/Datasets/tree/main/iPinYou) | [iPinYou_x1](https://github.com/reczoo/Datasets/tree/main/iPinYou/iPinYou_x1) | Ads | Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/iPinYou_x1/resolve/main/iPinYou_x1.zip?download=true) | 27 | | [MicroVideo](https://github.com/reczoo/Datasets/tree/main/MicroVideo) | [MicroVideo1.7M_x1](https://github.com/reczoo/Datasets/tree/main/MicroVideo/MicroVideo1.7M_x1) | MicroVideo | Sequential, multimodal | [:link:](https://huggingface.co/datasets/reczoo/MicroVideo1.7M_x1/resolve/main/MicroVideo1.7M_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/microvideo1.7m_x1.html) | 28 | | [KuaiShou](https://github.com/reczoo/Datasets/tree/main/KuaiShou) | [KuaiVideo_x1](https://github.com/reczoo/Datasets/tree/main/KuaiShou/KuaiVideo_x1) | MicroVideo | Sequential, multimodal | [:link:](https://huggingface.co/datasets/reczoo/KuaiVideo_x1/resolve/main/KuaiVideo_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/kuaivideo_x1.html) | 29 | | [MIND](https://github.com/reczoo/Datasets/tree/main/MIND) | [MIND_small_x1](https://github.com/reczoo/Datasets/tree/main/MIND/MIND_small_x1) | News | Sequential, pretraining | [:link:](https://huggingface.co/datasets/reczoo/MIND_small_x1/resolve/main/MIND_small_x1.zip?download=true) | 30 | 31 | 32 | ## Matching 33 | 34 | | Dataset | Dataset ID | Domain | Use Cases | Download | Leaderboard | 35 | |:-------------------|:----------------------|:-----------------|:-------------|:----------------------:|:----------------------:| 36 | | [Amazon](https://github.com/reczoo/Datasets/tree/main/Amazon) | [AmazonBooks_m1](https://github.com/reczoo/Datasets/tree/main/Amazon/AmazonBooks_m1) | Books | CF, GNN | [:link:](https://huggingface.co/datasets/reczoo/AmazonBooks_m1/tree/main) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/Matching/leaderboard/amazonbooks_m1.html) | 37 | | | [AmazonCDs_m1](https://github.com/reczoo/Datasets/tree/main/Amazon/AmazonCDs_m1) | CDs | CF, GNN | [:link:](https://huggingface.co/datasets/reczoo/AmazonCDs_m1/tree/main) | 38 | | | [AmazonMovies_m1](https://github.com/reczoo/Datasets/tree/main/Amazon/AmazonMovies_m1) | Movies | CF, GNN | [:link:](https://huggingface.co/datasets/reczoo/AmazonMovies_m1/tree/main) | 39 | | | [AmazonBeauty_m1](https://github.com/reczoo/Datasets/tree/main/Amazon/AmazonBeauty_m1) | Beauty | CF, GNN | [:link:](https://huggingface.co/datasets/reczoo/AmazonBeauty_m1/tree/main) | 40 | | | [AmazonElectronics_m1](https://github.com/reczoo/Datasets/tree/main/Amazon/AmazonElectronics_m1) | Electronics | CF | [:link:](https://huggingface.co/datasets/reczoo/AmazonElectronics_m1/tree/main) | 41 | | [MovieLens](https://github.com/reczoo/Datasets/tree/main/MovieLens) | [MovieLens1M_m1](https://github.com/reczoo/Datasets/tree/main/Amazon/MovieLens1M_m1) | Movies | CF, GNN | [:link:](https://huggingface.co/datasets/reczoo/MovieLens1M_m1/tree/main) | 42 | | [Yelp](https://github.com/reczoo/Datasets/tree/main/Yelp) | [Yelp18_m1](https://github.com/reczoo/Datasets/tree/main/Yelp/Yelp18_m1) | Restaurants | CF, GNN | [:link:](https://huggingface.co/datasets/reczoo/Yelp18_m1/tree/main) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/Matching/leaderboard/yelp18_m1.html) | 43 | | [Gowalla](https://github.com/reczoo/Datasets/tree/main/Gowalla) | [Gowalla_m1](https://github.com/reczoo/Datasets/tree/main/Gowalla/Gowalla_m1) | POIs | CF, GNN | [:link:](https://huggingface.co/datasets/reczoo/Gowalla_m1/tree/main) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/Matching/leaderboard/gowalla_m1.html) | 44 | | [CiteULike](https://github.com/reczoo/Datasets/tree/main/CiteULike) | [CiteUlikeA_m1](https://github.com/reczoo/Datasets/tree/main/CiteULike/CiteUlikeA_m1) | Citation | CF, GNN | [:link:](https://huggingface.co/datasets/reczoo/CiteUlikeA_m1/tree/main) | 45 | 46 | 47 | ## Reranking 48 | TODO 49 | 50 | ## Multimodal 51 | 52 | | Dataset | Dataset ID | Domain | Use Cases | Download | Leaderboard | 53 | |:-----------|:--------------------|:------------------------|:-------------------- |:---------------------:|:---------------------:| 54 | | [MicroVideo](https://github.com/reczoo/Datasets/tree/main/MicroVideo) | [MicroVideo1.7M_x1](https://github.com/reczoo/Datasets/tree/main/MicroVideo/MicroVideo1.7M_x1) | MicroVideo | Sequential, multimodal | [:link:](https://huggingface.co/datasets/reczoo/MicroVideo1.7M_x1/resolve/main/MicroVideo1.7M_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/microvideo1.7m_x1.html) | 55 | | [KuaiShou](https://github.com/reczoo/Datasets/tree/main/KuaiShou) | [KuaiVideo_x1](https://github.com/reczoo/Datasets/tree/main/KuaiShou/KuaiVideo_x1) | MicroVideo | Sequential, multimodal | [:link:](https://huggingface.co/datasets/reczoo/KuaiVideo_x1/resolve/main/KuaiVideo_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/kuaivideo_x1.html) | 56 | 57 | 58 | ## Multitask 59 | TODO 60 | 61 | ## Multidomain 62 | TODO 63 | 64 | -------------------------------------------------------------------------------- /Taobao/TaobaoAd_x1/README.md: -------------------------------------------------------------------------------- 1 | # TaobaoAd_x1 2 | 3 | + **Dataset description:** 4 | 5 | Taobao is a dataset provided by Alibaba, which contains 8 days of ad click-through data (26 million records) that are randomly sampled from 1140000 users. By default, the first 7 days (i.e., 20170506-20170512) of samples are used as training samples, and the last day's samples (i.e., 20170513) are used as test samples. Meanwhile, the dataset also covers the shopping behavior of all users in the recent 22 days, including totally seven hundred million records. We follow the preprocessing steps that have been applied to [reproducing the DMR work](https://aistudio.baidu.com/aistudio/projectdetail/1805731). We note that a small part (~5%) of samples have been dropped during preprocessing due the missing of user or item profiles. In this setting, we filter infrequent categorical features with the threshold min_category_count=10. We further set the maximal length of user behavior sequence to 50. 6 | 7 | The dataset statistics are summarized as follows: 8 | 9 | | Dataset Split | Total | #Train | #Validation | #Test | 10 | | :--------: | :-----: |:-----: | :----------: | :----: | 11 | | TaobaoAd_x1 | 25,029,426 | 21,929,911 | | 3,099,515 | 12 | 13 | + **Data format:** 14 | + user: User ID (int); 15 | + time_stamp: time stamp (Bigint, 1494032110 stands for 2017-05-06 08:55:10); 16 | + adgroup_id: adgroup ID (int); 17 | + pid: scenario; 18 | + noclk: 1 for not click, 0 for click; 19 | + clk: 1 for click, 0 for not click; 20 | 21 | ad_feature: 22 | + adgroup_id: Ad ID (int); 23 | + cate_id: category ID; 24 | + campaign_id: campaign ID; 25 | + brand: brand ID; 26 | + customer_id: Advertiser ID; 27 | + price: the price of item 28 | 29 | user_profile: 30 | + userid: user ID; 31 | + cms_segid: Micro group ID; 32 | + cms_group_id: cms group_id; 33 | + final_gender_code: gender 1 for male, 2 for female 34 | + age_level: age_level 35 | + pvalue_level: Consumption grade, 1: low, 2: mid, 3: high 36 | + shopping_level: Shopping depth, 1: shallow user, 2: moderate user, 3: depth user 37 | + occupation: Is the college student 1: yes, 0: no? 38 | + new_user_class_level: City level 39 | 40 | raw_behavior_log: 41 | + nick: User ID(int); 42 | + time_stamp: time stamp (Bigint, 1494032110 stands for 2017-05-06 08:55:10); 43 | + btag: Types of behavior, include: ipv/cart/fav/buy; 44 | + cate: category ID(int); 45 | + brand: brand ID(int); 46 | 47 | + **Source:** https://tianchi.aliyun.com/dataset/dataDetail?dataId=56 48 | + **Download:** https://huggingface.co/datasets/reczoo/TaobaoAd_x1/tree/main 49 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 50 | 51 | + **Used by papers:** 52 | - Ze Lyu, Yu Dong, Chengfu Huo, Weijun Ren. [Deep Match to Rank Model for Personalized Click-Through Rate Prediction](https://ojs.aaai.org/index.php/AAAI/article/view/5346). In AAAI 2020. 53 | 54 | + **Check the md5sum for data integrity:** 55 | ```bash 56 | $ md5sum train.csv test.csv 57 | eaabfc8629f23519b04593e26c7522fc train.csv 58 | f5ae6197e52385496d46e2867c1c8da1 test.csv 59 | ``` 60 | -------------------------------------------------------------------------------- /Taobao/TaobaoAd_x1/convert_taobaoad_x1.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | The raw dataset is available at https://tianchi.aliyun.com/dataset/56 4 | The preprocessed dataset is used by the following work: 5 | Lyu et al., Deep Match to Rank Model for Personalized Click-Through Rate Prediction, AAAI 2020. 6 | The preprocessing steps follow the scripts at https://aistudio.baidu.com/aistudio/projectdetail/1805731 7 | The required data `dataset_full.zip` can be download at https://aistudio.baidu.com/aistudio/datasetdetail/81892 8 | However, we note that the ID mapping of categorical features in `dataset_full.zip` has a buggy issue. 9 | Please refer to https://github.com/PaddlePaddle/PaddleRec/issues/821 10 | Thus, we suggest to re-map the categorical IDs to new indices when using this dataset. 11 | """ 12 | 13 | import pandas as pd 14 | import hashlib 15 | 16 | 17 | train_path = "./work/train_sorted.csv" 18 | test_path = "./work/test.csv" 19 | 20 | data_parts = ["train", "test"] 21 | for part in data_parts: 22 | data_df = pd.read_csv(eval(part + '_path'), header=None, dtype=object) 23 | data_df.fillna("0", inplace=True) 24 | part_df = pd.DataFrame() 25 | part_df["clk"] = data_df.iloc[:, 266] 26 | part_df["btag_his"] = ["^".join(filter(lambda k: k != "0", x.tolist())) for x in data_df.iloc[:, 0:50].values] 27 | part_df["cate_his"] = ["^".join(filter(lambda k: k != "0", x.tolist())) for x in data_df.iloc[:, 50:100].values] 28 | part_df["brand_his"] = ["^".join(filter(lambda k: k != "0", x.tolist())) for x in data_df.iloc[:, 100:150].values] 29 | part_df["userid"] = data_df.iloc[:, 250] 30 | part_df["cms_segid"] = data_df.iloc[:, 251] 31 | part_df["cms_group_id"] = data_df.iloc[:, 252] 32 | part_df["final_gender_code"] = data_df.iloc[:, 253] 33 | part_df["age_level"] = data_df.iloc[:, 254] 34 | part_df["pvalue_level"] = data_df.iloc[:, 255] 35 | part_df["shopping_level"] = data_df.iloc[:, 256] 36 | part_df["occupation"] = data_df.iloc[:, 257] 37 | part_df["new_user_class_level"] = data_df.iloc[:, 258] 38 | part_df["adgroup_id"] = data_df.iloc[:, 259] 39 | part_df["cate_id"] = data_df.iloc[:, 260] 40 | part_df["campaign_id"] = data_df.iloc[:, 261] 41 | part_df["customer"] = data_df.iloc[:, 262] 42 | part_df["brand"] = data_df.iloc[:, 263] 43 | part_df["price"] = data_df.iloc[:, 264] 44 | part_df["pid"] = data_df.iloc[:, 265] 45 | part_df["btag"] = [1] * len(data_df) 46 | part_df.to_csv(part + ".csv", index=False) 47 | 48 | # Check md5sum for correctness 49 | assert("eaabfc8629f23519b04593e26c7522fc" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest()) 50 | assert("f5ae6197e52385496d46e2867c1c8da1" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest()) 51 | 52 | print("Reproducing data succeeded!") 53 | -------------------------------------------------------------------------------- /Yelp/Yelp18_m1/README.md: -------------------------------------------------------------------------------- 1 | # Yelp18_m1 2 | 3 | + **Dataset description:** 4 | 5 | The data statistics are summarized as follows: 6 | 7 | | Dataset ID | #Users | #Items | #Interactions | #Train | #Test | Density | 8 | | :-------: | :----: | :----: | :-----------: | :-------: | :-----: | :-----: | 9 | | Yelp18_m1 | 31,668 | 38,048 | 1,561,406 | 1,237,259 | 324,147 | 0.00130 | 10 | 11 | 12 | + **Data format:** 13 | user_id item1 item2 ... 14 | 15 | + **Source:** https://www.yelp.com/dataset 16 | + **Download:** https://huggingface.co/datasets/reczoo/Yelp18_m1/tree/main 17 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 18 | 19 | + **Used by papers:** 20 | - Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, Meng Wang. [LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation](https://arxiv.org/abs/2002.02126). In SIGIR 2020. 21 | - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filtering](https://arxiv.org/abs/2109.12613). In CIKM 2021. 22 | - Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. [UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation](https://arxiv.org/abs/2110.15114). In CIKM 2021. 23 | 24 | + **Check the md5sum for data integrity:** 25 | ```bash 26 | $ md5sum *.txt 27 | 520fe559761ff2c654629201c807f353 item_list.txt 28 | 0d57d7399862c32152b045ec5d2698e7 test.txt 29 | 1b8b5d22a227e01d6de002c53d32b4c4 train.txt 30 | ae4f810cd6e827f10fc418753c7d92f9 user_list.txt 31 | ``` 32 | -------------------------------------------------------------------------------- /iFlytek/iFlyteckAds_x1/convert_iFlyteckAds_x1.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from collections import Counter 3 | import random 4 | 5 | train_file = "train_data.txt" 6 | test_file = "test_data.txt" 7 | 8 | columns = ['label'] + ["f{}".format(i + 1) for i in range(245)] 9 | train_data = pd.read_csv(train_file, header=None, names=columns, dtype=object, memory_map=True) 10 | train_data.drop([1613993, 1895025], 0, inplace=True) # drop abnormal rows 11 | num_train = len(train_data) 12 | test_data = pd.read_csv(test_file, header=None, names=columns[1:], dtype=object, memory_map=True) 13 | test_data[columns[0]] = [0] * len(test_data) 14 | num_test = len(test_data) 15 | all_data = pd.concat([train_data, test_data], sort=False).reset_index(drop=True) 16 | all_data["label"] = all_data["label"].astype(int) 17 | print("num_train, num_test:", num_train, num_test) 18 | 19 | for col in ["f1", "f2", "f3", "f4"]: 20 | col_dict = Counter(all_data[col]) 21 | vocab = dict(zip(col_dict.keys(), range(len(col_dict)))) 22 | all_data[col] = all_data[col].map(lambda x: vocab[x]) 23 | print(all_data.head()) 24 | 25 | train_df = all_data.iloc[0:num_train, :].reset_index(drop=True) 26 | test_df = all_data.iloc[num_train:, :] 27 | 28 | def get_statistics_by_slotid(df, fout): 29 | slotid_impressions = df.groupby(['f1'])['f1'].count() 30 | slotid_clicks = df.groupby(['f1'])['label'].sum() 31 | slotid_impressions.to_csv(fout + "_slotid_impressions.csv") 32 | slotid_clicks.to_csv(fout + "_slotid_clicks.csv") 33 | 34 | get_statistics_by_slotid(train_df, "train_all") 35 | get_statistics_by_slotid(test_df, "test") 36 | 37 | # train-validation-test splitting 38 | sample_index = list(range(num_train)) 39 | random.seed(2022) 40 | random.shuffle(sample_index) 41 | train_index = sample_index[0:int(num_train * 0.9)] 42 | valid_index = sample_index[int(num_train * 0.9):] 43 | valid_df = train_df.iloc[valid_index, :] 44 | train_df = train_df.iloc[train_index, :] 45 | train_df.to_csv("train.csv", index=False) 46 | valid_df.to_csv("valid.csv", index=False) 47 | test_df.to_csv("test.csv", index=False) 48 | print("train:valid:test samples:", len(train_df), len(valid_df), len(test_df)) 49 | 50 | get_statistics_by_slotid(train_df, "train") 51 | get_statistics_by_slotid(valid_df, "valid") 52 | print("All done.") 53 | -------------------------------------------------------------------------------- /iPinYou/iPinYou_x1/README.md: -------------------------------------------------------------------------------- 1 | # iPinYou_x1 2 | 3 | + **Dataset description:** 4 | 5 | The iPinYou Global Real-Time Bidding Algorithm Competition is organized by iPinYou from April 1st, 2013 to December 31st, 2013.The competition has been divided into three seasons. For each season, a training dataset is released to the competition participants, the testing dataset is reserved by iPinYou. The complete testing dataset is randomly divided into two parts: one part is the leaderboard testing dataset to score and rank the participating teams on the leaderboard, and the other part is reserved for the final offline evaluation. The participant's last offline submission is evaluated by the reserved testing dataset to get a team's offline final score. This dataset contains all three seasons training datasets and leaderboard testing datasets.The reserved testing datasets are withheld by iPinYou. The training dataset includes a set of processed iPinYou DSP bidding, impression, click, and conversion logs. 6 | 7 | + **Source:** https://contest.ipinyou.com/ 8 | + **Download:** https://huggingface.co/datasets/reczoo/iPinYou_x1/tree/main 9 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets 10 | 11 | + **Used by papers:** 12 | - Bin Liu, Niannan Xue, Huifeng Guo, Ruiming Tang, Stefanos Zafeiriou, Xiuqiang He, Zhenguo Li. [AutoGroup: Automatic Feature Grouping for Modelling Explicit High-Order Feature Interactions in CTR Prediction](https://dl.acm.org/doi/abs/10.1145/3397271.3401082). In SIGIR 2020. 13 | 14 | + **Check the md5sum for data integrity:** 15 | ```bash 16 | $ md5sum *.csv 17 | a94374868687794ff8c0c4d0b124a400 test.csv 18 | 9dd8979d265ab1ed7662ffd49fd73247 train.csv 19 | ``` 20 | -------------------------------------------------------------------------------- /tracking.md: -------------------------------------------------------------------------------- 1 | # Tracking Records 2 | 3 | We track dataset splits from the published papers in order to make the research results reproducible and reusable. We directly reuse the data splits or preprocessing steps if a paper has open the details. If not, we request the data splits by sending emails to the authors. 4 | 5 | - :question: 作者未回应提供数据复现请求。 6 | - :x: 数据不可复现 7 | - :heart: 希望大家一起构建可复用的数据集划分! 8 | 9 | | Dataset Splits | Paper Title | 10 | |:-----------|:--------------------| 11 | | [Criteo_x1](https://github.com/reczoo/Datasets/tree/main/Criteo#Criteo_x1), [Avazu_x1](https://github.com/reczoo/Datasets/tree/main/Avazu#Avazu_x1), [Frappe_x1](https://github.com/reczoo/Datasets/tree/main/Frappe#frappe_x1), [MovielensLatest_x1](https://github.com/reczoo/Datasets/tree/main/MovieLens#movielenslatest_x1) | [**AAAI'20**] [Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions](https://ojs.aaai.org/index.php/AAAI/article/view/5768), Weiyu Cheng, Yanyan Shen, Linpeng Huang. | 12 | | [Criteo_x4](https://github.com/reczoo/Datasets/tree/main/Criteo#Criteo_x4), [Avazu_x4](https://github.com/reczoo/Datasets/tree/main/Avazu#Avazu_x4) | [**CIKM'19**] [AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/abs/1810.11921), Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, Jian Tang. | 13 | --------------------------------------------------------------------------------