├── .gitignore
├── Amazon
    ├── AmazonBeauty_m1
    │   └── README.md
    ├── AmazonBooks_m1
    │   └── README.md
    ├── AmazonCDs_m1
    │   └── README.md
    ├── AmazonElectronics_m1
    │   └── README.md
    ├── AmazonElectronics_x1
    │   ├── README.md
    │   └── convert_amazonelectronics_x1.py
    ├── AmazonMovies_m1
    │   └── README.md
    └── README.md
├── Avazu
    ├── Avazu_x1
    │   ├── README.md
    │   ├── convert_avazu_x1.py
    │   └── download_avazu_x1.py
    ├── Avazu_x2
    │   └── README.md
    ├── Avazu_x4
    │   ├── README.md
    │   └── convert_avazu_x4.py
    └── README.md
├── CiteULike
    └── CiteUlikeA_m1
    │   └── README.md
├── Criteo
    ├── Criteo_x1
    │   ├── README.md
    │   ├── convert_criteo_x1.py
    │   └── download_criteo_x1.py
    ├── Criteo_x2
    │   └── README.md
    ├── Criteo_x4
    │   ├── README.md
    │   └── convert_criteo_x4.py
    └── README.md
├── Frappe
    ├── Frappe_x1
    │   ├── README.md
    │   └── convert_frappe_x1.py
    └── README.md
├── Gowalla
    └── Gowalla_m1
    │   └── README.md
├── KKBox
    ├── KKBox_x1
    │   └── README.md
    └── README.md
├── KuaiShou
    ├── KuaiVideo_x1
    │   ├── README.md
    │   └── convert_kuaivideo_x1.py
    └── README.md
├── MIND
    ├── MIND_large_x1
    │   ├── README.md
    │   └── convert_MIND_large_x1.py
    └── MIND_small_x1
    │   ├── README.md
    │   └── convert_MIND_small_x1.py
├── MicroVideo
    └── MicroVideo1.7M_x1
    │   ├── README.md
    │   └── convert_microvideo1.7m_x1.py
├── MovieLens
    ├── Movielens1M_m1
    │   └── README.md
    ├── MovielensLatest_x1
    │   ├── README.md
    │   └── convert_movielenslatest_x1.py
    └── README.md
├── README.md
├── Taobao
    └── TaobaoAd_x1
    │   ├── README.md
    │   └── convert_taobaoad_x1.py
├── Yelp
    └── Yelp18_m1
    │   └── README.md
├── iFlytek
    └── iFlyteckAds_x1
    │   └── convert_iFlyteckAds_x1.py
├── iPinYou
    └── iPinYou_x1
    │   └── README.md
└── tracking.md


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | *.egg-info/
 24 | .installed.cfg
 25 | *.egg
 26 | MANIFEST
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | .pytest_cache/
 49 | 
 50 | # Translations
 51 | *.mo
 52 | *.pot
 53 | 
 54 | # Django stuff:
 55 | local_settings.py
 56 | db.sqlite3
 57 | 
 58 | # Flask stuff:
 59 | instance/
 60 | .webassets-cache
 61 | 
 62 | # Scrapy stuff:
 63 | .scrapy
 64 | 
 65 | # Sphinx documentation
 66 | docs/_build/
 67 | 
 68 | # PyBuilder
 69 | target/
 70 | 
 71 | # Jupyter Notebook
 72 | .ipynb_checkpoints
 73 | 
 74 | # pyenv
 75 | .python-version
 76 | 
 77 | # celery beat schedule file
 78 | celerybeat-schedule
 79 | 
 80 | # SageMath parsed files
 81 | *.sage.py
 82 | 
 83 | # Environments
 84 | .env
 85 | .venv
 86 | env/
 87 | venv/
 88 | ENV/
 89 | env.bak/
 90 | venv.bak/
 91 | 
 92 | # Spyder project settings
 93 | .spyderproject
 94 | .spyproject
 95 | 
 96 | # Rope project settings
 97 | .ropeproject
 98 | 
 99 | # mkdocs documentation
100 | /site
101 | 
102 | # mypy
103 | .mypy_cache/
104 | .ipynb_checkpoints
105 | .DS_Store
106 | _build
107 | 


--------------------------------------------------------------------------------
/Amazon/AmazonBeauty_m1/README.md:
--------------------------------------------------------------------------------
 1 | # AmazonBeauty_m1
 2 | 
 3 | + **Data format:**  
 4 | user_id item1 item2 ...
 5 | 
 6 | + **Source:** https://cseweb.ucsd.edu/~jmcauley/datasets.html
 7 | + **Download:** https://huggingface.co/datasets/reczoo/AmazonBeauty_m1/tree/main
 8 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
 9 | 
10 | + **Used by papers:** 
11 |     - Jianing Sun, Wei Guo, Dengcheng Zhang, Yingxue Zhang, Florence Regol, Yaochen Hu, Huifeng Guo, Ruiming Tang, Han Yuan, Xiuqiang He, Mark Coates. [A Framework for Recommending Accurate and Diverse Items Using Bayesian Graph Convolutional Neural Networks](https://hyclex.github.io/papers/paper_sun2019BGCN.pdf). In KDD 2020.
12 |     - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filterin](https://arxiv.org/abs/2109.12613). In CIKM 2021.
13 | 
14 | + **Check the md5sum for data integrity:**
15 |     ```bash
16 |     $ md5sum *.txt
17 |     66fb687136d55b51742905ece189da31  test.txt
18 |     53cc9d39bc79f13c9bd3e75bd5121d1d  train.txt
19 |     ```
20 | 


--------------------------------------------------------------------------------
/Amazon/AmazonBooks_m1/README.md:
--------------------------------------------------------------------------------
 1 | # AmazonBooks_m1
 2 | 
 3 | + **Dataset description:**
 4 | 
 5 |   The data statistics are summarized as follows:
 6 | 
 7 |   | Dataset ID     | #Users | #Items | #Interactions |   #Train  |  #Test  | Density |
 8 |   |:--------------:|:------:|:------:|:-------------:|:---------:|:-------:|:-------:|
 9 |   | AmazonBooks_m1 | 52,643 | 91,599 |   2,984,108   | 2,380,730 | 603,378 | 0.00062 |
10 | 
11 | 
12 | + **Data format:**  
13 | user_id item1 item2 ...
14 | 
15 | + **Source:** https://cseweb.ucsd.edu/~jmcauley/datasets.html
16 | + **Download:** https://huggingface.co/datasets/reczoo/AmazonBooks_m1/tree/main
17 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
18 | 
19 | + **Used by papers:** 
20 |     - Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, Meng Wang. [LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation](https://arxiv.org/abs/2002.02126). In SIGIR 2020.
21 |     - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filtering](https://arxiv.org/abs/2109.12613). In CIKM 2021.
22 |     - Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. [UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation](https://arxiv.org/abs/2110.15114). In CIKM 2021.
23 | 
24 | + **Check the md5sum for data integrity:**
25 |     ```bash
26 |     $ md5sum *.txt
27 |     5b1125ef3bf4118a7988f1fd8ce52ef9  item_list.txt
28 |     30f8ccfea18d25007ba9fb9aba4e174d  test.txt
29 |     c916ecac04ca72300a016228258b41ed  train.txt
30 |     132f8a5d6d35d5fdde1e0396488be235  user_list.txt
31 |     ```
32 | 


--------------------------------------------------------------------------------
/Amazon/AmazonCDs_m1/README.md:
--------------------------------------------------------------------------------
 1 | # AmazonCDs_m1
 2 | 
 3 | + **Data format:**  
 4 | user_id item1 item2 ...
 5 | 
 6 | + **Source:** https://cseweb.ucsd.edu/~jmcauley/datasets.html
 7 | + **Download:** https://huggingface.co/datasets/reczoo/AmazonCDs_m1/tree/main
 8 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
 9 | 
10 | + **Used by papers:** 
11 |     - Jianing Sun, Wei Guo, Dengcheng Zhang, Yingxue Zhang, Florence Regol, Yaochen Hu, Huifeng Guo, Ruiming Tang, Han Yuan, Xiuqiang He, Mark Coates. [A Framework for Recommending Accurate and Diverse Items Using Bayesian Graph Convolutional Neural Networks](https://hyclex.github.io/papers/paper_sun2019BGCN.pdf). In KDD 2020.
12 |     - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filtering](https://arxiv.org/abs/2109.12613). In CIKM 2021.
13 |     - Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. [UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation](https://arxiv.org/abs/2110.15114). In CIKM 2021.
14 | 
15 | + **Check the md5sum for data integrity:**
16 |     ```bash
17 |     $ md5sum *.txt
18 |     d29acb66d0fb74bc3bc0791cbbce5cf2  test.txt
19 |     2df6a35cac4373cf3eef95f75568da0a  train.txt
20 |     ```
21 | 


--------------------------------------------------------------------------------
/Amazon/AmazonElectronics_m1/README.md:
--------------------------------------------------------------------------------
 1 | # AmazonElectronics_m1
 2 | 
 3 | + **Data format:**
 4 | 
 5 |   Each user corresponds to a list of interacted items: [[item1, item2], [item3, item4, item5], ...]
 6 | 
 7 | + **Source:** https://cseweb.ucsd.edu/~jmcauley/datasets.html
 8 | + **Download:** https://huggingface.co/datasets/reczoo/AmazonElectronics_m1/tree/main
 9 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
10 | 
11 | + **Used by papers:** 
12 |     - Wenhui Yu, Zheng Qin. [Sampler Design for Implicit Feedback Data by Noisy-label Robust Learning](https://arxiv.org/abs/2007.07204). In SIGIR 2020.
13 |     - Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. [UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation](https://arxiv.org/abs/2110.15114). In CIKM 2021.
14 | 
15 | + **Check the md5sum for data integrity:**
16 |     ```bash
17 |     $ md5sum *.json
18 |     7a0fa5d0da5dc5d5008da02b554ef688  test_data.json
19 |     ca71f3f5b9ada393ffd5490eba84c7db  train_data.json
20 |     7f2db9b5b0de91c7d757ed6ed6095a5a  validation_data.json
21 |     ```
22 | 


--------------------------------------------------------------------------------
/Amazon/AmazonElectronics_x1/README.md:
--------------------------------------------------------------------------------
 1 | # AmazonElectronics_x1
 2 | 
 3 | + **Dataset description:**
 4 | 
 5 |   The [Amazon dataset](http://jmcauley.ucsd.edu/data/amazon) contains product reviews and metadata from Amazon, which is a widely-used benchmark dataset. We use the preprocessed subset named Amazon-Electronics from the [DIN](https://arxiv.org/abs/1706.06978) work. It contains 192,403 users, 63,001 goods, 801 categories and 1,689,188 samples. User behaviors in this dataset are rich, with more than 5 reviews for each user and goods. Features include goods_id, cate_id, user reviewed goods_id_list and cate_id_list. Following DIN, the task is to predict the probability of reviewing the (k+1)-th goods by making use of the first k reviewed goods. The last item of each behavior sequence is reserved for testing.
 6 | 
 7 |   The dataset statistics are summarized as follows.
 8 | 
 9 |   | Dataset Split  | Total | #Train | #Validation | #Test | 
10 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
11 |   | AmazonElectronics_x1 | 2,993,570   | 2,608,764 |      | 384,806  |
12 | 
13 | + **Data format:**
14 | label, user_id, item_id, cate_id, item_history, cate_history
15 | 
16 | + **Source:** https://cseweb.ucsd.edu/~jmcauley/datasets.html
17 | + **Download:** https://huggingface.co/datasets/reczoo/AmazonElectronics_x1/tree/main
18 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
19 | 
20 | + **Used by papers:**
21 |     - Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, Kun Gai. [Deep Interest Network for Click-Through Rate Prediction](https://arxiv.org/abs/1706.06978). In KDD 2018.
22 |     - Jieming Zhu, Guohao Cai, Junjie Huang, Zhenhua Dong, Ruiming Tang, Weinan Zhang. [ReLoop2: Building Self-Adaptive Recommendation Models via Responsive Error Compensation Loop](https://arxiv.org/abs/2306.08808). In KDD 2023.
23 | 
24 | + **Check the md5sum for data integrity:**
25 |     ```bash
26 |     $ md5sum *.csv
27 |     57a20e82fe736dd495f2eaf0669bf6d0  test.csv
28 |     e9bf80b92985e463db18fdc753d347b5  train.csv
29 |     ```
30 | 


--------------------------------------------------------------------------------
/Amazon/AmazonElectronics_x1/convert_amazonelectronics_x1.py:
--------------------------------------------------------------------------------
 1 | """ Convert AmazonElectronics dataset used by the DIN paper from pickle file to csv file
 2 |     Run the following cat command to get `dataset.pkl`
 3 |     cat aa ab ac > dataset.pkl
 4 |     after downloading from https://github.com/zhougr1993/DeepInterestNetwork/tree/master/din
 5 | """
 6 | 
 7 | import pickle
 8 | import pandas as pd
 9 | import hashlib
10 | 
11 | 
12 | with open('dataset.pkl', 'rb') as f:
13 |     train_set = pickle.load(f, encoding='bytes')
14 |     test_set = pickle.load(f, encoding='bytes')
15 |     cate_list = pickle.load(f, encoding='bytes')
16 |     user_count, item_count, cate_count = pickle.load(f, encoding='bytes')
17 | 
18 | train_data = []
19 | for sample in train_set:
20 |     user_id = sample[0]
21 |     item_id = sample[2]
22 |     item_history = "^".join([str(i) for i in sample[1]])
23 |     label = sample[3]
24 |     cate_id = cate_list[item_id]
25 |     cate_history = "^".join([str(i) for i in cate_list[sample[1]]])
26 |     train_data.append([label, user_id, item_id, cate_id, item_history, cate_history])
27 | train_df = pd.DataFrame(train_data, columns=['label', 'user_id', 'item_id', 'cate_id', 'item_history', 'cate_history'])
28 | train_df.to_csv("train.csv", index=False)
29 | 
30 | test_data = []
31 | for sample in test_set:
32 |     user_id = sample[0]
33 |     item_pair = sample[2]
34 |     item_history = "^".join([str(i) for i in sample[1]])
35 |     cate_history = "^".join([str(i) for i in cate_list[sample[1]]])
36 |     test_data.append([1, user_id, item_pair[0], cate_list[item_pair[0]], item_history, cate_history])
37 |     test_data.append([0, user_id, item_pair[1], cate_list[item_pair[1]], item_history, cate_history])
38 | test_df = pd.DataFrame(test_data, columns=['label', 'user_id', 'item_id', 'cate_id', 'item_history', 'cate_history'])
39 | test_df.to_csv("test.csv", index=False)
40 | 
41 | # Check md5sum for correctness
42 | assert("e9bf80b92985e463db18fdc753d347b5" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest())
43 | assert("57a20e82fe736dd495f2eaf0669bf6d0" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest())
44 | 
45 | print("Reproducing data succeeded!")
46 | 


--------------------------------------------------------------------------------
/Amazon/AmazonMovies_m1/README.md:
--------------------------------------------------------------------------------
 1 | # AmazonMovies_m1
 2 | 
 3 | + **Data format:**  
 4 | user_id item1 item2 ...
 5 | 
 6 | + **Source:** https://cseweb.ucsd.edu/~jmcauley/datasets.html
 7 | + **Download:** https://huggingface.co/datasets/reczoo/AmazonMovies_m1/tree/main
 8 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
 9 | 
10 | + **Used by papers:** 
11 |     - Jianing Sun, Wei Guo, Dengcheng Zhang, Yingxue Zhang, Florence Regol, Yaochen Hu, Huifeng Guo, Ruiming Tang, Han Yuan, Xiuqiang He, Mark Coates. [A Framework for Recommending Accurate and Diverse Items Using Bayesian Graph Convolutional Neural Networks](https://hyclex.github.io/papers/paper_sun2019BGCN.pdf). In KDD 2020.
12 |     - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filterin](https://arxiv.org/abs/2109.12613). In CIKM 2021.
13 | 
14 | + **Check the md5sum for data integrity:**
15 |     ```bash
16 |     $ md5sum *.txt
17 |     c02e5f6579aa51950aa875c462a0204b  test.txt
18 |     3e9d30eacd30330a9feaa0fdb17760ba  train.txt
19 |     ```
20 | 


--------------------------------------------------------------------------------
/Amazon/README.md:
--------------------------------------------------------------------------------
1 | # Amazon
2 | 
3 | + [AmazonBeauty_m1](./AmazonBeauty_m1/README.md)
4 | + [AmazonBooks_m1](./AmazonBooks_m1/README.md)
5 | + [AmazonCDs_m1](./AmazonCDs_m1/README.md)
6 | + [AmazonElectronics_x1](./AmazonElectronics_x1/README.md)
7 | 


--------------------------------------------------------------------------------
/Avazu/Avazu_x1/README.md:
--------------------------------------------------------------------------------
 1 | # Avazu_x1
 2 | 
 3 | + **Dataset description:**
 4 | 
 5 |   This dataset contains about 10 days of labeled click-through data on mobile advertisements. It has 22 feature fields including user features and advertisement attributes. As with the [AFN](https://ojs.aaai.org/index.php/AAAI/article/view/5768) work, the data are randomly split into 7:1:2 as the training set, validation set, and test set, respectively. 
 6 | 
 7 |   The dataset statistics are summarized as follows:
 8 | 
 9 |   | Dataset  | Total | #Train | #Validation | #Test | 
10 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
11 |   | Avazu_x1 |  40,428,967     | 28,300,276   |  4,042,897     |  8,085,794    |     
12 | 
13 | + **Source:** https://www.kaggle.com/c/avazu-ctr-prediction/data
14 | + **Download:** https://huggingface.co/datasets/reczoo/Avazu_x1/tree/main
15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
16 | 
17 | + **Used by papers:** 
18 |     - Weiyu Cheng, Yanyan Shen, Linpeng Huang. [Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions](https://ojs.aaai.org/index.php/AAAI/article/view/5768). In AAAI 2020.
19 |     - Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, Zhenhua Dong. [FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction](https://arxiv.org/abs/2304.00902). In AAAI 2023.
20 |     - Jieming Zhu, Qinglin Jia, Guohao Cai, Quanyu Dai, Jingjie Li, Zhenhua Dong, Ruiming Tang, Rui Zhang. [FINAL: Factorized Interaction Layer for CTR Prediction](https://dl.acm.org/doi/10.1145/3539618.3591988). In SIGIR 2023.
21 | 
22 | + **Check the md5sum for data integrity:**
23 |     ```bash
24 |     $ md5sum train.csv valid.csv test.csv
25 |     f1114a07aea9e996842c71648e0f6395  train.csv
26 |     d9568f246357d156c4b8030fadb8b623  valid.csv
27 |     9e2fe9c48705c9315ae7a0953eb57acf  test.csv
28 |     ```
29 | 


--------------------------------------------------------------------------------
/Avazu/Avazu_x1/convert_avazu_x1.py:
--------------------------------------------------------------------------------
 1 | """ Convert libsvm data from AFN paper to csv format """
 2 | import pandas as pd
 3 | from pathlib import Path
 4 | import gc
 5 | import hashlib
 6 | 
 7 | headers = ["label", "feat_1", "feat_2", "feat_3", "feat_4", "feat_5", "feat_6", "feat_7", "feat_8", "feat_9", "feat_10",
 8 |            "feat_11", "feat_12", "feat_13", "feat_14", "feat_15", "feat_16", "feat_17", "feat_18", "feat_19", "feat_20", "feat_21", "feat_22"]
 9 | 
10 | data_files = ["train.libsvm", "valid.libsvm", "test.libsvm"]
11 | for f in data_files:
12 |     df = pd.read_csv(f, sep=" ", names=headers)
13 |     for col in headers[1:]:
14 |         df[col] = df[col].apply(lambda x: x.split(':')[0])
15 |     df.to_csv(Path(f).stem + ".csv", index=False)
16 |     del df
17 |     gc.collect()
18 | 
19 | 
20 | # Check md5sum for correctness
21 | assert("f1114a07aea9e996842c71648e0f6395" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest())
22 | assert("d9568f246357d156c4b8030fadb8b623" == hashlib.md5(open('valid.csv', 'r').read().encode('utf-8')).hexdigest())
23 | assert("9e2fe9c48705c9315ae7a0953eb57acf" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest())
24 | 
25 | print("Reproducing data succeeded!")
26 | 
27 | 


--------------------------------------------------------------------------------
/Avazu/Avazu_x1/download_avazu_x1.py:
--------------------------------------------------------------------------------
 1 | # This file is modified from https://github.com/WeiyuCheng/AFN-AAAI-20/blob/master/src/download_criteo_and_avazu.py
 2 | # to download the preprocessed data split Avazu_x1
 3 | 
 4 | import os
 5 | import zipfile
 6 | import urllib.request
 7 | from tqdm import tqdm
 8 | 
 9 | 
10 | class DownloadProgressBar(tqdm):
11 |     def update_to(self, b=1, bsize=1, tsize=None):
12 |         if tsize is not None:
13 |             self.total = tsize
14 |         self.update(b * bsize - self.n)
15 | 
16 | def download(url, output_path):
17 |     with DownloadProgressBar(unit='B', unit_scale=True,
18 |                              miniters=1, desc=url.split('/')[-1]) as t:
19 |         urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)
20 | 
21 | if __name__ == "__main__":
22 |     print("Begin to download avazu data, the total size is 683MB...")
23 |     download('https://worksheets.codalab.org/rest/bundles/0xf5ab597052744680b1a55986557472c7/contents/blob/', './avazu.zip')
24 |     print("Unzipping avazu dataset...")
25 |     with zipfile.ZipFile('./avazu.zip', 'r') as zip_ref:
26 |         zip_ref.extractall('./avazu/')
27 |     print("Done.")
28 | 


--------------------------------------------------------------------------------
/Avazu/Avazu_x2/README.md:
--------------------------------------------------------------------------------
 1 | # Avazu_x2
 2 | 
 3 | + **Dataset description:**
 4 | 
 5 |   This dataset contains about 10 days of labeled click-through data on mobile advertisements. It has 22 feature fields including user features and advertisement attributes. Following the same setting in the [AutoGroup](https://dl.acm.org/doi/abs/10.1145/3397271.3401082) work, we randomly split 80% of the data for training and validation, and the remaining 20% for testing, respectively. For all categorical fields, we filter infrequent features by setting the threshold min_category_count=20 and replace them with a default ``<OOV>`` token.
 6 | 
 7 |   The dataset statistics are summarized as follows:
 8 | 
 9 |   | Dataset  | Total | #Train | #Validation | #Test | 
10 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
11 |   | Avazu_x2 |  40,428,967     | 32,343,173     |      |  8,085,794    | 
12 | 
13 | + **Source:** https://www.kaggle.com/c/avazu-ctr-prediction/data
14 | + **Download:** https://huggingface.co/datasets/reczoo/Avazu_x2/tree/main
15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
16 | 
17 | + **Used by papers:** 
18 |     - Bin Liu, Niannan Xue, Huifeng Guo, Ruiming Tang, Stefanos Zafeiriou, Xiuqiang He, Zhenguo Li. [AutoGroup: Automatic Feature Grouping for Modelling Explicit High-Order Feature Interactions in CTR Prediction](https://dl.acm.org/doi/abs/10.1145/3397271.3401082). In SIGIR 2020.
19 | 
20 | + **Check the md5sum for data integrity:**
21 |     ```bash
22 |     $ md5sum train.csv test.csv
23 |     c41d786896e2ebe68e08a022199f0ce8  train.csv
24 |     e641ea94c72cdc99b49656d3404f536e  test.csv
25 |     ```
26 | 


--------------------------------------------------------------------------------
/Avazu/Avazu_x4/README.md:
--------------------------------------------------------------------------------
 1 | # Avazu_x4
 2 | 
 3 | + **Dataset description:**
 4 | 
 5 |   This dataset contains about 10 days of labeled click-through data on mobile advertisements. It has 22 feature fields including user features and advertisement attributes. Following the same setting with the [AutoInt](https://arxiv.org/abs/1810.11921) work, we split the data randomly into 8:1:1 as the training set, validation set, and test set, respectively. 
 6 | 
 7 |   The dataset statistics are summarized as follows:
 8 | 
 9 |   | Dataset  | Total | #Train | #Validation | #Test | 
10 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
11 |   | Avazu_x4 |  40,428,967     |  32,343,172   |  4,042,897     | 4,042,898     |  
12 | 
13 | 
14 |   - Avazu_x4_001
15 | 
16 |     In this setting, we preprocess the data split by removing the ``id`` field that is useless for CTR prediction. In addition, we transform the timestamp field into three fields: hour, weekday, and is_weekend. For all categorical fields, we filter infrequent features by setting the threshold min_category_count=2 (performs well) and replace them with a default ``<OOV>`` token. Note that we do not follow the exact preprocessing steps in AutoInt, because the authors neither remove the useless ``id`` field nor specially preprocess the timestamp field. We fix **embedding_dim=16** following the existing [AutoInt work](https://arxiv.org/abs/1810.11921).
17 |     
18 |   - Avazu_x4_002
19 | 
20 |     In this setting, we preprocess the data split by removing the ``id`` field that is useless for CTR prediction. In addition, we transform the timestamp field into three fields: hour, weekday, and is_weekend. For all categorical fields, we filter infrequent features by setting the threshold min_category_count=1 and replace them with a default ``<OOV>`` token. Note that we found that min_category_count=1 performs the best, which is surprising. We fix **embedding_dim=40** following the existing [FGCNN work](https://arxiv.org/abs/1904.04447).
21 | 
22 | 
23 | + **Source:** https://www.kaggle.com/c/avazu-ctr-prediction/data
24 | + **Download:** https://huggingface.co/datasets/reczoo/Avazu_x4/tree/main
25 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
26 | 
27 | + **Used by papers:** 
28 |   - Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, Jian Tang. [AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/abs/1810.11921). In CIKM 2019.
29 |   - Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, Xiuqiang He. [BARS-CTR: Open Benchmarking for Click-Through Rate Prediction](https://arxiv.org/abs/2009.05794). In CIKM 2021.
30 |   
31 | + **Check the md5sum for data integrity:**
32 |   ```bash
33 |   $ md5sum train.csv valid.csv test.csv
34 |   de3a27264cdabf66adf09df82328ccaa  train.csv
35 |   33232931d84d6452d3f956e936cab2c9  valid.csv
36 |   3ebb774a9ca74d05919b84a3d402986d  test.csv
37 |   ```
38 | 


--------------------------------------------------------------------------------
/Avazu/Avazu_x4/convert_avazu_x4.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import hashlib
 4 | from sklearn.model_selection import StratifiedKFold
 5 | 
 6 | """
 7 | NOTICE: We found that even though we fix the random seed, the resulting data split can be different 
 8 | due to the potential StratifiedKFold API change in different scikit-learn versions. For 
 9 | reproduciblity, `sklearn==0.19.1` is required. We use the python environement by installing 
10 | `Anaconda3-5.2.0-Linux-x86_64.sh`.
11 | """
12 | 
13 | RANDOM_SEED = 2018 # Fix seed for reproduction
14 | ddf = pd.read_csv('train/train.csv', encoding='utf-8', dtype=object)
15 | X = ddf.values
16 | y = ddf['click'].map(lambda x: float(x)).values
17 | print(str(len(X)) + ' lines in total')
18 | 
19 | folds = StratifiedKFold(n_splits=10, shuffle=True,
20 |                         random_state=RANDOM_SEED).split(X, y)
21 | 
22 | fold_indexes = []
23 | for train_id, valid_id in folds:
24 |     fold_indexes.append(valid_id)
25 | test_index = fold_indexes[0]
26 | valid_index = fold_indexes[1]
27 | train_index = np.concatenate(fold_indexes[2:])
28 | 
29 | test_df = ddf.loc[test_index, :]
30 | test_df.to_csv('test.csv', index=False, encoding='utf-8')
31 | valid_df = ddf.loc[valid_index, :]
32 | valid_df.to_csv('valid.csv', index=False, encoding='utf-8')
33 | ddf.loc[train_index, :].to_csv('train.csv', index=False, encoding='utf-8')
34 | 
35 | print('Train lines:', len(train_index))
36 | print('Validation lines:', len(valid_index))
37 | print('Test lines:', len(test_index))
38 | print('Postive ratio:', np.sum(y) / len(y))
39 | 
40 | # Check md5sum for correctness
41 | assert("de3a27264cdabf66adf09df82328ccaa" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest())
42 | assert("33232931d84d6452d3f956e936cab2c9" == hashlib.md5(open('valid.csv', 'r').read().encode('utf-8')).hexdigest())
43 | assert("3ebb774a9ca74d05919b84a3d402986d" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest())
44 | 
45 | print("Reproducing data succeeded!")
46 | 


--------------------------------------------------------------------------------
/Avazu/README.md:
--------------------------------------------------------------------------------
 1 | # Avazu
 2 | 
 3 | + [Avazu_x1](./Avazu_x1/README.md)
 4 | + [Avazu_x2](./Avazu_x2/README.md)
 5 | + [Avazu_x4](./Avazu_x4/README.md)
 6 | 
 7 | It is a [Kaggle challenge dataset](https://www.kaggle.com/c/avazu-ctr-prediction/data) for Avazu CTR prediction. [Avazu](http://avazuinc.com/home) is one of the leading mobile advertising platforms globally. The Kaggle competition targets at predicting whether a mobile ad will be clicked and has provided 11 days worth of Avazu data to build and test prediction models. It consists of 10 days of labeled click-through data for training and 1 day of ads data for testing (yet without labels). Note that only the first 10 days of labeled data are used for benchmarking. 
 8 | 
 9 | Data fields consist of:
10 | + id: ad identifier (``Note: This column is more like unique sample id, where each row has a distinct value, and thus should be dropped.``)
11 | + click: 0/1 for non-click/click
12 | + hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC. (``Note: It is a common practice to bucketize the timestamp into hour, day, is_weekend, and so on.``)
13 | + C1: anonymized categorical variable
14 | + banner_pos
15 | + site_id
16 | + site_domain
17 | + site_category
18 | + app_id
19 | + app_domain
20 | + app_category
21 | + device_id
22 | + device_ip
23 | + device_model
24 | + device_type
25 | + device_conn_type
26 | + C14-C21: anonymized categorical variables
27 | 


--------------------------------------------------------------------------------
/CiteULike/CiteUlikeA_m1/README.md:
--------------------------------------------------------------------------------
 1 | # CiteUlikeA_m1
 2 | 
 3 | + **Data format:**  
 4 | user_id item1 item2 ...
 5 | 
 6 | + **Source:** http://www.citeulike.org
 7 | + **Download:** https://huggingface.co/datasets/reczoo/CiteUlikeA_m1/tree/main
 8 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
 9 | 
10 | + **Used by papers:** 
11 |     - Shuyi Ji, Yifan Feng, Rongrong Ji, Xibin Zhao, Wanwan Tang, Yue Gao. [Dual Channel Hypergraph Collaborative Filtering](https://dl.acm.org/doi/10.1145/3394486.3403253). In KDD 2020.
12 |     - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filterin](https://arxiv.org/abs/2109.12613). In CIKM 2021.
13 | 
14 | + **Check the md5sum for data integrity:**
15 |     ```bash
16 |     $ md5sum *.txt
17 |     c9d2de139ac69d480264b6221a567324  test.txt
18 |     f037c7ac8f9d8142bb5fd137ff61ad0c  train.txt
19 |     ```
20 | 


--------------------------------------------------------------------------------
/Criteo/Criteo_x1/README.md:
--------------------------------------------------------------------------------
 1 | # Criteo_x1
 2 | 
 3 | + **Dataset description:**
 4 | 
 5 |   The Criteo dataset is a widely-used benchmark dataset for CTR prediction, which contains about one week of click-through data for display advertising. It has 13 numerical feature fields and 26 categorical feature fields. Following the [AFN](https://ojs.aaai.org/index.php/AAAI/article/view/5768) work, we randomly split the data into 7:2:1\* as the training set, validation set, and test set, respectively. 
 6 | 
 7 |   The dataset statistics are summarized as follows:
 8 | 
 9 |   | Dataset Split  | Total | #Train | #Validation | #Test | 
10 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
11 |   | Criteo_x1 |  45,840,617     | 33,003,326   |  8,250,124     | 4,587,167     |         
12 | 
13 | + **Source:** https://www.kaggle.com/c/criteo-display-ad-challenge/data
14 | + **Download:** https://huggingface.co/datasets/reczoo/Criteo_x1/tree/main
15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
16 | 
17 | + **Used by papers:** 
18 |     - Weiyu Cheng, Yanyan Shen, Linpeng Huang. [Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions](https://ojs.aaai.org/index.php/AAAI/article/view/5768). In AAAI 2020.
19 |     - Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, Zhenhua Dong. [FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction](https://arxiv.org/abs/2304.00902). In AAAI 2023.
20 |     - Jieming Zhu, Qinglin Jia, Guohao Cai, Quanyu Dai, Jingjie Li, Zhenhua Dong, Ruiming Tang, Rui Zhang. [FINAL: Factorized Interaction Layer for CTR Prediction](https://dl.acm.org/doi/10.1145/3539618.3591988). In SIGIR 2023.
21 | 
22 | + **Check the md5sum for data integrity:**
23 |     ```bash
24 |     $ md5sum train.csv valid.csv test.csv
25 |     30b89c1c7213013b92df52ec44f52dc5  train.csv
26 |     f73c71fb3c4f66b6ebdfa032646bea72  valid.csv
27 |     2c48b26e84c04a69b948082edae46f8c  test.csv
28 |     ```
29 | 


--------------------------------------------------------------------------------
/Criteo/Criteo_x1/convert_criteo_x1.py:
--------------------------------------------------------------------------------
 1 | """ Convert libsvm data from AFN paper to csv format """
 2 | import pandas as pd
 3 | from pathlib import Path
 4 | import gc
 5 | import hashlib
 6 | 
 7 | headers = ["label", "I1", "I2", "I3", "I4", "I5", "I6", "I7", "I8", "I9", "I10",
 8 |            "I11", "I12", "I13", "C1", "C2", "C3", "C4", "C5", "C6", "C7", "C8", "C9", "C10",
 9 |            "C11", "C12", "C13", "C14", "C15", "C16", "C17", "C18", "C19", "C20", "C21", "C22", 
10 |            "C23", "C24", "C25", "C26"]
11 | 
12 | data_files = ["train.libsvm", "valid.libsvm", "test.libsvm"]
13 | for f in data_files:
14 |     df = pd.read_csv(f, sep=" ", names=headers)
15 |     for col in headers[1:]:
16 |         if col.startswith("I"):
17 |             df[col] = df[col].apply(lambda x: x.split(':')[-1])
18 |         elif col.startswith("C"):
19 |             df[col] = df[col].apply(lambda x: x.split(':')[0])
20 |     df.to_csv(Path(f).stem + ".csv", index=False)
21 |     del df
22 |     gc.collect()
23 | 
24 | # Check md5sum for correctness
25 | assert("30b89c1c7213013b92df52ec44f52dc5" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest())
26 | assert("f73c71fb3c4f66b6ebdfa032646bea72" == hashlib.md5(open('valid.csv', 'r').read().encode('utf-8')).hexdigest())
27 | assert("2c48b26e84c04a69b948082edae46f8c" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest())
28 | 
29 | print("Reproducing data succeeded!")
30 | 
31 | 


--------------------------------------------------------------------------------
/Criteo/Criteo_x1/download_criteo_x1.py:
--------------------------------------------------------------------------------
 1 | # This file is modified from https://github.com/WeiyuCheng/AFN-AAAI-20/blob/master/src/download_criteo_and_avazu.py
 2 | # to download the preprocessed data split Criteo_x1
 3 | 
 4 | import os
 5 | import zipfile
 6 | import urllib.request
 7 | from tqdm import tqdm
 8 | 
 9 | 
10 | class DownloadProgressBar(tqdm):
11 |     def update_to(self, b=1, bsize=1, tsize=None):
12 |         if tsize is not None:
13 |             self.total = tsize
14 |         self.update(b * bsize - self.n)
15 | 
16 | def download(url, output_path):
17 |     with DownloadProgressBar(unit='B', unit_scale=True,
18 |                              miniters=1, desc=url.split('/')[-1]) as t:
19 |         urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)
20 | 
21 | if __name__ == "__main__":
22 |     print("Begin to download criteo data, the total size is 3GB...")
23 |     download('https://worksheets.codalab.org/rest/bundles/0x8dca5e7bac42470aa445f9a205d177c6/contents/blob/', './criteo.zip')
24 |     print("Unzipping criteo dataset...")
25 |     with zipfile.ZipFile('./criteo.zip', 'r') as zip_ref:
26 |         zip_ref.extractall('./criteo/')
27 |     print("Done.")
28 | 
29 | 


--------------------------------------------------------------------------------
/Criteo/Criteo_x2/README.md:
--------------------------------------------------------------------------------
 1 | # Criteo_x2
 2 | 
 3 | + **Dataset description:**
 4 | 
 5 |   This dataset employs the [Criteo 1TB Click Logs](https://ailab.criteo.com/criteo-1tb-click-logs-dataset/) for display advertising, which contains one month of click-through data with billions of data samples. Following the same setting with the [AutoGroup](https://dl.acm.org/doi/abs/10.1145/3397271.3401082) work, we select "data 6-12" as the training set while using "day-13" for testing. To reduce label imbalance, we perform negative sub-sampling to keep the positive ratio roughly at 50%. It has 13 numerical feature fields and 26 categorical feature fields. In this setting, 13 numerical fields are converted into categorical values through bucketizing, while categorical features appearing less than 20 times are set as a default ``<OOV>`` feature.
 6 | 
 7 |   The dataset statistics are summarized as follows:
 8 | 
 9 |   | Dataset Split  | Total | #Train | #Validation | #Test | 
10 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
11 |   | Criteo_x2 |   99,616,043    |  86,883,012    |      |  12,733,031    |   
12 | 
13 | + **Source:** https://ailab.criteo.com/criteo-1tb-click-logs-dataset
14 | + **Download:** https://huggingface.co/datasets/reczoo/Criteo_x2/tree/main
15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
16 | 
17 | + **Used by papers:** 
18 |     - Bin Liu, Niannan Xue, Huifeng Guo, Ruiming Tang, Stefanos Zafeiriou, Xiuqiang He, Zhenguo Li. [AutoGroup: Automatic Feature Grouping for Modelling Explicit High-Order Feature Interactions in CTR Prediction](https://dl.acm.org/doi/abs/10.1145/3397271.3401082). In SIGIR 2020.
19 | 
20 | + **Check the md5sum for data integrity:**
21 |     ```bash
22 |     $ md5sum train.csv test.csv
23 |     d4d08405e95836ee049455cae0f8b0d6  train.csv
24 |     32c14fbc7bfe02e72b501793e8db660b  test.csv
25 |     ```
26 | 


--------------------------------------------------------------------------------
/Criteo/Criteo_x4/README.md:
--------------------------------------------------------------------------------
 1 | # Criteo_x4
 2 | 
 3 | + **Dataset description:**
 4 | 
 5 |   The Criteo dataset is a widely-used benchmark dataset for CTR prediction, which contains about one week of click-through data for display advertising. It has 13 numerical feature fields and 26 categorical feature fields. Following the setting with the [AutoInt work](https://arxiv.org/abs/1810.11921), we randomly split the data into 8:1:1 as the training set, validation set, and test set, respectively. 
 6 | 
 7 |   The dataset statistics are summarized as follows:
 8 | 
 9 |   | Dataset Split  | Total | #Train | #Validation | #Test | 
10 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
11 |   | Criteo_x4 |  45,840,617     |   36,672,493  |   4,584,062    |  4,584,062    |    
12 | 
13 | 
14 |   - Criteo_x4_001
15 | 
16 |     In this setting, we follow the winner's solution of the Criteo challenge to discretize each integer value x to ⌊log2(x)⌋, if x > 2; and x = 1 otherwise. For all categorical fields, we replace infrequent features with a default ``<OOV>`` token by setting the threshold min_category_count=10. Note that we do not follow the exact preprocessing steps in AutoInt, because this preprocessing performs much better. We fix **embedding_dim=16** as with AutoInt.
17 |     
18 |   - Criteo_x4_002
19 | 
20 |     In this setting, we follow the winner's solution of the Criteo challenge to discretize each integer value x to ⌊log2(x)⌋, if x > 2; and x = 1 otherwise. For all categorical fields, we replace infrequent features with a default ``<OOV>`` token by setting the threshold min_category_count=2. We fix **embedding_dim=40** in this setting.
21 | 
22 | 
23 | + **Source:** https://www.kaggle.com/c/criteo-display-ad-challenge/data
24 | + **Download:** https://huggingface.co/datasets/reczoo/Criteo_x4/tree/main
25 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
26 | 
27 | + **Used by papers:** 
28 |   - Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, Jian Tang. [AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/abs/1810.11921). In CIKM 2019.
29 |   - Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, Xiuqiang He. [BARS-CTR: Open Benchmarking for Click-Through Rate Prediction](https://arxiv.org/abs/2009.05794). In CIKM 2021.
30 | 
31 | + **Check the md5sum for data integrity:**
32 |   ```bash
33 |   $ md5sum train.csv valid.csv test.csv
34 |   4a53bb7cbc0e4ee25f9d6a73ed824b1a  train.csv
35 |   fba5428b22895016e790e2dec623cb56  valid.csv
36 |   cfc37da0d75c4d2d8778e76997df2976  test.csv
37 |   ```
38 | 


--------------------------------------------------------------------------------
/Criteo/Criteo_x4/convert_criteo_x4.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import hashlib
 4 | from sklearn.model_selection import StratifiedKFold
 5 | 
 6 | """
 7 | NOTICE: We found that even though we fix the random seed, the resulting data split can be different 
 8 | due to the potential StratifiedKFold API change in different scikit-learn versions. For 
 9 | reproduciblity, `sklearn==0.19.1` is required. We use the python environement by installing 
10 | `Anaconda3-5.2.0-Linux-x86_64.sh`.
11 | """
12 | 
13 | RANDOM_SEED = 2018 # Fix seed for reproduction
14 | cols = ['Label']
15 | for i in range(1, 14):
16 |     cols.append('I' + str(i))
17 | for i in range(1, 27):
18 |     cols.append('C' + str(i))
19 | 
20 | ddf = pd.read_csv('dac/train.txt', sep='\t', header=None, names=cols, encoding='utf-8', dtype=object)
21 | X = ddf.values
22 | y = ddf['Label'].map(lambda x: float(x)).values
23 | print(str(len(X)) + ' lines in total')
24 | 
25 | folds = StratifiedKFold(n_splits=10, shuffle=True,
26 |                         random_state=RANDOM_SEED).split(X, y)
27 | 
28 | fold_indexes = []
29 | for train_id, valid_id in folds:
30 |     fold_indexes.append(valid_id)
31 | test_index = fold_indexes[0]
32 | valid_index = fold_indexes[1]
33 | train_index = np.concatenate(fold_indexes[2:])
34 | 
35 | test_df = ddf.loc[test_index, :]
36 | test_df.to_csv('test.csv', index=False, encoding='utf-8')
37 | valid_df = ddf.loc[valid_index, :]
38 | valid_df.to_csv('valid.csv', index=False, encoding='utf-8')
39 | ddf.loc[train_index, :].to_csv('train.csv', index=False, encoding='utf-8')
40 | 
41 | print('Train lines:', len(train_index))
42 | print('Validation lines:', len(valid_index))
43 | print('Test lines:', len(test_index))
44 | print('Postive ratio:', np.sum(y) / len(y))
45 | 
46 | # Check md5sum for correctness
47 | assert("4a53bb7cbc0e4ee25f9d6a73ed824b1a" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest())
48 | assert("fba5428b22895016e790e2dec623cb56" == hashlib.md5(open('valid.csv', 'r').read().encode('utf-8')).hexdigest())
49 | assert("cfc37da0d75c4d2d8778e76997df2976" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest())
50 | 
51 | print("Reproducing data succeeded!")
52 | 
53 | 


--------------------------------------------------------------------------------
/Criteo/README.md:
--------------------------------------------------------------------------------
 1 | # Criteo
 2 | 
 3 | + [Criteo_x1](./Criteo_x1)
 4 | + [Criteo_x2](./Criteo_x2)
 5 | + [Criteo_x4](./Criteo_x4)
 6 | 
 7 | The dataset is from a [Kaggle challenge for Criteo display advertising](https://www.kaggle.com/c/criteo-display-ad-challenge/data). Criteo is a personalized retargeting company that works with Internet retailers to serve personalized online display advertisements to consumers. The goal of this Kaggle challenge is to predict click-through rates on display ads. It offers a week's worth of data from Criteo's traffic. In the labeled training set over a period of 7 days, each row corresponds to a display ad served by Criteo. The samples are chronologically ordered. Positive and negatives samples have both been subsampled at different rates in order to reduce the dataset size. There are 13 count features and 26 categorical features. The semantic of these features is undisclosed. Some feature have missing values. Note that only the labeled part (i.e., `train.txt`) of the data is used for benchmarking. 
 8 | 
 9 | Data fields consist of:
10 | + Label: Target variable that indicates if an ad was clicked (1) or not (0).
11 | + I1-I13: A total of 13 columns of integer features (mostly count features).
12 | + C1-C26: A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes. 
13 | 


--------------------------------------------------------------------------------
/Frappe/Frappe_x1/README.md:
--------------------------------------------------------------------------------
 1 | # Frappe_x1
 2 | 
 3 | + **Dataset description:**
 4 | 
 5 |   The Frappe dataset contains a context-aware app usage log, which comprises 96203 entries by 957 users for 4082 apps used in various contexts. It has 10 feature fields including user_id, item_id, daytime, weekday, isweekend, homework, cost, weather, country, city. The target value indicates whether the user has used the app under the context. Following the [AFN](https://ojs.aaai.org/index.php/AAAI/article/view/5768) work, we randomly split the data into 7:2:1 as the training set, validation set, and test set, respectively. 
 6 | 
 7 |   The dataset statistics are summarized as follows:
 8 | 
 9 |   | Dataset Split  | Total | #Train | #Validation | #Test | 
10 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
11 |   | Frappe_x1 |  288,609   | 202,027  |  57,722    | 28,860    |         
12 | 
13 | + **Source:** https://www.baltrunas.info/context-aware
14 | + **Download:** https://huggingface.co/datasets/reczoo/Frappe_x1/tree/main
15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
16 | 
17 | + **Used by papers:** 
18 |   - Weiyu Cheng, Yanyan Shen, Linpeng Huang. [Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions](https://ojs.aaai.org/index.php/AAAI/article/view/5768). In AAAI 2020.
19 |   - Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, Zhenhua Dong. [FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction](https://arxiv.org/abs/2304.00902). In AAAI 2023.
20 |   - Jieming Zhu, Qinglin Jia, Guohao Cai, Quanyu Dai, Jingjie Li, Zhenhua Dong, Ruiming Tang, Rui Zhang. [FINAL: Factorized Interaction Layer for CTR Prediction](https://dl.acm.org/doi/10.1145/3539618.3591988). In SIGIR 2023.
21 | 
22 | + **Check the md5sum for data integrity:**
23 |   ```bash
24 |   $ md5sum train.csv valid.csv test.csv
25 |   ba7306e6c4fc19dd2cd84f2f0596d158 train.csv
26 |   88d51bf2173505436d3a8f78f2a59da8 valid.csv
27 |   3470f6d32713dc5f7715f198ca7c612a test.csv
28 |   ```
29 | 


--------------------------------------------------------------------------------
/Frappe/Frappe_x1/convert_frappe_x1.py:
--------------------------------------------------------------------------------
 1 | # Convert libsvm data from AFN [AAAI'2020] to csv format
 2 | 
 3 | import pandas as pd
 4 | from pathlib import Path
 5 | import gc
 6 | 
 7 | headers = ["label", "user", "item", "daytime", "weekday", "isweekend", "homework", "cost", "weather", "country", "city"]
 8 | 
 9 | data_files = ["train.libsvm", "valid.libsvm", "test.libsvm"]
10 | for f in data_files:
11 |     df = pd.read_csv(f, sep=" ", names=headers)
12 |     for col in headers[1:]:
13 |         df[col] = df[col].apply(lambda x: x.split(':')[0])
14 |     df.to_csv(Path(f).stem + ".csv", index=False)
15 |     del df
16 |     gc.collect()


--------------------------------------------------------------------------------
/Frappe/README.md:
--------------------------------------------------------------------------------
 1 | # Frappe
 2 | 
 3 | + [Frappe_x1](./Frappe_x1)
 4 | 
 5 | The frappe dataset contains a context-aware app usage log. It consists of 96203 entries by 957 users for 4082 apps used in various contexts.
 6 | 
 7 | Data fields consist of:
 8 | + user: anonymized user id
 9 | + item: anonymized app id
10 | + daytime
11 | + weekday
12 | + isweekend
13 | + homework
14 | + cost
15 | + weather
16 | + country
17 | + city
18 | + cnt: how many times the app has been used by the user
19 | 
20 | Any scientific publications that use this dataset should cite the following paper:
21 | 
22 | + Linas Baltrunas, Karen Church, Alexandros Karatzoglou, Nuria Oliver. [Frappe: Understanding the Usage and Perception of Mobile App Recommendations In-The-Wild](https://arxiv.org/abs/1505.03014), Arxiv 1505.03014, 2015.
23 | 


--------------------------------------------------------------------------------
/Gowalla/Gowalla_m1/README.md:
--------------------------------------------------------------------------------
 1 | # Gowalla_m1
 2 | 
 3 | + **Dataset description:**
 4 | 
 5 |   The dataset statistics are summarized as follows:
 6 | 
 7 |   | Dataset ID         | #Users | #Items | #Interactions |   #Train  |  #Test  | Density |
 8 |   |:--------------:|:------:|:------:|:-------------:|:---------:|:-------:|:-------:|
 9 |   | Gowalla_m1 | 29,858  | 40,981  |  1,027,370    | 810,128  | 217,242  |  0.00084  |    
10 | 
11 | + **Source:** https://snap.stanford.edu/data/loc-gowalla.html
12 | + **Download:** https://huggingface.co/datasets/reczoo/Gowalla_m1/tree/main
13 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
14 | 
15 | + **Used by papers:** 
16 |     - Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, Meng Wang. [LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation](https://arxiv.org/abs/2002.02126). In SIGIR 2020.
17 |     - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filtering](https://arxiv.org/abs/2109.12613). In CIKM 2021.
18 |     - Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. [UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation](https://arxiv.org/abs/2110.15114). In CIKM 2021.
19 | 
20 | + **Check the md5sum for data integrity:**
21 |   ```bash
22 |   $ md5sum *.txt
23 |   13b1c0d75b07b8cea9413f40042f476f  item_list.txt
24 |   c04e2c4bcd2389f53ed8281816166149  test.txt
25 |   5eec1eb2edb8dd648377d348b8e136cf  train.txt
26 |   f83ec6f2cd974ba6470e8808830cc144  user_list.txt
27 |   ```
28 | 


--------------------------------------------------------------------------------
/KKBox/KKBox_x1/README.md:
--------------------------------------------------------------------------------
 1 | # KKBox_x1
 2 | 
 3 | + **Dataset description:**
 4 |   
 5 |   KKBox is a challenge dataset for music recommendation at WSDM 2018. The data consist of user-song pairs in a given time period, with a total of 19 user features (e.g., city, gender) and song features (e.g., language, genre, artist). We randomly split the data into 8:1:1 as the training set, validation set, and test set, respectively. In this setting, for all categorical fields, we replace infrequent features with a default ``<OOV>`` token by setting the threshold min_category_count=10.
 6 | 
 7 |   The dataset statistics are summarized as follows:
 8 | 
 9 |   | Dataset Split  | Total | #Train | #Validation | #Test | 
10 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
11 |   | KKBox_x1 |  7,377,418   | 5,901,932  |  737,743    | 737,743    |
12 | 
13 | + **Source:** https://www.kaggle.com/c/kkbox-music-recommendation-challenge
14 | + **Download:** https://huggingface.co/datasets/reczoo/KKBox_x1/tree/main
15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
16 | 
17 | + **Used by papers:** 
18 |   - Jieming Zhu, Quanyu Dai, Liangcai Su, Rong Ma, Jinyang Liu, Guohao Cai, Xi Xiao, Rui Zhang. [BARS: Towards Open Benchmarking for Recommender Systems](https://arxiv.org/abs/2205.09626). In SIGIR 2022.
19 | 
20 | + **Check the md5sum for data integrity:**
21 |   ```bash
22 |   $ md5sum train.csv valid.csv test.csv
23 |   195b1ae8fc2d9267d7c8656c07ea1304  train.csv
24 |   398e97ac139611a09bd61a58e4240a3e  valid.csv
25 |   8c5f7add05a6f5258b6b3bcc00ba640b  test.csv
26 |   ```
27 | 


--------------------------------------------------------------------------------
/KKBox/README.md:
--------------------------------------------------------------------------------
 1 | # KKBox
 2 | 
 3 | + [KKBox_x1](./KKBox_x1) 
 4 | 
 5 | It is a [WSDM challenge dataset for KKBox's music recommendation](https://www.kaggle.com/c/kkbox-music-recommendation-challenge) in 2018. The dataset is from [KKBox](https://www.kkbox.com), Asia's leading music streaming service, which holds the world's most comprehensive Asia-Pop music library with over 30 million tracks. 
 6 | 
 7 | The task is to predict the chances of a user listening to a song repetitively after the first observable listening event within a time window was triggered. If there are recurring listening event(s) triggered within a month after the user's very first observable listening event, its target is marked 1, and 0 otherwise in the training set. KKBox provides a training data set consists of information of the first observable listening event for each unique user-song pair within a specific time duration. Metadata of each unique user and song pair is also provided. The train and the test data are selected from users listening history in a given time period, and are split based on time. Note that only the labeled train set of the dataset is used for benchmarking. 
 8 | 
 9 | Data fields consist of:
10 | + target: this is the target variable. target=1 means there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, target=0 otherwise .
11 | + msno: user id
12 | + song_id: song id
13 | + source_system_tab: the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps functions. For example, tab my library contains functions to manipulate the local storage, and tab search contains functions relating to search.
14 | + source_screen_name: name of the layout a user sees.
15 | + source_type: an entry point a user first plays music on mobile apps. An entry point could be album, online-playlist, song, etc.
16 | 
17 | Song features:
18 | + song_length: in ms
19 | + genre_ids: genre category. Some songs have multiple genres and they are separated by |
20 | + artist_name
21 | + composer
22 | + lyricist
23 | + language
24 | + song name: the name of the song.
25 | + isrc: International Standard Recording Code
26 |  
27 | User features:
28 | + city
29 | + bd: age. Note: this column has outlier values, please use your judgement.
30 | + gender
31 | + registered_via: registration method
32 | + registration_init_time: format %Y%m%d
33 | + expiration_date: format %Y%m%d
34 | 


--------------------------------------------------------------------------------
/KuaiShou/KuaiVideo_x1/README.md:
--------------------------------------------------------------------------------
 1 | # KuaiVideo_x1
 2 | 
 3 | + **Dataset description:**
 4 |   
 5 |   The raw dataset is released by the Kuaishou Competition in the China MM 2018 conference, which aims to predict users' click probabilities for new micro-videos. In this dataset, there are multiple types of interactions between users and micro-videos, such as "click", "not click", "like", and "follow". Particularly, "not click" means the user did not click the micro-video after previewing its thumbnail. Note that the timestamp associated with each behaviour has been processed such that the absolute time is unknown, but the sequential order can be obtained according to the timestamp. For each micro-video, we can access its 2,048-d visual embedding of its thumbnail. In total, 10,000 users and their 3,239,534 interacted micro-videos are randomly selected. We follow the train-test data splitting from the [ALPINE](https://github.com/liyongqi67/ALPINE) work. In this setting, we filter infrequent categorical features with the threshold min_category_count=10. We further set the maximal length of user behavior sequence to 100.
 6 | 
 7 |   Note that the 3239534 item ids in behavior data are not continous (0 ~ 3242314), thus `item_visual_emb_dim64.h5` has 3242315 rows, each of which corresponds to an item id and its visual embedding.
 8 | 
 9 |   The dataset statistics are summarized as follows:
10 | 
11 |   | Dataset Split  | Total | #Train | #Validation | #Test | 
12 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
13 |   | KuaiVideo_x1 | 13,661,383    | 10,931,092  |      | 2,730,291   |
14 | 
15 | + **Source:** https://www.kuaishou.com/activity/uimc
16 | + **Download:** https://huggingface.co/datasets/reczoo/KuaiVideo_x1/tree/main
17 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
18 | 
19 | + **Used by papers:**
20 |   - Yongqi Li, Meng Liu, Jianhua Yin, Chaoran Cui, Xinshun-Xu, and Liqiang Nie. [Routing Micro-videos via A Temporal Graph-guided Recommendation System](https://liyongqi67.github.io/papers/MM2019_Routing_Micro_videos_via_A_Temporal_Graph_guided_Recommendation_System.pdf). In MM 2020.
21 |   - Jieming Zhu, Guohao Cai, Junjie Huang, Zhenhua Dong, Ruiming Tang, Weinan Zhang. [ReLoop2: Building Self-Adaptive Recommendation Models via Responsive Error Compensation Loop](https://arxiv.org/abs/2306.08808). In KDD 2023.
22 | 
23 | + **Check the md5sum for data integrity:**
24 |   ```bash
25 |   $ md5sum train.csv test.csv
26 |   16f13734411532cc313caf2180bfcd56  train.csv
27 |   ba26c01caaf6c65c272af11aa451fc7a  test.csv
28 |   ```
29 | 


--------------------------------------------------------------------------------
/KuaiShou/KuaiVideo_x1/convert_kuaivideo_x1.py:
--------------------------------------------------------------------------------
 1 | """ Convert the raw `dataset.pkl` from pickle to csv format, which is
 2 |     obtained from the following paper: Li et al., Routing Micro-videos 
 3 |     via A Temporal Graph-guided Recommendation System, MM 2019.
 4 |     See https://github.com/liyongqi67/ALPINE
 5 | """
 6 | 
 7 | import pickle
 8 | import numpy as np
 9 | import h5py
10 | import pandas as pd
11 | import hashlib
12 | 
13 | 
14 | data_path = "./"
15 | MAX_SEQ_LEN = 100  # chunk the max length of behavior sequence to 100
16 | 
17 | with open(data_path + "dataset.pkl", "rb") as f:
18 |     train = pickle.load(f)
19 |     test = pickle.load(f)
20 |     pos_seq = pickle.load(f)
21 |     neg_seq = pickle.load(f)
22 |     pos_edge = pickle.load(f)
23 |     neg_edge = pickle.load(f)
24 | 
25 | for part in ["train", "test"]:
26 |     sample_list = []
27 |     for sample in eval(part):
28 |         user_id = sample[0][0]
29 |         item_id = sample[0][1]
30 |         is_click = sample[0][2]
31 |         is_like = sample[0][3]
32 |         is_follow = sample[0][4]
33 |         timestamp = sample[0][5]
34 |         pos_len = sample[1]
35 |         neg_len = sample[2]
36 |         pos_items = "^".join(map(str, pos_seq[user_id][0:min(pos_len, MAX_SEQ_LEN)]))
37 |         neg_items = "^".join(map(str, neg_seq[user_id][0:min(neg_len, MAX_SEQ_LEN)]))
38 |         sample_list.append([timestamp, user_id, item_id, is_click, is_like, is_follow, pos_items, neg_items])
39 |     data = pd.DataFrame(sample_list, columns=["timestamp", "user_id", "item_id", "is_click", "is_like", "is_follow", "pos_items", "neg_items"])
40 |     data.sort_values(by="timestamp", inplace=True)
41 |     data.to_csv(f"{part}" + ".csv", index=False)
42 | 
43 | user_emb = np.load(data_path + "user_like.npy")
44 | image_emb = np.load(data_path + "visual64_select.npy")
45 | 
46 | with h5py.File("item_visual_emb_dim64.h5", 'w') as hf:
47 |     hf.create_dataset("key", data=list(range(len(image_emb))))
48 |     hf.create_dataset("value", data=image_emb)
49 | 
50 | with h5py.File("user_visual_emb_dim64.h5", 'w') as hf:
51 |     hf.create_dataset("key", data=list(range(len(user_emb))))
52 |     hf.create_dataset("value", data=user_emb)
53 | 
54 | # Check md5sum for correctness
55 | assert("16f13734411532cc313caf2180bfcd56" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest())
56 | assert("ba26c01caaf6c65c272af11aa451fc7a" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest())
57 | 
58 | print("Reproducing data succeeded!")
59 | 


--------------------------------------------------------------------------------
/KuaiShou/README.md:
--------------------------------------------------------------------------------
1 | # KuaiShou
2 | 
3 | + [KuaiVideo_x1](./KuaiVideo_x1)
4 | 


--------------------------------------------------------------------------------
/MIND/MIND_large_x1/README.md:
--------------------------------------------------------------------------------
 1 | # MIND_large_x1
 2 | 
 3 | + **Dataset description:**
 4 |   
 5 |   MIND is a large-scale Microsoft news dataset for news recommendation. It was collected from anonymized behavior logs of Microsoft News website. MIND totally contains about 160k English news articles and more than 15 million impression logs generated by 1 million users. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression. 
 6 | 
 7 |   The dataset statistics are summarized as follows:
 8 | 
 9 |   | Dataset Split  | Total | #Train | #Validation | #Test | 
10 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
11 |   | MIND_large_x1 |      |    |      |     | 
12 | 
13 | + **Source:** https://msnews.github.io/index.html
14 | + **Download:** https://huggingface.co/datasets/reczoo/MIND_large_x1/tree/main
15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
16 | 
17 | + **Used by papers:**
18 |   - Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, Ming Zhou. [MIND: A Large-scale Dataset for News Recommendation](https://aclanthology.org/2020.acl-main.331). In ACL 2020.
19 |   - Jian Li, Jieming Zhu, Qiwei Bi, Guohao Cai, Lifeng Shang, Zhenhua Dong, Xin Jiang, Qun Liu. [MINER: Multi-Interest Matching Network for News Recommendation](https://aclanthology.org/2022.findings-acl.29.pdf). In ACL 2022.
20 |   - Qijiong Liu, Jieming Zhu, Quanyu Dai, Xiaoming Wu. [Boosting Deep CTR Prediction with a Plug-and-Play Pre-trainer for News Recommendation](https://aclanthology.org/2022.coling-1.249.pdf). In COLING 2022.
21 |   
22 | + **Check the md5sum for data integrity:**
23 |   ```bash
24 |   $ md5sum train.csv valid.csv test.csv news_corpus.tsv
25 |   955b80b959fb15076a0568d82da6bf05  train.csv
26 |   4942111ca7ba975b5f5dae8e2c54f1f0  valid.csv
27 |   cbd5e69d573dc471d9f9ae91f2b5690f  test.csv
28 |   9007e6b9127ff71bf146b7cfc1dc842d  news_corpus.tsv
29 |   ```
30 | 


--------------------------------------------------------------------------------
/MIND/MIND_large_x1/convert_MIND_large_x1.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | import h5py
  4 | import os
  5 | import hashlib
  6 | 
  7 | MAX_SEQ_LEN = 50
  8 | train_path = "./MINDlarge_train/"
  9 | dev_path = "./MINDlarge_dev/"
 10 | test_path = "./MINDlarge_test/"
 11 | 
 12 | print("Preprocess news profile...")
 13 | train_wiki_file = os.path.join(train_path, "entity_embedding.vec")
 14 | dev_wiki_file = os.path.join(dev_path, "entity_embedding.vec")
 15 | test_wiki_file = os.path.join(test_path, "entity_embedding.vec")
 16 | entity_dict = dict()
 17 | with open(train_wiki_file, "r") as fin:
 18 |     for line in fin:
 19 |         l = line.strip().split("\t")
 20 |         entity_dict[l[0]] = [float(v) for v in l[1:]]
 21 | with open(dev_wiki_file, "r") as fin:
 22 |     for line in fin:
 23 |         l = line.strip().split("\t")
 24 |         entity_dict[l[0]] = [float(v) for v in l[1:]]
 25 | with open(test_wiki_file, "r") as fin:
 26 |     for line in fin:
 27 |         l = line.strip().split("\t")
 28 |         entity_dict[l[0]] = [float(v) for v in l[1:]]
 29 | 
 30 | train_news_file = os.path.join(train_path, "news.tsv")
 31 | train_news = pd.read_csv(train_news_file, sep="\t", header=None,
 32 |                          names=["news_id", "cat", "sub_cat", "title", "abstract", "url", 
 33 |                          "title_entities", "abstract_entities"])
 34 | dev_news_file = os.path.join(dev_path, "news.tsv")
 35 | dev_news = pd.read_csv(dev_news_file, sep="\t", header=None, 
 36 |                        names=["news_id", "cat", "sub_cat", "title", "abstract", "url", 
 37 |                        "title_entities", "abstract_entities"])
 38 | test_news_file = os.path.join(test_path, "news.tsv")
 39 | test_news = pd.read_csv(test_news_file, sep="\t", header=None, 
 40 |                        names=["news_id", "cat", "sub_cat", "title", "abstract", "url", 
 41 |                        "title_entities", "abstract_entities"])
 42 | news = pd.concat([train_news, dev_news, test_news], axis=0)
 43 | news = news.drop_duplicates(subset=['news_id']).reset_index(drop=True)
 44 | news = news[["news_id", "cat", "sub_cat", "title", "abstract", "title_entities", "abstract_entities"]]
 45 | news["title_entities"] = news["title_entities"].fillna("[]")
 46 | news["abstract_entities"] = news["abstract_entities"].fillna("[]")
 47 | news["title_entities"] = news["title_entities"] \
 48 |     .map(lambda x: "^".join([v["WikidataId"] for v in eval(x) if v["WikidataId"] in entity_dict]))
 49 | news["abstract_entities"] = news["abstract_entities"] \
 50 |     .map(lambda x: "^".join([v["WikidataId"] for v in eval(x) if v["WikidataId"] in entity_dict]))
 51 | news.to_csv("news_corpus.tsv", sep="\t", index=False)
 52 | print(news.head())
 53 | 
 54 | entity_set = set(list(news["title_entities"].values) + list(news["abstract_entities"].values))
 55 | entity_keys = []
 56 | entity_values = []
 57 | for k, v in entity_dict.items():
 58 |     if k in entity_set:
 59 |         entity_keys.append(k)
 60 |         entity_values.append(v)
 61 | with h5py.File("entity_emb_dim100.h5", 'w') as hf:
 62 |     hf.create_dataset("key", data=np.array(entity_keys, dtype=h5py.special_dtype(vlen=str)))
 63 |     hf.create_dataset("value", data=np.array(entity_values))
 64 | 
 65 | news2cat = dict(zip(news["news_id"], news["cat"]))
 66 | news2subcat = dict(zip(news["news_id"], news["sub_cat"]))
 67 | news2title_entities = dict(zip(news["news_id"], news["title_entities"]))
 68 | news2abstract_entities = dict(zip(news["news_id"], news["abstract_entities"]))
 69 | used_feat = [
 70 |     "imp_id",
 71 |     "click",
 72 |     "hour",
 73 |     "user_id", 
 74 |     "news_id",
 75 |     "cat",
 76 |     "sub_cat",
 77 |     "title_entities",
 78 |     "abstract_entities",
 79 |     "news_his",
 80 |     "cat_his",
 81 |     "subcat_his"
 82 | ]
 83 | 
 84 | def join_data(in_path, out_path):
 85 |     df = pd.read_csv(in_path, sep="\t", header=None,
 86 |                      names=["imp_id", "user_id", "timestamp", "news_his", "impression_list"])
 87 |     df["news_his"] = df["news_his"].fillna("").map(lambda x: \
 88 |         "^".join([v for v in x.split() if v in news2cat][-MAX_SEQ_LEN:]))
 89 |     df = df.drop('impression_list', axis=1).join( \
 90 |             df['impression_list'].str.split(' ', expand=True).stack(). \
 91 |             reset_index(level=1, drop=True).rename('impression'))
 92 |     df["hour"] = df["timestamp"].map(lambda t: t.split(" ")[1].split(":")[0] + t.split(" ")[-1])
 93 |     try:
 94 |         df[["news_id", "click"]] = df["impression"].str.split("-", expand=True)
 95 |     except:
 96 |         df["news_id"] = df["impression"]
 97 |         df["click"] = [-1] * len(df["impression"])
 98 |     df = pd.merge(df, news, how="left", on="news_id")
 99 |     df["cat_his"] = df["news_his"].map(lambda x: "^".join([news2cat.get(i, "") for i in x.split("^")]))
100 |     df["subcat_his"] = df["news_his"].map(lambda x: "^".join([news2subcat.get(i, "") for i in x.split("^")]))
101 |     df[used_feat].to_csv(out_path, index=False)
102 | 
103 | print("Preprocess train data...")
104 | join_data(os.path.join(train_path, "behaviors.tsv"), "train.csv")
105 | print("Preprocess dev data...")
106 | join_data(os.path.join(dev_path, "behaviors.tsv"), "valid.csv")
107 | print("Preprocess test data...")
108 | join_data(os.path.join(test_path, "behaviors.tsv"), "test.csv")
109 | 
110 | # Check md5sum for correctness
111 | assert("9007e6b9127ff71bf146b7cfc1dc842d" == hashlib.md5(open('news_corpus.tsv', 'r').read().encode('utf-8')).hexdigest())
112 | assert("955b80b959fb15076a0568d82da6bf05" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest())
113 | assert("4942111ca7ba975b5f5dae8e2c54f1f0" == hashlib.md5(open('valid.csv', 'r').read().encode('utf-8')).hexdigest())
114 | assert("cbd5e69d573dc471d9f9ae91f2b5690f" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest())
115 | print("Reproducing data succeeded!")
116 | 


--------------------------------------------------------------------------------
/MIND/MIND_small_x1/README.md:
--------------------------------------------------------------------------------
 1 | # MIND_small_x1
 2 | 
 3 | + **Dataset description:**
 4 |   
 5 |   MIND is a large-scale Microsoft news dataset for news recommendation. It was collected from anonymized behavior logs of Microsoft News website. MIND totally contains about 160k English news articles and more than 15 million impression logs generated by 1 million users. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression. The MIND-small version of the dataset is made by randomly sampling 50,000 users and their behavior logs from the MIND dataset. 
 6 | 
 7 |   The dataset statistics are summarized as follows:
 8 | 
 9 |   | Dataset Split  | Total | #Train | #Validation | #Test | 
10 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
11 |   | MIND_small_x1 |   8,584,442   | 5,843,444   |  2,740,998    |   | 
12 | 
13 | + **Source:** https://msnews.github.io/index.html
14 | + **Download:** https://huggingface.co/datasets/reczoo/MIND_small_x1/tree/main
15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
16 | 
17 | + **Used by papers:**
18 |   - Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, Ming Zhou. [MIND: A Large-scale Dataset for News Recommendation](https://aclanthology.org/2020.acl-main.331). In ACL 2020.
19 |   - Jian Li, Jieming Zhu, Qiwei Bi, Guohao Cai, Lifeng Shang, Zhenhua Dong, Xin Jiang, Qun Liu. [MINER: Multi-Interest Matching Network for News Recommendation](https://aclanthology.org/2022.findings-acl.29.pdf). In ACL 2022.
20 |   - Qijiong Liu, Jieming Zhu, Quanyu Dai, Xiaoming Wu. [Boosting Deep CTR Prediction with a Plug-and-Play Pre-trainer for News Recommendation](https://aclanthology.org/2022.coling-1.249.pdf). In COLING 2022.
21 |   
22 | + **Check the md5sum for data integrity:**
23 |   ```bash
24 |   $ md5sum train.csv valid.csv news_corpus.tsv
25 |   51ac2a4514754078ad05b1028a4c7b9a  train.csv
26 |   691961eb780f97b68606e4decebf2296  valid.csv
27 |   51e0b3ae69deab32c7c3f6590f0dab72  news_corpus.tsv
28 |   ```
29 | 


--------------------------------------------------------------------------------
/MIND/MIND_small_x1/convert_MIND_small_x1.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import numpy as np
 3 | import h5py
 4 | import os
 5 | import hashlib
 6 | 
 7 | MAX_SEQ_LEN = 50
 8 | train_path = "./MINDsmall_train/"
 9 | dev_path = "./MINDsmall_dev/"
10 | 
11 | print("Preprocess news profile...")
12 | train_wiki_file = os.path.join(train_path, "entity_embedding.vec")
13 | dev_wiki_file = os.path.join(dev_path, "entity_embedding.vec")
14 | entity_dict = dict()
15 | with open(train_wiki_file, "r") as fin:
16 |     for line in fin:
17 |         l = line.strip().split("\t")
18 |         entity_dict[l[0]] = [float(v) for v in l[1:]]
19 | with open(dev_wiki_file, "r") as fin:
20 |     for line in fin:
21 |         l = line.strip().split("\t")
22 |         entity_dict[l[0]] = [float(v) for v in l[1:]]
23 | 
24 | train_news_file = os.path.join(train_path, "news.tsv")
25 | train_news = pd.read_csv(train_news_file, sep="\t", header=None,
26 |                          names=["news_id", "cat", "sub_cat", "title", "abstract", "url", 
27 |                          "title_entities", "abstract_entities"])
28 | dev_news_file = os.path.join(dev_path, "news.tsv")
29 | dev_news = pd.read_csv(dev_news_file, sep="\t", header=None,
30 |                        names=["news_id", "cat", "sub_cat", "title", "abstract", "url", 
31 |                        "title_entities", "abstract_entities"])
32 | news = pd.concat([train_news, dev_news], axis=0)
33 | news = news.drop_duplicates(subset=['news_id']).reset_index(drop=True)
34 | news = news[["news_id", "cat", "sub_cat", "title_entities", "abstract_entities", "title", "abstract"]]
35 | news["title_entities"] = news["title_entities"].fillna("[]")
36 | news["abstract_entities"] = news["abstract_entities"].fillna("[]")
37 | news["title_entities"] = news["title_entities"] \
38 |     .map(lambda x: "^".join([v["WikidataId"] for v in eval(x) if v["WikidataId"] in entity_dict]))
39 | news["abstract_entities"] = news["abstract_entities"] \
40 |     .map(lambda x: "^".join([v["WikidataId"] for v in eval(x) if v["WikidataId"] in entity_dict]))
41 | news.to_csv("news_corpus.tsv", sep="\t", index=False)
42 | print(news.head())
43 | 
44 | entity_set = set(list(news["title_entities"].values) + list(news["abstract_entities"].values))
45 | entity_keys = []
46 | entity_values = []
47 | for k, v in entity_dict.items():
48 |     if k in entity_set:
49 |         entity_keys.append(k)
50 |         entity_values.append(v)
51 | with h5py.File("entity_emb_dim100.h5", 'w') as hf:
52 |     hf.create_dataset("key", data=np.array(entity_keys, dtype=h5py.special_dtype(vlen=str)))
53 |     hf.create_dataset("value", data=np.array(entity_values))
54 | 
55 | news2cat = dict(zip(news["news_id"], news["cat"]))
56 | news2subcat = dict(zip(news["news_id"], news["sub_cat"]))
57 | news2title_entities = dict(zip(news["news_id"], news["title_entities"]))
58 | news2abstract_entities = dict(zip(news["news_id"], news["abstract_entities"]))
59 | used_feat = [
60 |     "imp_id",
61 |     "click",
62 |     "hour",
63 |     "user_id", 
64 |     "news_id",
65 |     "cat",
66 |     "sub_cat",
67 |     "title_entities",
68 |     "abstract_entities",
69 |     "news_his",
70 |     "cat_his",
71 |     "subcat_his"
72 | ]
73 | 
74 | def join_data(in_path, out_path):
75 |     df = pd.read_csv(in_path, sep="\t", header=None, 
76 |                      names=["imp_id", "user_id", "timestamp", "news_his", "impression_list"])
77 |     df["news_his"] = df["news_his"].fillna("").map(lambda x: \
78 |         "^".join([v for v in x.split() if v in news2cat][-MAX_SEQ_LEN:]))
79 |     df = df.drop('impression_list', axis=1).join( \
80 |             df['impression_list'].str.split(' ', expand=True).stack(). \
81 |             reset_index(level=1, drop=True).rename('impression'))
82 |     df["hour"] = df["timestamp"].map(lambda t: t.split(" ")[1].split(":")[0] + t.split(" ")[-1])
83 |     df[["news_id", "click"]] = df["impression"].str.split("-", expand=True)
84 |     df = pd.merge(df, news, how="left", on="news_id")
85 |     df["cat_his"] = df["news_his"].map(lambda x: "^".join([news2cat.get(i, "") for i in x.split("^")]))
86 |     df["subcat_his"] = df["news_his"].map(lambda x: "^".join([news2subcat.get(i, "") for i in x.split("^")]))
87 |     df[used_feat].to_csv(out_path, index=False)
88 | 
89 | print("Preprocess train data...")
90 | join_data(os.path.join(train_path, "behaviors.tsv"), "train.csv")
91 | print("Preprocess dev data...")
92 | join_data(os.path.join(dev_path, "behaviors.tsv"), "valid.csv")
93 | 
94 | # Check md5sum for correctness
95 | assert("fe0ec15c20424535b5e5471a7f32d61e" == hashlib.md5(open('news_corpus.tsv', 'r').read().encode('utf-8')).hexdigest())
96 | assert("18b16481ad421986de3f80ce3295f5ed" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest())
97 | assert("1223a2a14fa65e880bc8158836da3894" == hashlib.md5(open('valid.csv', 'r').read().encode('utf-8')).hexdigest())
98 | print("Reproducing data succeeded!")
99 | 


--------------------------------------------------------------------------------
/MicroVideo/MicroVideo1.7M_x1/README.md:
--------------------------------------------------------------------------------
 1 | # MicroVideo1.7M_x1
 2 | 
 3 | + **Dataset description:**
 4 |   
 5 |   This is a micro-video dataset provided by the [THACIL work](https://dl.acm.org/doi/10.1145/3240508.3240617), which contains 12,737,617 interactions that 10,986 users have made on 1,704,880 micro-videos. The features include user id, item id, category, and the extracted image embedding vectors of cover images of micro-videos. Note that the dataset has been split such that the items in the test set are all new micro-videos, which have no overlap with the items in the training set. This helps validate the generability of multimodal embedding vectors for new micro-videos. In this setting, we set the maximal length of user behavior sequence to 100.
 6 | 
 7 |   The dataset statistics are summarized as follows:
 8 | 
 9 |   | Dataset Split  | Total | #Train | #Validation | #Test | 
10 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
11 |   | MicroVideo1.7M_x1 |  12,737,617    | 8,970,309  |      | 3,767,308    | 
12 | 
13 | + **Source:** https://github.com/Ocxs/THACIL
14 | + **Download:** https://huggingface.co/datasets/reczoo/MicroVideo1.7M_x1/tree/main
15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
16 | 
17 | + **Used by papers:**
18 |   - Xusong Chen, Dong Liu, Zheng-Jun Zha, Wengang Zhou, Zhiwei Xiong, Yan Li. [Temporal Hierarchical Attention at Category- and Item-Level for Micro-Video Click-Through Prediction](https://dl.acm.org/doi/10.1145/3240508.3240617). In MM 2018.
19 |   - Jieming Zhu, Guohao Cai, Junjie Huang, Zhenhua Dong, Ruiming Tang, Weinan Zhang. [ReLoop2: Building Self-Adaptive Recommendation Models via Responsive Error Compensation Loop](https://arxiv.org/abs/2306.08808). In KDD 2023.
20 | 
21 | + **Check the md5sum for data integrity:**
22 |   ```bash
23 |   $ md5sum train.csv test.csv
24 |   936e6612714c887e76226a60829b4e0a  train.csv
25 |   9417a18304fb62411ac27c26c5e0de56  test.csv
26 |   ```
27 | 


--------------------------------------------------------------------------------
/MicroVideo/MicroVideo1.7M_x1/convert_microvideo1.7m_x1.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import h5py
 3 | import sys
 4 | from collections import defaultdict
 5 | import numpy as np
 6 | import hashlib
 7 | from sklearn.decomposition import PCA
 8 | 
 9 | 
10 | 
11 | sequence_maxlen = 128
12 | 
13 | train = pd.read_csv("train_data.csv", dtype=object)
14 | print("train.shape", train.shape)
15 | # print(train.columns)
16 | train = train.sort_values(by=["user_id", "timestamp"]).reset_index(drop=True)
17 | item_ID = sorted(list(train["item_id"].unique()))
18 | user_ID = sorted(list(train["user_id"].unique()))
19 | print("Number of users: ", len(user_ID))
20 | print("Number of items: ", len(item_ID))
21 | 
22 | clicked_items_queue = defaultdict(list)
23 | clicked_categories_queue = defaultdict(list)
24 | clicked_items_list = []
25 | clicked_categories_list = []
26 | click_time = ""
27 | for idx, row in train.iterrows():
28 |     if idx % 10000 == 0:
29 |         print("Processing {} lines".format(idx))
30 |     click_time = row['timestamp']
31 |     user_id = row["user_id"]
32 |     item_id = row["item_id"]
33 |     cate_id = row["cate_id"]
34 |     click = row['is_click']
35 |     click_history = clicked_items_queue[user_id]
36 |     if len(click_history) > sequence_maxlen:
37 |         click_history = click_history[-sequence_maxlen:]
38 |         clicked_items_queue[user_id] = click_history
39 |     clicked_items_list.append("^".join(click_history))
40 |     category_history = clicked_categories_queue[user_id]
41 |     if len(category_history) > sequence_maxlen:
42 |         category_history = category_history[-sequence_maxlen:]
43 |         clicked_categories_queue[user_id] = category_history
44 |     clicked_categories_list.append("^".join(category_history))
45 |     if click == "1":
46 |         clicked_items_queue[user_id].append(item_id)
47 |         clicked_categories_queue[user_id].append(cate_id)
48 | 
49 | train["clicked_items"] = clicked_items_list
50 | train["clicked_categories"] = clicked_categories_list
51 | train.to_csv("train.csv", index=False)
52 | 
53 | test = pd.read_csv("test_data.csv", dtype=object)
54 | print("test.shape", test.shape)
55 | test = test.sort_values(by=["user_id", "timestamp"]).reset_index(drop=True)
56 | test["item_id"] = test["item_id"].map(lambda x: str(len(item_ID) + int(x))) # re-map item ids of test
57 | test_item_ID = sorted(list(test["item_id"].unique()))
58 | test_user_ID = sorted(list(train["user_id"].unique()))
59 | print("Number of users: ", len(test_user_ID))
60 | print("Number of items: ", len(test_item_ID))
61 | test["clicked_items"] = test["user_id"].map(lambda x: "^".join(clicked_items_queue[x][-sequence_maxlen:]))
62 | test["clicked_categories"] = test["user_id"].map(lambda x: "^".join(clicked_categories_queue[x][-sequence_maxlen:]))
63 | test.to_csv("test.csv", index=False)
64 | 
65 | # Embedding dimension reduction via PCA
66 | train_emb = np.load("train_cover_image_feature.npy")
67 | test_emb = np.load("test_cover_image_feature.npy")
68 | item_emb = np.vstack([train_emb, test_emb])
69 | pca = PCA(n_components=64)
70 | item_emb = pca.fit_transform(item_emb)
71 | print("item_emb.shape", item_emb.shape)
72 | 
73 | with h5py.File("item_image_emb_dim64.h5", 'w') as hf:
74 |     hf.create_dataset("key", data=list(range(len(item_emb))))
75 |     hf.create_dataset("value", data=item_emb)
76 | 
77 | # Check md5sum for correctness
78 | assert("936e6612714c887e76226a60829b4e0a" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest())
79 | assert("9417a18304fb62411ac27c26c5e0de56" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest())
80 | 
81 | print("Reproducing data succeeded!")
82 | 


--------------------------------------------------------------------------------
/MovieLens/Movielens1M_m1/README.md:
--------------------------------------------------------------------------------
 1 | # Movielens1M_m1
 2 | 
 3 | + **Dataset description:**
 4 |   
 5 |   The MovieLens-1M dataset contain 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users. We follow the LCF work to split and preprocess the data into training, validation, and test sets, respectively.
 6 | 
 7 | + **Data format:**
 8 | 
 9 |   Each user corresponds to a list of interacted items: [[item1, item2], [item3, item4, item5], ...]
10 | 
11 | + **Source:** https://grouplens.org/datasets/movielens/1m/
12 | + **Download:** https://huggingface.co/datasets/reczoo/Movielens1M_m1/tree/main
13 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
14 | 
15 | + **Used by papers:**
16 |   - Wenhui Yu, Zheng Qin. [Graph Convolutional Network for Recommendation with Low-pass Collaborative Filters](https://arxiv.org/abs/2006.15516). In ICML 2020.
17 |   - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filtering](https://arxiv.org/abs/2109.12613). In CIKM 2021.
18 |   - Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. [UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation](https://arxiv.org/abs/2110.15114). In CIKM 2021.
19 |   
20 | + **Check the md5sum for data integrity:**
21 |   ```bash
22 |   $ md5sum *.json
23 |   cdd3ad819512cb87dad2f098c8437df2  test_data.json
24 |   4229bc5369f943918103daf7fd92e920  train_data.json
25 |   60be3b377d39806f80a43e37c94449f6  validation_data.json
26 |   ```
27 | 


--------------------------------------------------------------------------------
/MovieLens/MovielensLatest_x1/README.md:
--------------------------------------------------------------------------------
 1 | # MovielensLatest_x1
 2 | 
 3 | + **Dataset description:**
 4 |   
 5 |   The MovieLens dataset consists of users' tagging records on movies. The task is formulated as personalized tag recommendation with each tagging record (user_id, item_id, tag_id) as an data instance. The target value denotes whether the user has assigned a particular tag to the movie. Following the [AFN](https://ojs.aaai.org/index.php/AAAI/article/view/5768) work, we randomly split the data into 7:2:1 as the training set, validation set, and test set, respectively. 
 6 | 
 7 |   The dataset statistics are summarized as follows:
 8 | 
 9 |   | Dataset Split  | Total | #Train | #Validation | #Test | 
10 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
11 |   | MovielensLatest_x1 |  2,006,859   | 1,404,801  |  401,373    | 200,686   |  
12 | 
13 | + **Source:** https://grouplens.org/datasets/movielens
14 | + **Download:** https://huggingface.co/datasets/reczoo/MovielensLatest_x1/tree/main
15 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
16 | 
17 | + **Used by papers:**
18 |   - Weiyu Cheng, Yanyan Shen, Linpeng Huang. [Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions](https://ojs.aaai.org/index.php/AAAI/article/view/5768). In AAAI 2020.
19 |   - Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, Zhenhua Dong. [FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction](https://arxiv.org/abs/2304.00902). In AAAI 2023.
20 |   - Jieming Zhu, Qinglin Jia, Guohao Cai, Quanyu Dai, Jingjie Li, Zhenhua Dong, Ruiming Tang, Rui Zhang. [FINAL: Factorized Interaction Layer for CTR Prediction](https://dl.acm.org/doi/10.1145/3539618.3591988). In SIGIR 2023.
21 |   
22 | + **Check the md5sum for data integrity:**
23 |   ```bash
24 |   $ md5sum train.csv valid.csv test.csv
25 |   efc8bceeaa0e895d566470fc99f3f271 train.csv
26 |   e1930223a5026e910ed5a48687de8af1 valid.csv
27 |   54e8c6baff2e059fe067fb9b69e692d0 test.csv
28 |   ```
29 | 


--------------------------------------------------------------------------------
/MovieLens/MovielensLatest_x1/convert_movielenslatest_x1.py:
--------------------------------------------------------------------------------
 1 | # Convert libsvm data from AFN [AAAI'2020] to csv format
 2 | 
 3 | import pandas as pd
 4 | from pathlib import Path
 5 | import gc
 6 | 
 7 | headers = ["label", "user_id", "item_id", "tag_id"]
 8 | 
 9 | data_files = ["train.libsvm", "valid.libsvm", "test.libsvm"]
10 | for f in data_files:
11 |     df = pd.read_csv(f, sep=" ", names=headers)
12 |     for col in headers[1:]:
13 |         df[col] = df[col].apply(lambda x: x.split(':')[0])
14 |     df.to_csv(Path(f).stem + ".csv", index=False)
15 |     del df
16 |     gc.collect()


--------------------------------------------------------------------------------
/MovieLens/README.md:
--------------------------------------------------------------------------------
1 | # MovieLens
2 | 
3 | + [MovielensLatest_x1](./MovielensLatest_x1)
4 | + [Movielens1M_m1](./Movielens1M_m1)
5 | 
6 | The MovieLens datasets are collected by GroupLens Research from the MovieLens web site (https://movielens.org) where movie rating data are made available. The datasets have been widely used in various research on recommender systems.
7 | 
8 | MovieLens datasets https://grouplens.org/datasets/movielens
9 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # RecZoo Datasets
 2 | 
 3 | + [CTR Prediction](#ctr-prediction)
 4 | + [Matching](#matching)
 5 | + [Reranking](#reranking)
 6 | + [Multimodal](#multimodal)
 7 | + [Multitask](#multitask)
 8 | + [Multidomain](#multidomain)
 9 | 
10 | 
11 | ## CTR Prediction
12 | 
13 | | Dataset   | Dataset ID   |   Domain  |  Use Cases   | Download | Leaderboard | 
14 | |:-----------|:--------------------|:------------------------|:-------------------- |:---------------------:|:---------------------:|
15 | | [Criteo](https://github.com/reczoo/Datasets/tree/main/Criteo)    | [Criteo_x1](https://github.com/reczoo/Datasets/tree/main/Criteo/Criteo_x1)              |  Ads | Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/Criteo_x1/resolve/main/Criteo_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/criteo_x1.html) |
16 | |           | [Criteo_x2](https://github.com/reczoo/Datasets/tree/main/Criteo/Criteo_x2)              |   Ads | Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/Criteo_x2/resolve/main/Criteo_x2.zip?download=true) | 
17 | |           | [Criteo_x4](https://github.com/reczoo/Datasets/tree/main/Criteo/Criteo_x4)              |  Ads |Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/Criteo_x4/resolve/main/Criteo_x4.zip?download=true) |  [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/criteo_x4.html) |
18 | | [Avazu](https://github.com/reczoo/Datasets/tree/main/Avazu)     | [Avazu_x1](https://github.com/reczoo/Datasets/tree/main/Avazu/Avazu_x1)              | Ads |Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/Avazu_x1/resolve/main/Avazu_x1.zip?download=true) |  [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/avazu_x1.html) |
19 | |           | [Avazu_x2](https://github.com/reczoo/Datasets/tree/main/Avazu/Avazu_x2)             | Ads |Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/Avazu_x2/resolve/main/Avazu_x2.zip?download=true) |
20 | |           | [Avazu_x4](https://github.com/reczoo/Datasets/tree/main/Avazu/Avazu_x4)                | Ads |Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/Avazu_x4/resolve/main/Avazu_x4.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/avazu_x4.html) |
21 | | [KKBox](https://github.com/reczoo/Datasets/tree/main/KKBox)     | [KKBox_x1](https://github.com/reczoo/Datasets/tree/main/KKBox/KKBox_x1)              | Music | Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/KKBox_x1/resolve/main/KKBox_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/kkbox_x1.html) |
22 | | [Frappe](https://github.com/reczoo/Datasets/tree/main/Frappe)    | [Frappe_x1](https://github.com/reczoo/Datasets/tree/main/Frappe/Frappe_x1)             | Apps | Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/Frappe_x1/resolve/main/Frappe_x1.zip?download=true) |  [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/frappe_x1.html) |
23 | | [MovieLens](https://github.com/reczoo/Datasets/tree/main/MovieLens) | [MovielensLatest_x1](https://github.com/reczoo/Datasets/tree/main/MovieLens/MovielensLatest_x1)   | Movies | Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/MovielensLatest_x1/resolve/main/MovielensLatest_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/movielenslatest_x1.html) |
24 | | [Taobao](https://github.com/reczoo/Datasets/tree/main/Taobao)    | [TaobaoAd_x1](https://github.com/reczoo/Datasets/tree/main/Taobao/TaobaoAd_x1)              | Ads | Sequential | [:link:](https://huggingface.co/datasets/reczoo/TaobaoAd_x1/resolve/main/TaobaoAd_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/taobaoad_x1.html) |
25 | | [Amazon](https://github.com/reczoo/Datasets/tree/main/Amazon)            | [AmazonElectronics_x1](https://github.com/reczoo/Datasets/tree/main/Amazon/AmazonElectronics_x1)      | Electronics | Sequential | [:link:](https://huggingface.co/datasets/reczoo/AmazonElectronics_x1/resolve/main/AmazonElectronics_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/amazonelectronics_x1.html) |
26 | | [iPinYou](https://github.com/reczoo/Datasets/tree/main/iPinYou)        |  [iPinYou_x1](https://github.com/reczoo/Datasets/tree/main/iPinYou/iPinYou_x1)      |    Ads  | Feature interactions | [:link:](https://huggingface.co/datasets/reczoo/iPinYou_x1/resolve/main/iPinYou_x1.zip?download=true) |
27 | | [MicroVideo](https://github.com/reczoo/Datasets/tree/main/MicroVideo)    | [MicroVideo1.7M_x1](https://github.com/reczoo/Datasets/tree/main/MicroVideo/MicroVideo1.7M_x1)               | MicroVideo | Sequential, multimodal | [:link:](https://huggingface.co/datasets/reczoo/MicroVideo1.7M_x1/resolve/main/MicroVideo1.7M_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/microvideo1.7m_x1.html) |
28 | | [KuaiShou](https://github.com/reczoo/Datasets/tree/main/KuaiShou)        |  [KuaiVideo_x1](https://github.com/reczoo/Datasets/tree/main/KuaiShou/KuaiVideo_x1)      |    MicroVideo  | Sequential, multimodal | [:link:](https://huggingface.co/datasets/reczoo/KuaiVideo_x1/resolve/main/KuaiVideo_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/kuaivideo_x1.html) |
29 | | [MIND](https://github.com/reczoo/Datasets/tree/main/MIND)  |  [MIND_small_x1](https://github.com/reczoo/Datasets/tree/main/MIND/MIND_small_x1)  |   News  | Sequential, pretraining | [:link:](https://huggingface.co/datasets/reczoo/MIND_small_x1/resolve/main/MIND_small_x1.zip?download=true) | 
30 | 
31 | 
32 | ## Matching
33 | 
34 | | Dataset           | Dataset ID           |     Domain  |  Use Cases                        | Download | Leaderboard | 
35 | |:-------------------|:----------------------|:-----------------|:-------------|:----------------------:|:----------------------:|
36 | | [Amazon](https://github.com/reczoo/Datasets/tree/main/Amazon)            | [AmazonBooks_m1](https://github.com/reczoo/Datasets/tree/main/Amazon/AmazonBooks_m1)        | Books | CF, GNN | [:link:](https://huggingface.co/datasets/reczoo/AmazonBooks_m1/tree/main) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/Matching/leaderboard/amazonbooks_m1.html) |
37 | |                   | [AmazonCDs_m1](https://github.com/reczoo/Datasets/tree/main/Amazon/AmazonCDs_m1)         |   CDs | CF, GNN |   [:link:](https://huggingface.co/datasets/reczoo/AmazonCDs_m1/tree/main) |
38 | |                   | [AmazonMovies_m1](https://github.com/reczoo/Datasets/tree/main/Amazon/AmazonMovies_m1)      |   Movies     | CF, GNN | [:link:](https://huggingface.co/datasets/reczoo/AmazonMovies_m1/tree/main) |
39 | |                   | [AmazonBeauty_m1](https://github.com/reczoo/Datasets/tree/main/Amazon/AmazonBeauty_m1)      |   Beauty     | CF, GNN |  [:link:](https://huggingface.co/datasets/reczoo/AmazonBeauty_m1/tree/main) |
40 | |                   | [AmazonElectronics_m1](https://github.com/reczoo/Datasets/tree/main/Amazon/AmazonElectronics_m1) |   Electronics | CF |  [:link:](https://huggingface.co/datasets/reczoo/AmazonElectronics_m1/tree/main) |
41 | | [MovieLens](https://github.com/reczoo/Datasets/tree/main/MovieLens)         | [MovieLens1M_m1](https://github.com/reczoo/Datasets/tree/main/Amazon/MovieLens1M_m1)       |   Movies |    CF, GNN | [:link:](https://huggingface.co/datasets/reczoo/MovieLens1M_m1/tree/main) |
42 | | [Yelp](https://github.com/reczoo/Datasets/tree/main/Yelp)              | [Yelp18_m1](https://github.com/reczoo/Datasets/tree/main/Yelp/Yelp18_m1)            |   Restaurants | CF, GNN | [:link:](https://huggingface.co/datasets/reczoo/Yelp18_m1/tree/main) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/Matching/leaderboard/yelp18_m1.html) |
43 | | [Gowalla](https://github.com/reczoo/Datasets/tree/main/Gowalla)           | [Gowalla_m1](https://github.com/reczoo/Datasets/tree/main/Gowalla/Gowalla_m1)        | POIs | CF, GNN | [:link:](https://huggingface.co/datasets/reczoo/Gowalla_m1/tree/main) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/Matching/leaderboard/gowalla_m1.html) |
44 | | [CiteULike](https://github.com/reczoo/Datasets/tree/main/CiteULike)           | [CiteUlikeA_m1](https://github.com/reczoo/Datasets/tree/main/CiteULike/CiteUlikeA_m1)        | Citation  | CF, GNN | [:link:](https://huggingface.co/datasets/reczoo/CiteUlikeA_m1/tree/main) |
45 | 
46 | 
47 | ## Reranking
48 | TODO
49 | 
50 | ## Multimodal
51 | 
52 | | Dataset   | Dataset ID   |   Domain  |  Use Cases   | Download | Leaderboard | 
53 | |:-----------|:--------------------|:------------------------|:-------------------- |:---------------------:|:---------------------:|
54 | | [MicroVideo](https://github.com/reczoo/Datasets/tree/main/MicroVideo)    | [MicroVideo1.7M_x1](https://github.com/reczoo/Datasets/tree/main/MicroVideo/MicroVideo1.7M_x1)               | MicroVideo | Sequential, multimodal | [:link:](https://huggingface.co/datasets/reczoo/MicroVideo1.7M_x1/resolve/main/MicroVideo1.7M_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/microvideo1.7m_x1.html) |
55 | | [KuaiShou](https://github.com/reczoo/Datasets/tree/main/KuaiShou)        |  [KuaiVideo_x1](https://github.com/reczoo/Datasets/tree/main/KuaiShou/KuaiVideo_x1)      |    MicroVideo  | Sequential, multimodal | [:link:](https://huggingface.co/datasets/reczoo/KuaiVideo_x1/resolve/main/KuaiVideo_x1.zip?download=true) | [:arrow_upper_right:](https://openbenchmark.github.io/BARS/CTR/leaderboard/kuaivideo_x1.html) |
56 | 
57 | 
58 | ## Multitask
59 | TODO
60 | 
61 | ## Multidomain
62 | TODO
63 | 
64 | 


--------------------------------------------------------------------------------
/Taobao/TaobaoAd_x1/README.md:
--------------------------------------------------------------------------------
 1 | # TaobaoAd_x1
 2 | 
 3 | + **Dataset description:**
 4 |   
 5 |   Taobao is a dataset provided by Alibaba, which contains 8 days of ad click-through data (26 million records) that are randomly sampled from 1140000 users. By default, the first 7 days (i.e., 20170506-20170512) of samples are used as training samples, and the last day's samples (i.e., 20170513) are used as test samples. Meanwhile, the dataset also covers the shopping behavior of all users in the recent 22 days, including totally seven hundred million records. We follow the preprocessing steps that have been applied to [reproducing the DMR work](https://aistudio.baidu.com/aistudio/projectdetail/1805731). We note that a small part (~5%) of samples have been dropped during preprocessing due the missing of user or item profiles. In this setting, we filter infrequent categorical features with the threshold min_category_count=10. We further set the maximal length of user behavior sequence to 50.
 6 | 
 7 |   The dataset statistics are summarized as follows:
 8 | 
 9 |   | Dataset Split  | Total | #Train | #Validation | #Test | 
10 |   | :--------: | :-----: |:-----: | :----------: | :----: | 
11 |   | TaobaoAd_x1 | 25,029,426     | 21,929,911  |      | 3,099,515    |
12 | 
13 | + **Data format:**
14 |     + user: User ID (int);
15 |     + time_stamp: time stamp (Bigint, 1494032110 stands for 2017-05-06 08:55:10);
16 |     + adgroup_id: adgroup ID (int);
17 |     + pid: scenario;
18 |     + noclk: 1 for not click, 0 for click;
19 |     + clk: 1 for click, 0 for not click;
20 | 
21 |     ad_feature:
22 |     + adgroup_id: Ad ID (int);
23 |     + cate_id: category ID;
24 |     + campaign_id: campaign ID;
25 |     + brand: brand ID;
26 |     + customer_id: Advertiser ID;
27 |     + price: the price of item
28 | 
29 |     user_profile:
30 |     + userid: user ID;
31 |     + cms_segid: Micro group ID;
32 |     + cms_group_id: cms group_id;
33 |     + final_gender_code: gender 1 for male, 2 for female
34 |     + age_level: age_level
35 |     + pvalue_level: Consumption grade, 1: low, 2: mid, 3: high
36 |     + shopping_level: Shopping depth, 1: shallow user, 2: moderate user, 3: depth user
37 |     + occupation: Is the college student 1: yes, 0: no?
38 |     + new_user_class_level: City level
39 | 
40 |     raw_behavior_log:
41 |     + nick: User ID(int);
42 |     + time_stamp: time stamp (Bigint, 1494032110 stands for 2017-05-06 08:55:10);
43 |     + btag: Types of behavior, include: ipv/cart/fav/buy;
44 |     + cate: category ID(int);
45 |     + brand: brand ID(int);
46 | 
47 | + **Source:** https://tianchi.aliyun.com/dataset/dataDetail?dataId=56
48 | + **Download:** https://huggingface.co/datasets/reczoo/TaobaoAd_x1/tree/main
49 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
50 | 
51 | + **Used by papers:**
52 |   - Ze Lyu, Yu Dong, Chengfu Huo, Weijun Ren. [Deep Match to Rank Model for Personalized Click-Through Rate Prediction](https://ojs.aaai.org/index.php/AAAI/article/view/5346). In AAAI 2020.
53 | 
54 | + **Check the md5sum for data integrity:**
55 |   ```bash
56 |   $ md5sum train.csv test.csv
57 |   eaabfc8629f23519b04593e26c7522fc  train.csv
58 |   f5ae6197e52385496d46e2867c1c8da1  test.csv
59 |   ```
60 | 


--------------------------------------------------------------------------------
/Taobao/TaobaoAd_x1/convert_taobaoad_x1.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | The raw dataset is available at https://tianchi.aliyun.com/dataset/56
 4 | The preprocessed dataset is used by the following work: 
 5 | Lyu et al., Deep Match to Rank Model for Personalized Click-Through Rate Prediction, AAAI 2020.
 6 | The preprocessing steps follow the scripts at https://aistudio.baidu.com/aistudio/projectdetail/1805731
 7 | The required data `dataset_full.zip` can be download at https://aistudio.baidu.com/aistudio/datasetdetail/81892
 8 | However, we note that the ID mapping of categorical features in `dataset_full.zip` has a buggy issue. 
 9 | Please refer to https://github.com/PaddlePaddle/PaddleRec/issues/821
10 | Thus, we suggest to re-map the categorical IDs to new indices when using this dataset.
11 | """
12 | 
13 | import pandas as pd
14 | import hashlib
15 | 
16 | 
17 | train_path = "./work/train_sorted.csv"
18 | test_path = "./work/test.csv"
19 | 
20 | data_parts = ["train", "test"]
21 | for part in data_parts:
22 |     data_df = pd.read_csv(eval(part + '_path'), header=None, dtype=object) 
23 |     data_df.fillna("0", inplace=True)
24 |     part_df = pd.DataFrame()
25 |     part_df["clk"] = data_df.iloc[:, 266]
26 |     part_df["btag_his"] = ["^".join(filter(lambda k: k != "0", x.tolist())) for x in data_df.iloc[:, 0:50].values]
27 |     part_df["cate_his"] = ["^".join(filter(lambda k: k != "0", x.tolist())) for x in data_df.iloc[:, 50:100].values]
28 |     part_df["brand_his"] = ["^".join(filter(lambda k: k != "0", x.tolist())) for x in data_df.iloc[:, 100:150].values]
29 |     part_df["userid"] = data_df.iloc[:, 250]
30 |     part_df["cms_segid"] = data_df.iloc[:, 251]
31 |     part_df["cms_group_id"] = data_df.iloc[:, 252]
32 |     part_df["final_gender_code"] = data_df.iloc[:, 253]
33 |     part_df["age_level"] = data_df.iloc[:, 254]
34 |     part_df["pvalue_level"] = data_df.iloc[:, 255]
35 |     part_df["shopping_level"] = data_df.iloc[:, 256]
36 |     part_df["occupation"] = data_df.iloc[:, 257]
37 |     part_df["new_user_class_level"] = data_df.iloc[:, 258]
38 |     part_df["adgroup_id"] = data_df.iloc[:, 259]
39 |     part_df["cate_id"] = data_df.iloc[:, 260]
40 |     part_df["campaign_id"] = data_df.iloc[:, 261]
41 |     part_df["customer"] = data_df.iloc[:, 262]
42 |     part_df["brand"] = data_df.iloc[:, 263]
43 |     part_df["price"] = data_df.iloc[:, 264] 
44 |     part_df["pid"] = data_df.iloc[:, 265]
45 |     part_df["btag"] = [1] * len(data_df)
46 |     part_df.to_csv(part + ".csv", index=False)
47 | 
48 | # Check md5sum for correctness
49 | assert("eaabfc8629f23519b04593e26c7522fc" == hashlib.md5(open('train.csv', 'r').read().encode('utf-8')).hexdigest())
50 | assert("f5ae6197e52385496d46e2867c1c8da1" == hashlib.md5(open('test.csv', 'r').read().encode('utf-8')).hexdigest())
51 | 
52 | print("Reproducing data succeeded!")
53 | 


--------------------------------------------------------------------------------
/Yelp/Yelp18_m1/README.md:
--------------------------------------------------------------------------------
 1 | # Yelp18_m1
 2 | 
 3 | + **Dataset description:**
 4 | 
 5 |   The data statistics are summarized as follows:
 6 | 
 7 |   | Dataset ID          | #Users | #Items | #Interactions |  #Train   |  #Test  | Density |
 8 |   | :-------: | :----: | :----: | :-----------: | :-------: | :-----: | :-----: |
 9 |   | Yelp18_m1 | 31,668 | 38,048 |   1,561,406   | 1,237,259 | 324,147 | 0.00130 |
10 | 
11 | 
12 | + **Data format:**  
13 | user_id item1 item2 ...
14 | 
15 | + **Source:** https://www.yelp.com/dataset
16 | + **Download:** https://huggingface.co/datasets/reczoo/Yelp18_m1/tree/main
17 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
18 | 
19 | + **Used by papers:** 
20 |     - Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, Meng Wang. [LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation](https://arxiv.org/abs/2002.02126). In SIGIR 2020.
21 |     - Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. [SimpleX: A Simple and Strong Baseline for Collaborative Filtering](https://arxiv.org/abs/2109.12613). In CIKM 2021.
22 |     - Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. [UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation](https://arxiv.org/abs/2110.15114). In CIKM 2021.
23 | 
24 | + **Check the md5sum for data integrity:**
25 |     ```bash
26 |     $ md5sum *.txt
27 |     520fe559761ff2c654629201c807f353  item_list.txt
28 |     0d57d7399862c32152b045ec5d2698e7  test.txt
29 |     1b8b5d22a227e01d6de002c53d32b4c4  train.txt
30 |     ae4f810cd6e827f10fc418753c7d92f9  user_list.txt
31 |     ```
32 | 


--------------------------------------------------------------------------------
/iFlytek/iFlyteckAds_x1/convert_iFlyteckAds_x1.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | from collections import Counter
 3 | import random
 4 | 
 5 | train_file = "train_data.txt"
 6 | test_file = "test_data.txt"
 7 | 
 8 | columns = ['label'] + ["f{}".format(i + 1) for i in range(245)]
 9 | train_data = pd.read_csv(train_file, header=None, names=columns, dtype=object, memory_map=True)
10 | train_data.drop([1613993, 1895025], 0, inplace=True) # drop abnormal rows
11 | num_train = len(train_data)
12 | test_data = pd.read_csv(test_file, header=None, names=columns[1:], dtype=object, memory_map=True)
13 | test_data[columns[0]] = [0] * len(test_data)
14 | num_test = len(test_data)
15 | all_data = pd.concat([train_data, test_data], sort=False).reset_index(drop=True)
16 | all_data["label"] = all_data["label"].astype(int)
17 | print("num_train, num_test:", num_train, num_test)
18 | 
19 | for col in ["f1", "f2", "f3", "f4"]:
20 |     col_dict = Counter(all_data[col])
21 |     vocab = dict(zip(col_dict.keys(), range(len(col_dict))))
22 |     all_data[col] = all_data[col].map(lambda x: vocab[x])
23 | print(all_data.head())
24 | 
25 | train_df = all_data.iloc[0:num_train, :].reset_index(drop=True)
26 | test_df = all_data.iloc[num_train:, :]
27 | 
28 | def get_statistics_by_slotid(df, fout):
29 |     slotid_impressions = df.groupby(['f1'])['f1'].count()
30 |     slotid_clicks = df.groupby(['f1'])['label'].sum()
31 |     slotid_impressions.to_csv(fout + "_slotid_impressions.csv")
32 |     slotid_clicks.to_csv(fout + "_slotid_clicks.csv")
33 | 
34 | get_statistics_by_slotid(train_df, "train_all")
35 | get_statistics_by_slotid(test_df, "test")
36 | 
37 | # train-validation-test splitting
38 | sample_index = list(range(num_train))
39 | random.seed(2022)
40 | random.shuffle(sample_index)
41 | train_index = sample_index[0:int(num_train * 0.9)]
42 | valid_index = sample_index[int(num_train * 0.9):]
43 | valid_df = train_df.iloc[valid_index, :]
44 | train_df = train_df.iloc[train_index, :]
45 | train_df.to_csv("train.csv", index=False)
46 | valid_df.to_csv("valid.csv", index=False)
47 | test_df.to_csv("test.csv", index=False)
48 | print("train:valid:test samples:", len(train_df), len(valid_df), len(test_df))
49 | 
50 | get_statistics_by_slotid(train_df, "train")
51 | get_statistics_by_slotid(valid_df, "valid")
52 | print("All done.")
53 | 


--------------------------------------------------------------------------------
/iPinYou/iPinYou_x1/README.md:
--------------------------------------------------------------------------------
 1 | # iPinYou_x1
 2 | 
 3 | + **Dataset description:**
 4 |   
 5 |   The iPinYou Global Real-Time Bidding Algorithm Competition is organized by iPinYou from April 1st, 2013 to December 31st, 2013.The competition has been divided into three seasons. For each season, a training dataset is released to the competition participants, the testing dataset is reserved by iPinYou. The complete testing dataset is randomly divided into two parts: one part is the leaderboard testing dataset to score and rank the participating teams on the leaderboard, and the other part is reserved for the final offline evaluation. The participant's last offline submission is evaluated by the reserved testing dataset to get a team's offline final score. This dataset contains all three seasons training datasets and leaderboard testing datasets.The reserved testing datasets are withheld by iPinYou. The training dataset includes a set of processed iPinYou DSP bidding, impression, click, and conversion logs.
 6 | 
 7 | + **Source:** https://contest.ipinyou.com/
 8 | + **Download:** https://huggingface.co/datasets/reczoo/iPinYou_x1/tree/main
 9 | + **RecZoo Datasets:** https://github.com/reczoo/Datasets
10 | 
11 | + **Used by papers:**
12 |   - Bin Liu, Niannan Xue, Huifeng Guo, Ruiming Tang, Stefanos Zafeiriou, Xiuqiang He, Zhenguo Li. [AutoGroup: Automatic Feature Grouping for Modelling Explicit High-Order Feature Interactions in CTR Prediction](https://dl.acm.org/doi/abs/10.1145/3397271.3401082). In SIGIR 2020.
13 | 
14 | + **Check the md5sum for data integrity:**
15 |   ```bash
16 |   $ md5sum *.csv
17 |   a94374868687794ff8c0c4d0b124a400  test.csv
18 |   9dd8979d265ab1ed7662ffd49fd73247  train.csv
19 |   ```
20 | 


--------------------------------------------------------------------------------
/tracking.md:
--------------------------------------------------------------------------------
 1 | # Tracking Records
 2 | 
 3 | We track dataset splits from the published papers in order to make the research results reproducible and reusable. We directly reuse the data splits or preprocessing steps if a paper has open the details. If not, we request the data splits by sending emails to the authors.
 4 | 
 5 | - :question: 作者未回应提供数据复现请求。  
 6 | - :x: 数据不可复现
 7 | - :heart: 希望大家一起构建可复用的数据集划分！
 8 | 
 9 | | Dataset Splits    |  Paper Title  |   
10 | |:-----------|:--------------------|
11 | |  [Criteo_x1](https://github.com/reczoo/Datasets/tree/main/Criteo#Criteo_x1), [Avazu_x1](https://github.com/reczoo/Datasets/tree/main/Avazu#Avazu_x1), [Frappe_x1](https://github.com/reczoo/Datasets/tree/main/Frappe#frappe_x1), [MovielensLatest_x1](https://github.com/reczoo/Datasets/tree/main/MovieLens#movielenslatest_x1)     |  [**AAAI'20**] [Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions](https://ojs.aaai.org/index.php/AAAI/article/view/5768), Weiyu Cheng, Yanyan Shen, Linpeng Huang.    |
12 | |  [Criteo_x4](https://github.com/reczoo/Datasets/tree/main/Criteo#Criteo_x4), [Avazu_x4](https://github.com/reczoo/Datasets/tree/main/Avazu#Avazu_x4) |  [**CIKM'19**] [AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/abs/1810.11921), Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, Jian Tang.      |
13 | 


--------------------------------------------------------------------------------