├── README.md ├── api ├── README.md └── images │ └── teams-access-tokens.png ├── baseline ├── README.md ├── model.py ├── parser.py ├── recommendation_worker.py └── xgb.py └── online-evaluation ├── README.md ├── baseline ├── Readme.md ├── model.py ├── online_schedule.py ├── parser.py └── recommendation_worker.py └── img └── timeline.png /README.md: -------------------------------------------------------------------------------- 1 | ACM RecSys Challenge 2017 2 | ========================= 3 | 4 | Pointers: 5 | 6 | - [RecSys Challenge 2017](http://2017.recsyschallenge.com/) 7 | + Details about the challenge: [http://2017.recsyschallenge.com](http://2017.recsyschallenge.com/) 8 | + Submission System of the challenge: [https://recsys.xing.com](https://recsys.xing.com/) 9 | + News: [@recsyschallenge](https://twitter.com/recsyschallenge) 10 | + Paper: [Workshop Summary](https://www.overleaf.com/read/ftccwkttscxs) 11 | - [RecSys Challenge 2016](http://2016.recsyschallenge.com/) 12 | 13 | -------------------------------------------------------------------------------- /api/README.md: -------------------------------------------------------------------------------- 1 | Submission System API (alpha version) 2 | ===================== 3 | 4 | The API should allow teams to automate their recommender systems and particularly the process of 5 | downloading data (relevant for online challenge) and uploading solutions. 6 | 7 | Current list of endpoints: 8 | 9 | - [Offline evaluation](#offline-evaluation) 10 | + [GET /api/team](#get-apiteam) 11 | + [POST /api/submission](#post-apisubmission) 12 | - [Online evaluation](#online-evaluation) 13 | + [GET /api/online/data/status](#get-apionlinedatastatus) 14 | + [GET /api/online/data/items](#get-apionlinedataitems) 15 | + [GET /api/online/data/users](#get-apionlinedatausers) 16 | + [GET /api/online/data/interactions](#get-apionlinedatainteractions) 17 | + [POST /api/online/submission](#post-apionlinesubmission) 18 | + [GET /api/online/submission](#get-apionlinesubmission) 19 | 20 | ## Offline Evaluation 21 | 22 | ### GET /api/team 23 | 24 | Get team details. 25 | 26 | #### Example request 27 | 28 | ``` 29 | curl -vv -XGET -H 'Authorization: Bearer RAtN...LTA1NjkyOGU5OTE5Mw==' 'https://recsys.xing.com/api/team' 30 | ``` 31 | 32 | Notes: 33 | 34 | - `RAtN...LTA1NjkyOGU5OTE5Mw==` is the access token of the team. It can be generated on the [team details page (access token)](https://recsys.xing.com/team). 35 | 36 | 37 | #### Example response 38 | 39 | ```javascript 40 | { 41 | "name":"Data Rangers 2", 42 | "remaining_submissions_today":17, 43 | "submissions":[ 44 | { 45 | "score":26100, 46 | "rank":2, 47 | "label":"test42 (time-decay)", 48 | "submitted_at":"2017-03-11T00:15:34.000+01:00" 49 | }, 50 | { 51 | "score":10003, 52 | "rank":14, 53 | "label":"random testing", 54 | "submitted_at":"2017-03-10T23:38:04.000+01:00" 55 | }, 56 | ... 57 | ] 58 | } 59 | ``` 60 | 61 | Notes: 62 | 63 | - `name`: name of the team 64 | - `remaining_submissions_today`: number of submissions that the team can still do on the given day (CET timezone) 65 | - `submissions` pas submissions of the team 66 | + `score`: the score that was achieved 67 | + `rank`: the rank that the team achieved with the submission (at the time the submission was done) 68 | + `label`: the (optinal) label that the team passed when [submitting the solution](#post-apisubmission) 69 | + `submitted_at`: timestamp when the solution was submitted 70 | - Response codes: 71 | + `200` OK 72 | + `401` Unauthorized (in case the access token is no longer valid or was not properly set in the Header of the request) 73 | 74 | ### POST /api/submission 75 | 76 | Uploads a new solution for the team. 77 | 78 | #### Example request 79 | 80 | ``` 81 | curl -vv -XPOST -H 'Authorization: Bearer RAtN...LTA1NjkyOGU5OTE5Mw==' 'https://recsys.xing.com/api/submission?label=test42%20(time-decay)' --data-binary @solution_file.csv 82 | ``` 83 | 84 | Notes: 85 | 86 | - `RAtN...LTA1NjkyOGU5OTE5Mw==` is the access token of the team. It can be generated on the [team details page (access token)](https://recsys.xing.com/team). 87 | - `label`: optional label that should be assigned to the submission (won't be visible to other teams) 88 | - `solution_file.csv` is the actual solution file that should be uploaded. See: [Format instructions](https://recsys.xing.com/submission#instructions) 89 | 90 | #### Example response 91 | 92 | ```javascript 93 | { 94 | "result": { 95 | "score": 10004, 96 | "rank": 13, 97 | "label": "test42 (time-decay)", 98 | "submitted_at": "2017-03-11T00:44:04.764+01:00" 99 | }, 100 | "is_top_score": false, 101 | "remaining_submissions_today": 18, 102 | "lines_skipped": [1] 103 | } 104 | ``` 105 | 106 | Notes: 107 | 108 | - `result`: the result that the uploaded solution achieved (relevant for offline challenge) 109 | + `score`: the score that was achieved 110 | + `rank`: the rank that the score achieved wrt to the [current leaderboard](https://recsys.xing.com/leaders) 111 | + `label`: the label that was assigned to the submission 112 | + `submitted_at`: timestamp when the solution was submitted 113 | - `is_top_score`: boolean that indicates whether the current upload achieved an equal or higher score than the current best solution of the team 114 | - `remaining_submissions_today`: number of submissions that the team can still do on the given day (CET timezone) 115 | - `lines_skipped`: array of line numbers that were skipped / not processed 116 | - Response codes: 117 | + `200` OK (currently, a 200 is also returned when the no submissions are possible anymore according to `remaining_submissions_today`, i.e. check whether `remaining_submissions_today > 0` before submitting) 118 | + `400` Bad Request (e.g. if the solution file could not be parsed properly) 119 | + `401` Unauthorized (in case the access token is no longer valid or was not properly set in the Header of the request) 120 | 121 | 122 | ## Online Evaluation 123 | 124 | The APIs for the online evaluation (see also: [details about online evaluation procedure](https://github.com/recsyschallenge/2017/tree/master/online-evaluation)) can only be used by those teams that are approved for the online challenge. The API endpoints below, require again an access token that can be generated on the [team details page (access token)](https://recsys.xing.com/team) and is different from the token that was used for the offline challenge: 125 | 126 | 127 | 128 | All requests need to mention the access token as part of the HTTP authorization header, e.g.: 129 | 130 | ``` 131 | curl -XGET -H 'Authorization: Bearer RGF0...A2OQ==' 'https://recsys.xing.com/api/online/data/status' 132 | ``` 133 | 134 | `RGF0...A2OQ==` is the access token that can be (re-)generated on [recsys.xing.com/team](https://recsys.xing.com/team). The token is valid throughout the online challenge (in case it is not regenerated by the owner of the team). If a team overloads the API then the token will be disabled. 135 | 136 | ### GET /api/online/data/status 137 | 138 | Once a day, a new dataset is released. We strongly encourage the teams to download the dataset only once per day. `GET /api/online/data/status` should be called to get to know whether the new dataset is ready to be downloaded. The dataset will then contain: 139 | 140 | - new items that should be pushed to as recommendations to users (see: [GET /api/online/data/items](#get-apionlinedataitems)) 141 | - new target users for your team (see: [GET /api/online/data/users](#get-apionlinedatausers)) 142 | - updated interaction data (see: [GET /api/online/data/interactions](#get-apionlinedatainteractions)) 143 | 144 | 145 | #### Example 146 | 147 | Get status: 148 | 149 | ``` 150 | curl -XGET -H 'Authorization: Bearer RGF0...A2OQ==' 'https://recsys.xing.com/api/online/data/status' 151 | ``` 152 | 153 | Response: 154 | 155 | ```javascript 156 | { 157 | "current": { 158 | "num_items":10927, 159 | "updated_at":"2017-03-29T04:20:23.000+02:00" 160 | } 161 | } 162 | ``` 163 | 164 | Notes: 165 | 166 | - Attributes: 167 | + `num_items`: number of new items that can be recommended to the users for whom the team is responsible 168 | + `updated_at`: timestamp when the data was exported so that it can be downloaded 169 | - Teams need to and should download the data only once per day. `GET /api/online/data/status` can be polled e.g. every 5 minutes to get to know when the new data is available (as soon as `updated_at` corresponds to the current date, teams can download the new datasets). 170 | 171 | 172 | ### GET /api/online/data/items 173 | 174 | Downloading the new target items (including details) that teams can push as recommendations to their users. Those target items will be _new_ job posting that typically were published on xing.com within the last 24 hours. 175 | 176 | #### Example 177 | 178 | Download target items: 179 | 180 | ``` 181 | curl -XGET -H 'Authorization: Bearer RGF0...A2OQ==' 'https://recsys.xing.com/api/online/data/items' 182 | ``` 183 | 184 | Response is a tab-separated text file: 185 | 186 | ``` 187 | id title career_level discipline_id industry_id country is_payed region latitude longitude employment tags created_at 188 | 30 3 15 9 de 0 7 50.8 8.9 1 3323880,4078788,1025029,88816,1032508,2922746,2282775,878709 1450216800 189 | 70 2994300,665762,901938,127655,3680343 3 5 18 de 0 1 48.7 9.01127655,3680343,366525,2114599,901938,2994300,665762,427470,3284568,3175069,792161,3671787,1023516 1475791200 190 | ... 191 | ``` 192 | 193 | Notes: 194 | 195 | - the format of the file corresponds to the one of the _items.csv_ file that was used during the offline challenge 196 | - description of columns, see: [Dataset / Items](http://2017.recsyschallenge.com/#dataset-items) 197 | 198 | 199 | ### GET /api/online/data/users 200 | 201 | Get the list of target user IDs for the given day. Each day, the team will get a new list of target users to whom they are allowed to push the new items as recommendations. The target users do not overlap between the teams, i.e. each team will get a dedicated list of target users. 202 | 203 | #### Example 204 | 205 | Get target user IDs: 206 | 207 | ``` 208 | curl -XGET -H 'Authorization: Bearer RGF0...A2OQ==' 'https://recsys.xing.com/api/online/data/users' 209 | ``` 210 | 211 | Example response 212 | 213 | ``` 214 | 726 215 | 172 216 | 98 217 | 1009 218 | ... 219 | ``` 220 | 221 | Notes: 222 | 223 | - each line features one target user ID 224 | - the target user IDs refer to the users that are part of the dataset that is released to the teams at the beginning of the online challenge 225 | 226 | 227 | ### GET /api/online/data/interactions 228 | 229 | Download the updated interaction data. The tab-separated file (format see: [Dataset / Interactions](http://2017.recsyschallenge.com/#dataset-interactions)) will contain also the interaction data from the last day (for which a team possibly pushed already recommendations). 230 | 231 | #### Example 232 | 233 | Example request: 234 | 235 | ``` 236 | curl -XGET -H 'Authorization: Bearer RGF0...A2OQ==' 'https://recsys.xing.com/api/online/data/interactions' 237 | ``` 238 | 239 | Response is a tab-separated text file: 240 | 241 | ``` 242 | user_id item_id interaction_type created_at 243 | 2082156 80 1 1484299172 244 | 1934123 140 1 1486388563 245 | 1320213 240 1 1479409825 246 | 297303 310 1 1484817366 247 | 1635596 310 1 1486370081 248 | 857319 340 1 1485121421 249 | ``` 250 | 251 | Format: [Dataset / Interactions](http://2017.recsyschallenge.com/#dataset-interactions) 252 | 253 | 254 | ## POST /api/online/submission 255 | 256 | Upload recommendations for the [target users](#get-apionlinedatausers) of the team. The format of the submission is the same as the format of the solutions during the offline challenge: 257 | 258 | ``` 259 | item_id user_ids 260 | 762817 46271, 18262, 81725, 18236, 49762, 61726 261 | 172 7612, 65182 262 | 67816 78161, 97615, 89010, 176241, 18651, 49785, 59872, 67 263 | ... 264 | ``` 265 | 266 | Notes: 267 | 268 | - `item_id` and `user_ids` have to be separated by tab or whitespace (the first number is interpreted as item ID and all following IDs are interpreted as the IDs of users/candidates to whom the item should be recommended). 269 | + `item_id` has to be one item from the current items, see: [GET /api/online/data/items](#get-apionlinedataitems) 270 | + `user_ids` have to be IDs from users that are part of the current target users of the team, see: [GET /api/online/data/users](#get-apionlinedatausers) 271 | - `user_ids` do not need to be sorted 272 | - For each item, we accept at maximum 250 users (we cut off after 250 users). 273 | - Important difference to the offline evaluation: 274 | + each user may receive at maximum 1 recommendation per day. 275 | + Hence, the submission file must not contain a user twice. 276 | + We strongly encourage the teams to not spam users with recommendations (ie. no best guess, no randomness, etc.). It is perfectly fine to not recommend items to all of the current target users. 277 | + Background: users will receive push notifications once a recommendation is pushed to them and we do not want to bother the users more than once per day. You should thus wisely decide whether you want to push an item as recommendation and, if so, which item. 278 | - `POST /api/online/submission` can be called more than once per day (at maximum 1000 times per day), i.e. recommendations can be submitted in smaller batches. However, the first recommendation that is pushed to a user _wins_, i.e. if a first file already contained a row such as `42 78,728,97` then the following submissions should not contain the user IDs `78,728,97` because those users already got their _recommendation of the day_ (additional recommendations to those users on the same day will be neglected). 279 | 280 | #### Example 281 | 282 | Example request: 283 | 284 | ``` 285 | curl -XPOST -H 'Authorization: Bearer RGF0...A2OQ==' 'https://recsys.xing.com/api/online/submission' --data-binary @submission-2017-05-01.csv 286 | ``` 287 | 288 | Example response: 289 | 290 | ``` 291 | {"num_accepted_recos": 17827} 292 | ``` 293 | 294 | The `num_accepted_recos` is the number of user-item pairs that were successfully accepted and were sent to XING's push recommendation component. 295 | 296 | ## GET /api/online/submission 297 | 298 | Download the submission of the team for the current day. This endpoint is rather for the teams to cross-check which of the recommendations that were submitted by the team via [POST /api/online/submission](#post-apionlinesubmission) were accepted and have been further processed by XING's push recommendation component. 299 | 300 | #### Example 301 | 302 | Example request: 303 | 304 | ``` 305 | curl -XGET -H 'Authorization: Bearer RGF0...A2OQ==' 'https://recsys.xing.com/api/online/submission' 306 | ``` 307 | 308 | Example response: 309 | 310 | ``` 311 | 854173 52381 312 | 816205 71625, 17353, 881 313 | 701204 162, 91826 314 | 712783 928361 315 | 773903 1023 316 | 654400 837361 317 | ... 318 | ``` 319 | -------------------------------------------------------------------------------- /api/images/teams-access-tokens.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/recsyschallenge/2017/0515d32417993b831915087a8a57da0f107557a5/api/images/teams-access-tokens.png -------------------------------------------------------------------------------- /baseline/README.md: -------------------------------------------------------------------------------- 1 | Baseline 2 | ======== 3 | 4 | !!!!!! DELETE THE HEADER IN THE TARGET FILES !!!!!!! 5 | ======== 6 | 7 | This is the simple baseline that creates the sample_solution.csv file. 8 | The baseline system extracts features from interacted user-item pairs 9 | and "learns" if a user will interact positively with an item. 10 | The underlying learning algorithm is [XGboost](https://xgboost.readthedocs.io/en/latest/). 11 | Which is a tree ensembeling method that most winning teams used last year. 12 | 13 | The features are: 14 | + number of matches in title ids [Int] 15 | + discipline matches [0, 1] 16 | + career level matches [0, 1] 17 | + industry matches [0, 1] 18 | + country match[0, 1] 19 | + region match [0,1] 20 | 21 | Files: 22 | + The data model along with the feature extraction can be found in `model.py`. 23 | + Parsing the data is performed in `parser.py`. 24 | + A parallel prediction algorithm is given in `recommendation_worker.py`. 25 | The recommendation worker uses the target items and users. If we have 26 | a title match at all the score for each user is predicted. 27 | + The main training and testing is performed in `xgb.py` 28 | 29 | In order to build the baseline XGboost has to be installed with the python binindgs. 30 | -------------------------------------------------------------------------------- /baseline/model.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Modeling users, interactions and items from 3 | the recsys challenge 2017. 4 | 5 | by Daniel Kohlsdorf 6 | ''' 7 | 8 | class User: 9 | 10 | def __init__(self, title, clevel, indus, disc, country, region): 11 | self.title = title 12 | self.clevel = clevel 13 | self.indus = indus 14 | self.disc = disc 15 | self.country = country 16 | self.region = region 17 | 18 | class Item: 19 | 20 | def __init__(self, title, clevel, indus, disc, country, region): 21 | self.title = title 22 | self.clevel = clevel 23 | self.indus = indus 24 | self.disc = disc 25 | self.country = country 26 | self.region = region 27 | 28 | class Interaction: 29 | 30 | def __init__(self, user, item, interaction_type): 31 | self.user = user 32 | self.item = item 33 | self.interaction_type = interaction_type 34 | 35 | def title_match(self): 36 | return float(len(set(self.user.title).intersection(set(self.item.title)))) 37 | 38 | def clevel_match(self): 39 | if self.user.clevel == self.item.clevel: 40 | return 1.0 41 | else: 42 | return 0.0 43 | 44 | def indus_match(self): 45 | if self.user.indus == self.item.indus: 46 | return 1.0 47 | else: 48 | return 0.0 49 | 50 | def discipline_match(self): 51 | if self.user.disc == self.item.disc: 52 | return 2.0 53 | else: 54 | return 0.0 55 | 56 | def country_match(self): 57 | if self.user.country == self.item.country: 58 | return 1.0 59 | else: 60 | return 0.0 61 | 62 | def region_match(self): 63 | if self.user.region == self.item.region: 64 | return 1.0 65 | else: 66 | return 0.0 67 | 68 | def features(self): 69 | return [ 70 | self.title_match(), self.clevel_match(), self.indus_match(), 71 | self.discipline_match(), self.country_match(), self.region_match() 72 | ] 73 | 74 | def label(self): 75 | if self.interaction_type == 4: 76 | return 0.0 77 | else: 78 | return 1.0 79 | 80 | 81 | -------------------------------------------------------------------------------- /baseline/parser.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Parsing the ACM Recsys Challenge 2017 data into interactions, 3 | items and user models. 4 | 5 | by Daniel Kohlsdorf 6 | ''' 7 | 8 | from model import * 9 | 10 | 11 | def is_header(line): 12 | return "recsyschallenge" in line 13 | 14 | 15 | def process_header(header): 16 | x = {} 17 | pos = 0 18 | for name in header: 19 | x[name.split(".")[1]] = pos 20 | pos += 1 21 | return x 22 | 23 | 24 | def select(from_file, where, toObject, index): 25 | header = None 26 | data = {} 27 | i = 0 28 | for line in open(from_file): 29 | if is_header(line): 30 | header = process_header(line.strip().split("\t")) 31 | else: 32 | cmp = line.strip().split("\t") 33 | if where(cmp): 34 | obj = toObject(cmp, header) 35 | if obj != None: 36 | data[index(cmp)] = obj 37 | i += 1 38 | if i % 100000 == 0: 39 | print("... reading line " + str(i) + " from file " + from_file) 40 | return(header, data) 41 | 42 | 43 | def build_user(str_user, names): 44 | return User( 45 | [int(x) for x in str_user[names["jobroles"]].split(",") if len(x) > 0], 46 | int(str_user[names["career_level"]]), 47 | int(str_user[names["industry_id"]]), 48 | int(str_user[names["discipline_id"]]), 49 | str_user[names["country"]], 50 | str_user[names["region"]] 51 | ) 52 | 53 | 54 | def build_item(str_item, names): 55 | return Item( 56 | [int(x) for x in str_item[names["title"]].split(",") if len(x) > 0], 57 | int(str_item[names["career_level"]]), 58 | int(str_item[names["industry_id"]]), 59 | int(str_item[names["discipline_id"]]), 60 | str_item[names["country"]], 61 | str_item[names["region"]] 62 | ) 63 | 64 | 65 | class InteractionBuilder: 66 | 67 | def __init__(self, user_dict, item_dict): 68 | self.user_dict = user_dict 69 | self.item_dict = item_dict 70 | 71 | def build_interaction(self, str_inter, names): 72 | if int(str_inter[names['item_id']]) in self.item_dict and int(str_inter[names['user_id']]) in self.user_dict: 73 | return Interaction( 74 | self.user_dict[int(str_inter[names['user_id']])], 75 | self.item_dict[int(str_inter[names['item_id']])], 76 | int(str_inter[names["interaction_type"]]) 77 | ) 78 | else: 79 | return None 80 | 81 | 82 | -------------------------------------------------------------------------------- /baseline/recommendation_worker.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Build recommendations based on trained XGBoost model 3 | 4 | by Daniel Kohlsdorf 5 | ''' 6 | 7 | from model import * 8 | import xgboost as xgb 9 | import numpy as np 10 | 11 | TH = 0.8 12 | 13 | def classify_worker(item_ids, target_users, items, users, output_file, model): 14 | with open(output_file, 'w') as fp: 15 | pos = 0 16 | average_score = 0.0 17 | num_evaluated = 0.0 18 | for i in item_ids: 19 | data = [] 20 | ids = [] 21 | 22 | # build all (user, item) pair features based for this item 23 | for u in target_users: 24 | x = Interaction(users[u], items[i], -1) 25 | if x.title_match() > 0: 26 | f = x.features() 27 | data += [f] 28 | ids += [u] 29 | 30 | if len(data) > 0: 31 | # predictions from XGBoost 32 | dtest = xgb.DMatrix(np.array(data)) 33 | ypred = model.predict(dtest) 34 | 35 | # compute average score 36 | average_score += sum(ypred) 37 | num_evaluated += float(len(ypred)) 38 | 39 | # use all items with a score above the given threshold and sort the result 40 | user_ids = sorted( 41 | [ 42 | (ids_j, ypred_j) for ypred_j, ids_j in zip(ypred, ids) if ypred_j > TH 43 | ], 44 | key = lambda x: -x[1] 45 | )[0:99] 46 | 47 | # write the results to file 48 | if len(user_ids) > 0: 49 | item_id = str(i) + "\t" 50 | fp.write(item_id) 51 | for j in range(0, len(user_ids)): 52 | user_id = str(user_ids[j][0]) + "," 53 | fp.write(user_id) 54 | user_id = str(user_ids[-1][0]) + "\n" 55 | fp.flush() 56 | 57 | # Every 100 iterations print some stats 58 | if pos % 100 == 0: 59 | try: 60 | score = str(average_score / num_evaluated) 61 | except ZeroDivisionError: 62 | score = 0 63 | percentageDown = str(pos / float(len(item_ids))) 64 | print(output_file + " " + percentageDown + " " + score) 65 | pos += 1 66 | -------------------------------------------------------------------------------- /baseline/xgb.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Baseline solution for the ACM Recsys Challenge 2017 3 | using XGBoost 4 | 5 | by Daniel Kohlsdorf 6 | ''' 7 | 8 | import xgboost as xgb 9 | import numpy as np 10 | import multiprocessing 11 | 12 | from model import * 13 | from parser import * 14 | from recommendation_worker import * 15 | import random 16 | 17 | print(" --- Recsys Challenge 2017 Baseline --- ") 18 | 19 | N_WORKERS = 5 20 | USERS_FILE = "users.csv" 21 | ITEMS_FILE = "items.csv" 22 | INTERACTIONS_FILE = "interactions.csv" 23 | TARGET_USERS = "targetUsers.csv" 24 | TARGET_ITEMS = "targetItems.csv" 25 | 26 | 27 | ''' 28 | 1) Parse the challenge data, exclude all impressions 29 | Exclude all impressions 30 | ''' 31 | (header_users, users) = select(USERS_FILE, lambda x: True, build_user, lambda x: int(x[0])) 32 | (header_items, items) = select(ITEMS_FILE, lambda x: True, build_item, lambda x: int(x[0])) 33 | 34 | builder = InteractionBuilder(users, items) 35 | (header_interactions, interactions) = select( 36 | INTERACTIONS_FILE, 37 | lambda x: x[2] != '0', 38 | builder.build_interaction, 39 | lambda x: (int(x[0]), int(x[1])) 40 | ) 41 | 42 | 43 | ''' 44 | 2) Build recsys training data 45 | ''' 46 | data = np.array([interactions[key].features() for key in interactions.keys()]) 47 | labels = np.array([interactions[key].label() for key in interactions.keys()]) 48 | dataset = xgb.DMatrix(data, label=labels) 49 | dataset.save_binary("recsys2017.buffer") 50 | 51 | 52 | ''' 53 | 3) Train XGBoost regression model with maximum tree depth of 2 and 25 trees 54 | ''' 55 | evallist = [(dataset, 'train')] 56 | param = {'bst:max_depth': 2, 'bst:eta': 0.1, 'silent': 1, 'objective': 'reg:linear' } 57 | param['nthread'] = 4 58 | param['eval_metric'] = 'rmse' 59 | param['base_score'] = 0.0 60 | num_round = 25 61 | bst = xgb.train(param, dataset, num_round, evallist) 62 | bst.save_model('recsys2017.model') 63 | 64 | 65 | ''' 66 | 4) Create target sets for items and users 67 | ''' 68 | target_users = [] 69 | for n, line in enumerate(open(TARGET_USERS)): 70 | # there is a header in target_users in dataset 71 | if n == 0: 72 | continue 73 | target_users += [int(line.strip())] 74 | target_users = set(target_users) 75 | 76 | target_items = [] 77 | for line in open(TARGET_ITEMS): 78 | target_items += [int(line.strip())] 79 | 80 | 81 | ''' 82 | 5) Schedule classification 83 | ''' 84 | bucket_size = len(target_items) / N_WORKERS 85 | start = 0 86 | jobs = [] 87 | for i in range(0, N_WORKERS): 88 | stop = int(min(len(target_items), start + bucket_size)) 89 | filename = "solution_" + str(i) + ".csv" 90 | process = multiprocessing.Process(target = classify_worker, args=(target_items[start:stop], target_users, items, users, filename, bst)) 91 | jobs.append(process) 92 | start = stop 93 | 94 | for j in jobs: 95 | j.start() 96 | 97 | for j in jobs: 98 | j.join() 99 | 100 | -------------------------------------------------------------------------------- /online-evaluation/README.md: -------------------------------------------------------------------------------- 1 | Online Evaluation 2 | ===================== 3 | 4 | ![Recsys2017 Timeline](img/timeline.png) 5 | 6 | The online evaluation is set up as follows. The goal of each team is the same as during the _offline challenge_: given a new item (job posting), identify those users that are interested in the job and that are at the same time also of interest to the recruiter who is associated with the posting. 7 | 8 | - `X_0`: Some days before the online evaluation challenge starts, a new dataset will be released for the participating teams: 9 | + users: teams will be presented with a set of users and their profile information at the beginning of the challenge, the teams (cf. [users.csv](http://2017.recsyschallenge.com/#dataset-users)). 10 | - This set stays valid throughout the whole online evaluation period (until `X_end`). 11 | + items: details about items that those users recently interated with (cf. [items.csv](http://2017.recsyschallenge.com/#dataset-items)). 12 | + interactions: the interactions that those users performed recently (cf. [interactions.csv](http://2017.recsyschallenge.com/#dataset-interactions)) 13 | - Each day the teams then receive... 14 | + a set of target users = those user IDs to whom the team can recommend new items (cf. [targetUsers.csv](http://2017.recsyschallenge.com/#dataset-targets)) 15 | + the new items which can be recommended to the target users. Format of the item description is the same as during the offline evaluation: [items.csv](http://2017.recsyschallenge.com/#dataset-items) 16 | + updated interactions: the interactions that have been collected during the previous day (cf. [interactions.csv](http://2017.recsyschallenge.com/#dataset-interactions)) 17 | - The impressions/interactions will also include entries that have been triggered by other teams or by XING's search and recommendation systems. 18 | - `X_t-1` till `X_t`: Within these 24 hours, teams can submit their solution files (columns: `item_id`, `user_ids`). Here, the following restrictions hold: 19 | + Teams will be allowed to 20 | submit each user id from their target list at maximum one time. 21 | + It is ok if the team chooses to not play out a recommendation for a given target user (notice that rolling out a recommendation that just receives negative feedback, will result in a negative score, see: [Evaluation Metrics](http://2017.recsyschallenge.com/#evaluation)). 22 | + If the team submits a posting older than 24 hours, a user 23 | not in their target list or a user they already sent a recommendation to that 24 | day, the system will ignore the following information. 25 | - `X_w`: the score will be calculated on all item-user pairs a team submitted (given that the above mentioned restrictions were not violated). 26 | + Submitted recommendations can be interacted with for one week. 27 | + Afterwards, the interactions of a user with that item do not 28 | count towards the final score. 29 | - `X_res`: The winning team is the one that achieved the highest 30 | sum of their two best scoring weeks. The winnner of the challenge will be announced one week after 31 | the last submisssion slot. 32 | 33 | 34 | ## Differences to Offline Evaluation 35 | 36 | In contrast to the offline evaluation: 37 | 38 | - there is a fixed window of one day in which the recommendation for the target users and target items have to be submitted 39 | - teams can only recommend to a user once per day 40 | - the final leaderboard score is not calculated over all submissions but over the two best weeks 41 | 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /online-evaluation/baseline/Readme.md: -------------------------------------------------------------------------------- 1 | # Baseline Example for Offline Challenge 2 | 3 | The baseline script will periodically check the 4 | recsys online api for updates on the daily data, 5 | download the new data, compute a solution 6 | and then submit it to the system. 7 | 8 | The frequency of checks is every 10 minutes. 9 | Once a solution is computed the script will not perform 10 | another pass that day. The example is only computing 11 | the recommendations based on the xgboost solution 12 | from the offline challenge's baseline, so you have to 13 | put your solution in there. 14 | 15 | The files 'parser.py' and 'model.py' are the same as 16 | for the offline challenge. The worker script 'recommendation_worker.py', 17 | is rewritten to match the online challenge. 18 | 19 | The main script to execute and schedule a solution is: 'online_schedule.py'. 20 | You need the following libraries: 21 | 22 | + httplib2 23 | + xgboost 24 | + numpy 25 | 26 | I tested this with python 3. 27 | In order to run the test you have to grab the user data from: 28 | 29 | https://recsys.xing.com/data/online 30 | 31 | And put the users.csv into the data folder. 32 | 33 | Train a xgboost model using the offline challenges scripts: 34 | https://github.com/recsyschallenge/2017/tree/master/baseline 35 | 36 | And copy the xgboost model to 'data/recsys2017.model'. 37 | 38 | Then you will be set to go :) 39 | -------------------------------------------------------------------------------- /online-evaluation/baseline/model.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Modeling users, interactions and items from 3 | the recsys challenge 2017. 4 | 5 | by Daniel Kohlsdorf 6 | ''' 7 | 8 | class User: 9 | 10 | def __init__(self, title, clevel, indus, disc, country, region): 11 | self.title = title 12 | self.clevel = clevel 13 | self.indus = indus 14 | self.disc = disc 15 | self.country = country 16 | self.region = region 17 | 18 | class Item: 19 | 20 | def __init__(self, title, clevel, indus, disc, country, region): 21 | self.title = title 22 | self.clevel = clevel 23 | self.indus = indus 24 | self.disc = disc 25 | self.country = country 26 | self.region = region 27 | 28 | class Interaction: 29 | 30 | def __init__(self, user, item, interaction_type): 31 | self.user = user 32 | self.item = item 33 | self.interaction_type = interaction_type 34 | 35 | def title_match(self): 36 | return float(len(set(self.user.title).intersection(set(self.item.title)))) 37 | 38 | def clevel_match(self): 39 | if self.user.clevel == self.item.clevel: 40 | return 1.0 41 | else: 42 | return 0.0 43 | 44 | def indus_match(self): 45 | if self.user.indus == self.item.indus: 46 | return 1.0 47 | else: 48 | return 0.0 49 | 50 | def discipline_match(self): 51 | if self.user.disc == self.item.disc: 52 | return 2.0 53 | else: 54 | return 0.0 55 | 56 | def country_match(self): 57 | if self.user.country == self.item.country: 58 | return 1.0 59 | else: 60 | return 0.0 61 | 62 | def region_match(self): 63 | if self.user.region == self.item.region: 64 | return 1.0 65 | else: 66 | return 0.0 67 | 68 | def features(self): 69 | return [ 70 | self.title_match(), self.clevel_match(), self.indus_match(), 71 | self.discipline_match(), self.country_match(), self.region_match() 72 | ] 73 | 74 | def label(self): 75 | if self.interaction_type == 4: 76 | return 0.0 77 | else: 78 | return 1.0 79 | 80 | 81 | -------------------------------------------------------------------------------- /online-evaluation/baseline/online_schedule.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Online example 3 | 4 | Uses the offline mode to make predictions 5 | for the online challenge. 6 | 7 | by Daniel Kohlsdorf 8 | ''' 9 | import time 10 | import httplib2 11 | import json 12 | from dateutil.parser import parse 13 | import datetime 14 | import parser 15 | from recommendation_worker import * 16 | 17 | TMP_ITEMS = "data/current_items.csv" 18 | TMP_SOLUTION = "data/current_solution.csv" 19 | 20 | MODEL = "data/recsys2017.model" # Model from offline training 21 | USERS_FILE = "data/users.csv" # Online user data 22 | 23 | TOKEN = "WElORyB...TgwNmYtMGJiZGYwOTNkZWY2" # your key 24 | SERVER = "http://recsys.xing.com" 25 | 26 | def header(token): 27 | return {"Authorization" : "Bearer " + token} 28 | 29 | def post_url(server): 30 | return server + "/api/online/submission" 31 | 32 | def status_url(server): 33 | return server + "/api/online/data/status" 34 | 35 | def users_url(server): 36 | return server + "/api/online/data/users" 37 | 38 | def items_url(server): 39 | return server + "/api/online/data/items" 40 | 41 | def get_stats(): 42 | http = httplib2.Http() 43 | content = http.request(status_url(SERVER), method="GET", headers=header(TOKEN))[1].decode("utf-8") 44 | response = json.loads(content) 45 | return parse(response['current']['updated_at']) 46 | 47 | def is_ready(): 48 | return get_stats().date() == datetime.date.today() 49 | 50 | def download_items(): 51 | http = httplib2.Http() 52 | content = http.request(items_url(SERVER), method="GET", headers=header(TOKEN))[1].decode("utf-8") 53 | fp = open(TMP_ITEMS, "w") 54 | fp.write(content) 55 | fp.close() 56 | return parser.select(TMP_ITEMS, lambda x: True, parser.build_item, lambda x: int(x[0])) 57 | 58 | def user_info(user_ids): 59 | return parser.select( 60 | USERS_FILE, 61 | lambda x: int(x[0]) in user_ids and "NULL" not in x, 62 | parser.build_user, 63 | lambda x: int(x[0]) 64 | ) 65 | 66 | def download_target_users(): 67 | http = httplib2.Http() 68 | content = http.request(users_url(SERVER), method="GET", headers=header(TOKEN))[1].decode("utf-8") 69 | user_ids = set([int(uid) for uid in content.split("\n") if len(uid) > 0]) 70 | return user_info(user_ids) 71 | 72 | def process(): 73 | (_, users) = download_target_users() 74 | (_, items) = download_items() 75 | target_users = list(users.keys()) 76 | target_items = list(items.keys()) 77 | filename = TMP_SOLUTION 78 | model = xgb.Booster({'nthread':1}) 79 | model.load_model(MODEL) 80 | classify_worker(target_items, target_users, items, users, filename, model) 81 | 82 | def submit(): 83 | http = httplib2.Http() 84 | filename = TMP_SOLUTION 85 | with open(filename, 'r') as content_file: 86 | content = content_file.read() 87 | response = http.request(post_url(SERVER), method="POST", body=content, 88 | headers=header(TOKEN) 89 | )[1].decode("utf-8") 90 | print("SUBMIT: " + filename + " " + response) 91 | 92 | if __name__ == "__main__": 93 | last_submit = None 94 | while True: 95 | if is_ready() and last_submit != datetime.date.today(): 96 | process() 97 | last_submit = datetime.date.today() 98 | submit() 99 | else: 100 | print("Not ready yet: " + str(datetime.date.today())) 101 | time.sleep(600) 102 | -------------------------------------------------------------------------------- /online-evaluation/baseline/parser.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Parsing the ACM Recsys Challenge 2017 data into interactions, 3 | items and user models. 4 | 5 | by Daniel Kohlsdorf 6 | ''' 7 | 8 | from model import * 9 | 10 | def is_header(line): 11 | return "recsyschallenge" in line 12 | 13 | def process_header(header): 14 | x = {} 15 | pos = 0 16 | for name in header: 17 | x[name.split(".")[1]] = pos 18 | pos += 1 19 | return x 20 | 21 | def select(from_file, where, toObject, index): 22 | header = None 23 | data = {} 24 | i = 0 25 | for line in open(from_file): 26 | if is_header(line): 27 | header = process_header(line.strip().split("\t")) 28 | elif len(line.strip()) > 0 and header != None: 29 | cmp = line.strip().split("\t") 30 | if where(cmp) and len(cmp) == len(header): 31 | obj = toObject(cmp, header) 32 | if obj != None: 33 | data[index(cmp)] = obj 34 | i += 1 35 | if i % 100000 == 0: 36 | print("... reading line " + str(i) + " from file " + from_file) 37 | return(header, data) 38 | 39 | def build_user(str_user, names): 40 | return User( 41 | [int(x) for x in str_user[names["jobroles"]].split(",") if len(x) > 0], 42 | int(str_user[names["career_level"]]), 43 | int(str_user[names["industry_id"]]), 44 | int(str_user[names["discipline_id"]]), 45 | str_user[names["country"]], 46 | str_user[names["region"]] 47 | ) 48 | 49 | def build_item(str_item, names): 50 | return Item( 51 | [int(x) for x in str_item[names["title"]].split(",") if len(x) > 0], 52 | int(str_item[names["career_level"]]), 53 | int(str_item[names["industry_id"]]), 54 | int(str_item[names["discipline_id"]]), 55 | str_item[names["country"]], 56 | str_item[names["region"]] 57 | ) 58 | 59 | class InteractionBuilder: 60 | 61 | def __init__(self, user_dict, item_dict): 62 | self.user_dict = user_dict 63 | self.item_dict = item_dict 64 | 65 | def build_interaction(self, str_inter, names): 66 | if int(str_inter[names['item_id']]) in self.item_dict and int(str_inter[names['user_id']]) in self.user_dict: 67 | return Interaction( 68 | self.user_dict[int(str_inter[names['user_id']])], 69 | self.item_dict[int(str_inter[names['item_id']])], 70 | int(str_inter[names["interaction_type"]]) 71 | ) 72 | else: 73 | return None 74 | 75 | 76 | -------------------------------------------------------------------------------- /online-evaluation/baseline/recommendation_worker.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Build recommendations based on trained XGBoost model 3 | 4 | by Daniel Kohlsdorf 5 | ''' 6 | 7 | from model import * 8 | import xgboost as xgb 9 | import numpy as np 10 | 11 | TH = 0.9 12 | 13 | def classify_worker(item_ids, target_users, items, users, output_file, model): 14 | item_dict = {} 15 | with open(output_file, 'w') as fp: 16 | pos = 0 17 | average_score = 0.0 18 | num_evaluated = 0.0 19 | for u in target_users: 20 | data = [] 21 | ids = [] 22 | 23 | # build all (user, item) pair features based for this item 24 | for i in item_ids: 25 | x = Interaction(users[u], items[i], -1) 26 | if x.title_match() > 0: 27 | f = x.features() 28 | data += [f] 29 | ids += [i] 30 | 31 | if len(data) > 0: 32 | # predictions from XGBoost 33 | dtest = xgb.DMatrix(np.array(data)) 34 | ypred = model.predict(dtest) 35 | 36 | # compute average score 37 | average_score += sum(ypred) 38 | num_evaluated += float(len(ypred)) 39 | 40 | # use all items with a score above the given threshold and sort the result 41 | iids = sorted( 42 | [ 43 | (ids_j, ypred_j) for ypred_j, ids_j in zip(ypred, ids) if ypred_j > TH 44 | ], 45 | key = lambda x: -x[1] 46 | ) 47 | if(len(iids) > 0): 48 | iid = iids[0] 49 | if iid[0] not in item_dict: 50 | item_dict[iid[0]] = [] 51 | item_dict[iid[0]] += [u] 52 | 53 | # Every 100 iterations print some stats 54 | if pos % 10 == 0: 55 | try: 56 | score = str(average_score / num_evaluated) 57 | except ZeroDivisionError: 58 | score = 0 59 | percentageDown = str(pos / float(len(item_ids))) 60 | print(output_file + " " + str(percentageDown) + " " + str(score)) 61 | pos += 1 62 | 63 | for item_id in item_dict.keys(): 64 | if len(item_dict[item_id]) > 0: 65 | fp.write(str(item_id) + "\t") 66 | for user_id in item_dict[item_id][0:-1]: 67 | fp.write(str(user_id) + ",") 68 | fp.write(str(item_dict[item_id][-1]) + "\n") 69 | fp.flush() 70 | 71 | -------------------------------------------------------------------------------- /online-evaluation/img/timeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/recsyschallenge/2017/0515d32417993b831915087a8a57da0f107557a5/online-evaluation/img/timeline.png --------------------------------------------------------------------------------