├── README.md
├── api
    ├── README.md
    └── images
    │   └── teams-access-tokens.png
├── baseline
    ├── README.md
    ├── model.py
    ├── parser.py
    ├── recommendation_worker.py
    └── xgb.py
└── online-evaluation
    ├── README.md
    ├── baseline
        ├── Readme.md
        ├── model.py
        ├── online_schedule.py
        ├── parser.py
        └── recommendation_worker.py
    └── img
        └── timeline.png


/README.md:
--------------------------------------------------------------------------------
 1 | ACM RecSys Challenge 2017
 2 | =========================
 3 | 
 4 | Pointers: 
 5 | 
 6 | - [RecSys Challenge 2017](http://2017.recsyschallenge.com/)
 7 |   + Details about the challenge: [http://2017.recsyschallenge.com](http://2017.recsyschallenge.com/)
 8 |   + Submission System of the challenge: [https://recsys.xing.com](https://recsys.xing.com/)
 9 |   + News: [@recsyschallenge](https://twitter.com/recsyschallenge)
10 |   + Paper: [Workshop Summary](https://www.overleaf.com/read/ftccwkttscxs)
11 | - [RecSys Challenge 2016](http://2016.recsyschallenge.com/)
12 | 
13 | 


--------------------------------------------------------------------------------
/api/README.md:
--------------------------------------------------------------------------------
  1 | Submission System API (alpha version)
  2 | =====================
  3 | 
  4 | The API should allow teams to automate their recommender systems and particularly the process of 
  5 | downloading data (relevant for online challenge) and uploading solutions. 
  6 | 
  7 | Current list of endpoints: 
  8 | 
  9 | - [Offline evaluation](#offline-evaluation) 
 10 |   + [GET /api/team](#get-apiteam)
 11 |   + [POST /api/submission](#post-apisubmission)
 12 | - [Online evaluation](#online-evaluation) 
 13 |   + [GET /api/online/data/status](#get-apionlinedatastatus)
 14 |   + [GET /api/online/data/items](#get-apionlinedataitems)
 15 |   + [GET /api/online/data/users](#get-apionlinedatausers)
 16 |   + [GET /api/online/data/interactions](#get-apionlinedatainteractions)
 17 |   + [POST /api/online/submission](#post-apionlinesubmission)
 18 |   + [GET /api/online/submission](#get-apionlinesubmission)
 19 | 
 20 | ## Offline Evaluation
 21 | 
 22 | ### GET /api/team
 23 | 
 24 | Get team details. 
 25 | 
 26 | #### Example request
 27 | 
 28 | ```
 29 | curl -vv -XGET -H 'Authorization: Bearer RAtN...LTA1NjkyOGU5OTE5Mw==' 'https://recsys.xing.com/api/team'
 30 | ```
 31 | 
 32 | Notes: 
 33 | 
 34 | - `RAtN...LTA1NjkyOGU5OTE5Mw==` is the access token of the team. It can be generated on the [team details page (access token)](https://recsys.xing.com/team).
 35 | 
 36 | 
 37 | #### Example response
 38 | 
 39 | ```javascript
 40 | {
 41 |   "name":"Data Rangers 2",
 42 |   "remaining_submissions_today":17,
 43 |   "submissions":[
 44 |     {
 45 |       "score":26100,
 46 |       "rank":2,
 47 |       "label":"test42 (time-decay)",
 48 |       "submitted_at":"2017-03-11T00:15:34.000+01:00"
 49 |     },
 50 |     {
 51 |       "score":10003,
 52 |       "rank":14,
 53 |       "label":"random testing",
 54 |       "submitted_at":"2017-03-10T23:38:04.000+01:00"
 55 |     },
 56 |     ...
 57 |   ]
 58 | }
 59 | ```
 60 | 
 61 | Notes: 
 62 | 
 63 | - `name`: name of the team
 64 | - `remaining_submissions_today`: number of submissions that the team can still do on the given day (CET timezone)
 65 | - `submissions` pas submissions of the team
 66 |   + `score`: the score that was achieved
 67 |   + `rank`: the rank that the team achieved with the submission (at the time the submission was done)
 68 |   + `label`: the (optinal) label that the team passed when [submitting the solution](#post-apisubmission)
 69 |   + `submitted_at`: timestamp when the solution was submitted
 70 | - Response codes: 
 71 |   + `200` OK
 72 |   + `401` Unauthorized (in case the access token is no longer valid or was not properly set in the Header of the request)
 73 |   
 74 | ### POST /api/submission
 75 | 
 76 | Uploads a new solution for the team.
 77 | 
 78 | #### Example request
 79 | 
 80 | ```
 81 | curl -vv -XPOST -H 'Authorization: Bearer RAtN...LTA1NjkyOGU5OTE5Mw==' 'https://recsys.xing.com/api/submission?label=test42%20(time-decay)' --data-binary @solution_file.csv
 82 | ```
 83 | 
 84 | Notes: 
 85 | 
 86 | - `RAtN...LTA1NjkyOGU5OTE5Mw==` is the access token of the team. It can be generated on the [team details page (access token)](https://recsys.xing.com/team).
 87 | - `label`: optional label that should be assigned to the submission (won't be visible to other teams)
 88 | - `solution_file.csv` is the actual solution file that should be uploaded. See: [Format instructions](https://recsys.xing.com/submission#instructions)
 89 | 
 90 | #### Example response
 91 | 
 92 | ```javascript
 93 | {
 94 |   "result": {
 95 |     "score": 10004,
 96 |     "rank": 13,
 97 |     "label": "test42 (time-decay)",
 98 |     "submitted_at": "2017-03-11T00:44:04.764+01:00"
 99 |   },
100 |   "is_top_score": false,
101 |   "remaining_submissions_today": 18,
102 |   "lines_skipped": [1]
103 | }
104 | ```
105 | 
106 | Notes: 
107 | 
108 | - `result`: the result that the uploaded solution achieved (relevant for offline challenge)
109 |   + `score`: the score that was achieved
110 |   + `rank`: the rank that the score achieved wrt to the [current leaderboard](https://recsys.xing.com/leaders)
111 |   + `label`: the label that was assigned to the submission 
112 |   + `submitted_at`: timestamp when the solution was submitted
113 | - `is_top_score`: boolean that indicates whether the current upload achieved an equal or higher score than the current best solution of the team
114 | - `remaining_submissions_today`: number of submissions that the team can still do on the given day (CET timezone)
115 | - `lines_skipped`: array of line numbers that were skipped / not processed
116 | - Response codes: 
117 |   + `200` OK (currently, a 200 is also returned when the no submissions are possible anymore according to `remaining_submissions_today`, i.e. check whether `remaining_submissions_today > 0` before submitting)
118 |   + `400` Bad Request (e.g. if the solution file could not be parsed properly)
119 |   + `401` Unauthorized (in case the access token is no longer valid or was not properly set in the Header of the request)
120 | 
121 | 
122 | ## Online Evaluation
123 | 
124 | The APIs for the online evaluation (see also: [details about online evaluation procedure](https://github.com/recsyschallenge/2017/tree/master/online-evaluation)) can only be used by those teams that are approved for the online challenge. The API endpoints below, require again an access token that can be generated on the [team details page (access token)](https://recsys.xing.com/team) and is different from the token that was used for the offline challenge: 
125 | 
126 | <img src="images/teams-access-tokens.png" width="450px" />
127 | 
128 | All requests need to mention the access token as part of the HTTP authorization header, e.g.: 
129 | 
130 | ```
131 | curl -XGET -H 'Authorization: Bearer RGF0...A2OQ==' 'https://recsys.xing.com/api/online/data/status'
132 | ```
133 | 
134 | `RGF0...A2OQ==` is the access token that can be (re-)generated on [recsys.xing.com/team](https://recsys.xing.com/team). The token is valid throughout the online challenge (in case it is not regenerated by the owner of the team). If a team overloads the API then the token will be disabled. 
135 | 
136 | ### GET /api/online/data/status
137 | 
138 | Once a day, a new dataset is released. We strongly encourage the teams to download the dataset only once per day. `GET /api/online/data/status` should be called to get to know whether the new dataset is ready to be downloaded. The dataset will then contain:  
139 | 
140 | - new items that should be pushed to as recommendations to users (see: [GET /api/online/data/items](#get-apionlinedataitems))
141 | - new target users for your team (see: [GET /api/online/data/users](#get-apionlinedatausers))  
142 | - updated interaction data (see: [GET /api/online/data/interactions](#get-apionlinedatainteractions))
143 | 
144 | 
145 | #### Example
146 | 
147 | Get status:
148 | 
149 | ```
150 | curl -XGET -H 'Authorization: Bearer RGF0...A2OQ==' 'https://recsys.xing.com/api/online/data/status'
151 | ```
152 | 
153 | Response: 
154 | 
155 | ```javascript
156 | {
157 |   "current": {
158 |     "num_items":10927,
159 |     "updated_at":"2017-03-29T04:20:23.000+02:00"
160 |   }
161 | }
162 | ```
163 | 
164 | Notes: 
165 | 
166 | - Attributes: 
167 |   + `num_items`: number of new items that can be recommended to the users for whom the team is responsible 
168 |   + `updated_at`: timestamp when the data was exported so that it can be downloaded
169 | - Teams need to and should download the data only once per day. `GET /api/online/data/status` can be polled e.g. every 5 minutes to get to know when the new data is available (as soon as `updated_at` corresponds to the current date, teams can download the new datasets).
170 | 
171 | 
172 | ### GET /api/online/data/items
173 | 
174 | Downloading the new target items (including details) that teams can push as recommendations to their users. Those target items will be _new_ job posting that typically were published on xing.com within the last 24 hours. 
175 | 
176 | #### Example
177 | 
178 | Download target items: 
179 | 
180 | ```
181 | curl -XGET -H 'Authorization: Bearer RGF0...A2OQ==' 'https://recsys.xing.com/api/online/data/items'
182 | ```
183 | 
184 | Response is a tab-separated text file: 
185 | 
186 | ```
187 | id  title career_level  discipline_id industry_id country is_payed  region  latitude  longitude employment  tags  created_at
188 | 30    3 15  9 de  0 7 50.8  8.9 1 3323880,4078788,1025029,88816,1032508,2922746,2282775,878709  1450216800
189 | 70  2994300,665762,901938,127655,3680343  3 5 18  de  0 1 48.7  9.01127655,3680343,366525,2114599,901938,2994300,665762,427470,3284568,3175069,792161,3671787,1023516 1475791200
190 | ...
191 | ```
192 | 
193 | Notes: 
194 | 
195 | - the format of the file corresponds to the one of the _items.csv_ file that was used during the offline challenge
196 | - description of columns, see: [Dataset / Items](http://2017.recsyschallenge.com/#dataset-items)
197 | 
198 | 
199 | ### GET /api/online/data/users
200 | 
201 | Get the list of target user IDs for the given day. Each day, the team will get a new list of target users to whom they are allowed to push the new items as recommendations. The target users do not overlap between the teams, i.e. each team will get a dedicated list of target users. 
202 | 
203 | #### Example
204 | 
205 | Get target user IDs: 
206 | 
207 | ```
208 | curl -XGET -H 'Authorization: Bearer RGF0...A2OQ==' 'https://recsys.xing.com/api/online/data/users'
209 | ```
210 | 
211 | Example response
212 | 
213 | ```
214 | 726
215 | 172
216 | 98
217 | 1009
218 | ...
219 | ```
220 | 
221 | Notes: 
222 | 
223 | - each line features one target user ID
224 | - the target user IDs refer to the users that are part of the dataset that is released to the teams at the beginning of the online challenge
225 | 
226 | 
227 | ### GET /api/online/data/interactions
228 | 
229 | Download the updated interaction data. The tab-separated file (format see: [Dataset / Interactions](http://2017.recsyschallenge.com/#dataset-interactions)) will contain also the interaction data from the last day (for which a team possibly pushed already recommendations). 
230 | 
231 | #### Example
232 | 
233 | Example request: 
234 | 
235 | ```
236 | curl -XGET -H 'Authorization: Bearer RGF0...A2OQ==' 'https://recsys.xing.com/api/online/data/interactions'
237 | ```
238 | 
239 | Response is a tab-separated text file: 
240 | 
241 | ```
242 | user_id item_id interaction_type  created_at
243 | 2082156 80  1 1484299172
244 | 1934123 140 1 1486388563
245 | 1320213 240 1 1479409825
246 | 297303  310 1 1484817366
247 | 1635596 310 1 1486370081
248 | 857319  340 1 1485121421
249 | ```
250 | 
251 | Format: [Dataset / Interactions](http://2017.recsyschallenge.com/#dataset-interactions)
252 | 
253 | 
254 | ## POST /api/online/submission
255 | 
256 | Upload recommendations for the [target users](#get-apionlinedatausers) of the team. The format of the submission is the same as the format of the solutions during the offline challenge: 
257 | 
258 | ```
259 | item_id     user_ids
260 | 762817      46271, 18262, 81725, 18236, 49762, 61726
261 | 172         7612, 65182
262 | 67816       78161, 97615, 89010, 176241, 18651, 49785, 59872, 67
263 | ...
264 | ```
265 | 
266 | Notes: 
267 | 
268 | - `item_id` and `user_ids` have to be separated by tab or whitespace (the first number is interpreted as item ID and all following IDs are interpreted as the IDs of users/candidates to whom the item should be recommended).
269 |   + `item_id` has to be one item from the current items, see: [GET /api/online/data/items](#get-apionlinedataitems)
270 |   + `user_ids` have to be IDs from users that are part of the current target users of the team, see: [GET /api/online/data/users](#get-apionlinedatausers)
271 | - `user_ids` do not need to be sorted
272 | - For each item, we accept at maximum 250 users (we cut off after 250 users).
273 | - Important difference to the offline evaluation: 
274 |   + each user may receive at maximum 1 recommendation per day. 
275 |   + Hence, the submission file must not contain a user twice. 
276 |   + We strongly encourage the teams to not spam users with recommendations (ie. no best guess, no randomness, etc.). It is perfectly fine to not recommend items to all of the current target users.  
277 |   + Background: users will receive push notifications once a recommendation is pushed to them and we do not want to bother the users more than once per day. You should thus wisely decide whether you want to push an item as recommendation and, if so, which item. 
278 | - `POST /api/online/submission` can be called more than once per day (at maximum 1000 times per day), i.e. recommendations can be submitted in smaller batches. However, the first recommendation that is pushed to a user _wins_, i.e. if a first file already contained a row such as `42  78,728,97` then the following submissions should not contain the user IDs `78,728,97` because those users already got their _recommendation of the day_ (additional recommendations to those users on the same day will be neglected).
279 | 
280 | #### Example
281 | 
282 | Example request: 
283 | 
284 | ```
285 | curl -XPOST -H 'Authorization: Bearer RGF0...A2OQ==' 'https://recsys.xing.com/api/online/submission' --data-binary @submission-2017-05-01.csv
286 | ```
287 | 
288 | Example response:
289 | 
290 | ```
291 | {"num_accepted_recos": 17827}
292 | ```
293 | 
294 | The `num_accepted_recos` is the number of user-item pairs that were successfully accepted and were sent to XING's push recommendation component. 
295 | 
296 | ## GET /api/online/submission
297 | 
298 | Download the submission of the team for the current day. This endpoint is rather for the teams to cross-check which of the recommendations that were submitted by the team via [POST /api/online/submission](#post-apionlinesubmission) were accepted and have been further processed by XING's push recommendation component. 
299 | 
300 | #### Example
301 | 
302 | Example request: 
303 | 
304 | ```
305 | curl -XGET -H 'Authorization: Bearer RGF0...A2OQ==' 'https://recsys.xing.com/api/online/submission' 
306 | ```
307 | 
308 | Example response:
309 | 
310 | ```
311 | 854173  52381
312 | 816205  71625, 17353, 881
313 | 701204  162, 91826
314 | 712783  928361
315 | 773903  1023
316 | 654400  837361
317 | ...
318 | ```
319 | 


--------------------------------------------------------------------------------
/api/images/teams-access-tokens.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/recsyschallenge/2017/0515d32417993b831915087a8a57da0f107557a5/api/images/teams-access-tokens.png


--------------------------------------------------------------------------------
/baseline/README.md:
--------------------------------------------------------------------------------
 1 | Baseline
 2 | ========
 3 | 
 4 | !!!!!! DELETE THE HEADER IN THE TARGET FILES !!!!!!!
 5 | ========
 6 | 
 7 | This is the simple baseline that creates the sample_solution.csv file.
 8 | The baseline system extracts features from interacted user-item pairs
 9 | and "learns" if a user will interact positively with an item.
10 | The underlying learning algorithm is [XGboost](https://xgboost.readthedocs.io/en/latest/).
11 | Which is a tree ensembeling method that most winning teams used last year.
12 | 
13 | The features are:
14 |   + number of matches in title ids [Int]
15 |   + discipline matches [0, 1]
16 |   + career level matches [0, 1]
17 |   + industry matches [0, 1]
18 |   + country match[0, 1]
19 |   + region match [0,1]
20 | 
21 | Files:
22 |   + The data model along with the feature extraction can be found in `model.py`.
23 |   + Parsing the data is performed in `parser.py`.
24 |   + A parallel prediction algorithm is given in `recommendation_worker.py`.
25 |     The recommendation worker uses the target items and users. If we have
26 |     a title match at all the score for each user is predicted.
27 |   + The main training and testing is performed in `xgb.py`
28 |   
29 | In order to build the baseline XGboost has to be installed with the python binindgs.
30 | 


--------------------------------------------------------------------------------
/baseline/model.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Modeling users, interactions and items from
 3 | the recsys challenge 2017.
 4 | 
 5 | by Daniel Kohlsdorf
 6 | '''
 7 | 
 8 | class User:
 9 | 
10 |     def __init__(self, title, clevel, indus, disc, country, region):
11 |         self.title   = title
12 |         self.clevel  = clevel
13 |         self.indus   = indus
14 |         self.disc    = disc
15 |         self.country = country
16 |         self.region  = region
17 | 
18 | class Item:
19 | 
20 |     def __init__(self, title, clevel, indus, disc, country, region):
21 |         self.title   = title
22 |         self.clevel  = clevel
23 |         self.indus   = indus
24 |         self.disc    = disc
25 |         self.country = country
26 |         self.region  = region
27 | 
28 | class Interaction:
29 |     
30 |     def __init__(self, user, item, interaction_type):
31 |         self.user = user
32 |         self.item = item
33 |         self.interaction_type = interaction_type
34 | 
35 |     def title_match(self):
36 |         return float(len(set(self.user.title).intersection(set(self.item.title))))
37 | 
38 |     def clevel_match(self):
39 |         if self.user.clevel == self.item.clevel:
40 |             return 1.0
41 |         else:
42 |             return 0.0
43 | 
44 |     def indus_match(self):
45 |         if self.user.indus == self.item.indus:
46 |             return 1.0
47 |         else:
48 |             return 0.0
49 | 
50 |     def discipline_match(self):
51 |         if self.user.disc == self.item.disc:
52 |             return 2.0
53 |         else:
54 |             return 0.0
55 | 
56 |     def country_match(self):
57 |         if self.user.country == self.item.country:
58 |             return 1.0
59 |         else:
60 |             return 0.0
61 | 
62 |     def region_match(self):
63 |         if self.user.region == self.item.region:
64 |             return 1.0
65 |         else:
66 |             return 0.0
67 | 
68 |     def features(self):
69 |         return [
70 |             self.title_match(), self.clevel_match(), self.indus_match(), 
71 |             self.discipline_match(), self.country_match(), self.region_match()
72 |         ]
73 | 
74 |     def label(self): 
75 |         if self.interaction_type == 4: 
76 |             return 0.0
77 |         else:
78 |             return 1.0
79 | 
80 | 
81 | 


--------------------------------------------------------------------------------
/baseline/parser.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Parsing the ACM Recsys Challenge 2017 data into interactions,
 3 | items and user models.
 4 | 
 5 | by Daniel Kohlsdorf
 6 | '''
 7 | 
 8 | from model import * 
 9 | 
10 | 
11 | def is_header(line):
12 |     return "recsyschallenge" in line 
13 | 
14 | 
15 | def process_header(header):
16 |     x = {}
17 |     pos = 0
18 |     for name in header:
19 |         x[name.split(".")[1]] = pos
20 |         pos += 1
21 |     return x
22 | 
23 | 
24 | def select(from_file, where, toObject, index):    
25 |     header = None
26 |     data = {}
27 |     i = 0
28 |     for line in open(from_file):
29 |         if is_header(line):
30 |             header = process_header(line.strip().split("\t"))
31 |         else:
32 |             cmp = line.strip().split("\t")
33 |             if where(cmp):
34 |                 obj = toObject(cmp, header)
35 |                 if obj != None:
36 |                     data[index(cmp)] = obj
37 |         i += 1
38 |         if i % 100000 == 0:
39 |             print("... reading line " + str(i) + " from file " + from_file)
40 |     return(header, data)        
41 | 
42 | 
43 | def build_user(str_user, names):
44 |     return User(
45 |         [int(x) for x in str_user[names["jobroles"]].split(",") if len(x) > 0],
46 |         int(str_user[names["career_level"]]),
47 |         int(str_user[names["industry_id"]]),
48 |         int(str_user[names["discipline_id"]]),
49 |         str_user[names["country"]],
50 |         str_user[names["region"]]
51 |     )
52 |     
53 | 
54 | def build_item(str_item, names):
55 |     return Item(
56 |         [int(x) for x in str_item[names["title"]].split(",") if len(x) > 0],
57 |         int(str_item[names["career_level"]]),
58 |         int(str_item[names["industry_id"]]),
59 |         int(str_item[names["discipline_id"]]),
60 |         str_item[names["country"]],
61 |         str_item[names["region"]]
62 |     )
63 | 
64 | 
65 | class InteractionBuilder:
66 |     
67 |     def __init__(self, user_dict, item_dict):
68 |         self.user_dict = user_dict
69 |         self.item_dict = item_dict
70 |     
71 |     def build_interaction(self, str_inter, names):
72 |         if int(str_inter[names['item_id']]) in self.item_dict and int(str_inter[names['user_id']]) in self.user_dict:
73 |             return Interaction(
74 |                 self.user_dict[int(str_inter[names['user_id']])],
75 |                 self.item_dict[int(str_inter[names['item_id']])],
76 |                 int(str_inter[names["interaction_type"]])
77 |             )
78 |         else:
79 |             return None
80 | 
81 | 
82 | 


--------------------------------------------------------------------------------
/baseline/recommendation_worker.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Build recommendations based on trained XGBoost model
 3 | 
 4 | by Daniel Kohlsdorf
 5 | '''
 6 | 
 7 | from model import *
 8 | import xgboost as xgb
 9 | import numpy as np
10 | 
11 | TH = 0.8
12 | 
13 | def classify_worker(item_ids, target_users, items, users, output_file, model):
14 |     with open(output_file, 'w') as fp:
15 |         pos = 0
16 |         average_score = 0.0
17 |         num_evaluated = 0.0
18 |         for i in item_ids:
19 |             data = []        
20 |             ids  = []
21 | 
22 |             # build all (user, item) pair features based for this item
23 |             for u in target_users:        
24 |                 x = Interaction(users[u], items[i], -1)
25 |                 if x.title_match() > 0:
26 |                     f = x.features()
27 |                     data += [f]
28 |                     ids  += [u]
29 | 
30 |             if len(data) > 0:
31 |                 # predictions from XGBoost
32 |                 dtest = xgb.DMatrix(np.array(data))
33 |                 ypred = model.predict(dtest)
34 | 
35 |                 # compute average score
36 |                 average_score += sum(ypred)
37 |                 num_evaluated += float(len(ypred))
38 | 
39 |                 # use all items with a score above the given threshold and sort the result
40 |                 user_ids = sorted(
41 |                     [
42 |                         (ids_j, ypred_j) for ypred_j, ids_j in zip(ypred, ids) if ypred_j > TH
43 |                     ],
44 |                     key = lambda x: -x[1]
45 |                 )[0:99]                                                        
46 | 
47 |                 # write the results to file
48 |                 if len(user_ids) > 0:            
49 |                     item_id = str(i) + "\t"
50 |                     fp.write(item_id)
51 |                     for j in range(0, len(user_ids)):
52 |                         user_id = str(user_ids[j][0]) + ","
53 |                         fp.write(user_id)
54 |                     user_id = str(user_ids[-1][0]) + "\n"
55 |                     fp.flush()
56 | 
57 |             # Every 100 iterations print some stats
58 |             if pos % 100 == 0:
59 |                 try:
60 |                     score = str(average_score / num_evaluated)
61 |                 except ZeroDivisionError:
62 |                     score = 0
63 |                 percentageDown = str(pos / float(len(item_ids)))
64 |                 print(output_file + " " + percentageDown + " " + score)
65 |             pos += 1        
66 | 


--------------------------------------------------------------------------------
/baseline/xgb.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Baseline solution for the ACM Recsys Challenge 2017
  3 | using XGBoost
  4 | 
  5 | by Daniel Kohlsdorf
  6 | '''
  7 | 
  8 | import xgboost as xgb
  9 | import numpy as np
 10 | import multiprocessing
 11 | 
 12 | from model import *
 13 | from parser import *
 14 | from recommendation_worker import *
 15 | import random
 16 | 
 17 | print(" --- Recsys Challenge 2017 Baseline --- ")
 18 | 
 19 | N_WORKERS         = 5
 20 | USERS_FILE        = "users.csv"
 21 | ITEMS_FILE        = "items.csv"
 22 | INTERACTIONS_FILE = "interactions.csv"
 23 | TARGET_USERS      = "targetUsers.csv"
 24 | TARGET_ITEMS      = "targetItems.csv"
 25 | 
 26 | 
 27 | '''
 28 | 1) Parse the challenge data, exclude all impressions
 29 |    Exclude all impressions
 30 | '''
 31 | (header_users, users) = select(USERS_FILE, lambda x: True, build_user, lambda x: int(x[0]))
 32 | (header_items, items) = select(ITEMS_FILE, lambda x: True, build_item, lambda x: int(x[0]))
 33 | 
 34 | builder = InteractionBuilder(users, items)
 35 | (header_interactions, interactions) = select(
 36 |     INTERACTIONS_FILE,
 37 |     lambda x: x[2] != '0',  
 38 |     builder.build_interaction,
 39 |     lambda x: (int(x[0]), int(x[1])) 
 40 | )
 41 | 
 42 | 
 43 | '''
 44 | 2) Build recsys training data
 45 | '''
 46 | data    = np.array([interactions[key].features() for key in interactions.keys()])
 47 | labels  = np.array([interactions[key].label() for key in interactions.keys()])
 48 | dataset = xgb.DMatrix(data, label=labels)
 49 | dataset.save_binary("recsys2017.buffer")
 50 | 
 51 | 
 52 | '''
 53 | 3) Train XGBoost regression model with maximum tree depth of 2 and 25 trees
 54 | '''
 55 | evallist = [(dataset, 'train')]
 56 | param = {'bst:max_depth': 2, 'bst:eta': 0.1, 'silent': 1, 'objective': 'reg:linear' }
 57 | param['nthread']     = 4
 58 | param['eval_metric'] = 'rmse'
 59 | param['base_score']  = 0.0
 60 | num_round            = 25
 61 | bst = xgb.train(param, dataset, num_round, evallist)
 62 | bst.save_model('recsys2017.model')
 63 | 
 64 | 
 65 | '''
 66 | 4) Create target sets for items and users
 67 | '''
 68 | target_users = []
 69 | for n, line in enumerate(open(TARGET_USERS)):
 70 |    # there is a header in target_users in dataset
 71 |     if n == 0:
 72 |          continue
 73 |     target_users += [int(line.strip())]
 74 | target_users = set(target_users)
 75 | 
 76 | target_items = []
 77 | for line in open(TARGET_ITEMS):
 78 |     target_items += [int(line.strip())]
 79 | 
 80 | 
 81 | '''
 82 | 5) Schedule classification
 83 | '''
 84 | bucket_size = len(target_items) / N_WORKERS
 85 | start = 0
 86 | jobs = []
 87 | for i in range(0, N_WORKERS):
 88 |     stop = int(min(len(target_items), start + bucket_size))
 89 |     filename = "solution_" + str(i) + ".csv"
 90 |     process = multiprocessing.Process(target = classify_worker, args=(target_items[start:stop], target_users, items, users, filename, bst))
 91 |     jobs.append(process)
 92 |     start = stop
 93 | 
 94 | for j in jobs:
 95 |     j.start()
 96 | 
 97 | for j in jobs:
 98 |     j.join()
 99 | 
100 | 


--------------------------------------------------------------------------------
/online-evaluation/README.md:
--------------------------------------------------------------------------------
 1 | Online Evaluation
 2 | =====================
 3 | 
 4 | ![Recsys2017 Timeline](img/timeline.png)
 5 | 
 6 | The online evaluation is set up as follows. The goal of each team is the same as during the _offline challenge_: given a new item (job posting), identify those users that are interested in the job and that are at the same time also of interest to the recruiter who is associated with the posting.
 7 | 
 8 | - `X_0`: Some days before the online evaluation challenge starts, a new dataset will be released for the participating teams: 
 9 |   + users: teams will be presented with a set of users and their profile information at the beginning of the challenge, the teams  (cf. [users.csv](http://2017.recsyschallenge.com/#dataset-users)). 
10 |     - This set stays valid throughout the whole online evaluation period (until `X_end`). 
11 |   + items: details about items that those users recently interated with (cf. [items.csv](http://2017.recsyschallenge.com/#dataset-items)). 
12 |   + interactions: the interactions that those users performed recently (cf. [interactions.csv](http://2017.recsyschallenge.com/#dataset-interactions))
13 | - Each day the teams then receive... 
14 |   + a set of target users = those user IDs to whom the team can recommend new items (cf. [targetUsers.csv](http://2017.recsyschallenge.com/#dataset-targets))
15 |   + the new items which can be recommended to the target users. Format of the item description is the same as during the offline evaluation: [items.csv](http://2017.recsyschallenge.com/#dataset-items)
16 |   + updated interactions: the interactions that have been collected during the previous day (cf. [interactions.csv](http://2017.recsyschallenge.com/#dataset-interactions))
17 |     - The impressions/interactions will also include entries that have been triggered by other teams or by XING's search and recommendation systems. 
18 | - `X_t-1` till `X_t`: Within these 24 hours, teams can submit their solution files (columns: `item_id`, `user_ids`). Here, the following restrictions hold: 
19 |   + Teams will be allowed to
20 | submit each user id from their target list at maximum one time. 
21 |   + It is ok if the team chooses to not play out a recommendation for a given target user (notice that rolling out a recommendation that just receives negative feedback, will result in a negative score, see: [Evaluation Metrics](http://2017.recsyschallenge.com/#evaluation)). 
22 |   + If the team submits a posting older than 24 hours, a user
23 | not in their target list or a user they already sent a recommendation to that
24 | day, the system will ignore the following information.
25 | - `X_w`: the score will be calculated on all item-user pairs a team submitted (given that the above mentioned restrictions were not violated).
26 |   + Submitted recommendations can be interacted with for one week. 
27 |   + Afterwards, the interactions of a user with that item do not
28 | count towards the final score. 
29 | - `X_res`: The winning team is the one that achieved the highest
30 | sum of their two best scoring weeks. The winnner of the challenge will be announced one week after
31 | the last submisssion slot.
32 | 
33 | 
34 | ## Differences to Offline Evaluation
35 | 
36 | In contrast to the offline evaluation:
37 | 
38 | - there is a fixed window of one day in which the recommendation for the target users and target items have to be submitted
39 | - teams can only recommend to a user once per day
40 | - the final leaderboard score is not calculated over all submissions but over the two best weeks
41 | 
42 |   
43 |   
44 | 
45 | 


--------------------------------------------------------------------------------
/online-evaluation/baseline/Readme.md:
--------------------------------------------------------------------------------
 1 | # Baseline Example for Offline Challenge
 2 | 
 3 | The baseline script will periodically check the
 4 | recsys online api for updates on the daily data,
 5 | download the new data, compute a solution 
 6 | and then submit it to the system.
 7 | 
 8 | The frequency of checks is every 10 minutes.
 9 | Once a solution is computed the script will not perform
10 | another pass that day. The example is only computing
11 | the recommendations based on the xgboost solution
12 | from the offline challenge's baseline, so you have to
13 | put your solution in there.
14 | 
15 | The files 'parser.py' and 'model.py' are the same as
16 | for the offline challenge. The worker script 'recommendation_worker.py',
17 | is rewritten to match the online challenge.
18 | 
19 | The main script to execute and schedule a solution is: 'online_schedule.py'.
20 | You need the following libraries: 
21 |    
22 |     + httplib2
23 |     + xgboost
24 |     + numpy
25 | 
26 | I tested this with python 3.
27 | In order to run the test you have to grab the user data from:
28 | 
29 |    https://recsys.xing.com/data/online
30 | 
31 | And put the users.csv into the data folder.
32 | 
33 | Train a xgboost model using the offline challenges scripts:
34 |    https://github.com/recsyschallenge/2017/tree/master/baseline
35 | 
36 | And copy the xgboost model to 'data/recsys2017.model'.
37 | 
38 | Then you will be set to go :)
39 | 


--------------------------------------------------------------------------------
/online-evaluation/baseline/model.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Modeling users, interactions and items from
 3 | the recsys challenge 2017.
 4 | 
 5 | by Daniel Kohlsdorf
 6 | '''
 7 | 
 8 | class User:
 9 | 
10 |     def __init__(self, title, clevel, indus, disc, country, region):
11 |         self.title   = title
12 |         self.clevel  = clevel
13 |         self.indus   = indus
14 |         self.disc    = disc
15 |         self.country = country
16 |         self.region  = region
17 | 
18 | class Item:
19 | 
20 |     def __init__(self, title, clevel, indus, disc, country, region):
21 |         self.title   = title
22 |         self.clevel  = clevel
23 |         self.indus   = indus
24 |         self.disc    = disc
25 |         self.country = country
26 |         self.region  = region
27 | 
28 | class Interaction:
29 |     
30 |     def __init__(self, user, item, interaction_type):
31 |         self.user = user
32 |         self.item = item
33 |         self.interaction_type = interaction_type
34 | 
35 |     def title_match(self):
36 |         return float(len(set(self.user.title).intersection(set(self.item.title))))
37 | 
38 |     def clevel_match(self):
39 |         if self.user.clevel == self.item.clevel:
40 |             return 1.0
41 |         else:
42 |             return 0.0
43 | 
44 |     def indus_match(self):
45 |         if self.user.indus == self.item.indus:
46 |             return 1.0
47 |         else:
48 |             return 0.0
49 | 
50 |     def discipline_match(self):
51 |         if self.user.disc == self.item.disc:
52 |             return 2.0
53 |         else:
54 |             return 0.0
55 | 
56 |     def country_match(self):
57 |         if self.user.country == self.item.country:
58 |             return 1.0
59 |         else:
60 |             return 0.0
61 | 
62 |     def region_match(self):
63 |         if self.user.region == self.item.region:
64 |             return 1.0
65 |         else:
66 |             return 0.0
67 | 
68 |     def features(self):
69 |         return [
70 |             self.title_match(), self.clevel_match(), self.indus_match(), 
71 |             self.discipline_match(), self.country_match(), self.region_match()
72 |         ]
73 | 
74 |     def label(self): 
75 |         if self.interaction_type == 4: 
76 |             return 0.0
77 |         else:
78 |             return 1.0
79 | 
80 | 
81 | 


--------------------------------------------------------------------------------
/online-evaluation/baseline/online_schedule.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Online example
  3 | 
  4 | Uses the offline mode to make predictions 
  5 | for the online challenge.
  6 | 
  7 | by Daniel Kohlsdorf
  8 | '''
  9 | import time 
 10 | import httplib2
 11 | import json
 12 | from dateutil.parser import parse
 13 | import datetime
 14 | import parser
 15 | from recommendation_worker import *
 16 | 
 17 | TMP_ITEMS    = "data/current_items.csv"
 18 | TMP_SOLUTION = "data/current_solution.csv"
 19 | 
 20 | MODEL        = "data/recsys2017.model" # Model from offline training
 21 | USERS_FILE   = "data/users.csv"        # Online user data
 22 | 
 23 | TOKEN  = "WElORyB...TgwNmYtMGJiZGYwOTNkZWY2" # your key 
 24 | SERVER = "http://recsys.xing.com"
 25 | 
 26 | def header(token):
 27 |     return {"Authorization" : "Bearer " + token}
 28 | 
 29 | def post_url(server):
 30 |     return server + "/api/online/submission"
 31 | 
 32 | def status_url(server):
 33 |     return server + "/api/online/data/status"
 34 | 
 35 | def users_url(server):
 36 |     return server + "/api/online/data/users"
 37 | 
 38 | def items_url(server):
 39 |     return server + "/api/online/data/items"
 40 | 
 41 | def get_stats():
 42 |     http = httplib2.Http()
 43 |     content = http.request(status_url(SERVER), method="GET", headers=header(TOKEN))[1].decode("utf-8")
 44 |     response = json.loads(content)    
 45 |     return parse(response['current']['updated_at'])
 46 | 
 47 | def is_ready():
 48 |     return get_stats().date() == datetime.date.today()
 49 | 
 50 | def download_items():
 51 |     http = httplib2.Http()
 52 |     content = http.request(items_url(SERVER), method="GET", headers=header(TOKEN))[1].decode("utf-8")
 53 |     fp = open(TMP_ITEMS, "w")
 54 |     fp.write(content)
 55 |     fp.close()
 56 |     return parser.select(TMP_ITEMS, lambda x: True, parser.build_item, lambda x: int(x[0]))
 57 | 
 58 | def user_info(user_ids):
 59 |     return parser.select(
 60 |         USERS_FILE, 
 61 |         lambda x: int(x[0]) in user_ids and "NULL" not in x,
 62 |         parser.build_user, 
 63 |         lambda x: int(x[0])
 64 |     )
 65 | 
 66 | def download_target_users():
 67 |     http = httplib2.Http()
 68 |     content = http.request(users_url(SERVER), method="GET", headers=header(TOKEN))[1].decode("utf-8") 
 69 |     user_ids = set([int(uid) for uid in content.split("\n") if len(uid) > 0])
 70 |     return user_info(user_ids)
 71 |                   
 72 | def process():
 73 |     (_, users) = download_target_users()
 74 |     (_, items) = download_items()
 75 |     target_users = list(users.keys())
 76 |     target_items = list(items.keys())
 77 |     filename = TMP_SOLUTION
 78 |     model = xgb.Booster({'nthread':1})        
 79 |     model.load_model(MODEL)
 80 |     classify_worker(target_items, target_users, items, users, filename, model)
 81 |     
 82 | def submit():
 83 |     http = httplib2.Http()    
 84 |     filename = TMP_SOLUTION
 85 |     with open(filename, 'r') as content_file:
 86 |         content = content_file.read()
 87 |         response = http.request(post_url(SERVER), method="POST", body=content,
 88 |             headers=header(TOKEN)
 89 |         )[1].decode("utf-8")
 90 |         print("SUBMIT: " + filename + " " + response)
 91 | 
 92 | if __name__ == "__main__":
 93 |     last_submit = None
 94 |     while True:
 95 |         if is_ready() and last_submit != datetime.date.today():
 96 |             process()
 97 |             last_submit = datetime.date.today()
 98 |             submit()
 99 |         else:
100 |             print("Not ready yet: " + str(datetime.date.today()))
101 |         time.sleep(600)
102 | 


--------------------------------------------------------------------------------
/online-evaluation/baseline/parser.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Parsing the ACM Recsys Challenge 2017 data into interactions,
 3 | items and user models.
 4 | 
 5 | by Daniel Kohlsdorf
 6 | '''
 7 | 
 8 | from model import * 
 9 | 
10 | def is_header(line):
11 |     return "recsyschallenge" in line
12 | 
13 | def process_header(header):
14 |     x = {}
15 |     pos = 0
16 |     for name in header:
17 |         x[name.split(".")[1]] = pos
18 |         pos += 1
19 |     return x
20 | 
21 | def select(from_file, where, toObject, index):    
22 |     header = None
23 |     data = {}
24 |     i = 0
25 |     for line in open(from_file):
26 |         if is_header(line):
27 |             header = process_header(line.strip().split("\t"))
28 |         elif len(line.strip()) > 0 and header != None: 
29 |             cmp = line.strip().split("\t")                        
30 |             if where(cmp) and len(cmp) == len(header):    
31 |                 obj = toObject(cmp, header)
32 |                 if obj != None:
33 |                     data[index(cmp)] = obj
34 |         i += 1
35 |         if i % 100000 == 0:
36 |             print("... reading line " + str(i) + " from file " + from_file)
37 |     return(header, data) 
38 | 
39 | def build_user(str_user, names):
40 |     return User(
41 |         [int(x) for x in str_user[names["jobroles"]].split(",") if len(x) > 0],
42 |         int(str_user[names["career_level"]]),
43 |         int(str_user[names["industry_id"]]),
44 |         int(str_user[names["discipline_id"]]),
45 |         str_user[names["country"]],
46 |         str_user[names["region"]]
47 |     )
48 | 
49 | def build_item(str_item, names):    
50 |     return Item(        
51 |         [int(x) for x in str_item[names["title"]].split(",") if len(x) > 0],
52 |         int(str_item[names["career_level"]]),
53 |         int(str_item[names["industry_id"]]),
54 |         int(str_item[names["discipline_id"]]),
55 |         str_item[names["country"]],
56 |         str_item[names["region"]]
57 |     )
58 | 
59 | class InteractionBuilder:
60 |     
61 |     def __init__(self, user_dict, item_dict):
62 |         self.user_dict = user_dict
63 |         self.item_dict = item_dict
64 |     
65 |     def build_interaction(self, str_inter, names):
66 |         if int(str_inter[names['item_id']]) in self.item_dict and int(str_inter[names['user_id']]) in self.user_dict:
67 |             return Interaction(
68 |                 self.user_dict[int(str_inter[names['user_id']])],
69 |                 self.item_dict[int(str_inter[names['item_id']])],
70 |                 int(str_inter[names["interaction_type"]])
71 |             )
72 |         else:
73 |             return None
74 | 
75 | 
76 | 


--------------------------------------------------------------------------------
/online-evaluation/baseline/recommendation_worker.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Build recommendations based on trained XGBoost model
 3 | 
 4 | by Daniel Kohlsdorf
 5 | '''
 6 | 
 7 | from model import *
 8 | import xgboost as xgb
 9 | import numpy as np
10 | 
11 | TH = 0.9
12 | 
13 | def classify_worker(item_ids, target_users, items, users, output_file, model):
14 |     item_dict = {}
15 |     with open(output_file, 'w') as fp:
16 |         pos = 0
17 |         average_score = 0.0
18 |         num_evaluated = 0.0
19 |         for u in target_users:
20 |             data = []        
21 |             ids  = []
22 | 
23 |             # build all (user, item) pair features based for this item
24 |             for i in item_ids:        
25 |                 x = Interaction(users[u], items[i], -1)
26 |                 if x.title_match() > 0:
27 |                     f = x.features()
28 |                     data += [f]
29 |                     ids  += [i]
30 | 
31 |             if len(data) > 0:
32 |                 # predictions from XGBoost
33 |                 dtest = xgb.DMatrix(np.array(data))
34 |                 ypred = model.predict(dtest)
35 | 
36 |                 # compute average score
37 |                 average_score += sum(ypred)
38 |                 num_evaluated += float(len(ypred))
39 | 
40 |                 # use all items with a score above the given threshold and sort the result
41 |                 iids = sorted(
42 |                     [
43 |                         (ids_j, ypred_j) for ypred_j, ids_j in zip(ypred, ids) if ypred_j > TH
44 |                     ],
45 |                     key = lambda x: -x[1]
46 |                 )
47 |                 if(len(iids) > 0):
48 |                     iid = iids[0]
49 |                     if iid[0] not in item_dict:
50 |                         item_dict[iid[0]] = []
51 |                     item_dict[iid[0]] += [u]
52 |             
53 |             # Every 100 iterations print some stats
54 |             if pos % 10 == 0:
55 |                 try:
56 |                     score = str(average_score / num_evaluated)
57 |                 except ZeroDivisionError:
58 |                     score = 0
59 |                 percentageDown = str(pos / float(len(item_ids)))
60 |                 print(output_file + " " + str(percentageDown) + " " + str(score))
61 |             pos += 1        
62 | 
63 |         for item_id in item_dict.keys():
64 |             if len(item_dict[item_id]) > 0:
65 |                 fp.write(str(item_id) + "\t")
66 |                 for user_id in item_dict[item_id][0:-1]:
67 |                     fp.write(str(user_id) + ",")
68 |                 fp.write(str(item_dict[item_id][-1]) + "\n")
69 |         fp.flush()
70 | 
71 | 


--------------------------------------------------------------------------------
/online-evaluation/img/timeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/recsyschallenge/2017/0515d32417993b831915087a8a57da0f107557a5/online-evaluation/img/timeline.png


--------------------------------------------------------------------------------