├── .gitignore ├── README.md ├── baseline ├── README.md ├── baseline.py └── create_input_files.py ├── data_extraction ├── Makefile ├── README.md ├── lists │ ├── cc-match.tsv │ ├── domains-official.txt │ ├── subreddits-official.txt │ ├── test-multiref.hashes │ ├── test-multiref.sets │ ├── test.hashes │ └── valid.hashes ├── requirements.txt ├── src │ ├── Makefile.official │ ├── Makefile.official.targets │ ├── commoncrawl.py │ ├── create_official_data.py │ └── ids2refs.py └── trial │ ├── README.md │ ├── lists │ ├── data-trial.txt │ ├── domains.txt │ └── subreddits.txt │ └── src │ ├── Makefile.trial │ ├── create_trial_data.py │ ├── create_trial_data.sh │ └── requirements.txt ├── doc ├── DSTC7_task2.pdf └── proposal.pdf └── evaluation ├── LICENSE.txt ├── README.md ├── dstc7-task2-human_eval.tgz ├── old ├── dstc7-task2-individual_judgments.tgz ├── dstc7-task2-individual_judgments.xlsx └── dstc7-task2-scores.xlsx └── src ├── .gitignore ├── 3rdparty └── .create ├── README.md ├── automatic-evaluation.sh ├── demo.py ├── demo ├── hyp.txt ├── ref0.txt └── ref1.txt ├── dstc.py ├── keys └── test.2k.txt ├── metrics.py ├── systems └── constant-baseline.txt ├── tokenizers.py └── util.py /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | tmp 3 | src/tmp 4 | data_extraction/logs 5 | data_extraction/reddit 6 | data_extraction/data-official 7 | data_extraction/data-official-test 8 | data_extraction/data-official-valid 9 | *.log 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DSTC7: End-to-End Conversation Modeling 2 | 3 | **[DSTC7](http://workshop.colips.org/dstc7/) has ended on January 27, 2019. This github project is still available 'as is', but we unfortunately no longer have time to maintain the code or to provide assistance with this project.** 4 | 5 | ## News 6 | * 10/29/2018: [Spreadsheet](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/blob/master/evaluation/dstc7-task2-individual_judgments.xlsx) containing indivdual judgments used for human evaluation. 7 | * 10/23/2018 and 10/15/2018: Automatic and human evaluation [results](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/blob/master/evaluation/dstc7-task2-scores.xlsx) posted. The code to reproduce the automatic evaluation and get the same scores can be found [here](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/blob/master/evaluation/src/README.md). 8 | * 10/8/2018: Participants submitted system outputs. 9 | * 9/10/2018-10/8/2018: Evaluation phase, instructions [here](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/tree/master/evaluation). 10 | * 7/11/2018: An [FAQ](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction#FAQ) section has been added to the data extraction page. 11 | * 7/1/2018: [Official training data](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction) is up. 12 | * 6/18/2018: [Trial data](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction/trial) is up. 13 | * 6/1/2018: [Task description](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/blob/master/doc/DSTC7_task2.pdf) is up. 14 | * 6/1/2018: [Registration](https://docs.google.com/forms/d/e/1FAIpQLSf4aoCdtLsnFr_AKfp3tnTy4OUCITy5avcEEpUHJ9oZ5ZFvbg/viewform) for DSTC7 is now open. 15 | 16 | ## Registration 17 | 18 | ~~Please register [here]~~ Registration has now closed. 19 | 20 | ## Task 21 | This [DSTC7](http://workshop.colips.org/dstc7/) track presents an end-to-end conversational modeling task, in which the goal is to generate conversational responses that go beyond trivial chitchat by injecting informative responses that are grounded in external knowledge. This task is distinct from what is commonly thought of as goal-oriented, task-oriented, or task-completion dialog in that there is no specific or predefined goal (e.g., booking a flight, or reserving a table at a restaurant). Instead, it targets human-like interactions where the underlying goal is often ill-defined or not known in advance, of the kind seen, for example, in work and other productive environments (e.g.,brainstorming meetings) where people share information. 22 | 23 | Please check this [description](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/blob/master/doc/DSTC7_task2.pdf) for more details about the task, which follows our previous work ["A Knowledge-Grounded Neural Conversation Model"](https://arxiv.org/abs/1702.01932) and our original task [proposal](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/blob/master/doc/proposal.pdf). 24 | 25 | ## Data 26 | We extend the [knowledge-grounded](https://arxiv.org/abs/1702.01932) setting, with each system input consisting of two parts: 27 | * Conversational data from Reddit. 28 | * Contextually-relevant “facts”, taken from the website that started the (Reddit) conversation. 29 | 30 | Please check the [data extraction](https://github.com/DSTC-MSR/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction) for the input data pipeline. Note: We are providing scripts to extract the data from a Reddit [dump](http://files.pushshift.io/reddit/comments/), as we are unable to release the data directly ourselves. 31 | 32 | ## Evaluation 33 | As described in the [task description](http://workshop.colips.org/dstc7/proposals/DSTC7-MSR_end2end.pdf) (Section 4), We will evaluate response quality using both automatic and human evaluations on two criteria. 34 | * Appropriateness; 35 | * Informativeness. 36 | 37 | We will use automatic evaluation metrics such as BLEU and METEOR to have preliminary score for each submission prior to the human evaluation. Participants can also use these metrics for their own evaluations during the development phase. We will allow participants to submit multiple system outputs with one system marked as “primary” for human evaluation. We will provide a BLEU scoring script to help participants decide which system they want to select as primary. 38 | 39 | We will use crowdsourcing for human evaluation. For each response, we ask humans if it is an (1) appropriate and (2) informative response, on a scale from 1 to 5. The system with best average Appropriateness and Informativeness will be determined the winner. 40 | 41 | ## Baseline 42 | A standard seq2seq [baseline model](https://github.com/DSTC-MSR/DSTC7-End-to-End-Conversation-Modeling/tree/master/baseline) will be provided soon. 43 | 44 | ## Timeline 45 | |Phase|Dates| 46 | | ------ | -------------- | 47 | |1. Development Phase|June 1 – September 9| 48 | |      1.1 Code (data extraction code, seq2seq baseline)|June 1| 49 | |      1.2 "Trial" data made available|June 18| 50 | |      1.3 Official training data made available| July 1| 51 | |2. Evaluation Phase|September 10 – October 8| 52 | |      2.1 Test data made available|September 10| 53 | |      2.2 Participants submit their system outputs|October 8| 54 | |3. Results are released|October| 55 | |      3.1 Automatic scores (BLEU, etc.)|October 16| 56 | |      3.2 Human evaluation|October 23| 57 | 58 | ## Organizers 59 | * [Michel Galley](https://www.microsoft.com/en-us/research/people/mgalley/) 60 | * [Chris Brockett](https://www.microsoft.com/en-us/research/people/chrisbkt/) 61 | * [Sean Xiang Gao](https://www.linkedin.com/in/gxiang1228/) 62 | * [Bill Dolan](https://www.microsoft.com/en-us/research/people/billdol/) 63 | * [Jianfeng Gao](https://www.microsoft.com/en-us/research/people/jfgao/) 64 | 65 | ## Reference 66 | If you submit any system to DSTC7-Task2 or publish any other work making use of the resources provided on this project, we ask you to cite the following task description paper: 67 | 68 | ```Michel Galley, Chris Brockett, Xiang Gao, Bill Dolan, Jianfeng Gao. End-to-End conversation Modeling: DSTC7 Task 2 Description. In DSTC7 workshop (forthcoming).``` 69 | 70 | ## Contact Information 71 | * ~~For questions specific to Task 2, you can contact us at .~~ (No longer maintained.) 72 | * ~~You can get the latest updates and participate in discussions on [DSTC mailing list](http://workshop.colips.org/dstc7/contact.html).~~ 73 | -------------------------------------------------------------------------------- /baseline/README.md: -------------------------------------------------------------------------------- 1 | # Baseline Model 2 | This is a baseline model for [DSTC task 2](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling). This is a GRU-based seq2seq generation system. Since it is a baseline, the model does not use grounding information ("facts"), attention or beam search. It uses greedy decoding (unkown token disabled). The implementation is in Python, adapted from a Keras [tutorial](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html). 3 | 4 | ## Requirement 5 | The scripts are tested on [Python 3.6.4](https://www.python.org/downloads/) with the following libaries 6 | * [Keras 2.1.6](https://keras.io/), which requires another backend lib. We used [Tensorflow 1.8.0](https://www.tensorflow.org/) 7 | * [numpy 1.14.0](http://www.numpy.org/) 8 | 9 | ## Input files 10 | Trial data extraction script and instructions are available [here](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction). Based on the conversation text file, the following input files are generated with the command: 11 | ``` 12 | python create_input_files.py 13 | ``` 14 | |generated file|description| 15 | |---|---| 16 | |dict.txt|The vocab list. The line id is the `word id` (starting from 1) of the word in this line. The words are ordered by their frequencies appeared in the raw input file| 17 | |source_num.txt|The list of source sentences where words are replaced by their `word id`| 18 | |target_num.txt|The list of corresponding target sentences where words are replaced by their `word id`| 19 | 20 | ## Parameters 21 | Key parameters can be specified in the `main()` function of [baseline.py](baseline.py). We follow [DSTC 2016](https://github.com/dialogtekgeek/DSTC6-End-to-End-Conversation-Modeling/blob/master/ChatbotBaseline/egs/twitter/run.sh) for the hyperparameter value settings. 22 | 23 | |parameter|description|value| 24 | |---------|-------|-----| 25 | |`token_embed_dim` | length of word embedding vector |100| 26 | |`rnn_units`| number of hidden units of each GRU cell|512| 27 | |`encoder_depth`| number of GRU cells stacked in the encoder|2| 28 | |`decoder_depth`| number of GRU cells stacked in the decoder|2| 29 | |`dropout_rate`| dropout probability|0.5| 30 | |`max_num_token`| if not None, only use top `max_num_token` most frequent tokens|20000| 31 | |`max_seq_len`| tokens after the first `max_seq_len` tokens will be discarded |32| 32 | 33 | 34 | ## Run 35 | Use the command: 36 | ``` 37 | python baseline.py [mode] 38 | ``` 39 | where `mode` can be one of the following values 40 | 41 | |mode|description| 42 | |---------|-------| 43 | |`train` | train a new model. The trained model is saved after each epoch | 44 | |`continue` | load existing model and continue the training | 45 | |`eval`| evaluate the model on held-out data. Negative log likelihood loss is printed| 46 | |`interact`| explore the trained model interactively| 47 | -------------------------------------------------------------------------------- /baseline/baseline.py: -------------------------------------------------------------------------------- 1 | import os, random, sys, io 2 | import numpy as np 3 | from keras.models import Model 4 | from keras.layers import Input, GRU, Dense, Embedding, Dropout 5 | from keras.models import load_model 6 | from keras.optimizers import Adam 7 | 8 | """ 9 | a simple seq2seq model prepared as a baseline model for DSTC7 10 | https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling 11 | 12 | following Keras tutorial: 13 | https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html 14 | 15 | NOTE: 16 | * word-level, GRU-based 17 | * no attention mechanism 18 | * no beam search. greedy decoding, UNK disabled 19 | 20 | CONTACT: 21 | Sean Xiang Gao (xiag@microsoft.com) at Microsoft Research 22 | """ 23 | 24 | SOS_token = '_SOS_' 25 | EOS_token = '_EOS_' 26 | UNK_token = '_UNK_' 27 | 28 | 29 | def set_random_seed(seed=912): 30 | random.seed(seed) 31 | np.random.seed(seed) 32 | 33 | 34 | def makedirs(fld): 35 | if not os.path.exists(fld): 36 | os.makedirs(fld) 37 | 38 | class Dataset: 39 | 40 | """ 41 | assumptions of the data files 42 | * SOS and EOS are top 2 tokens 43 | * dictionary ordered by frequency 44 | """ 45 | 46 | def __init__(self, 47 | path_source, path_target, path_vocab, 48 | max_seq_len=32, 49 | test_split=0.2, # how many hold out as vali data 50 | read_txt=True, 51 | ): 52 | 53 | 54 | # load token dictionary 55 | 56 | self.index2token = {0: ''} 57 | self.token2index = {'': 0} 58 | self.max_seq_len = max_seq_len 59 | 60 | with io.open(path_vocab, encoding="utf-8") as f: 61 | lines = f.readlines() 62 | for i, line in enumerate(lines): 63 | token = line.strip('\n').strip() 64 | if len(token) == 0: 65 | break 66 | self.index2token[i + 1] = token 67 | self.token2index[token] = i + 1 68 | 69 | self.SOS = self.token2index[SOS_token] 70 | self.EOS = self.token2index[EOS_token] 71 | self.UNK = self.token2index[UNK_token] 72 | self.num_tokens = len(self.token2index) - 1 # not including 0-th (padding) 73 | print('num_tokens: %i'%self.num_tokens) 74 | 75 | if read_txt: 76 | self.read_txt(path_source, path_target, test_split) 77 | 78 | 79 | def read_txt(self, path_source, path_target, test_split): 80 | print('loading data from txt files...') 81 | # load source-target pairs, tokenized 82 | 83 | seqs = dict() 84 | for k, path in [('source', path_source), ('target', path_target)]: 85 | seqs[k] = [] 86 | with io.open(path, encoding="utf-8") as f: 87 | lines = f.readlines() 88 | for line in lines: 89 | seq = [] 90 | for c in line.strip('\n').strip().split(' '): 91 | i = int(c) 92 | if i <= self.num_tokens: # delete the "unkown" words 93 | seq.append(i) 94 | seqs[k].append(seq[-min(self.max_seq_len - 2, len(seq)):]) 95 | self.pairs = list(zip(seqs['source'], seqs['target'])) 96 | 97 | # train-test split 98 | 99 | np.random.shuffle(self.pairs) 100 | self.n_train = int(len(self.pairs) * (1. - test_split)) 101 | 102 | self.i_sample_range = { 103 | 'train': (0, self.n_train), 104 | 'test': (self.n_train, len(self.pairs)), 105 | } 106 | self.i_sample = dict() 107 | self.reset() 108 | 109 | 110 | def reset(self): 111 | for task in self.i_sample_range: 112 | self.i_sample[task] = self.i_sample_range[task][0] 113 | 114 | def all_loaded(self, task): 115 | return self.i_sample[task] == self.i_sample_range[task][1] 116 | 117 | def load_data(self, task, max_num_sample_loaded=None): 118 | 119 | i_sample = self.i_sample[task] 120 | if max_num_sample_loaded is None: 121 | max_num_sample_loaded = self.i_sample_range[task][1] - i_sample 122 | i_sample_next = min(i_sample + max_num_sample_loaded, self.i_sample_range[task][1]) 123 | num_samples = i_sample_next - i_sample 124 | self.i_sample[task] = i_sample_next 125 | 126 | print('building %s data from %i to %i'%(task, i_sample, i_sample_next)) 127 | 128 | encoder_input_data = np.zeros((num_samples, self.max_seq_len)) 129 | decoder_input_data = np.zeros((num_samples, self.max_seq_len)) 130 | decoder_target_data = np.zeros((num_samples, self.max_seq_len, self.num_tokens + 1)) # +1 as mask_zero 131 | 132 | source_texts = [] 133 | target_texts = [] 134 | 135 | for i in range(num_samples): 136 | 137 | seq_source, seq_target = self.pairs[i_sample + i] 138 | if not bool(seq_target) or not bool(seq_source): 139 | continue 140 | 141 | if seq_target[-1] != self.EOS: 142 | seq_target.append(self.EOS) 143 | 144 | source_texts.append(' '.join([self.index2token[j] for j in seq_source])) 145 | target_texts.append(' '.join([self.index2token[j] for j in seq_target])) 146 | 147 | for t, token_index in enumerate(seq_source): 148 | encoder_input_data[i, t] = token_index 149 | 150 | decoder_input_data[i, 0] = self.SOS 151 | for t, token_index in enumerate(seq_target): 152 | decoder_input_data[i, t + 1] = token_index 153 | decoder_target_data[i, t, token_index] = 1. 154 | 155 | return encoder_input_data, decoder_input_data, decoder_target_data, source_texts, target_texts 156 | 157 | 158 | 159 | 160 | class Seq2Seq: 161 | 162 | def __init__(self, 163 | dataset, model_dir, 164 | token_embed_dim, rnn_units, encoder_depth, decoder_depth, dropout_rate=0.2): 165 | 166 | self.token_embed_dim = token_embed_dim 167 | self.rnn_units = rnn_units 168 | self.encoder_depth = encoder_depth 169 | self.decoder_depth = decoder_depth 170 | self.dropout_rate = dropout_rate 171 | self.dataset = dataset 172 | 173 | makedirs(model_dir) 174 | self.model_dir = model_dir 175 | 176 | 177 | def load_models(self): 178 | self.build_model_train() 179 | self.model_train.load_weights(os.path.join(self.model_dir, 'model.h5')) 180 | self.build_model_test() 181 | 182 | 183 | def _stacked_rnn(self, rnns, inputs, initial_states=None): 184 | if initial_states is None: 185 | initial_states = [None] * len(rnns) 186 | outputs, state = rnns[0](inputs, initial_state=initial_states[0]) 187 | states = [state] 188 | for i in range(1, len(rnns)): 189 | outputs, state = rnns[i](outputs, initial_state=initial_states[i]) 190 | states.append(state) 191 | return outputs, states 192 | 193 | 194 | def build_model_train(self): 195 | 196 | # build layers 197 | embeding = Embedding( 198 | self.dataset.num_tokens + 1, # +1 as mask_zero 199 | self.token_embed_dim, mask_zero=True, 200 | name='embeding') 201 | 202 | encoder_inputs = Input(shape=(None,), name='encoder_inputs') 203 | encoder_rnns = [] 204 | for i in range(self.encoder_depth): 205 | encoder_rnns.append(GRU( 206 | self.rnn_units, 207 | return_state=True, 208 | return_sequences=True, 209 | name='encoder_rnn_%i'%i, 210 | )) 211 | 212 | decoder_inputs = Input(shape=(None,), name='decoder_inputs') 213 | decoder_rnns = [] 214 | for i in range(self.decoder_depth): 215 | decoder_rnns.append(GRU( 216 | self.rnn_units, 217 | return_state=True, 218 | return_sequences=True, 219 | name='decoder_rnn_%i'%i, 220 | )) 221 | 222 | decoder_softmax = Dense( 223 | self.dataset.num_tokens + 1, # +1 as mask_zero 224 | activation='softmax', name='decoder_softmax') 225 | 226 | # set connections: teacher forcing 227 | 228 | encoder_outputs, encoder_states = self._stacked_rnn( 229 | encoder_rnns, embeding(encoder_inputs)) 230 | 231 | decoder_outputs, decoder_states = self._stacked_rnn( 232 | decoder_rnns, embeding(decoder_inputs), [encoder_states[-1]] * self.decoder_depth) 233 | 234 | decoder_outputs = Dropout(self.dropout_rate)(decoder_outputs) 235 | decoder_outputs = decoder_softmax(decoder_outputs) 236 | self.model_train = Model( 237 | [encoder_inputs, decoder_inputs], # [input sentences, ground-truth target sentences], 238 | decoder_outputs) # shifted ground-truth sentences 239 | 240 | 241 | def build_model_test(self): 242 | 243 | # load/build layers 244 | 245 | names = ['embeding', 'decoder_softmax'] 246 | for i in range(self.encoder_depth): 247 | names.append('encoder_rnn_%i'%i) 248 | for i in range(self.decoder_depth): 249 | names.append('decoder_rnn_%i'%i) 250 | 251 | reused = dict() 252 | for name in names: 253 | reused[name] = self.model_train.get_layer(name) 254 | 255 | encoder_inputs = Input(shape=(None,), name='encoder_inputs') 256 | decoder_inputs = Input(shape=(None,), name='decoder_inputs') 257 | decoder_inital_states = [] 258 | for i in range(self.decoder_depth): 259 | decoder_inital_states.append(Input(shape=(self.rnn_units,), name="decoder_inital_state_%i"%i)) 260 | 261 | # set connections: autoregressive 262 | 263 | encoder_outputs, encoder_states = self._stacked_rnn( 264 | [reused['encoder_rnn_%i'%i] for i in range(self.encoder_depth)], 265 | reused['embeding'](encoder_inputs)) 266 | self.model_infer_encoder = Model(encoder_inputs, encoder_states[-1]) 267 | 268 | decoder_outputs, decoder_states = self._stacked_rnn( 269 | [reused['decoder_rnn_%i'%i] for i in range(self.decoder_depth)], 270 | reused['embeding'](decoder_inputs), 271 | decoder_inital_states) 272 | 273 | decoder_outputs = Dropout(self.dropout_rate)(decoder_outputs) 274 | decoder_outputs = reused['decoder_softmax'](decoder_outputs) 275 | self.model_infer_decoder = Model( 276 | [decoder_inputs] + decoder_inital_states, 277 | [decoder_outputs] + decoder_states) 278 | 279 | 280 | def save_model(self, name): 281 | path = os.path.join(self.model_dir, name) 282 | self.model_train.save_weights(path) 283 | print('saved to: '+path) 284 | 285 | 286 | def train(self, 287 | batch_size, epochs, 288 | batch_per_load=10, 289 | lr=0.001): 290 | 291 | 292 | self.model_train.compile(optimizer=Adam(lr=lr), loss='categorical_crossentropy') 293 | max_load = np.ceil(self.dataset.n_train/batch_size/batch_per_load) 294 | 295 | for epoch in range(epochs): 296 | load = 0 297 | self.dataset.reset() 298 | while not self.dataset.all_loaded('train'): 299 | load += 1 300 | print('\n***** Epoch %i/%i - load %.2f perc *****'%(epoch + 1, epochs, 100*load/max_load)) 301 | encoder_input_data, decoder_input_data, decoder_target_data, _, _ = self.dataset.load_data('train', batch_size * batch_per_load) 302 | 303 | self.model_train.fit( 304 | [encoder_input_data, decoder_input_data], 305 | decoder_target_data, 306 | batch_size=batch_size,) 307 | 308 | self.save_model('model_epoch%i.h5'%(epoch + 1)) 309 | self.save_model('model.h5') 310 | 311 | 312 | def evaluate(self, samples_per_load=640): 313 | 314 | self.model_train.compile(optimizer=Adam(lr=1e-3), loss='categorical_crossentropy') 315 | self.dataset.reset() 316 | sum_loss = 0. 317 | sum_n = 0 318 | 319 | while not self.dataset.all_loaded('test'): 320 | encoder_input_data, decoder_input_data, decoder_target_data, _, _ = self.dataset.load_data('test', samples_per_load) 321 | 322 | print('evaluating') 323 | loss = self.model_train.evaluate( 324 | x=[encoder_input_data, decoder_input_data], 325 | y=decoder_target_data, 326 | verbose=0) 327 | 328 | n = encoder_input_data.shape[0] 329 | sum_loss += loss * n 330 | sum_n += n 331 | print('avg loss: %.2f'%(sum_loss/sum_n)) 332 | print('done') 333 | 334 | 335 | 336 | 337 | def _infer(self, source_seq_int): 338 | 339 | state = self.model_infer_encoder.predict(source_seq_int) 340 | prev_word = np.atleast_2d([self.dataset.SOS]) 341 | states = [state] * self.decoder_depth 342 | decoded_sentence = '' 343 | t = 0 344 | while True: 345 | 346 | out = self.model_infer_decoder.predict([prev_word] + states) 347 | tokens_proba = out[0].ravel() 348 | tokens_proba[self.dataset.UNK] = 0 # UNK disabled 349 | tokens_proba = tokens_proba/sum(tokens_proba) 350 | states = out[1:] 351 | sampled_token_index = np.argmax(tokens_proba) 352 | sampled_token = self.dataset.index2token[sampled_token_index] 353 | decoded_sentence += sampled_token+' ' 354 | 355 | t += 1 356 | if sampled_token_index == self.dataset.EOS or t > self.dataset.max_seq_len: 357 | break 358 | 359 | prev_word = np.atleast_2d([sampled_token_index]) 360 | 361 | return decoded_sentence 362 | 363 | 364 | def dialog(self, input_text): 365 | 366 | source_seq_int = [] 367 | for token in input_text.strip().strip('\n').split(' '): 368 | source_seq_int.append(self.dataset.token2index.get(token, self.dataset.UNK)) 369 | return self._infer(np.atleast_2d(source_seq_int)) 370 | 371 | 372 | def interact(self): 373 | while True: 374 | print('----- please input -----') 375 | input_text = input() 376 | if not bool(input_text): 377 | break 378 | print(self.dialog(input_text)) 379 | 380 | 381 | 382 | def main(mode): 383 | 384 | 385 | token_embed_dim = 100 386 | rnn_units = 512 387 | encoder_depth = 2 388 | decoder_depth = 2 389 | dropout_rate = 0.5 390 | learning_rate = 1e-3 391 | max_seq_len = 32 392 | 393 | batch_size = 100 394 | epochs = 10 395 | 396 | path_source = os.path.join('official','source_num.txt') 397 | path_target = os.path.join('official','target_num.txt') 398 | path_vocab = os.path.join('official','dict.txt') 399 | 400 | dataset = Dataset(path_source, path_target, path_vocab, max_seq_len=max_seq_len, read_txt=(mode!='interact')) 401 | model_dir = 'model' 402 | 403 | s2s = Seq2Seq(dataset, model_dir, 404 | token_embed_dim, rnn_units, encoder_depth, decoder_depth, dropout_rate) 405 | 406 | if mode == 'train': 407 | s2s.build_model_train() 408 | else: 409 | s2s.load_models() 410 | 411 | if mode in ['train', 'continue']: 412 | s2s.train(batch_size, epochs, lr=learning_rate) 413 | else: 414 | if mode == 'eval': 415 | s2s.build_model_test() 416 | s2s.evaluate() 417 | elif mode == 'interact': 418 | s2s.interact() 419 | 420 | 421 | if __name__ == '__main__': 422 | set_random_seed() 423 | mode = sys.argv[1] # one of [train, continue, eval, interact] 424 | main(mode) 425 | -------------------------------------------------------------------------------- /baseline/create_input_files.py: -------------------------------------------------------------------------------- 1 | import os, queue 2 | from baseline import SOS_token, EOS_token, UNK_token 3 | 4 | 5 | 6 | def main(path_txt, fld_out, 7 | max_vocab_size=2e4, 8 | ): 9 | 10 | 11 | with open(path_txt, encoding="utf-8") as f: 12 | lines = f.readlines() 13 | 14 | if not os.path.exists(fld_out): 15 | os.makedirs(fld_out) 16 | 17 | path = dict() 18 | for end in ['source', 'target']: 19 | path[end] = os.path.join(fld_out, '%s_num.txt'%end) 20 | path['dict'] = os.path.join(fld_out, 'dict.txt') 21 | 22 | for k in path: 23 | open(path[k], 'w') 24 | 25 | sources = [] 26 | targets = [] 27 | n = 0 28 | for line in lines: 29 | n += 1 30 | if n%1e5 == 0: 31 | print('checked %.2fM/%.2fM lines'%(n/1e6, len(lines)/1e6)) 32 | sub = line.split('\t') 33 | source = sub[-2] 34 | target = sub[-1] 35 | if source == 'START': # skip if source has nothing 36 | continue 37 | sources.append(source.strip().split()) 38 | targets.append(target.strip().split()) 39 | 40 | vocab = dict() 41 | for tokens in sources + targets: 42 | for token in tokens: 43 | if token not in vocab: 44 | vocab[token] = 0 45 | vocab[token] += 1 46 | 47 | pq = queue.PriorityQueue() 48 | for token in vocab: 49 | pq.put((-vocab[token], token)) 50 | 51 | ordered_tokens = [SOS_token, EOS_token, UNK_token] 52 | while not pq.empty(): 53 | freq, token = pq.get() 54 | ordered_tokens.append(token) 55 | if len(ordered_tokens) == max_vocab_size: 56 | break 57 | 58 | print('vocab size = %i'%len(ordered_tokens)) 59 | 60 | token2index = dict() 61 | for i, token in enumerate(ordered_tokens): 62 | token2index[token] = str(i + 1) # +1 as 0 is for padding 63 | with open(path['dict'], 'a', encoding="utf-8") as f: 64 | f.write('\n'.join(ordered_tokens)) 65 | 66 | nums = [] 67 | for tokens in sources: 68 | nums.append(' '.join([token2index.get(token, token2index[UNK_token]) for token in tokens])) 69 | with open(path['source'], 'a', encoding="utf-8") as f: 70 | f.write('\n'.join(nums)) 71 | 72 | nums = [] 73 | for tokens in targets: 74 | nums.append(' '.join([token2index.get(token, token2index[UNK_token]) for token in tokens])) 75 | with open(path['target'], 'a', encoding="utf-8") as f: 76 | f.write('\n'.join(nums)) 77 | 78 | 79 | 80 | if __name__ == '__main__': 81 | fld_out = 'official' 82 | path_txt = 'F:/DSTC/data-official/train.convos.txt' 83 | main(path_txt, fld_out) 84 | print('done!') 85 | 86 | 87 | 88 | 89 | 90 | 91 | -------------------------------------------------------------------------------- /data_extraction/Makefile: -------------------------------------------------------------------------------- 1 | .SECONDARY: 2 | 3 | all: train dev valid test refs test.refs 4 | data: train dev valid test 5 | 6 | refs: data-official-test/test.refs.txt 7 | test: data-official-test/test.convos.txt data-official-test/test.facts.txt 8 | valid: data-official-valid/valid.convos.txt data-official-valid/valid.facts.txt 9 | dev: data-official/dev.convos.txt data-official/dev.facts.txt 10 | train: data-official/train.convos.txt data-official/train.facts.txt 11 | 12 | WARGS=--tries=0 --retry-connrefused --waitretry=30 13 | 14 | include src/Makefile.official.targets 15 | 16 | data-official-test/test.refs.txt: $(OFFICIAL_TEST_REFS) 17 | cat $+ | sort | uniq > $@ 18 | 19 | data-official-test/test.convos.txt: $(OFFICIAL_TEST_CONVOS) 20 | cat $+ | sort | uniq > $@ 21 | 22 | data-official-test/test.facts.txt: $(OFFICIAL_TEST_FACTS) 23 | cat $+ > $@ 24 | 25 | data-official-valid/valid.convos.txt: $(OFFICIAL_VALID_CONVOS) 26 | cat $+ | sort | uniq > $@ 27 | 28 | data-official-valid/valid.facts.txt: $(OFFICIAL_VALID_FACTS) 29 | cat $+ > $@ 30 | 31 | data-official/dev.convos.txt: $(OFFICIAL_DEV_CONVOS) 32 | cat $+ > $@ 33 | 34 | data-official/dev.facts.txt: $(OFFICIAL_DEV_FACTS) 35 | cat $+ > $@ 36 | 37 | data-official/train.convos.txt: $(OFFICIAL_TRAIN_CONVOS) 38 | cat $+ > $@ 39 | 40 | data-official/train.facts.txt: $(OFFICIAL_TRAIN_FACTS) 41 | cat $+ > $@ 42 | 43 | test.refs: data-official-test/test.refs.txt lists/test-multiref.sets 44 | cat $< | python src/ids2refs.py lists/test-multiref.sets > $@ 45 | 46 | data-official-test/%.refs.txt: data-official-test/%.pkl lists/test-multiref.hashes 47 | time python src/create_official_data.py --rsinput=reddit/RS_$(*F).zst --rcinput=reddit/RC_$(*F).zst --subreddit_filter=lists/subreddits-official.txt --domain_filter=lists/domains-official.txt --pickle=data-official-test/$(*F).pkl --facts=- --convos=data-official-test/$(*F).refs.txt --anchoronly=True --use_cc=True --test=lists/test-multiref.hashes > logs/$(*F)-refs.log 2> logs/$(*F)-refs.err 48 | 49 | ## Note: there is no point in removing the '--blind' flag in order to peek at the reference responses (gold), as the organizers will rely on different responses to compute BLEU, etc. 50 | data-official-test/%.facts.txt data-official-test/%.convos.txt data-official-test/%.pkl: reddit/RC_%.zst reddit/RS_%.zst data-official-test/.create lists/test.hashes 51 | time python src/create_official_data.py --rsinput=reddit/RS_$(*F).zst --rcinput=reddit/RC_$(*F).zst --subreddit_filter=lists/subreddits-official.txt --domain_filter=lists/domains-official.txt --pickle=data-official-test/$(*F).pkl --facts=data-official-test/$(*F).facts.txt --convos=data-official-test/$(*F).convos.txt --anchoronly=True --use_cc=True --test=lists/test.hashes --blind=True > logs/$(*F).log 2> logs/$(*F).err 52 | 53 | data-official-valid/%.facts.txt data-official-valid/%.convos.txt: reddit/RC_%.zst reddit/RS_%.zst data-official-valid/.create lists/valid.hashes 54 | time python src/create_official_data.py --rsinput=reddit/RS_$(*F).zst --rcinput=reddit/RC_$(*F).zst --subreddit_filter=lists/subreddits-official.txt --domain_filter=lists/domains-official.txt --pickle=data-official-valid/$(*F).pkl --facts=data-official-valid/$(*F).facts.txt --convos=data-official-valid/$(*F).convos.txt --anchoronly=True --use_cc=True --test=lists/valid.hashes > logs/$(*F).log 2> logs/$(*F).err 55 | 56 | data-official/%.facts.txt data-official/%.convos.txt: reddit/RC_%.zst reddit/RS_%.zst data-official/.create 57 | time python src/create_official_data.py --rsinput=reddit/RS_$(*F).zst --rcinput=reddit/RC_$(*F).zst --subreddit_filter=lists/subreddits-official.txt --domain_filter=lists/domains-official.txt --pickle=data-official/$(*F).pkl --facts=data-official/$(*F).facts.txt --convos=data-official/$(*F).convos.txt --anchoronly=True --use_cc=True > logs/$(*F).log 2> logs/$(*F).err 58 | 59 | data-official-test/.create: 60 | mkdir -p logs 61 | mkdir -p data-official-test 62 | touch $@ 63 | 64 | data-official-valid/.create: 65 | mkdir -p logs 66 | mkdir -p data-official-valid 67 | touch $@ 68 | 69 | data-official/.create: 70 | mkdir -p logs 71 | mkdir -p data-official 72 | touch $@ 73 | 74 | reddit/RS_%.zst: 75 | mkdir -p logs 76 | mkdir -p reddit 77 | wget $(WARGS) https://files.pushshift.io/reddit/submissions/RS_$(*F).zst -O reddit/RS_$(*F).zst -o logs/RS_$(*F).zst.log -c 78 | 79 | reddit/RC_%.zst: 80 | mkdir -p logs 81 | mkdir -p reddit 82 | wget $(WARGS) https://files.pushshift.io/reddit/comments/RC_$(*F).zst -O reddit/RC_$(*F).zst -o logs/RC_$(*F).zst.log -c 83 | -------------------------------------------------------------------------------- /data_extraction/README.md: -------------------------------------------------------------------------------- 1 | # Data Extraction for DSTC7: End-to-End Conversation Modeling 2 | 3 | Task 2 uses conversational data extracted from Reddit. Each conversation in this setup is _grounded_, as each conversation in this data is about a specific web page that was linked at the start of the conversation. This page provides code to extract the data from a Reddit [dump](http://files.pushshift.io/reddit/comments/) and from [Common Crawl](http://commoncrawl.org/). The former data provides the conversation, while the latter offers the grounding. We provide code instead of actual data, as we are unable to directly release this data. 4 | 5 | (Note: the older and now obsolete setup to create the "trial" data can be found [here](https://github.com/DSTC-MSR/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction/trial).) 6 | 7 | If you run into any problem creating the data, please check the [FAQ](https://github.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction#FAQ) section below. 8 | 9 | ## Requirements 10 | 11 | This page assumes you are running a UNIX environment (Linux, macOS, etc.) If you are on Windows, please either use its Ubuntu subsystem (instructions [here](https://docs.microsoft.com/en-us/windows/wsl/install-win10)) or any third-party UNIX-like environment such as [Cygwin](https://www.cygwin.com/). Creating the data requires a fair amount of disk space to store Reddit dump files locally, i.e., 500 GB total. You will also need the following programs: 12 | 13 | * `Python 3.x`, with modules: 14 | * `nltk` 15 | * `beautifulsoup4` 16 | * `chardet` 17 | * `zstandard` 18 | * `make` 19 | 20 | To install the above Python modules, you can run: 21 | 22 | ```pip install -r requirements.txt``` 23 | 24 | Please also run set `PYTHONIOENCODING=UTF-8` in your environment, e.g., by running this in bash: 25 | ```export PYTHONIOENCODING=UTF-8`` 26 | 27 | ## Create official data (train and dev): 28 | 29 | First, make sure that Python and all required modules are installed, for example by generating a subset of the data (January 2011): 30 | 31 | ```make data-official/2011-01.convos.txt``` 32 | 33 | If the target file is missing after running that command, that probably means Python or one of its modules is missing, is the wrong version, or is misconfigured. Please inspect logs files in `logs/*.log` or `logs/*.err` to find the cause of the problem (please refer to the "Requirements" section above). 34 | 35 | If everything is setup properly, please run the following command to create the official training data: 36 | 37 | ```make -j4``` 38 | 39 | This will run the extraction pipeline with 4 processes. Depending on your number of cores on your machine, you might want to increase or descrease that number. This will take 2-5 days to run, depending on the number of processes selected. This will create two tab-separated (tsv) files `data/train.convos.txt` and `data/train.facts.txt`, which respectively contain the conversational data and grounded text ("facts"). This will also create two files for the dev set. 40 | 41 | The data is generated from Reddit and the web, so some of it is noisy and occasionally contains offensive language. While we mostly selected Reddit boards (i.e., "subreddits") and web domains that are mostly "safe for work", explicit and offensive language sometimes appears in the data and we did not attempt to eliminate it further (for the sake of simplicity and reproducibility of our pipeline). 42 | 43 | Note: if you set a large number of processes, the server hosting the Reddit data (`files.pushshift.io`) might complain about "too many open connections". If so, you might want to use the makefile to first create all `reddit/*.zst` files and only then run e.g. `make -j7`. 44 | 45 | Generated data files (see data description below): 46 | 1. `train.convos.txt`: Conversations of the training set 47 | 2. `train.facts.txt`: Facts of the training set 48 | 3. `dev.convos.txt`: Conversations of the dev set 49 | 4. `dev.facts.txt`: Facts of the dev set 50 | 51 | Role of each dataset: 52 | 1. TRAIN: do anything you want with this data (train, analyze, etc.); 53 | 2. DEV: you may tune model hyperparameters on this data, but you may NOT train any model directly on that; 54 | 3. TEST (official release Sept 10): The blind test set will provide facts and conversational contexts, but the gold responses will be hidden (replaced with `__UNDISCLOSED__`). 55 | 56 | If you participate in the challenge and plan to send us system outputs, **please do NOT use any other training data**, other than the data provided on this page. 57 | 58 | ## Data description: 59 | 60 | Each conversation in this dataset consist of Reddit `submission` and its following discussion-like `comments`. In this data, we restrict ourselves to submissions that provide a `URL` along with a `title` (see [example Reddit submission](https://www.reddit.com/r/todayilearned/comments/f2ruz/til_a_woman_fell_30000_feet_from_an_airplane_and/), which refers to [this web page](https://en.wikipedia.org/wiki/Vesna_Vulovi%C4%87)). The web page scraped from the URL provides grounding or context to the conversation, and is additional (non-conversational) input that models can condition on to produce responses that are more informative and contentful. 61 | 62 | ### Conversation file: 63 | 64 | Each line of `train.convos.txt` and `dev.convos.txt` contains a Reddit response and its preceding conversational context. Long conversational contexts are truncated by keeping the last 100 words. The file contains 5 columns: 65 | 66 | 1. hash value (only for sanity check) 67 | 2. subreddit name 68 | 3. conversation ID 69 | 4. response score 70 | 5. dialogue turn number (e.g., "1" = start of the conversation, "2" = 2nd turn of a conversation) 71 | 6. conversational context, usually multiple turns (input of the model) 72 | 7. response (output of the model) 73 | 74 | The converational context may contain: 75 | * EOS: special symbol indicating a turn transition 76 | * START: special symbol indicating the start of the conversation 77 | 78 | ### Facts file: 79 | 80 | Each line of `train.facts.txt` and `dev.facts.txt` contains a "fact", either a sentence, paragraph (or other snippet of text) relevant to the current conversation. Use conversation IDs to find the facts relevant to each conversation. Note: facts relevant to a given conversation are ordered as they appear on the original web page. The file contains 3 columns: 81 | 82 | 1. hash value (only for sanity check) 83 | 2. subreddit name 84 | 3. conversation ID 85 | 4. domain name 86 | 5. fact 87 | 88 | To produce the facts relevant to each conversation, we extracted the text of the page using an html-to-text converter ([BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)), but kept the most important tags intact (`, <h1-6>, <p>, etc`). As web formatting differs substantially from domain to domain and common tags like `<p>` may not be used in some domains, we decided to keep all the text of the original page (however, we do remove javascript and style code). As some of the fact data tend to be noisy, you may want restrict yourself to facts delimited by these tags. 89 | 90 | 91 | #### Labeled anchors 92 | 93 | A substantial number of URLs contain labeled achors, for example: 94 | 95 | ```http://en.wikipedia.org/wiki/John_Rhys-Davies#The_Lord_of_the_Rings_trilogy``` 96 | 97 | which here refers to the label `The_Lord_of_the_Rings_trilogy`. This information is preserved in the facts, and indicated with the tags `<anchor>` and `</anchor>`. As many web pages in this dataset are lengthy, anchors are probably useful information, as they indicate what text the model should likely attend to in order to produce a good response. 98 | 99 | ### Data statistics: 100 | 101 | | | Trial data | Train set | Dev set | 102 | | ---- | ---- | ---- | ---- | 103 | |# dialogue turns | 649,866 | 2,364,228 | - | 104 | |# facts | 4,320,438 | 15,180,800 | - | 105 | |# tagged facts (1) | 998,032 | 2,185,893 | - | 106 | 107 | (1): facts tagged with html markup (e.g., <title>) and therefore potentially important. 108 | 109 | ### Sample data: 110 | 111 | #### Sample conversation turn: 112 | 113 | ```<hash> \t todayilearned \t f2ruz \t 145 \t 2 \t START EOS til a woman fell 30,000 feet from an airplane and survived . \t the page states that a 2009 report found the plane only fell several hundred meters .``` 114 | 115 | Maps to: 116 | 117 | 1. hash value: ... 118 | 2. subreddit name: `TodayILearned` 119 | 3. conversation ID: `f2ruz` 120 | 4. response score: `145` 121 | 5. dialogue turn number: `2` 122 | 6. conversational context: `START EOS til a woman fell 30,000 feet from an airplane and survived .` 123 | 7. response: `the page states that a 2009 report found the plane only fell several hundred meters .` 124 | 125 | #### Sample fact: 126 | 127 | ```<hash> \t todayilearned \t f2ruz \t en.wikipedia.org \t <p> four years later , peter hornung-andersen and pavel theiner , two prague-based journalists , claimed that flight 367 had been mistaken for an enemy aircraft and shot down by the czechoslovak air force at an altitude of 800 metres ( 2,600 ft ) . </p>``` 128 | 129 | Maps to: 130 | 1. hash value: ... 131 | 2. subreddit name: `TodayILearned` 132 | 3. conversation ID: `f2ruz` 133 | 4. domain name: `en.wikipedia.org` 134 | 5. fact: `<p> four years later , peter hornung-andersen and pavel theiner , two prague-based journalists , claimed that flight 367 had been mistaken for an enemy aircraft and shot down by the czechoslovak air force at an altitude of 800 metres ( 2,600 ft ) . </p>` 135 | 136 | <a name="FAQ"></a> 137 | ## FAQ 138 | 139 | **Q:** `make` crashed and returned some non-zero code(s), e.g.: `make: *** [reddit/RC_2013-05.zst] Error 8` 140 | <br> 141 | **A:** It might be a temporary network connection problem. If you rerun the same `make` command, the scripts will resume data download and creation from where it left off. 142 | <br> 143 | **A:** Alternatively, it might be because you ran `make` with a large number of processes (> 4). The server hosting the Reddit data doesn't allow more than 4 simultaneous connections from the same IP address. 144 | 145 | **Q:** Creating the data with these scripts is inconvenient. Instead, could you please send us the data directly? 146 | <br> 147 | **A:** The data (either Reddit or web) is not our own, so we are unable to redistribute it. The same situtation happended last year for [DSTC6 Task 2](https://github.com/dialogtekgeek/DSTC6-End-to-End-Conversation-Modeling), and the organizers then also provided scraping code that let participants construct the data themselves. 148 | 149 | **Q:** I have trouble running the download script because of Internet control in my country. 150 | <br> 151 | **A:** You might want to consider VPN solutions. Alternatively, you could try to team up with other people or team who live in a different country. Participants from different institutions can collaborate and submit system outputs together. 152 | 153 | -------------------------------------------------------------------------------- /data_extraction/lists/domains-official.txt: -------------------------------------------------------------------------------- 1 | 3news.co.nz 2 | 9news.com.au 3 | abc.net.au 4 | abcnews.go.com 5 | afr.com 6 | al.com 7 | aljazeera.com 8 | anandtech.com 9 | animenewsnetwork.com 10 | arcgames.com 11 | arstechnica.com 12 | au.news.yahoo.com 13 | autoblog.com 14 | baltimoresun.com 15 | basketball-reference.com 16 | bbc.co.uk 17 | bbc.com 18 | bestbuy.com 19 | bigstory.ap.org 20 | bleacherreport.com 21 | bleedingcool.com 22 | blog.chron.com 23 | blog.eu.playstation.com 24 | blog.sfgate.com 25 | blog.us.playstation.com 26 | bloomberg.com 27 | boardgamegeek.com 28 | boingboing.net 29 | bostonglobe.com 30 | brisbanetimes.com.au 31 | business.financialpost.com 32 | businessweek.com 33 | buzzfeed.com 34 | canberratimes.com.au 35 | cbc.ca 36 | charlotteobserver.com 37 | chicagotribune.com 38 | chron.com 39 | clarin.com 40 | cleveland.com 41 | cnet.com 42 | cnn.com 43 | collider.com 44 | comicbook.com 45 | comicbookmovie.com 46 | comingsoon.net 47 | cp24.com 48 | crash.net 49 | crunchyroll.com 50 | csmonitor.com 51 | dailymail.co.uk 52 | dailymotion.com 53 | dailyrecord.co.uk 54 | dallasnews.com 55 | deadline.com 56 | deadspin.com 57 | deviantart.com 58 | digitaltrends.com 59 | dnainfo.com 60 | economist.com 61 | edition.cnn.com 62 | en.m.wikipedia.org 63 | en.wikipedia.org 64 | engadget.com 65 | english.alarabiya.net 66 | escapistmagazine.com 67 | espn.go.com 68 | euractiv.com 69 | eurekalert.org 70 | euronews.com 71 | ew.com 72 | faz.net 73 | finance.yahoo.com 74 | fivethirtyeight.com 75 | flickeringmyth.com 76 | football-italia.net 77 | forbes.com 78 | foreignpolicy.com 79 | forum.xda-developers.com 80 | forums.robertsspaceindustries.com 81 | fourfourtwo.com 82 | foxnews.com 83 | foxsports.com 84 | ft.com 85 | gawker.com 86 | gizmodo.com 87 | globalnews.ca 88 | goal.com 89 | gq.com 90 | guardian.co.uk 91 | haaretz.com 92 | huffingtonpost.ca 93 | humblebundle.com 94 | hurriyetdailynews.com 95 | ibtimes.com 96 | ign.com 97 | independent.co.uk 98 | independent.ie 99 | indiegogo.com 100 | inquisitr.com 101 | io9.com 102 | irishexaminer.com 103 | irishtimes.com 104 | jalopnik.com 105 | japantimes.co.jp 106 | joystiq.com 107 | jsonline.com 108 | kansascity.com 109 | kickstarter.com 110 | latimes.com 111 | liverpoolecho.co.uk 112 | livescience.com 113 | macleans.ca 114 | majornelson.com 115 | manchestereveningnews.co.uk 116 | marca.com 117 | mashable.com 118 | mercurynews.com 119 | metro.co.uk 120 | miamiherald.com 121 | mirror.co.uk 122 | mlb.mlb.com 123 | mlive.com 124 | mlssoccer.com 125 | montrealgazette.com 126 | msn.com 127 | nasa.gov 128 | nature.com 129 | nba.com 130 | nbcnews.com 131 | news.nationalgeographic.com 132 | news.nationalpost.com 133 | news.yahoo.com 134 | newsarama.com 135 | newscientist.com 136 | newshub.co.nz 137 | newyorker.com 138 | nfl.com 139 | nhl.com 140 | nj.com 141 | nola.com 142 | npr.org 143 | nydailynews.com 144 | nypost.com 145 | oglobo.globo.com 146 | oregonlive.com 147 | orlandosentinel.com 148 | pbs.org 149 | pcgamer.com 150 | pcworld.com 151 | philly.com 152 | phillymag.com 153 | phys.org 154 | playstationlifestyle.net 155 | politico.com 156 | polygon.com 157 | popularmechanics.com 158 | post-gazette.com 159 | pro-football-reference.com 160 | profootballtalk.nbcsports.com 161 | qz.com 162 | redditgifts.com 163 | reuters.com 164 | robertsspaceindustries.com 165 | rockpapershotgun.com 166 | rollingstone.com 167 | rt.com 168 | sacbee.com 169 | sanfrancisco.cbslocal.com 170 | sbnation.com 171 | scotsman.com 172 | screenrant.com 173 | sfchronicle.com 174 | sfgate.com 175 | si.com 176 | skysports.com 177 | slate.com 178 | smashboards.com 179 | smh.com.au 180 | smithsonianmag.com 181 | snopes.com 182 | sports.yahoo.com 183 | sputniknews.com 184 | stackoverflow.com 185 | steamcommunity.com 186 | store.playstation.com 187 | sun-sentinel.com 188 | surrenderat20.net 189 | target.com 190 | techcrunch.com 191 | technologyreview.com 192 | telegraph.co.uk 193 | theage.com.au 194 | theatlantic.com 195 | thebiglead.com 196 | theglobeandmail.com 197 | theguardian.com 198 | thehill.com 199 | thejournal.ie 200 | themoscowtimes.com 201 | thenextweb.com 202 | theringer.com 203 | thestar.com 204 | thesun.co.uk 205 | theverge.com 206 | thewrap.com 207 | thinkprogress.org 208 | time.com 209 | timesofisrael.com 210 | tmz.com 211 | torontosun.com 212 | tsn.ca 213 | tvline.com 214 | uefa.com 215 | uk.businessinsider.com 216 | upi.com 217 | us.battle.net 218 | usatoday.com 219 | vancouversun.com 220 | variety.com 221 | vg247.com 222 | vulture.com 223 | walmart.com 224 | wikipedia.org 225 | wired.com 226 | zdnet.com 227 | -------------------------------------------------------------------------------- /data_extraction/lists/subreddits-official.txt: -------------------------------------------------------------------------------- 1 | 49ers 2 | amateurradio 3 | anime 4 | arrow 5 | asoiaf 6 | australia 7 | aviation 8 | badphilosophy 9 | baltimore 10 | battlefield_4 11 | belgium 12 | bengals 13 | bicycling 14 | bindingofisaac 15 | bjj 16 | boardgames 17 | bonnaroo 18 | borussiadortmund 19 | bostonceltics 20 | bourbon 21 | breakingbad 22 | buccaneers 23 | buffalobills 24 | business 25 | canada 26 | canucks 27 | cars 28 | chelseafc 29 | cincinnati 30 | classicalmusic 31 | climbing 32 | comicbooks 33 | comics 34 | cordcutters 35 | cowboys 36 | creepy 37 | dayz 38 | de 39 | diabetes 40 | diablo3 41 | disney 42 | doctorwho 43 | dogs 44 | dotamasterrace 45 | dvdcollection 46 | eagles 47 | electronicmusic 48 | energy 49 | entertainment 50 | eu4 51 | europe 52 | everymanshouldknow 53 | exjw 54 | falcons 55 | fivenightsatfreddys 56 | flying 57 | formula1 58 | frugalmalefashion 59 | ftm 60 | furry 61 | futurama 62 | gamegrumps 63 | gameofthrones 64 | gamernews 65 | germany 66 | guns 67 | hardware 68 | haskell 69 | hawks 70 | heroesofthestorm 71 | homestuck 72 | horror 73 | houston 74 | humor 75 | interestingasfuck 76 | ipad 77 | iphone 78 | ireland 79 | kansascity 80 | knitting 81 | korea 82 | lakers 83 | leafs 84 | leagueoflegends 85 | linguistics 86 | linux 87 | linux_gaming 88 | london 89 | lotr 90 | madmen 91 | malefashionadvice 92 | martialarts 93 | math 94 | medicine 95 | melbourne 96 | metalgearsolid 97 | miamidolphins 98 | minnesotavikings 99 | montreal 100 | motogp 101 | motorcycles 102 | movies 103 | nba 104 | newzealand 105 | nfl 106 | nhl 107 | nintendo 108 | nursing 109 | nyc 110 | nyjets 111 | oneplus 112 | orangecounty 113 | orlando 114 | ottawa 115 | panthers 116 | paradoxplaza 117 | patientgamers 118 | paydaytheheist 119 | perth 120 | philadelphia 121 | philosophy 122 | phish 123 | pinkfloyd 124 | pittsburgh 125 | progmetal 126 | programming 127 | ravens 128 | redsox 129 | roosterteeth 130 | rugbyunion 131 | running 132 | russia 133 | sanfrancisco 134 | science 135 | scifi 136 | seinfeld 137 | short 138 | shutupandtakemymoney 139 | skeptic 140 | smashbros 141 | soccer 142 | sports 143 | starcitizen 144 | sto 145 | subaru 146 | sydney 147 | sysadmin 148 | tea 149 | tech 150 | television 151 | tennis 152 | teslamotors 153 | texas 154 | tf2 155 | timberwolves 156 | titanfall 157 | todayilearned 158 | toronto 159 | torontoraptors 160 | unitedkingdom 161 | vegan 162 | vexillology 163 | vinyl 164 | visualnovels 165 | vita 166 | wargame 167 | warriors 168 | washingtondc 169 | web_design 170 | webdev 171 | windows 172 | windowsphone 173 | worldnews 174 | wow 175 | writing 176 | xbox360 177 | xboxone 178 | zen 179 | -------------------------------------------------------------------------------- /data_extraction/requirements.txt: -------------------------------------------------------------------------------- 1 | chardet==3.0.4 2 | nltk==3.4.1 3 | beautifulsoup4==4.6.0 4 | zstandard==0.18.0 5 | -------------------------------------------------------------------------------- /data_extraction/src/Makefile.official: -------------------------------------------------------------------------------- 1 | .SECONDARY: 2 | 3 | train: data-official/train.convos.txt data-official/train.facts.txt 4 | 5 | include src/Makefile.official.targets 6 | 7 | 8 | data-official/train.convos.txt: $(OFFICIAL_TRAIN_CONVOS) 9 | cat $+ > $@ 10 | 11 | data-official/train.facts.txt: $(OFFICIAL_TRAIN_FACTS) 12 | cat $+ > $@ 13 | 14 | data-official/%.facts.txt data-official/%.convos.txt: reddit/RC_%.bz2 reddit/RS_%.bz2 data-official/.create 15 | python src/create_official_data.py --rsinput=reddit/RS_$(*F).bz2 --rcinput=reddit/RC_$(*F).bz2 --subreddit_filter=lists/subreddits-official.txt --domain_filter=lists/domains-official.txt --pickle=data-official/$(*F).pkl --facts=data-official/$(*F).facts.txt --convos=data-official/$(*F).convos.txt --anchoronly=True --use_cc=True > logs-official/$(*F).log 2> logs-official/$(*F).err 16 | 17 | data-official/.create: 18 | mkdir -p logs 19 | mkdir -p data-official 20 | mkdir -p logs-official 21 | touch $@ 22 | 23 | reddit/RS_%.bz2: 24 | mkdir -p reddit 25 | wget https://files.pushshift.io/reddit/submissions/RS_$(*F).bz2 -O reddit/RS_$(*F).bz2 -o logs/RS_$(*F).bz2.log -c 26 | 27 | reddit/RC_%.bz2: 28 | mkdir -p reddit 29 | wget https://files.pushshift.io/reddit/comments/RC_$(*F).bz2 -O reddit/RC_$(*F).bz2 -o logs/RC_$(*F).bz2.log -c 30 | -------------------------------------------------------------------------------- /data_extraction/src/Makefile.official.targets: -------------------------------------------------------------------------------- 1 | OFFICIAL_TRAIN_CONVOS=data-official/2011-01.convos.txt data-official/2011-02.convos.txt data-official/2011-03.convos.txt data-official/2011-04.convos.txt data-official/2011-05.convos.txt data-official/2011-06.convos.txt data-official/2011-07.convos.txt data-official/2011-08.convos.txt data-official/2011-09.convos.txt data-official/2011-10.convos.txt data-official/2011-11.convos.txt data-official/2011-12.convos.txt data-official/2012-01.convos.txt data-official/2012-02.convos.txt data-official/2012-03.convos.txt data-official/2012-04.convos.txt data-official/2012-05.convos.txt data-official/2012-06.convos.txt data-official/2012-07.convos.txt data-official/2012-08.convos.txt data-official/2012-09.convos.txt data-official/2012-10.convos.txt data-official/2012-11.convos.txt data-official/2012-12.convos.txt data-official/2013-01.convos.txt data-official/2013-02.convos.txt data-official/2013-03.convos.txt data-official/2013-04.convos.txt data-official/2013-05.convos.txt data-official/2013-06.convos.txt data-official/2013-07.convos.txt data-official/2013-08.convos.txt data-official/2013-09.convos.txt data-official/2013-10.convos.txt data-official/2013-11.convos.txt data-official/2013-12.convos.txt data-official/2014-01.convos.txt data-official/2014-02.convos.txt data-official/2014-03.convos.txt data-official/2014-04.convos.txt data-official/2014-05.convos.txt data-official/2014-06.convos.txt data-official/2014-07.convos.txt data-official/2014-08.convos.txt data-official/2014-09.convos.txt data-official/2014-10.convos.txt data-official/2014-11.convos.txt data-official/2014-12.convos.txt data-official/2015-01.convos.txt data-official/2015-02.convos.txt data-official/2015-03.convos.txt data-official/2015-04.convos.txt data-official/2015-05.convos.txt data-official/2015-06.convos.txt data-official/2015-07.convos.txt data-official/2015-08.convos.txt data-official/2015-09.convos.txt data-official/2015-10.convos.txt data-official/2015-11.convos.txt data-official/2015-12.convos.txt data-official/2016-01.convos.txt data-official/2016-02.convos.txt data-official/2016-03.convos.txt data-official/2016-04.convos.txt data-official/2016-05.convos.txt data-official/2016-06.convos.txt data-official/2016-07.convos.txt data-official/2016-08.convos.txt data-official/2016-09.convos.txt data-official/2016-10.convos.txt data-official/2016-11.convos.txt data-official/2016-12.convos.txt 2 | 3 | OFFICIAL_TRAIN_FACTS=data-official/2011-01.facts.txt data-official/2011-02.facts.txt data-official/2011-03.facts.txt data-official/2011-04.facts.txt data-official/2011-05.facts.txt data-official/2011-06.facts.txt data-official/2011-07.facts.txt data-official/2011-08.facts.txt data-official/2011-09.facts.txt data-official/2011-10.facts.txt data-official/2011-11.facts.txt data-official/2011-12.facts.txt data-official/2012-01.facts.txt data-official/2012-02.facts.txt data-official/2012-03.facts.txt data-official/2012-04.facts.txt data-official/2012-05.facts.txt data-official/2012-06.facts.txt data-official/2012-07.facts.txt data-official/2012-08.facts.txt data-official/2012-09.facts.txt data-official/2012-10.facts.txt data-official/2012-11.facts.txt data-official/2012-12.facts.txt data-official/2013-01.facts.txt data-official/2013-02.facts.txt data-official/2013-03.facts.txt data-official/2013-04.facts.txt data-official/2013-05.facts.txt data-official/2013-06.facts.txt data-official/2013-07.facts.txt data-official/2013-08.facts.txt data-official/2013-09.facts.txt data-official/2013-10.facts.txt data-official/2013-11.facts.txt data-official/2013-12.facts.txt data-official/2014-01.facts.txt data-official/2014-02.facts.txt data-official/2014-03.facts.txt data-official/2014-04.facts.txt data-official/2014-05.facts.txt data-official/2014-06.facts.txt data-official/2014-07.facts.txt data-official/2014-08.facts.txt data-official/2014-09.facts.txt data-official/2014-10.facts.txt data-official/2014-11.facts.txt data-official/2014-12.facts.txt data-official/2015-01.facts.txt data-official/2015-02.facts.txt data-official/2015-03.facts.txt data-official/2015-04.facts.txt data-official/2015-05.facts.txt data-official/2015-06.facts.txt data-official/2015-07.facts.txt data-official/2015-08.facts.txt data-official/2015-09.facts.txt data-official/2015-10.facts.txt data-official/2015-11.facts.txt data-official/2015-12.facts.txt data-official/2016-01.facts.txt data-official/2016-02.facts.txt data-official/2016-03.facts.txt data-official/2016-04.facts.txt data-official/2016-05.facts.txt data-official/2016-06.facts.txt data-official/2016-07.facts.txt data-official/2016-08.facts.txt data-official/2016-09.facts.txt data-official/2016-10.facts.txt data-official/2016-11.facts.txt data-official/2016-12.facts.txt 4 | 5 | OFFICIAL_DEV_CONVOS=data-official/2017-01.convos.txt data-official/2017-02.convos.txt data-official/2017-03.convos.txt 6 | 7 | OFFICIAL_DEV_FACTS=data-official/2017-01.facts.txt data-official/2017-02.facts.txt data-official/2017-03.facts.txt 8 | 9 | # Note: validation == dev set, but validation is a subset used for automatic evaluation 10 | OFFICIAL_VALID_CONVOS=data-official-valid/2017-01.convos.txt data-official-valid/2017-02.convos.txt data-official-valid/2017-03.convos.txt 11 | 12 | OFFICIAL_VALID_FACTS=data-official-valid/2017-01.facts.txt data-official-valid/2017-02.facts.txt data-official-valid/2017-03.facts.txt 13 | 14 | OFFICIAL_TEST_CONVOS=data-official-test/2017-04.convos.txt data-official-test/2017-05.convos.txt data-official-test/2017-06.convos.txt data-official-test/2017-07.convos.txt data-official-test/2017-08.convos.txt data-official-test/2017-09.convos.txt data-official-test/2017-10.convos.txt data-official-test/2017-11.convos.txt 15 | 16 | OFFICIAL_TEST_FACTS=data-official-test/2017-04.facts.txt data-official-test/2017-05.facts.txt data-official-test/2017-06.facts.txt data-official-test/2017-07.facts.txt data-official-test/2017-08.facts.txt data-official-test/2017-09.facts.txt data-official-test/2017-10.facts.txt data-official-test/2017-11.facts.txt 17 | 18 | OFFICIAL_TEST_REFS=data-official-test/2017-04.refs.txt data-official-test/2017-05.refs.txt data-official-test/2017-06.refs.txt data-official-test/2017-07.refs.txt data-official-test/2017-08.refs.txt data-official-test/2017-09.refs.txt data-official-test/2017-10.refs.txt data-official-test/2017-11.refs.txt 19 | -------------------------------------------------------------------------------- /data_extraction/src/commoncrawl.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """ 4 | commoncrawl.py 5 | Class to download web pages from Common Crawl, optionally specifying a target year and month. 6 | Author: Michel Galley, Microsoft Research NLP Group (dstc7-task2@microsoft.com) 7 | """ 8 | 9 | import sys 10 | import time 11 | import gzip 12 | import json 13 | import urllib.parse 14 | import urllib.request 15 | import chardet 16 | from datetime import datetime 17 | from io import BytesIO 18 | 19 | class CommonCrawl: 20 | 21 | month_keys = [ '2018-05', '2018-04', '2018-03', '2018-02', '2018-01', '2017-12', '2017-11', '2017-10', '2017-09', '2017-08', '2017-07', '2017-06', '2017-05', '2017-04', '2017-03', '2017-02', '2017-01', '2016-12', '2016-10', '2016-09', '2016-08', '2016-07', '2016-06', '2016-05', '2016-04', '2016-02', '2015-11', '2015-09', '2015-08', '2015-07', '2015-06', '2015-05', '2015-04', '2015-03', '2015-02', '2015-01', '2014-12', '2014-11', '2014-10', '2014-09', '2014-08', '2014-07', '2014-04', '2014-03', '2014-02', '2013-09' ] 22 | month_ids = [ '2018-22', '2018-17', '2018-13', '2018-09', '2018-05', '2017-51', '2017-47', '2017-43', '2017-39', '2017-34', '2017-30', '2017-26', '2017-22', '2017-17', '2017-13', '2017-09', '2017-04', '2016-50', '2016-44', '2016-40', '2016-36', '2016-30', '2016-26', '2016-22', '2016-18', '2016-07', '2015-48', '2015-40', '2015-35', '2015-32', '2015-27', '2015-22', '2015-18', '2015-14', '2015-11', '2015-06', '2014-52', '2014-49', '2014-42', '2014-41', '2014-35', '2014-23', '2014-15', '2014-10', '2013-48', '2013-20' ] 23 | index_url_prefix = 'http://index.commoncrawl.org/CC-MAIN-' 24 | data_url = 'https://data.commoncrawl.org/' 25 | index_url_suffix = '%2F&output=json' 26 | error_internal_server = 500 27 | error_bad_gateway = 502 28 | error_service_unavailable = 503 29 | error_gateway_timeout = 504 30 | error_not_found = 404 31 | max_retry = 5 32 | retry_wait = 3 33 | 34 | def __init__(self, month_offset): 35 | self.month_keys_dic = dict([ (self.month_keys[i], i) for i in range(0, len(self.month_keys))]) 36 | self.month_offset = month_offset 37 | 38 | 39 | def get_key(self, url, month, year=None): 40 | if year == None: 41 | return f"{url}|{month}" 42 | return f"{url}|{year}-{month}" 43 | 44 | def get_match(self, cc_match, url, month_id): 45 | k = self.get_key(url, month_id) 46 | return cc_match[k] 47 | 48 | def _get_month_id(self, year, month): 49 | k = year + "-" + month 50 | iyear = int(year) 51 | imonth = int(month) 52 | if k in self.month_keys_dic.keys(): 53 | idx = self.month_keys_dic[k] 54 | elif (iyear == 2014 and imonth <= 2) or (iyear == 2013 and imonth >= 10): 55 | idx = self.month_keys_dic['2014-02'] 56 | elif int(iyear) <= 2013: 57 | idx = self.month_keys_dic['2013-09'] 58 | else: 59 | idx = 0 # >= 2018 60 | idx += self.month_offset 61 | return max(0, min(idx, len(self.month_keys)-1)) 62 | 63 | def download(self, url, year=None, month=None, backward=True, cc_match=None): 64 | """ 65 | Returns html from a url using Common Crawl (CC). 66 | url = identifies page to retrieve 67 | year = of the page 68 | month = month of the page 69 | backward = whether to search backward in time if page isn't found (if false, search forward) 70 | Returns (response, date), where response is the html as a string, and the date the page 71 | was originally retrieved (datetime object). 72 | """ 73 | idx = 0 74 | if cc_match: 75 | old_date = f'{year}-{month}' 76 | year, month = self.get_match(cc_match, url, year + '-' + month).split('-') 77 | new_date = f'{year}-{month}' 78 | print(f'MATCH\t{url}\t{old_date}\t{new_date}') 79 | if year != None and month != None: 80 | idx = self._get_month_id(year, month) 81 | step = int(backward)*2-1 82 | retry = 0 83 | while 0 <= idx and idx < len(self.month_keys): 84 | month_id = self.month_ids[idx] 85 | iurl = self.index_url_prefix + month_id + '-index?url=' + urllib.parse.quote_plus(url) + self.index_url_suffix 86 | #print(" --> try: %s [%s]" % (url, self.month_keys[idx])) 87 | try: 88 | # Find page in index: 89 | u = urllib.request.urlopen(iurl) 90 | pages = [json.loads(x) for x in u.read().decode('utf-8').strip().split('\n')] 91 | page = pages[0] # To do: if get multiple pages, find closest match in time 92 | 93 | # Get data from warc file: 94 | offset, length = int(page['offset']), int(page['length']) 95 | offset_end = offset + length - 1 96 | dataurl = self.data_url + page['filename'] 97 | request = urllib.request.Request(dataurl) 98 | rangestr = 'bytes={}-{}'.format(offset, offset_end) 99 | request.add_header('Range', rangestr) 100 | u = urllib.request.urlopen(request) 101 | content = u.read() 102 | raw_data = BytesIO(content) 103 | f = gzip.GzipFile(fileobj=raw_data) 104 | 105 | data = f.read() 106 | enc = chardet.detect(data) 107 | els = data.decode(enc['encoding']).strip().split('\r\n\r\n', 2) 108 | if len(els) != 3: 109 | idx = idx + step 110 | else: 111 | warc, header, response = els 112 | date = datetime.strptime(page['timestamp'],'%Y%m%d%H%M%S') 113 | return response, self.month_keys[idx], date 114 | except UnicodeDecodeError: 115 | idx = idx + step 116 | except urllib.error.HTTPError as err: 117 | if err.code == self.error_service_unavailable or \ 118 | err.code == self.error_gateway_timeout or \ 119 | err.code == self.error_bad_gateway or \ 120 | err.code == self.error_internal_server: 121 | if retry >= self.max_retry: 122 | idx = idx + step 123 | retry = 0 124 | else: 125 | retry += 1 126 | time.sleep(self.retry_wait) 127 | msg = "Common Crawl error code %d, waiting %d seconds... (retry attempt %d/%d), url: %s" 128 | print(msg % (err.code, self.retry_wait, retry, self.max_retry, iurl), file=sys.stderr) 129 | sys.stderr.flush() 130 | elif err.code == self.error_not_found: 131 | idx = idx + step 132 | retry = 0 133 | else: 134 | idx = idx + step 135 | retry = 0 136 | print("Unexpected error code: %d" % err.code, file=sys.stderr) 137 | sys.stderr.flush() 138 | except Exception as err: 139 | print("Unexpected error: %s, waiting %d seconds" % (err, self.retry_wait), file=sys.stderr) 140 | sys.stderr.flush() 141 | time.sleep(self.retry_wait) 142 | return None, None, None 143 | 144 | if __name__== "__main__": 145 | cc = CommonCrawl(-2) 146 | if len(sys.argv) != 2 and len(sys.argv) != 4: 147 | print("Usage: python %s URL [YEAR] [MONTH]\n\nArguments YEAR and MONTH must have the following format: YYYY and MM" % sys.argv[0], file=sys.stderr) 148 | else: 149 | url = sys.argv[1] 150 | month = None 151 | year = None 152 | if len(sys.argv) == 4: 153 | year = sys.argv[2] 154 | month = sys.argv[3] 155 | html, date = cc.download(url, year, month) 156 | if html != None: 157 | print(html) 158 | print("<!-- Retrieved on: " + str(date) + " -->") 159 | -------------------------------------------------------------------------------- /data_extraction/src/create_official_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """ 4 | create_official_data.py 5 | A script to turn extract: 6 | (1) conversation from Reddit file dumps (originally downloaded from https://files.pushshift.io/reddit/daily/) 7 | (2) grounded data ("facts") extracted from the web, respecting robots.txt 8 | """ 9 | 10 | import sys 11 | import io 12 | import time 13 | import os.path 14 | import re 15 | import argparse 16 | import traceback 17 | import json 18 | import zstandard as zstd 19 | import pickle 20 | import nltk 21 | import urllib.request 22 | import urllib.robotparser 23 | import hashlib 24 | 25 | from bs4 import BeautifulSoup 26 | from bs4.element import NavigableString 27 | from bs4.element import CData 28 | from multiprocessing import Pool 29 | from nltk.tokenize import TweetTokenizer 30 | from commoncrawl import CommonCrawl 31 | 32 | parser = argparse.ArgumentParser() 33 | parser.add_argument("--rsinput", help="Submission (RS) file to load.") 34 | parser.add_argument("--rcinput", help="Comments (RC) file to load.") 35 | parser.add_argument("--test", help="Hashes of test set convos.", default="") 36 | parser.add_argument("--facts", help="Facts file to create.") 37 | parser.add_argument("--convos", help="Convo file to create.") 38 | parser.add_argument("--pickle", help="Pickle that contains conversations and facts.", default="data.pkl") 39 | parser.add_argument("--subreddit_filter", help="List of subreddits (inoffensive, safe for work, etc.)") 40 | parser.add_argument("--domain_filter", help="Filter on subreddits and domains.") 41 | parser.add_argument("--nsubmissions", help="Number of submissions to process (< 0 means all)", default=-1, type=int) 42 | parser.add_argument("--min_fact_len", help="Minimum number of tokens in each fact (reduce noise in html).", default=0, type=int) 43 | parser.add_argument("--min_res_len", help="Min number of characters in response.", default=2, type=int) 44 | parser.add_argument("--max_res_len", help="Max number of characters in response.", default=280, type=int) 45 | parser.add_argument("--max_context_len", help="Max number of words in context.", default=200, type=int) 46 | parser.add_argument("--max_depth", help="Maximum length of conversation.", default=5, type=int) 47 | parser.add_argument("--mincomments", help="Minimum number of comments per submission.", default=10, type=int) 48 | parser.add_argument("--minscore", help="Minimum score of each comment.", default=1, type=int) 49 | parser.add_argument("--delay", help="Seconds of delay when crawling web pages", default=0, type=int) 50 | parser.add_argument("--tokenize", help="Whether to tokenize facts and conversations.", default=True, type=bool) 51 | parser.add_argument("--anchoronly", help="Filter out URLs with no named anchors.", default=False, type=bool) 52 | parser.add_argument("--use_robots_txt", help="Whether to respect robots.txt (disable this only if urls have been previously checked by other means!)", default=True, type=bool) 53 | parser.add_argument("--use_cc", help="Whether to download pages from Common Crawl.", default=False, type=bool) 54 | parser.add_argument("--cc_match", help="tsv file containing the precomputation of [url, submission month] -> [cc month]", default='lists/cc-match.tsv', type=str) 55 | parser.add_argument("--dryrun", help="Just collect stats about data; don't create any data.", default=False, type=bool) 56 | parser.add_argument("--blind", help="Don't print out responses.", default=False, type=bool) 57 | args = parser.parse_args() 58 | 59 | fields = [ "id", "subreddit", "score", "num_comments", "domain", "title", "url", "permalink" ] 60 | important_tags = ['title', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'p'] 61 | notext_tags = ['script', 'style'] 62 | deleted_str = '[deleted]' 63 | undisclosed_str = '__UNDISCLOSED__' 64 | max_window_size = 2**31 65 | 66 | batch_download_facts = False 67 | robotparsers = {} 68 | tokenizer = TweetTokenizer(preserve_case=False) 69 | if args.cc_match: 70 | cc = CommonCrawl(0) 71 | else: 72 | cc = CommonCrawl(-2) 73 | 74 | def get_subreddit(submission): 75 | return submission["subreddit"] 76 | def get_domain(submission): 77 | return submission["domain"] 78 | def get_url(submission): 79 | return submission["url"] 80 | def get_submission_text(submission): 81 | return submission["title"] 82 | def get_permalink(submission): 83 | return submission["permalink"] 84 | def get_submission_id(submission): 85 | return submission["id"] 86 | def get_comment_id(comment): 87 | return "t1_" + comment["id"] 88 | #return comment["name"] 89 | def get_parent_comment_id(comment): 90 | return comment["parent_id"] 91 | def get_text(comment): 92 | return comment["body"] 93 | def get_user(comment): 94 | return comment["author"] 95 | def get_score(comment): 96 | return comment["score"] 97 | def get_linked_submission_id(comment): 98 | return comment["link_id"].split("_")[1] 99 | 100 | def get_anchor(url): 101 | pos = url.find("#") 102 | if (pos > 0): 103 | label = url[pos+1:] 104 | label = label.strip() 105 | return label 106 | return "" 107 | 108 | def filter_submission(submission): 109 | """Determines whether to filter out this submission (over-18, deleted user, etc.).""" 110 | if submission["num_comments"] < args.mincomments: 111 | return True 112 | if "num_crossposts" in submission and submission["num_crossposts"] > 0: 113 | return True 114 | if "locked" in submission and submission["locked"]: 115 | return True 116 | if "over-18" in submission and submission["over_18"]: 117 | return True 118 | if "brand_safe" in submission and not submission["brand_safe"]: 119 | return True 120 | if submission["distinguished"] != None: 121 | return True 122 | if "subreddit_type" in submission: 123 | if submission["subreddit_type"] == "restricted": # filter only public 124 | return True 125 | if submission["subreddit_type"] == "archived": 126 | return True 127 | url = get_url(submission) 128 | domain = get_domain(submission) 129 | if domain.find("reddit.com") >= 0 or domain.find("twitter.com") >= 0 or domain.find("youtube.com") >= 0 or domain.find("youtube.com") >= 0 or domain.find("imgur.com") >= 0 or domain.find("flickr.com") >= 0 or domain.find("ebay.com") >= 0: 130 | return True 131 | if args.anchoronly and len(get_anchor(url)) <= 2: 132 | return True 133 | if url.find(" ") >= 0: 134 | return True 135 | if url.endswith("jpg") or url.endswith("gif") or url.endswith("png") or url.endswith("pdf"): 136 | return True 137 | return False 138 | 139 | def norm_article(t): 140 | """Minimalistic processing with linebreaking.""" 141 | t = re.sub("\s*\n+\s*","\n", t) 142 | t = re.sub(r'(</[pP]>)',r'\1\n', t) 143 | t = re.sub("[ \t]+"," ", t) 144 | t = t.strip() 145 | return t 146 | 147 | def norm_sentence(t): 148 | """Minimalistic processing: remove extra space characters.""" 149 | t = re.sub("[ \n\r\t]+", " ", t) 150 | t = t.strip() 151 | if args.tokenize: 152 | t = " ".join(tokenizer.tokenize(t)) 153 | t = t.replace('[ deleted ]','[deleted]'); 154 | return t 155 | 156 | def add_webpage(submission, year, month, cc_match=None): 157 | """Retrive sentences ('facts') from submission["url"]. """ 158 | if args.use_cc: 159 | return add_cc_webpage(submission, year, month, cc_match) 160 | return add_live_webpage(submission) 161 | 162 | def add_cc_webpage(submission, year, month, cc_match=None): 163 | url = get_url(submission) 164 | kstr = cc.get_key(url, month, year) 165 | if cc_match and kstr not in cc_match: 166 | print("Can't fetch (precomputed): [%s] submission month: [%s-%s]" % (url, year, month)) 167 | sys.stdout.flush() 168 | return None 169 | src, month_id, date = cc.download(url, year, month, False, cc_match=cc_match) 170 | sys.stdout.flush() 171 | if src == None: 172 | src, month_id, date = cc.download(url, year, month, True, cc_match=cc_match) 173 | sys.stdout.flush() 174 | if src == None: 175 | print("Can't fetch: [%s] submission month: [%s-%s]" % (url, year, month)) 176 | sys.stdout.flush() 177 | return None 178 | print("Fetching url: [%s] submission month: [%s-%s] -> [%s] commoncrawl date: [%s]" % (url, year, month, month_id, str(date))) 179 | print("FETCH\t%s\t%s-%s\t%s" % (url, year, month, month_id)) 180 | sys.stdout.flush() 181 | submission["source"] = src 182 | return submission 183 | 184 | def add_live_webpage(submission): 185 | url = get_url(submission) 186 | domain = get_domain(submission) 187 | try: 188 | if args.use_robots_txt: 189 | if args.delay > 0: 190 | time.sleep(args.delay) 191 | if domain in robotparsers.keys(): 192 | rp = robotparsers[domain] 193 | else: 194 | rp = urllib.robotparser.RobotFileParser() 195 | robotparsers[domain] = rp 196 | rurl = "http://" + domain + "/robots.txt" 197 | print("Fetching robots.txt: [%s]" % rurl) 198 | rp.set_url(rurl) 199 | rp.read() 200 | if not rp.can_fetch("*", url): 201 | print("Can't download url due to robots.txt: [%s] domain: [%s]" % (url, domain)) 202 | return None 203 | print("Fetching url: [%s] domain: [%s]" % (url, domain)) 204 | u = urllib.request.urlopen(url) 205 | src = u.read() 206 | submission["source"] = src 207 | return submission 208 | except urllib.error.HTTPError: 209 | return None 210 | except urllib.error.URLError: 211 | return None 212 | except UnicodeEncodeError: 213 | return None 214 | except: 215 | traceback.print_exc() 216 | return None 217 | 218 | def add_webpages(submissions): 219 | """Use multithreading to retrieve multiple webpages at once.""" 220 | print("Downloading %d pages:" % len(submissions)) 221 | pool = Pool() 222 | submissions = pool.map(add_webpage, submissions) 223 | print("\nDone.") 224 | return [s for s in submissions if s is not None] 225 | 226 | def get_date(file_name): 227 | m = re.search(r'(\d\d\d\d)-(\d\d)', file_name) 228 | year = m.group(1) 229 | month = m.group(2) 230 | return year, month 231 | 232 | def get_submissions(rs_file, subreddit_file, domain_file, cc_match=None): 233 | """Return all submissions from a dump submission file rs_file (RS_*.zst), 234 | restricted to the subreddit+domain listed in filter_file.""" 235 | submissions = [] 236 | subreddit_dic = None 237 | domain_dic = None 238 | year, month = get_date(rs_file) 239 | if subreddit_file != None: 240 | with open(subreddit_file) as f: 241 | subreddit_dic = dict([ (el.strip(), 1) for el in f.readlines() ]) 242 | if domain_file != None: 243 | with open(domain_file) as f: 244 | domain_dic = dict([ (el.strip(), 1) for el in f.readlines() ]) 245 | 246 | with open(rs_file, 'rb') as fh: 247 | dctx = zstd.ZstdDecompressor(max_window_size=max_window_size) 248 | with dctx.stream_reader(fh) as reader: 249 | i = 0 250 | for line in io.TextIOWrapper(io.BufferedReader(reader), encoding='utf-8'): 251 | try: 252 | submission = json.loads(line) 253 | if not filter_submission(submission): 254 | subreddit = get_subreddit(submission) 255 | domain = get_domain(submission) 256 | scheck = subreddit_dic == None or subreddit in subreddit_dic 257 | dcheck = domain_dic == None or domain in domain_dic 258 | if scheck and dcheck: 259 | s = dict([ (f, submission[f]) for f in fields ]) 260 | print("keeping: subreddit=%s\tdomain=%s" % (subreddit, domain)) 261 | if args.dryrun: 262 | continue 263 | if not batch_download_facts: 264 | s = add_webpage(s, year, month, cc_match=cc_match) 265 | submissions.append(s) 266 | sys.stdout.flush() 267 | sys.stderr.flush() 268 | i += 1 269 | if i == args.nsubmissions: 270 | break 271 | else: 272 | print("skipping: subreddit=%s\tdomain=%s (%s %s)" % (subreddit, domain, scheck, dcheck)) 273 | pass 274 | except json.decoder.JSONDecodeError: 275 | pass 276 | except Exception: 277 | traceback.print_exc() 278 | pass 279 | if batch_download_facts: 280 | submissions = add_webpages(submissions) 281 | else: 282 | submissions = [s for s in submissions if s is not None] 283 | return dict([ (get_submission_id(s), s) for s in submissions ]) 284 | 285 | def get_comments(rc_file, submissions): 286 | """Return all conversation triples from rc_file (RC_*.bz2), 287 | restricted to given submissions.""" 288 | comments = {} 289 | with open(rc_file, 'rb') as fh: 290 | dctx = zstd.ZstdDecompressor(max_window_size=max_window_size) 291 | with dctx.stream_reader(fh) as reader: 292 | for line in io.TextIOWrapper(io.BufferedReader(reader), encoding='utf-8'): 293 | try: 294 | comment = json.loads(line) 295 | sid = get_linked_submission_id(comment) 296 | if sid in submissions.keys(): 297 | comments[get_comment_id(comment)] = comment 298 | except Exception: 299 | traceback.print_exc() 300 | pass 301 | return comments 302 | 303 | def load_data(cc_match=None): 304 | """Load data either from a pickle file if it exists, 305 | and otherwise from RC_* RS_* and directly from the web.""" 306 | if not os.path.isfile(args.pickle): 307 | submissions = get_submissions(args.rsinput, args.subreddit_filter, args.domain_filter, cc_match=cc_match) 308 | comments = get_comments(args.rcinput, submissions) 309 | with open(args.pickle, 'wb') as f: 310 | pickle.dump([submissions, comments], f, protocol=pickle.HIGHEST_PROTOCOL) 311 | else: 312 | with open(args.pickle, 'rb') as f: 313 | [submissions, comments] = pickle.load(f) 314 | return submissions, comments 315 | 316 | def insert_escaped_tags(tags, label=None): 317 | """For each tag in "tags", insert contextual tags (e.g., <p> </p>) as escaped text 318 | so that these tags are still there when html markup is stripped out.""" 319 | found = False 320 | for tag in tags: 321 | strs = list(tag.strings) 322 | if len(strs) > 0: 323 | if label != None: 324 | l = label 325 | else: 326 | l = tag.name 327 | strs[0].parent.insert(0, NavigableString("<"+l+">")) 328 | strs[-1].parent.append(NavigableString("</"+l+">")) 329 | found = True 330 | return found 331 | 332 | def save_facts(submissions, sids = None): 333 | subs = {} 334 | i = 0 335 | if args.facts == '-': 336 | return submissions 337 | with open(args.facts, 'wt', encoding="utf-8") as f: 338 | for id in sorted(submissions.keys()): 339 | s = submissions[id] 340 | url = get_url(s) 341 | label = get_anchor(url) 342 | print("Processing submission %s...\n\turl: %s\n\tanchor: %s\n\tpermalink: http://reddit.com%s" % (id, url, str(label), get_permalink(s))) 343 | subs[id] = s 344 | if sids == None or id in sids.keys(): 345 | b = BeautifulSoup(s["source"],'html.parser') 346 | # If there is any anchor in the url, locate it in the facts: 347 | if label != "": 348 | if not insert_escaped_tags(b.find_all(True, attrs={"id": label}), 'anchor'): 349 | print("\t(couldn't find anchor on page: %s)" % label) 350 | # Remove tags whose text we don't care about (javascript, etc.): 351 | for el in b(notext_tags): 352 | el.decompose() 353 | # Delete other unimportant tags, but keep the text: 354 | for tag in b.findAll(True): 355 | if tag.name not in important_tags: 356 | tag.append(' ') 357 | tag.replaceWithChildren() 358 | # All tags left are important (e.g., <p>) so add them to the text: 359 | insert_escaped_tags(b.find_all(True)) 360 | # Extract facts from html: 361 | t = b.get_text(" ") 362 | t = norm_article(t) 363 | facts = [] 364 | for sent in filter(None, t.split("\n")): 365 | if len(sent.split(" ")) >= args.min_fact_len: 366 | facts.append(norm_sentence(sent)) 367 | for fact in facts: 368 | out_str = "\t".join([get_subreddit(s), id, get_domain(s), fact]) 369 | hash_str = hashlib.sha224(out_str.encode("utf-8")).hexdigest() 370 | f.write(hash_str + "\t" + out_str + "\n") 371 | s["facts"] = facts 372 | i += 1 373 | if i == args.nsubmissions: 374 | break 375 | return subs 376 | 377 | def get_convo(id, submissions, comments, depth=args.max_depth): 378 | c = comments[id] 379 | pid = get_parent_comment_id(c) 380 | if pid in comments.keys() and depth > 0: 381 | els = get_convo(pid, submissions, comments, depth-1) 382 | else: 383 | s = submissions[get_linked_submission_id(c)] 384 | els = [ "START", norm_sentence(get_submission_text(s)) ] 385 | els.append(norm_sentence(get_text(c))) 386 | return els 387 | 388 | def save_tuple(f, subreddit, sid, pos, user, context, message, response, score, test_hashes): 389 | cwords = re.split("\s+", context) 390 | mwords = re.split("\s+", message) 391 | max_len = max(args.max_context_len, len(mwords)+1) 392 | if len(cwords) > max_len: 393 | ndel = len(cwords) - max_len 394 | del cwords[:ndel] 395 | context = "... " + " ".join(cwords) 396 | if len(response) <= args.max_res_len and len(response) >= args.min_res_len and response != deleted_str and user != deleted_str and response.find(">") < 0: 397 | if context.find(deleted_str) < 0: 398 | if score >= args.minscore: 399 | out_str = "\t".join([subreddit, sid, str(score), str(pos), context, response]) 400 | hash_str = hashlib.sha224(out_str.encode("utf-8")).hexdigest() 401 | if test_hashes == None or hash_str in test_hashes.keys(): 402 | if args.blind: 403 | ## Note: there is no point in removing the '--blind' flag in order to peek at the reference responses (gold), 404 | ## as the organizers will rely on different responses to compute BLEU, etc. 405 | out_str = "\t".join([subreddit, sid, str(score), str(pos), context, undisclosed_str]) 406 | f.write(hash_str + "\t" + out_str + "\n") 407 | 408 | def save_tuples(submissions, comments, test_hashes): 409 | has_firstturn = {} 410 | with open(args.convos, 'wt', encoding="utf-8") as f: 411 | for id in sorted(comments.keys()): 412 | comment = comments[id] 413 | user = get_user(comment) 414 | score = get_score(comment) 415 | sid = get_linked_submission_id(comment) 416 | if sid in submissions.keys(): 417 | s = submissions[sid] 418 | convo = get_convo(id, submissions, comments) 419 | pos = len(convo) - 1 420 | context = " EOS ".join(convo[:-1]) 421 | message = convo[-2] 422 | response = convo[-1] 423 | if len(convo) == 3 and not sid in has_firstturn.keys(): 424 | save_tuple(f, get_subreddit(s), sid, pos-1, "", convo[-3], convo[-3], message, 1, test_hashes) 425 | has_firstturn[sid] = 1 426 | save_tuple(f, get_subreddit(s), sid, pos, user, context, message, response, score, test_hashes) 427 | 428 | def read_test_hashes(hash_file): 429 | hashes = {} 430 | with open(hash_file, 'r') as f: 431 | for line in f: 432 | hash_str = line.rstrip() 433 | hashes[hash_str] = 1 434 | return hashes 435 | 436 | def load_cc_match(match_file): 437 | cc_match = {} 438 | with open(match_file, 'r') as f: 439 | for line in f: 440 | els = line.strip().split('\t') 441 | k = cc.get_key(els[0], els[1]) 442 | cc_match[k] = els[2] 443 | return cc_match 444 | 445 | if __name__== "__main__": 446 | test_hashes = None 447 | if args.test != "": 448 | test_hashes = read_test_hashes(args.test) 449 | cc_match = None 450 | if args.cc_match != '': 451 | cc_match = load_cc_match(args.cc_match) 452 | submissions, comments = load_data(cc_match=cc_match) 453 | submissions = save_facts(submissions) 454 | save_tuples(submissions, comments, test_hashes) 455 | -------------------------------------------------------------------------------- /data_extraction/src/ids2refs.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | 4 | refs = {} 5 | for line in sys.stdin: 6 | line = line.rstrip() 7 | els = line.split("\t") 8 | hashstr = els[0] 9 | response = els[6] 10 | refs[hashstr] = response 11 | 12 | with open(sys.argv[1]) as f: 13 | for line in f: 14 | line = line.rstrip() 15 | els = line.split("\t") 16 | sys.stdout.write(els[0]) 17 | for i in range(1,len(els)): 18 | p = els[i].split('|') 19 | score = p[0] 20 | hashstr = p[1] 21 | if hashstr in refs.keys(): 22 | sys.stdout.write("\t" + score + "|" + refs[hashstr]) 23 | else: 24 | print("WARNING: missing ref, automatic eval scores may differ: [%s]" % hashstr, file=sys.stderr) 25 | sys.stdout.write("\n") 26 | -------------------------------------------------------------------------------- /data_extraction/trial/README.md: -------------------------------------------------------------------------------- 1 | # Data Extraction for DSTC7: End-to-End Conversation Modeling 2 | 3 | Task 2 uses conversational data extracted from Reddit, along with the text of the link that started these conversations. This page provides scripts to extract the data from a Reddit [dump](http://files.pushshift.io/reddit/comments/), as we are unable to release the data directly ourselves. 4 | 5 | *Note: In the original proposal, we planned to use Twitter data (conversational data) and Foursquare (grounded data), but decided to use Reddit, owing to the volatility of Twitter data, as well the technical difficulties of aligning Twitter content with data from other sources. Reddit provides an intuitive direct link to external data in the submissions that can be utilized for this task.* 6 | 7 | ## Requirements 8 | 9 | * `Python 3.x`, with modules: 10 | * `nltk` 11 | * `beautifulsoup4` 12 | * `make` 13 | * `wget` 14 | 15 | 16 | ## Create trial data: 17 | 18 | To create the trial data, please run: 19 | 20 | ```src/create_trial_data.sh``` 21 | 22 | This will create two tab-separated (tsv) files `data/trial.convos.txt` and `data/trial.facts.txt`, which respectively contain the conversational data and grounded text ("facts"). This requires about 20 GB of disk space. 23 | 24 | ### Notes: 25 | 26 | * **Web crawling**: The above script downloads grounding information directly from the web, but does respect the servers' `robots.txt` rules. The official version of the data (forthcoming) will extract that data from [Common Crawl](http://commoncrawl.org/), to ensure that all participants use exactly the same data, and to minimize the number of dead links. 27 | * **Data split**: The official data will be divided into train/dev/test, but the trial data isn't. 28 | * **Offensive language**: We restricted the data to subreddits that are generally inoffensive. However, even the most "well behaved" subreddits occasionally contain offensive and explicit language, and the trial-version of the data does not attempt to remove it. 29 | 30 | ## Data description: 31 | 32 | Each conversation in this dataset consist of Reddit `submission` and its following discussion-like `comments`. In this data, we restrict ourselves to submissions that provide an `URL` along with a `title` (see [example Reddit submission](https://www.reddit.com/r/todayilearned/comments/f2ruz/til_a_woman_fell_30000_feet_from_an_airplane_and/), which refers to [this web page](https://en.wikipedia.org/wiki/Vesna_Vulovi%C4%87)). The web page scraped from the URL provides grounding or context to the conversation, and is additional (non-conversational) input that models can condition on to produce responses that are more informative and contentful. 33 | 34 | ### Conversation file: 35 | 36 | Each line of `trial.convos.txt` contains a Reddit response and its preceding conversational context. Long conversational contexts are truncated by keeping the last 100 words. The file contains 5 columns: 37 | 38 | 1. subreddit name 39 | 2. conversation ID 40 | 3. response score 41 | 4. conversational context, usually multiple turns (input of the model) 42 | 5. response (output of the model) 43 | 44 | The converational context may contain: 45 | * EOS: special symbol indicating a turn transition 46 | * START: special symbol indicating the start of the conversation 47 | 48 | ### Facts file: 49 | 50 | Each line of `trial.facts.txt` contains a "fact", either a sentence, paragraph (or other snippet of text) relevant to the current conversation. Use conversation IDs to find the facts relevant to each conversation. Note: facts relevant to a given conversation are ordered as they appear on the original web page. The file contains 3 columns: 51 | 52 | 1. subreddit name 53 | 2. conversation ID 54 | 3. fact 55 | 56 | To produce the facts relevant to each conversation, we extracted the text of the page using an html-to-text converter ([BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)), but kept the most important tags intact (`<title>, <h1-6>, <p>, etc`). As web formatting differs substantially from domain to domain and common tags like `<p>` may not be used in some domains, we decided to keep all the text of the original page (however, we do remove javascript and style code). As some of the fact data tend to be noisy, you may want restrict yourself to facts delimited by these tags. 57 | 58 | 59 | #### Labeled anchors 60 | 61 | A substantial number of URLs contain labeled achors, for example: 62 | 63 | ```http://en.wikipedia.org/wiki/John_Rhys-Davies#The_Lord_of_the_Rings_trilogy``` 64 | 65 | which here refers to the label `The_Lord_of_the_Rings_trilogy`. This information is preserved in the facts, and indicated with the tags `<anchor>` and `</anchor>`. As many web pages in this dataset are lengthy, anchors are probably useful information, as they indicate what text the model should likely attend to in order to produce a good response. 66 | 67 | ### Data statistics: 68 | 69 | | | Trial data | Train set | Dev set | Test set | 70 | | ---- | ---- | ---- | ---- | ---- | 71 | |# dialogue turns | 649,866 | - | - | - | 72 | |# facts | 4,320,438 | - | - | - | 73 | |# tagged facts (1) | 998,032 | - | - | - | 74 | 75 | (1): facts tagged with html markup (e.g., <title>) and therefore potentially important. 76 | 77 | ### Sample data: 78 | 79 | #### Sample conversation turn (from trial.convos.txt): 80 | 81 | ```todayilearned \t f2ruz \t 145 \t START EOS til a woman fell 30,000 feet from an airplane and survived . \t the page states that a 2009 report found the plane only fell several hundred meters .``` 82 | 83 | Maps to: 84 | 1. subreddit name: `TodayILearned` 85 | 2. conversation ID: `f2ruz` 86 | 3. response score: `145` 87 | 4. conversational context: `START EOS til a woman fell 30,000 feet from an airplane and survived .` 88 | 5. response: `the page states that a 2009 report found the plane only fell several hundred meters .` 89 | 90 | #### Sample fact (from trial.facts.txt): 91 | 92 | ```todayilearned \t f2ruz \t <p> four years later , peter hornung-andersen and pavel theiner , two prague-based journalists , claimed that flight 367 had been mistaken for an enemy aircraft and shot down by the czechoslovak air force at an altitude of 800 metres ( 2,600 ft ) . </p>``` 93 | 94 | Maps to: 95 | 1. subreddit name: `TodayILearned` 96 | 2. conversation ID: `f2ruz` 97 | 3. fact: `<p> four years later , peter hornung-andersen and pavel theiner , two prague-based journalists , claimed that flight 367 had been mistaken for an enemy aircraft and shot down by the czechoslovak air force at an altitude of 800 metres ( 2,600 ft ) . </p>` 98 | -------------------------------------------------------------------------------- /data_extraction/trial/lists/data-trial.txt: -------------------------------------------------------------------------------- 1 | 2011-01 2 | 2011-02 3 | 2011-03 4 | 2011-04 5 | 2011-05 6 | 2011-06 7 | 2011-07 8 | 2011-08 9 | 2011-09 10 | 2011-10 11 | 2011-11 12 | 2011-12 13 | -------------------------------------------------------------------------------- /data_extraction/trial/lists/domains.txt: -------------------------------------------------------------------------------- 1 | abc.net.au 2 | androidpolice.com 3 | arstechnica.com 4 | bbc.co.uk 5 | brisbanetimes.com.au 6 | cbc.ca 7 | cnn.com 8 | en.m.wikipedia.org 9 | en.wikipedia.org 10 | engadget.com 11 | english.aljazeera.net 12 | espn.go.com 13 | eurogamer.net 14 | france24.com 15 | github.com 16 | gizmodo.com 17 | globalpost.com 18 | huffingtonpost.com 19 | mlb.mlb.com 20 | nasa.gov 21 | nba.com 22 | ncbi.nlm.nih.gov 23 | news.bbc.co.uk 24 | news.cnet.com 25 | news.yahoo.com 26 | nfl.com 27 | nhl.com 28 | nzherald.co.nz 29 | pbs.org 30 | phys.org 31 | physorg.com 32 | profootballtalk.nbcsports.com 33 | reuters.com 34 | smh.com.au 35 | snopes.com 36 | soccernet.espn.go.com 37 | space.com 38 | sports.espn.go.com 39 | sports.yahoo.com 40 | stuff.co.nz 41 | techdirt.com 42 | telegraph.co.uk 43 | theage.com.au 44 | thestar.com 45 | theverge.com 46 | uk.eurosport.yahoo.com 47 | uk.reuters.com 48 | wikipedia.org 49 | -------------------------------------------------------------------------------- /data_extraction/trial/lists/subreddits.txt: -------------------------------------------------------------------------------- 1 | android 2 | australia 3 | canada 4 | europe 5 | games 6 | ireland 7 | movies 8 | nba 9 | newzealand 10 | nfl 11 | programming 12 | science 13 | soccer 14 | todayilearned 15 | toronto 16 | unitedkingdom 17 | worldnews 18 | -------------------------------------------------------------------------------- /data_extraction/trial/src/Makefile.trial: -------------------------------------------------------------------------------- 1 | .SECONDARY: 2 | 3 | trial: data/trial.convos.txt data/trial.facts.txt 4 | 5 | TRIAL_CONVOS=data/2011-01.convos.txt data/2011-02.convos.txt data/2011-03.convos.txt data/2011-04.convos.txt data/2011-05.convos.txt data/2011-06.convos.txt data/2011-07.convos.txt data/2011-08.convos.txt data/2011-09.convos.txt data/2011-10.convos.txt data/2011-11.convos.txt data/2011-12.convos.txt 6 | 7 | TRIAL_FACTS=data/2011-01.facts.txt data/2011-02.facts.txt data/2011-03.facts.txt data/2011-04.facts.txt data/2011-05.facts.txt data/2011-06.facts.txt data/2011-07.facts.txt data/2011-08.facts.txt data/2011-09.facts.txt data/2011-10.facts.txt data/2011-11.facts.txt data/2011-12.facts.txt 8 | 9 | data/trial.convos.txt: $(TRIAL_CONVOS) 10 | cat $+ > data/trial.convos.txt 11 | 12 | data/trial.facts.txt: $(TRIAL_FACTS) 13 | cat $+ > data/trial.facts.txt 14 | 15 | data/%.facts.txt data/%.convos.txt: reddit/RC_%.bz2 reddit/RS_%.bz2 16 | python src/create_trial_data.py --rsinput=../reddit/RS_$(*F).bz2 --rcinput=../reddit/RC_$(*F).bz2 --subreddit_filter=lists/subreddits.txt --domain_filter=lists/domains.txt --pickle=data/$(*F).pkl --facts=data/$(*F).facts.txt --convos=data/$(*F).convos.txt > logs/$(*F).log 2> logs/$(*F).err 17 | 18 | reddit/RS_%.bz2: 19 | wget https://files.pushshift.io/reddit/submissions/RS_$(*F).bz2 -O ../reddit/RS_$(*F).bz2 -o logs/RS_$(*F).bz2.log -c 20 | 21 | reddit/RC_%.bz2: 22 | wget https://files.pushshift.io/reddit/comments/RC_$(*F).bz2 -O ../reddit/RC_$(*F).bz2 -o logs/RC_$(*F).bz2.log -c 23 | -------------------------------------------------------------------------------- /data_extraction/trial/src/create_trial_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """ 4 | create_trial_data.py 5 | A script to turn extract: 6 | (1) conversation from Reddit file dumps (originally downloaded from https://files.pushshift.io/reddit/daily/) 7 | (2) grounded data ("facts") extracted from the web, respecting robots.txt 8 | Author: Michel Galley, Microsoft Research NLP Group (dstc7-task2@microsoft.com) 9 | """ 10 | 11 | import sys 12 | import time 13 | import os.path 14 | import re 15 | import argparse 16 | import traceback 17 | import json 18 | import bz2 19 | import pickle 20 | import nltk 21 | import urllib.request 22 | import urllib.robotparser 23 | 24 | from bs4 import BeautifulSoup 25 | from bs4.element import NavigableString 26 | from bs4.element import CData 27 | from multiprocessing import Pool 28 | from nltk.tokenize import TweetTokenizer 29 | 30 | parser = argparse.ArgumentParser() 31 | parser.add_argument("--rsinput", help="Submission (RS) file to load.") 32 | parser.add_argument("--rcinput", help="Comments (RC) file to load.") 33 | parser.add_argument("--facts", help="Facts file to create.") 34 | parser.add_argument("--convos", help="Convo file to create.") 35 | parser.add_argument("--pickle", help="Pickle that contains conversations and facts.", default="data.pkl") 36 | parser.add_argument("--subreddit_filter", help="List of subreddits (inoffensive, safe for work, etc.)") 37 | parser.add_argument("--domain_filter", help="Filter on subreddits and domains.") 38 | parser.add_argument("--nsubmissions", help="Number of submissions to process (< 0 means all)", default=-1, type=int) 39 | parser.add_argument("--min_fact_len", help="Minimum number of tokens in each fact (reduce noise in html).", default=0, type=int) 40 | parser.add_argument("--max_res_len", help="Max number of characters in response.", default=280, type=int) 41 | parser.add_argument("--max_context_len", help="Max number of words in context.", default=100, type=int) 42 | parser.add_argument("--max_depth", help="Maximum length of conversation.", default=5, type=int) 43 | parser.add_argument("--mincomments", help="Minimum number of comments per submission.", default=10, type=int) 44 | parser.add_argument("--delay", help="Seconds of delay when crawling web pages", default=1, type=int) 45 | parser.add_argument("--tokenize", help="Whether to tokenize facts and conversations.", default=True, type=bool) 46 | parser.add_argument("--dryrun", help="Just collect stats about data; don't create any data.", default=False, type=bool) 47 | args = parser.parse_args() 48 | 49 | fields = [ "id", "subreddit", "score", "num_comments", "domain", "title", "url", "permalink" ] 50 | important_tags = ['title', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'p'] 51 | notext_tags = ['script', 'style'] 52 | deleted_str = '[deleted]' 53 | 54 | batch_download_facts = False 55 | robotparsers = {} 56 | tokenizer = TweetTokenizer(preserve_case=False) 57 | 58 | def get_subreddit(submission): 59 | return submission["subreddit"] 60 | def get_domain(submission): 61 | return submission["domain"] 62 | def get_url(submission): 63 | return submission["url"] 64 | def get_submission_text(submission): 65 | return submission["title"] 66 | def get_permalink(submission): 67 | return submission["permalink"] 68 | def get_submission_id(submission): 69 | return submission["id"] 70 | def get_comment_id(comment): 71 | return comment["name"] 72 | def get_parent_comment_id(comment): 73 | return comment["parent_id"] 74 | def get_text(comment): 75 | return comment["body"] 76 | def get_user(comment): 77 | return comment["author"] 78 | def get_score(comment): 79 | return comment["score"] 80 | def get_linked_submission_id(comment): 81 | return comment["link_id"].split("_")[1] 82 | 83 | def filter_submission(submission): 84 | """Determines whether to filter out this submission (over-18, deleted user, etc.).""" 85 | if submission["num_comments"] < args.mincomments: 86 | return True 87 | if "num_crossposts" in submission and submission["num_crossposts"] > 0: 88 | return True 89 | if "locked" in submission and submission["locked"]: 90 | return True 91 | if "over-18" in submission and submission["over_18"]: 92 | return True 93 | if "brand_safe" in submission and not submission["brand_safe"]: 94 | return True 95 | if submission["distinguished"] != None: 96 | return True 97 | if "subreddit_type" in submission: 98 | if submission["subreddit_type"] == "restricted": # filter only public 99 | return True 100 | if submission["subreddit_type"] == "archived": 101 | return True 102 | url = get_url(submission) 103 | if url.find("reddit.com") >= 0 or url.find("twitter.com") >= 0: 104 | return True 105 | if url.find(" ") >= 0: 106 | return True 107 | if url.endswith("jpg") or url.endswith("gif") or url.endswith("png") or url.endswith("pdf"): 108 | return True 109 | return False 110 | 111 | def norm_article(t): 112 | """Minimalistic processing with linebreaking.""" 113 | t = re.sub("\s*\n+\s*","\n", t) 114 | t = re.sub(r'(</[pP]>)',r'\1\n', t) 115 | t = re.sub("[ \t]+"," ", t) 116 | t = t.strip() 117 | return t 118 | 119 | def norm_sentence(t): 120 | """Minimalistic processing: remove extra space characters.""" 121 | t = re.sub("[ \n\r\t]+", " ", t) 122 | t = t.strip() 123 | if args.tokenize: 124 | t = " ".join(tokenizer.tokenize(t)) 125 | t = t.replace('[ deleted ]','[deleted]'); 126 | return t 127 | 128 | def add_webpage(submission): 129 | """Retrive sentences ('facts') from submission["url"]. 130 | Note: For the final version, of the training/dev/test data, we will rely on 131 | the Common Crawl so that we can ensure the data is the same for each participant. 132 | This will also speed up data creation.""" 133 | sys.stderr.flush() 134 | url = get_url(submission) 135 | domain = get_domain(submission) 136 | try: 137 | if args.delay > 0: 138 | time.sleep(args.delay) 139 | if domain in robotparsers.keys(): 140 | rp = robotparsers[domain] 141 | else: 142 | rp = urllib.robotparser.RobotFileParser() 143 | robotparsers[domain] = rp 144 | rurl = "http://" + domain + "/robots.txt" 145 | print("Fetching robots.txt: [%s]" % rurl, file=sys.stderr) 146 | rp.set_url(rurl) 147 | rp.read() 148 | if not rp.can_fetch("*", url): 149 | print("Can't download url due to robots.txt: [%s]" % url, file=sys.stderr) 150 | return None 151 | print("Fetching url: [%s]" % url, file=sys.stderr) 152 | u = urllib.request.urlopen(url) 153 | src = u.read() 154 | submission["source"] = src 155 | return submission 156 | except urllib.error.HTTPError: 157 | return None 158 | except urllib.error.URLError: 159 | return None 160 | except UnicodeEncodeError: 161 | return None 162 | except: 163 | traceback.print_exc() 164 | return None 165 | 166 | def add_webpages(submissions): 167 | """Use multithreading to retrieve multiple webpages at once.""" 168 | print("Downloading %d pages:" % len(submissions), file=sys.stderr) 169 | sys.stderr.flush() 170 | pool = Pool() 171 | submissions = pool.map(add_webpage, submissions) 172 | print("\nDone.", file=sys.stderr) 173 | sys.stderr.flush() 174 | return [s for s in submissions if s is not None] 175 | 176 | def get_submissions(rs_file, subreddit_file, domain_file): 177 | """Return all submissions from a dump submission file rs_file (RS_*.bz2), 178 | restricted to the subreddit+domain listed in filter_file.""" 179 | submissions = [] 180 | subreddit_dic = None 181 | domain_dic = None 182 | if subreddit_file != None: 183 | with open(subreddit_file) as f: 184 | subreddit_dic = dict([ (el.strip(), 1) for el in f.readlines() ]) 185 | if domain_file != None: 186 | with open(domain_file) as f: 187 | domain_dic = dict([ (el.strip(), 1) for el in f.readlines() ]) 188 | with bz2.open(rs_file, 'rt', encoding="utf-8") as f: 189 | i = 0 190 | for line in f: 191 | try: 192 | submission = json.loads(line) 193 | if not filter_submission(submission): 194 | subreddit = get_subreddit(submission) 195 | domain = get_domain(submission) 196 | scheck = subreddit_dic == None or subreddit in subreddit_dic 197 | dcheck = domain_dic == None or domain in domain_dic 198 | if scheck and dcheck: 199 | s = dict([ (f, submission[f]) for f in fields ]) 200 | print("keeping: subreddit=%s\tdomain=%s" % (subreddit, domain)) 201 | if args.dryrun: 202 | continue 203 | if not batch_download_facts: 204 | s = add_webpage(s) 205 | submissions.append(s) 206 | i += 1 207 | if i == args.nsubmissions: 208 | break 209 | else: 210 | print("skipping: subreddit=%s\tdomain=%s (%s %s)" % (subreddit, domain, scheck, dcheck)) 211 | pass 212 | 213 | except Exception: 214 | traceback.print_exc() 215 | pass 216 | if batch_download_facts: 217 | submissions = add_webpages(submissions) 218 | else: 219 | submissions = [s for s in submissions if s is not None] 220 | return dict([ (get_submission_id(s), s) for s in submissions ]) 221 | 222 | def get_comments(rc_file, submissions): 223 | """Return all conversation triples from rc_file (RC_*.bz2), 224 | restricted to given submissions.""" 225 | comments = {} 226 | with bz2.open(rc_file, 'rt', encoding="utf-8") as f: 227 | for line in f: 228 | try: 229 | comment = json.loads(line) 230 | sid = get_linked_submission_id(comment) 231 | if sid in submissions.keys(): 232 | comments[get_comment_id(comment)] = comment 233 | except Exception: 234 | traceback.print_exc() 235 | pass 236 | return comments 237 | 238 | def load_data(): 239 | """Load data either from a pickle file if it exists, 240 | and otherwise from RC_* RS_* and directly from the web.""" 241 | if not os.path.isfile(args.pickle): 242 | submissions = get_submissions(args.rsinput, args.subreddit_filter, args.domain_filter) 243 | comments = get_comments(args.rcinput, submissions) 244 | with open(args.pickle, 'wb') as f: 245 | pickle.dump([submissions, comments], f, protocol=pickle.HIGHEST_PROTOCOL) 246 | else: 247 | with open(args.pickle, 'rb') as f: 248 | [submissions, comments] = pickle.load(f) 249 | return submissions, comments 250 | 251 | def insert_escaped_tags(tags, label=None): 252 | """For each tag in "tags", insert contextual tags (e.g., <p> </p>) as escaped text 253 | so that these tags are still there when html markup is stripped out.""" 254 | found = False 255 | for tag in tags: 256 | strs = list(tag.strings) 257 | if len(strs) > 0: 258 | if label != None: 259 | l = label 260 | else: 261 | l = tag.name 262 | strs[0].parent.insert(0, NavigableString("<"+l+">")) 263 | strs[-1].parent.append(NavigableString("</"+l+">")) 264 | found = True 265 | return found 266 | 267 | def save_facts(submissions, sids = None): 268 | subs = {} 269 | i = 0 270 | with open(args.facts, 'wt', encoding="utf-8") as f: 271 | for id in sorted(submissions.keys()): 272 | s = submissions[id] 273 | url = get_url(s) 274 | label = "" 275 | pos = url.find("#") 276 | if (pos > 0): 277 | label = url[pos+1:] 278 | label = label.strip() 279 | print("Processing submission %s...\n\turl: %s\n\tanchor: %s\n\tpermalink: http://reddit.com%s" % (id, url, str(label), get_permalink(s)), file=sys.stderr) 280 | sys.stdout.flush() 281 | subs[id] = s 282 | if sids == None or id in sids.keys(): 283 | b = BeautifulSoup(s["source"],'html.parser') 284 | # If there is any anchor in the url, locate it in the facts: 285 | if label != "": 286 | if not insert_escaped_tags(b.find_all(True, attrs={"id": label}), 'anchor'): 287 | print("\t(couldn't find anchor on page: %s)" % label, file=sys.stderr) 288 | # Remove tags whose text we don't care about (javascript, etc.): 289 | for el in b(notext_tags): 290 | el.decompose() 291 | # Delete other unimportant tags, but keep the text: 292 | for tag in b.findAll(True): 293 | if tag.name not in important_tags: 294 | tag.append(' ') 295 | tag.replaceWithChildren() 296 | # All tags left are important (e.g., <p>) so add them to the text: 297 | insert_escaped_tags(b.find_all(True)) 298 | # Extract facts from html: 299 | t = b.get_text(" ") 300 | t = norm_article(t) 301 | facts = [] 302 | for sent in filter(None, t.split("\n")): 303 | if len(sent.split(" ")) >= args.min_fact_len: 304 | facts.append(norm_sentence(sent)) 305 | for fact in facts: 306 | f.write("\t".join([get_subreddit(s), id, fact]) + "\n") 307 | s["facts"] = facts 308 | i += 1 309 | if i == args.nsubmissions: 310 | break 311 | return subs 312 | 313 | def get_convo(id, submissions, comments, depth=args.max_depth): 314 | c = comments[id] 315 | pid = get_parent_comment_id(c) 316 | if pid in comments.keys() and depth > 0: 317 | els = get_convo(pid, submissions, comments, depth-1) 318 | else: 319 | s = submissions[get_linked_submission_id(c)] 320 | els = [ "START", norm_sentence(get_submission_text(s)) ] 321 | els.append(norm_sentence(get_text(c))) 322 | return els 323 | 324 | def save_triples(submissions, comments): 325 | sids = {} 326 | with open(args.convos, 'wt', encoding="utf-8") as f: 327 | for id in sorted(comments.keys()): 328 | comment = comments[id] 329 | sid = get_linked_submission_id(comment) 330 | if sid in submissions.keys(): 331 | convo = get_convo(id, submissions, comments) 332 | context = " EOS ".join(convo[:-1]) 333 | response = convo[-1] 334 | cwords = re.split("\s+", context) 335 | if len(cwords) > args.max_context_len: 336 | ndel = len(cwords) - args.max_context_len 337 | del cwords[:ndel] 338 | context = "... " + " ".join(cwords) 339 | s = submissions[sid] 340 | if len(response) <= args.max_res_len and response != deleted_str and get_user(comment) != deleted_str and response.find(">") < 0: 341 | if context.find(deleted_str) < 0: 342 | f.write("\t".join([get_subreddit(s), sid, str(get_score(comment)), context, response]) + "\n") 343 | sids[sid] = 1 344 | return sids 345 | 346 | if __name__== "__main__": 347 | submissions, comments = load_data() 348 | submissions = save_facts(submissions) 349 | save_triples(submissions, comments) 350 | -------------------------------------------------------------------------------- /data_extraction/trial/src/create_trial_data.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Creates trial data for DSTC7 Task 2 4 | 5 | mkdir reddit 6 | mkdir data 7 | mkdir logs 8 | 9 | TRIAL=`cat lists/data-trial.txt` 10 | 11 | for id in $TRIAL; do 12 | echo Downloading $id 13 | wget https://files.pushshift.io/reddit/submissions/RS_$id.bz2 -O ../reddit/RS_$id.bz2 -o logs/RS_$id.bz2.log -c 14 | wget https://files.pushshift.io/reddit/comments/RC_$id.bz2 -O ../reddit/RC_$id.bz2 -o logs/RC_$id.bz2.log -c 15 | python src/create_trial_data.py --rsinput=../reddit/RS_$id.bz2 --rcinput=../reddit/RC_$id.bz2 --subreddit_filter=lists/subreddits.txt --domain_filter=lists/domains.txt --pickle=data/$id.pkl --facts=data/$id.facts.txt --convos=data/$id.convos.txt > logs/$id.log 2> logs/$id.err 16 | done 17 | 18 | eval "cat data/{`echo $TRIAL | sed 's/ /,/g'`}.convos.txt" > data/trial.convos.txt 19 | eval "cat data/{`echo $TRIAL | sed 's/ /,/g'`}.facts.txt" > data/trial.facts.txt 20 | -------------------------------------------------------------------------------- /data_extraction/trial/src/requirements.txt: -------------------------------------------------------------------------------- 1 | nltk==3.4.1 2 | beautifulsoup4==4.6.0 3 | -------------------------------------------------------------------------------- /doc/DSTC7_task2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/e2b50e535e10fc07c55946c20571ced972c6c059/doc/DSTC7_task2.pdf -------------------------------------------------------------------------------- /doc/proposal.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/e2b50e535e10fc07c55946c20571ced972c6c059/doc/proposal.pdf -------------------------------------------------------------------------------- /evaluation/LICENSE.txt: -------------------------------------------------------------------------------- 1 | # Open Use of Data Agreement v1.0 2 | 3 | This is the Open Use of Data Agreement, Version 1.0 (the "O-UDA"). Capitalized terms are defined in Section 5. Data Provider and you agree as follows: 4 | 5 | 1. **Provision of the Data** 6 | 7 | 1.1. You may use, modify, and distribute the Data made available to you by the Data Provider under this O-UDA if you follow the O-UDA's terms. 8 | 9 | 1.2. Data Provider will not sue you or any Downstream Recipient for any claim arising out of the use, modification, or distribution of the Data provided you meet the terms of the O-UDA. 10 | 11 | 1.3 This O-UDA does not restrict your use, modification, or distribution of any portions of the Data that are in the public domain or that may be used, modified, or distributed under any other legal exception or limitation. 12 | 13 | 2. **No Restrictions on Use or Results** 14 | 15 | 2.1. The O-UDA does not impose any restriction with respect to: 16 | 17 | 2.1.1. the use or modification of Data; or 18 | 19 | 2.1.2. the use, modification, or distribution of Results. 20 | 21 | 3. **Redistribution of Data** 22 | 23 | 3.1. You may redistribute the Data under terms of your choice, so long as: 24 | 25 | 3.1.1. You include with any Data you redistribute all credit or attribution information that you received with the Data, and your terms require any Downstream Recipient to do the same; and 26 | 27 | 3.1.2. Your terms include a warranty disclaimer and limitation of liability for Upstream Data Providers at least as broad as those contained in Section 4.2 and 4.3 of the O-UDA. 28 | 29 | 4. **No Warranty, Limitation of Liability** 30 | 31 | 4.1. Data Provider does not represent or warrant that it has any rights whatsoever in the Data. 32 | 33 | 4.2. THE DATA IS PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 34 | 35 | 4.3. NEITHER DATA PROVIDER NOR ANY UPSTREAM DATA PROVIDER SHALL HAVE ANY LIABILITY FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING WITHOUT LIMITATION LOST PROFITS), HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE DATA OR RESULTS, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 36 | 37 | 5. **Definitions** 38 | 39 | 5.1. "Data" means the material you receive under the O-UDA in modified or unmodified form, but not including Results. 40 | 41 | 5.2. "Data Provider" means the source from which you receive the Data and with whom you enter into the O-UDA. 42 | 43 | 5.3. "Downstream Recipient" means any person or persons who receives the Data directly or indirectly from you in accordance with the O-UDA. 44 | 45 | 5.4. "Result" means anything that you develop or improve from your use of Data that does not include more than a de minimis portion of the Data on which the use is based. Results may include de minimis portions of the Data necessary to report on or explain use that has been conducted with the Data, such as figures in scientific papers, but do not include more. Artificial intelligence models trained on Data (and which do not include more than a de minimis portion of Data) are Results. 46 | 47 | 5.5. "Upstream Data Providers" means the source or sources from which the Data Provider directly or indirectly received, under the terms of the O-UDA, material that is included in the Data. 48 | 49 | -------------------------------------------------------------------------------- /evaluation/README.md: -------------------------------------------------------------------------------- 1 | # Evaluation 2 | 3 | * Full evaluation results and data: [tgz](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/blob/master/evaluation/dstc7-task2-human_eval.tgz) 4 | 5 | ## Important Dates 6 | * 9/10/2018: Task organizers release test data 7 | * 10/1/2018 (11:59pm PST): Deadline for sending system outputs to task organizers (email to <dstc7-task2@microsoft.com>) 8 | * 10/8/2018: Task organizers release automatic evaluation results (BLEU, METEOR, etc.) 9 | * 10/16/2018: Task organizers release human evaluation results 10 | * 11/5/2018: System descriptions due to [DSTC7](http://workshop.colips.org/dstc7/) organizers (coming soon) 11 | 12 | ## Create test data (estimated time: 1 week maximum) 13 | 14 | Please refer to the [data extraction page](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction) to create data. The scripts on that page have just been updated to create validation and test data, so please dowload the latest version of the code (e.g., with `git pull`). Then, you just need to run the following command: 15 | 16 | ```make -j4 valid test``` 17 | 18 | This will create the followng four files: 19 | 20 | * Validation data: ``valid.convos.txt`` and ``valid.facts.txt`` 21 | * Test data: ``test.convos.txt`` and ``test.facts.txt`` 22 | 23 | These files are in exactly the same format as ``train.convos.txt`` and ``train.facts.txt`` already explained [here](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction). The only difference is that the ``response`` field of test.convos.txt has been replaced with the strings ``__UNDISCLOSED__``. 24 | 25 | Notes: 26 | * The two validation files are optional and you can skip them if you want (e.g., no need to send us system outputs for them). We provide them so that you can run your own automatic evaluation (BLEU, etc.) by comparing the ``response`` field with your own system outputs. 27 | * Obviously, you may not use the validation or test data for training any of your models or systems. 28 | * Data creation should take about 1-4 days (depending on your internet connection, etc.). If you run into trouble creating the data or data extraction isn't complete by September 17, please contact us. 29 | 30 | ### Data statistics 31 | 32 | Number of conversational responses: 33 | * Validation (valid.convos.txt): 4542 lines 34 | * Test (test.convos.txt): 13440 lines 35 | 36 | Due to the way the data is created by querying Common Crawl, there may be small differences between your version of the data and our own. To make pairwise comparisons between systems of each pair of participants, we will rely on the largest subset of the test set that is common to both participants. **However, if your file test.convos.txt contains less than 13,000 lines, this might be an indication of a problem so please contact us immediately**. 37 | 38 | ## Create system outputs (estimated time: 2 weeks maximum) 39 | 40 | **By October 8th**, please send us a modification of ``test.convos.txt`` where ``__UNDISCLOSED__`` has been replaced by your own system output, which is an output based on the query, subreddit, and conversation ID specified on the same line. On October 4th, we sent an email to registered participants with a URL to the system output submission site. If you registed but didn't receive such an email, please contact us at <dstc7-task2@microsoft.com>. 41 | 42 | In order for us to process these files automatically, please ensure the following about file format: 43 | * **Other than replacing ``__UNDISCLOSED__`` with your own output, the rest of the file should not be altered in any way.** That is, it needs to have exactly 7 columns on each line, and columns 1-6 should be exactly the same as the test.convos.txt file created by our scripts. These other columns are also important as they will help us sort out differences between the test sets of the different participants. 44 | * You may submit multiple systems, in which case we ask that you: (1) Please give a different name to each file (e.g., system1.convos.txt, system2.convos.txt, ...). (2) Please identify one of your system as primary, as we may only be able to use one of them for human evaluation. 45 | * Please do not send us any other files (e.g., we do not need validation sets, and any facts files.) 46 | 47 | *Before submitting, **please check the format of your files** and make sure they are as described above, as we will process submissions automatically for metric evaluation (and semi-automatically for human evaluation). Formatting your data correctly as described above will minimize problems in the evaluation pipeline and the risk of getting lower metric scores. If you are not sure about the format, just ask.* 48 | 49 | If you have any questions, again please email us at <dstc7-task2@microsoft.com>. 50 | -------------------------------------------------------------------------------- /evaluation/dstc7-task2-human_eval.tgz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/e2b50e535e10fc07c55946c20571ced972c6c059/evaluation/dstc7-task2-human_eval.tgz -------------------------------------------------------------------------------- /evaluation/old/dstc7-task2-individual_judgments.tgz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/e2b50e535e10fc07c55946c20571ced972c6c059/evaluation/old/dstc7-task2-individual_judgments.tgz -------------------------------------------------------------------------------- /evaluation/old/dstc7-task2-individual_judgments.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/e2b50e535e10fc07c55946c20571ced972c6c059/evaluation/old/dstc7-task2-individual_judgments.xlsx -------------------------------------------------------------------------------- /evaluation/old/dstc7-task2-scores.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/e2b50e535e10fc07c55946c20571ced972c6c059/evaluation/old/dstc7-task2-scores.xlsx -------------------------------------------------------------------------------- /evaluation/src/.gitignore: -------------------------------------------------------------------------------- 1 | 2 | temp 3 | __pycache__ 4 | -------------------------------------------------------------------------------- /evaluation/src/3rdparty/.create: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/e2b50e535e10fc07c55946c20571ced972c6c059/evaluation/src/3rdparty/.create -------------------------------------------------------------------------------- /evaluation/src/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Automatic evaluation script for DSTC7 Task 2 3 | 4 | Steps: 5 | 1) Make sure you 'git pull' the latest changes (from October 15, 2018), including changes in ../../data_extraction. 6 | 2) cd to `../../data_extraction` and type make. This will create the multi-reference file used by the metrics (`../../data_extraction/test.refs`). 7 | 3) Install 3rd party software as instructed below (METEOR and mteval-v14c.pl). 8 | 5) Run the following command, where `[SUBMISSION]` is the submission file you want to evaluate: (same format as the one you submitted on Oct 8.) 9 | ``` 10 | python dstc.py -c [SUBMISSION] --refs ../../data_extraction/test.refs 11 | ``` 12 | 13 | Important: the results printed by dstc.py might differ slightly from the official results, if part of your test set failed to download. 14 | 15 | 16 | 17 | # What does it do? 18 | (Based on this [repo](https://github.com/golsun/NLP-tools) by [Sean Xiang Gao](https://www.linkedin.com/in/gxiang1228/)) 19 | 20 | * **evaluation**: calculate automated NLP metrics (BLEU, NIST, METEOR, entropy, etc...) 21 | ```python 22 | from metrics import nlp_metrics 23 | nist, bleu, meteor, entropy, diversity, avg_len = nlp_metrics( 24 | path_refs=["demo/ref0.txt", "demo/ref1.txt"], 25 | path_hyp="demo/hyp.txt") 26 | 27 | # nist = [1.8338, 2.0838, 2.1949, 2.1949] 28 | # bleu = [0.4667, 0.441, 0.4017, 0.3224] 29 | # meteor = 0.2832 30 | # entropy = [2.5232, 2.4849, 2.1972, 1.7918] 31 | # diversity = [0.8667, 1.000] 32 | # avg_len = 5.0000 33 | ``` 34 | * **tokenization**: clean string and deal with punctation, contraction, url, mention, tag, etc 35 | ```python 36 | from tokenizers import clean_str 37 | s = " I don't know:). how about this?https://github.com" 38 | clean_str(s) 39 | 40 | # i do n't know :) . how about this ? __url__ 41 | ``` 42 | 43 | # Requirements 44 | * Works fine for both Python 2.7 and 3.6 45 | * Please **downloads** the following 3rd-party packages and save in a new folder `3rdparty`: 46 | * [**mteval-v14c.pl**](https://goo.gl/YUFajQ) to compute [NIST](http://www.mt-archive.info/HLT-2002-Doddington.pdf). You may need to install the following [perl](https://www.perl.org/get.html) modules (e.g. by `cpan install`): XML:Twig, Sort:Naturally and String:Util. 47 | * [**meteor-1.5**](http://www.cs.cmu.edu/~alavie/METEOR/download/meteor-1.5.tar.gz) to compute [METEOR](http://www.cs.cmu.edu/~alavie/METEOR/index.html). It requires [Java](https://www.java.com/en/download/help/download_options.xml). 48 | 49 | -------------------------------------------------------------------------------- /evaluation/src/automatic-evaluation.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Automatic evaluation script: 4 | 5 | # Name of any of the files submitted to DSTC7 Task2 6 | # (or any more recent file) 7 | SUBMISSION=systems/constant-baseline.txt 8 | 9 | # Make sure this file exists: 10 | REFS=../../data_extraction/test.refs 11 | 12 | if [ ! -f $REFS ]; then 13 | echo "Reference file not found. Please move to ../../data_extraction and type make." 14 | else 15 | python dstc.py -c $SUBMISSION --refs $REFS 16 | fi 17 | 18 | -------------------------------------------------------------------------------- /evaluation/src/demo.py: -------------------------------------------------------------------------------- 1 | from metrics import * 2 | from tokenizers import * 3 | 4 | # evaluation 5 | 6 | 7 | nist, bleu, meteor, entropy, diversity, avg_len = nlp_metrics( 8 | path_refs=['demo/ref0.txt', 'demo/ref1.txt'], 9 | path_hyp='demo/hyp.txt') 10 | 11 | print(nist) 12 | print(bleu) 13 | print(meteor) 14 | print(entropy) 15 | print(diversity) 16 | print(avg_len) 17 | 18 | # tokenization 19 | 20 | s = " I don't know:). how about this?https://github.com/golsun/deep-RL-time-series" 21 | print(clean_str(s)) 22 | -------------------------------------------------------------------------------- /evaluation/src/demo/hyp.txt: -------------------------------------------------------------------------------- 1 | i do n't know . 2 | he is a rocket scientist . 3 | i love it ! -------------------------------------------------------------------------------- /evaluation/src/demo/ref0.txt: -------------------------------------------------------------------------------- 1 | ok that 's fine 2 | he is a trader 3 | i love it ! -------------------------------------------------------------------------------- /evaluation/src/demo/ref1.txt: -------------------------------------------------------------------------------- 1 | well it 's ok 2 | he is an engineer 3 | i 'm not a fan -------------------------------------------------------------------------------- /evaluation/src/dstc.py: -------------------------------------------------------------------------------- 1 | # author: Xiang Gao @ Microsoft Research, Oct 2018 2 | # evaluate DSTC-task2 submissions. https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling 3 | 4 | from util import * 5 | from metrics import * 6 | from tokenizers import * 7 | 8 | def extract_cells(path_in, path_hash): 9 | keys = [line.strip('\n') for line in open(path_hash)] 10 | cells = dict() 11 | for line in open(path_in, encoding='utf-8'): 12 | c = line.strip('\n').split('\t') 13 | k = c[0] 14 | if k in keys: 15 | cells[k] = c[1:] 16 | return cells 17 | 18 | 19 | def extract_hyp_refs(raw_hyp, raw_ref, path_hash, fld_out, n_refs=6, clean=False, vshuman=-1): 20 | cells_hyp = extract_cells(raw_hyp, path_hash) 21 | cells_ref = extract_cells(raw_ref, path_hash) 22 | if not os.path.exists(fld_out): 23 | os.makedirs(fld_out) 24 | 25 | def _clean(s): 26 | if clean: 27 | return clean_str(s) 28 | else: 29 | return s 30 | 31 | keys = sorted(cells_hyp.keys()) 32 | with open(fld_out + '/hash.txt', 'w', encoding='utf-8') as f: 33 | f.write(unicode('\n'.join(keys))) 34 | 35 | lines = [_clean(cells_hyp[k][-1]) for k in keys] 36 | path_hyp = fld_out + '/hyp.txt' 37 | with open(path_hyp, 'w', encoding='utf-8') as f: 38 | f.write(unicode('\n'.join(lines))) 39 | 40 | lines = [] 41 | for _ in range(n_refs): 42 | lines.append([]) 43 | for k in keys: 44 | refs = cells_ref[k] 45 | for i in range(n_refs): 46 | idx = i % len(refs) 47 | if idx == vshuman: 48 | idx = (idx + 1) % len(refs) 49 | lines[i].append(_clean(refs[idx].split('|')[1])) 50 | 51 | path_refs = [] 52 | for i in range(n_refs): 53 | path_ref = fld_out + '/ref%i.txt'%i 54 | with open(path_ref, 'w', encoding='utf-8') as f: 55 | f.write(unicode('\n'.join(lines[i]))) 56 | path_refs.append(path_ref) 57 | 58 | return path_hyp, path_refs 59 | 60 | 61 | def eval_one_system(submitted, keys, multi_ref, n_refs=6, n_lines=None, clean=False, vshuman=-1, PRINT=True): 62 | 63 | print('evaluating %s' % submitted) 64 | 65 | fld_out = submitted.replace('.txt','') 66 | if clean: 67 | fld_out += '_cleaned' 68 | path_hyp, path_refs = extract_hyp_refs(submitted, multi_ref, keys, fld_out, n_refs, clean=clean, vshuman=vshuman) 69 | nist, bleu, meteor, entropy, div, avg_len = nlp_metrics(path_refs, path_hyp, fld_out, n_lines=n_lines) 70 | 71 | if n_lines is None: 72 | n_lines = len(open(path_hyp, encoding='utf-8').readlines()) 73 | 74 | if PRINT: 75 | print('n_lines = '+str(n_lines)) 76 | print('NIST = '+str(nist)) 77 | print('BLEU = '+str(bleu)) 78 | print('METEOR = '+str(meteor)) 79 | print('entropy = '+str(entropy)) 80 | print('diversity = ' + str(div)) 81 | print('avg_len = '+str(avg_len)) 82 | 83 | return [n_lines] + nist + bleu + [meteor] + entropy + div + [avg_len] 84 | 85 | 86 | def eval_all_systems(files, path_report, keys, multi_ref, n_refs=6, n_lines=None, clean=False, vshuman=False): 87 | # evaluate all systems (*.txt) in each folder `files` 88 | 89 | with open(path_report, 'w') as f: 90 | f.write('\t'.join( 91 | ['fname', 'n_lines'] + \ 92 | ['nist%i'%i for i in range(1, 4+1)] + \ 93 | ['bleu%i'%i for i in range(1, 4+1)] + \ 94 | ['meteor'] + \ 95 | ['entropy%i'%i for i in range(1, 4+1)] +\ 96 | ['div1','div2','avg_len'] 97 | ) + '\n') 98 | 99 | for fl in files: 100 | if fl.endswith('.txt'): 101 | submitted = fl 102 | results = eval_one_system(submitted, keys=keys, multi_ref=multi_ref, n_refs=n_refs, clean=clean, n_lines=n_lines, vshuman=vshuman, PRINT=False) 103 | with open(path_report, 'a') as f: 104 | f.write('\t'.join(map(str, [submitted] + results)) + '\n') 105 | else: 106 | for fname in os.listdir(fl): 107 | if fname.endswith('.txt'): 108 | submitted = fl + '/' + fname 109 | results = eval_one_system(submitted, keys=keys, multi_ref=multi_ref, n_refs=n_refs, clean=clean, n_lines=n_lines, vshuman=vshuman, PRINT=False) 110 | with open(path_report, 'a') as f: 111 | f.write('\t'.join(map(str, [submitted] + results)) + '\n') 112 | 113 | print('report saved to: '+path_report, file=sys.stderr) 114 | 115 | 116 | if __name__ == '__main__': 117 | 118 | parser = argparse.ArgumentParser() 119 | parser.add_argument('submitted') # if 'all' or '*', eval all teams listed in dstc/teams.txt 120 | # elif endswith '.txt', eval this single file 121 | # else, eval all *.txt in folder `submitted_fld` 122 | 123 | parser.add_argument('--clean', '-c', action='store_true') # whether to clean ref and hyp before eval 124 | parser.add_argument('--n_lines', '-n', type=int, default=-1) # eval all lines (default) or top n_lines (e.g., for fast debugging) 125 | parser.add_argument('--n_refs', '-r', type=int, default=6) # number of references 126 | parser.add_argument('--vshuman', '-v', type=int, default='1') # when evaluating against human performance (N in refN.txt that should be removed) 127 | # in which case we need to remove human output from refs 128 | parser.add_argument('--refs', '-g', default='dstc/test.refs') 129 | parser.add_argument('--keys', '-k', default='keys/test.2k.txt') 130 | parser.add_argument('--teams', '-i', type=str, default='dstc/teams.txt') 131 | parser.add_argument('--report', '-o', type=str, default=None) 132 | args = parser.parse_args() 133 | print('Args: %s\n' % str(args), file=sys.stderr) 134 | 135 | if args.n_lines < 0: 136 | n_lines = None # eval all lines 137 | else: 138 | n_lines = args.n_lines # just eval top n_lines 139 | 140 | if args.submitted.endswith('.txt'): 141 | eval_one_system(args.submitted, keys=args.keys, multi_ref=args.refs, clean=args.clean, n_lines=n_lines, n_refs=args.n_refs, vshuman=args.vshuman) 142 | else: 143 | fname_report = 'report_ref%i'%args.n_refs 144 | if args.clean: 145 | fname_report += '_cleaned' 146 | fname_report += '.tsv' 147 | if args.submitted == 'all' or args.submitted == '*': 148 | files = ['dstc/' + line.strip('\n') for line in open(args.teams)] 149 | path_report = 'dstc/' + fname_report 150 | else: 151 | files = [args.submitted] 152 | path_report = args.submitted + '/' + fname_report 153 | if args.report != None: 154 | path_report = args.report 155 | eval_all_systems(files, path_report, keys=args.keys, multi_ref=args.refs, clean=args.clean, n_lines=n_lines, n_refs=args.n_refs, vshuman=args.vshuman) 156 | -------------------------------------------------------------------------------- /evaluation/src/metrics.py: -------------------------------------------------------------------------------- 1 | # author: Xiang Gao @ Microsoft Research, Oct 2018 2 | # compute NLP evaluation metrics 3 | 4 | import re 5 | from util import * 6 | from collections import defaultdict 7 | 8 | 9 | def calc_nist_bleu(path_refs, path_hyp, fld_out='temp', n_lines=None): 10 | # call mteval-v14c.pl 11 | # ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v14c.pl 12 | # you may need to cpan install XML:Twig Sort:Naturally String:Util 13 | 14 | makedirs(fld_out) 15 | 16 | if n_lines is None: 17 | n_lines = len(open(path_hyp, encoding='utf-8').readlines()) 18 | _write_xml([''], fld_out + '/src.xml', 'src', n_lines=n_lines) 19 | _write_xml([path_hyp], fld_out + '/hyp.xml', 'hyp', n_lines=n_lines) 20 | _write_xml(path_refs, fld_out + '/ref.xml', 'ref', n_lines=n_lines) 21 | 22 | time.sleep(1) 23 | cmd = [ 24 | 'perl','3rdparty/mteval-v14c.pl', 25 | '-s', '%s/src.xml'%fld_out, 26 | '-t', '%s/hyp.xml'%fld_out, 27 | '-r', '%s/ref.xml'%fld_out, 28 | ] 29 | process = subprocess.Popen(cmd, stdout=subprocess.PIPE) 30 | output, error = process.communicate() 31 | 32 | lines = output.decode().split('\n') 33 | try: 34 | nist = lines[-6].strip('\r').split()[1:5] 35 | bleu = lines[-4].strip('\r').split()[1:5] 36 | return [float(x) for x in nist], [float(x) for x in bleu] 37 | 38 | except Exception: 39 | print('mteval-v14c.pl returns unexpected message') 40 | print('cmd = '+str(cmd)) 41 | print(output.decode()) 42 | print(error.decode()) 43 | return [-1]*4, [-1]*4 44 | 45 | 46 | 47 | 48 | def calc_cum_bleu(path_refs, path_hyp): 49 | # call multi-bleu.pl 50 | # https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl 51 | # the 4-gram cum BLEU returned by this one should be very close to calc_nist_bleu 52 | # however multi-bleu.pl doesn't return cum BLEU of lower rank, so in nlp_metrics we preferr calc_nist_bleu 53 | # NOTE: this func doesn't support n_lines argument and output is not parsed yet 54 | 55 | process = subprocess.Popen( 56 | ['perl', '3rdparty/multi-bleu.perl'] + path_refs, 57 | stdout=subprocess.PIPE, 58 | stdin=subprocess.PIPE 59 | ) 60 | with open(path_hyp, encoding='utf-8') as f: 61 | lines = f.readlines() 62 | for line in lines: 63 | process.stdin.write(line.encode()) 64 | output, error = process.communicate() 65 | return output.decode() 66 | 67 | 68 | def calc_meteor(path_refs, path_hyp, fld_out='temp', n_lines=None, pretokenized=True): 69 | # Call METEOR code. 70 | # http://www.cs.cmu.edu/~alavie/METEOR/index.html 71 | 72 | makedirs(fld_out) 73 | path_merged_refs = fld_out + '/refs_merged.txt' 74 | _write_merged_refs(path_refs, path_merged_refs) 75 | 76 | cmd = [ 77 | 'java', '-Xmx1g', # heapsize of 1G to avoid OutOfMemoryError 78 | '-jar', '3rdparty/meteor-1.5/meteor-1.5.jar', 79 | path_hyp, path_merged_refs, 80 | '-r', '%i'%len(path_refs), # refCount 81 | '-l', 'en', '-norm' # also supports language: cz de es fr ar 82 | ] 83 | 84 | process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) 85 | output, error = process.communicate() 86 | for line in output.decode().split('\n'): 87 | if "Final score:" in line: 88 | return float(line.split()[-1]) 89 | 90 | print('meteor-1.5.jar returns unexpected message') 91 | print("cmd = " + " ".join(cmd)) 92 | print(output.decode()) 93 | print(error.decode()) 94 | return -1 95 | 96 | 97 | def calc_entropy(path_hyp, n_lines=None): 98 | # based on Yizhe Zhang's code 99 | etp_score = [0.0,0.0,0.0,0.0] 100 | counter = [defaultdict(int),defaultdict(int),defaultdict(int),defaultdict(int)] 101 | i = 0 102 | for line in open(path_hyp, encoding='utf-8'): 103 | i += 1 104 | words = line.strip('\n').split() 105 | for n in range(4): 106 | for idx in range(len(words)-n): 107 | ngram = ' '.join(words[idx:idx+n+1]) 108 | counter[n][ngram] += 1 109 | if i == n_lines: 110 | break 111 | 112 | for n in range(4): 113 | total = sum(counter[n].values()) 114 | for v in counter[n].values(): 115 | etp_score[n] += - v /total * (np.log(v) - np.log(total)) 116 | 117 | return etp_score 118 | 119 | 120 | def calc_len(path, n_lines): 121 | l = [] 122 | for line in open(path, encoding='utf8'): 123 | l.append(len(line.strip('\n').split())) 124 | if len(l) == n_lines: 125 | break 126 | return np.mean(l) 127 | 128 | 129 | def calc_diversity(path_hyp): 130 | tokens = [0.0,0.0] 131 | types = [defaultdict(int),defaultdict(int)] 132 | for line in open(path_hyp, encoding='utf-8'): 133 | words = line.strip('\n').split() 134 | for n in range(2): 135 | for idx in range(len(words)-n): 136 | ngram = ' '.join(words[idx:idx+n+1]) 137 | types[n][ngram] = 1 138 | tokens[n] += 1 139 | div1 = len(types[0].keys())/tokens[0] 140 | div2 = len(types[1].keys())/tokens[1] 141 | return [div1, div2] 142 | 143 | 144 | def nlp_metrics(path_refs, path_hyp, fld_out='temp', n_lines=None): 145 | nist, bleu = calc_nist_bleu(path_refs, path_hyp, fld_out, n_lines) 146 | meteor = calc_meteor(path_refs, path_hyp, fld_out, n_lines) 147 | entropy = calc_entropy(path_hyp, n_lines) 148 | div = calc_diversity(path_hyp) 149 | avg_len = calc_len(path_hyp, n_lines) 150 | return nist, bleu, meteor, entropy, div, avg_len 151 | 152 | 153 | def _write_merged_refs(paths_in, path_out, n_lines=None): 154 | # prepare merged ref file for meteor-1.5.jar (calc_meteor) 155 | # lines[i][j] is the ref from i-th ref set for the j-th query 156 | 157 | lines = [] 158 | for path_in in paths_in: 159 | lines.append([line.strip('\n') for line in open(path_in, encoding='utf-8')]) 160 | 161 | with open(path_out, 'w', encoding='utf-8') as f: 162 | for j in range(len(lines[0])): 163 | for i in range(len(paths_in)): 164 | f.write(unicode(lines[i][j]) + "\n") 165 | 166 | 167 | 168 | def _write_xml(paths_in, path_out, role, n_lines=None): 169 | # prepare .xml files for mteval-v14c.pl (calc_nist_bleu) 170 | # role = 'src', 'hyp' or 'ref' 171 | 172 | lines = [ 173 | '<?xml version="1.0" encoding="UTF-8"?>', 174 | '<!DOCTYPE mteval SYSTEM "">', 175 | '<!-- generated by https://github.com/golsun/NLP-tools -->', 176 | '<!-- from: %s -->'%paths_in, 177 | '<!-- as inputs for ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v14c.pl -->', 178 | '<mteval>', 179 | ] 180 | 181 | for i_in, path_in in enumerate(paths_in): 182 | 183 | # header ---- 184 | 185 | if role == 'src': 186 | lines.append('<srcset setid="unnamed" srclang="src">') 187 | set_ending = '</srcset>' 188 | elif role == 'hyp': 189 | lines.append('<tstset setid="unnamed" srclang="src" trglang="tgt" sysid="unnamed">') 190 | set_ending = '</tstset>' 191 | elif role == 'ref': 192 | lines.append('<refset setid="unnamed" srclang="src" trglang="tgt" refid="ref%i">'%i_in) 193 | set_ending = '</refset>' 194 | 195 | lines.append('<doc docid="unnamed" genre="unnamed">') 196 | 197 | # body ----- 198 | 199 | if role == 'src': 200 | body = [''] * n_lines 201 | else: 202 | with open(path_in, 'r', encoding='utf-8') as f: 203 | body = f.readlines() 204 | if n_lines is not None: 205 | body = body[:n_lines] 206 | for i in range(len(body)): 207 | line = body[i].strip('\n') 208 | line = line.replace('&',' ').replace('<',' ') # remove illegal xml char 209 | if len(line) == 0: 210 | line = '__empty__' 211 | lines.append('<p><seg id="%i"> %s </seg></p>'%(i + 1, line)) 212 | 213 | # ending ----- 214 | 215 | lines.append('</doc>') 216 | if role == 'src': 217 | lines.append('</srcset>') 218 | elif role == 'hyp': 219 | lines.append('</tstset>') 220 | elif role == 'ref': 221 | lines.append('</refset>') 222 | 223 | lines.append('</mteval>') 224 | with open(path_out, 'w', encoding='utf-8') as f: 225 | f.write(unicode('\n'.join(lines))) 226 | -------------------------------------------------------------------------------- /evaluation/src/tokenizers.py: -------------------------------------------------------------------------------- 1 | # author: Xiang Gao @ Microsoft Research, Oct 2018 2 | # clean and tokenize natural language text 3 | 4 | import re 5 | from util import * 6 | from nltk.tokenize import TweetTokenizer 7 | 8 | def clean_str(txt): 9 | #print("in=[%s]" % txt) 10 | txt = txt.lower() 11 | txt = re.sub('^',' ', txt) 12 | txt = re.sub('$',' ', txt) 13 | 14 | # url and tag 15 | words = [] 16 | for word in txt.split(): 17 | i = word.find('http') 18 | if i >= 0: 19 | word = word[:i] + ' ' + '__url__' 20 | words.append(word.strip()) 21 | txt = ' '.join(words) 22 | 23 | # remove markdown URL 24 | txt = re.sub(r'\[([^\]]*)\] \( *__url__ *\)', r'\1', txt) 25 | 26 | # remove illegal char 27 | txt = re.sub('__url__','URL',txt) 28 | txt = re.sub(r"[^A-Za-z0-9():,.!?\"\']", " ", txt) 29 | txt = re.sub('URL','__url__',txt) 30 | 31 | # contraction 32 | add_space = ["'s", "'m", "'re", "n't", "'ll","'ve","'d","'em"] 33 | tokenizer = TweetTokenizer(preserve_case=False) 34 | txt = ' ' + ' '.join(tokenizer.tokenize(txt)) + ' ' 35 | txt = txt.replace(" won't ", " will n't ") 36 | txt = txt.replace(" can't ", " can n't ") 37 | for a in add_space: 38 | txt = txt.replace(a+' ', ' '+a+' ') 39 | 40 | txt = re.sub(r'^\s+', '', txt) 41 | txt = re.sub(r'\s+$', '', txt) 42 | txt = re.sub(r'\s+', ' ', txt) # remove extra spaces 43 | 44 | #print("out=[%s]" % txt) 45 | return txt 46 | 47 | 48 | if __name__ == '__main__': 49 | ss = [ 50 | " I don't know:). how about this?https://github.com/golsun/deep-RL-time-series", 51 | "please try [ GitHub ] ( https://github.com )", 52 | ] 53 | for s in ss: 54 | print(s) 55 | print(clean_str(s)) 56 | print() 57 | 58 | -------------------------------------------------------------------------------- /evaluation/src/util.py: -------------------------------------------------------------------------------- 1 | import os, time, subprocess, io, sys, re, argparse 2 | import numpy as np 3 | 4 | py_version = sys.version.split('.')[0] 5 | if py_version == '2': 6 | open = io.open 7 | else: 8 | unicode = str 9 | 10 | def makedirs(fld): 11 | if not os.path.exists(fld): 12 | makedirs(fld) 13 | 14 | 15 | def str2bool(s): 16 | # to avoid issue like this: https://stackoverflow.com/questions/15008758/parsing-boolean-values-with-argparse 17 | if s.lower() in ['t','true','1','y']: 18 | return True 19 | elif s.lower() in ['f','false','0','n']: 20 | return False 21 | else: 22 | raise ValueError --------------------------------------------------------------------------------