├── .gitignore
├── README.md
├── baseline
    ├── README.md
    ├── baseline.py
    └── create_input_files.py
├── data_extraction
    ├── Makefile
    ├── README.md
    ├── lists
    │   ├── cc-match.tsv
    │   ├── domains-official.txt
    │   ├── subreddits-official.txt
    │   ├── test-multiref.hashes
    │   ├── test-multiref.sets
    │   ├── test.hashes
    │   └── valid.hashes
    ├── requirements.txt
    ├── src
    │   ├── Makefile.official
    │   ├── Makefile.official.targets
    │   ├── commoncrawl.py
    │   ├── create_official_data.py
    │   └── ids2refs.py
    └── trial
    │   ├── README.md
    │   ├── lists
    │       ├── data-trial.txt
    │       ├── domains.txt
    │       └── subreddits.txt
    │   └── src
    │       ├── Makefile.trial
    │       ├── create_trial_data.py
    │       ├── create_trial_data.sh
    │       └── requirements.txt
├── doc
    ├── DSTC7_task2.pdf
    └── proposal.pdf
└── evaluation
    ├── LICENSE.txt
    ├── README.md
    ├── dstc7-task2-human_eval.tgz
    ├── old
        ├── dstc7-task2-individual_judgments.tgz
        ├── dstc7-task2-individual_judgments.xlsx
        └── dstc7-task2-scores.xlsx
    └── src
        ├── .gitignore
        ├── 3rdparty
            └── .create
        ├── README.md
        ├── automatic-evaluation.sh
        ├── demo.py
        ├── demo
            ├── hyp.txt
            ├── ref0.txt
            └── ref1.txt
        ├── dstc.py
        ├── keys
            └── test.2k.txt
        ├── metrics.py
        ├── systems
            └── constant-baseline.txt
        ├── tokenizers.py
        └── util.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | __pycache__
 2 | tmp
 3 | src/tmp
 4 | data_extraction/logs
 5 | data_extraction/reddit
 6 | data_extraction/data-official
 7 | data_extraction/data-official-test
 8 | data_extraction/data-official-valid
 9 | *.log
10 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # DSTC7: End-to-End Conversation Modeling
 2 | 
 3 | **[DSTC7](http://workshop.colips.org/dstc7/) has ended on January 27, 2019. This github project is still available 'as is', but we unfortunately no longer have time to maintain the code or to provide assistance with this project.**
 4 | 
 5 | ## News
 6 | * 10/29/2018: [Spreadsheet](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/blob/master/evaluation/dstc7-task2-individual_judgments.xlsx) containing indivdual judgments used for human evaluation.
 7 | * 10/23/2018 and 10/15/2018: Automatic and human evaluation [results](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/blob/master/evaluation/dstc7-task2-scores.xlsx) posted. The code to reproduce the automatic evaluation and get the same scores can be found [here](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/blob/master/evaluation/src/README.md). 
 8 | * 10/8/2018: Participants submitted system outputs.
 9 | * 9/10/2018-10/8/2018: Evaluation phase, instructions [here](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/tree/master/evaluation).
10 | * 7/11/2018: An [FAQ](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction#FAQ) section has been added to the data extraction page. 
11 | * 7/1/2018: [Official training data](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction) is up.
12 | * 6/18/2018: [Trial data](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction/trial) is up.
13 | * 6/1/2018: [Task description](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/blob/master/doc/DSTC7_task2.pdf) is up.
14 | * 6/1/2018: [Registration](https://docs.google.com/forms/d/e/1FAIpQLSf4aoCdtLsnFr_AKfp3tnTy4OUCITy5avcEEpUHJ9oZ5ZFvbg/viewform) for DSTC7 is now open.
15 | 
16 | ## Registration
17 | 
18 | ~~Please register [here]~~ Registration has now closed.
19 | 
20 | ## Task
21 | This [DSTC7](http://workshop.colips.org/dstc7/) track presents an end-to-end conversational modeling task, in which the goal is to generate conversational responses that go beyond trivial chitchat by injecting informative responses that are grounded in external knowledge. This task is distinct from what is commonly thought of as goal-oriented, task-oriented, or task-completion dialog in that there is no specific or predefined goal (e.g., booking a flight, or reserving a table at a restaurant). Instead, it targets human-like interactions where the underlying goal is often ill-defined or not known in advance, of the kind seen, for example, in work and other productive environments (e.g.,brainstorming meetings) where people share information.
22 | 
23 | Please check this [description](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/blob/master/doc/DSTC7_task2.pdf) for more details about the task, which follows our previous work ["A Knowledge-Grounded Neural Conversation Model"](https://arxiv.org/abs/1702.01932) and our original task [proposal](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/blob/master/doc/proposal.pdf).
24 | 
25 | ## Data
26 | We extend the [knowledge-grounded](https://arxiv.org/abs/1702.01932) setting, with each system input consisting of two parts: 
27 | * Conversational data from Reddit.  
28 | * Contextually-relevant “facts”, taken from the website that started the (Reddit) conversation.
29 | 
30 | Please check the [data extraction](https://github.com/DSTC-MSR/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction) for the input data pipeline. Note: We are providing scripts to extract the data from a Reddit [dump](http://files.pushshift.io/reddit/comments/), as we are unable to release the data directly ourselves. 
31 | 
32 | ## Evaluation
33 | As described in the [task description](http://workshop.colips.org/dstc7/proposals/DSTC7-MSR_end2end.pdf) (Section 4), We will evaluate response quality using both automatic and human evaluations on two criteria.
34 | * Appropriateness;
35 | * Informativeness.
36 | 
37 | We will use automatic evaluation metrics such as BLEU and METEOR to have preliminary score for each submission prior to the human evaluation. Participants can also use these metrics for their own evaluations during the development phase. We will allow participants to submit multiple system outputs with one system marked as “primary” for human evaluation. We will provide a BLEU scoring script to help participants decide which system they want to select as primary. 
38 | 
39 | We will use crowdsourcing for human evaluation. For each response, we ask humans if it is an (1) appropriate and (2) informative response, on a scale from 1 to 5. The system with best average Appropriateness and Informativeness will be determined the winner.
40 | 
41 | ## Baseline
42 | A standard seq2seq [baseline model](https://github.com/DSTC-MSR/DSTC7-End-to-End-Conversation-Modeling/tree/master/baseline) will be provided soon.
43 | 
44 | ## Timeline
45 | |Phase|Dates|
46 | | ------ | -------------- |
47 | |1. Development Phase|June 1 – September 9|
48 | |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1.1 Code (data extraction code, seq2seq baseline)|June 1|
49 | |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1.2 "Trial" data made available|June 18|
50 | |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1.3 Official training data made available| July 1|
51 | |2. Evaluation Phase|September 10 – October 8|
52 | |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2.1 Test data made available|September 10|
53 | |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2.2 Participants submit their system outputs|October 8|
54 | |3. Results are released|October|
55 | |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3.1 Automatic scores (BLEU, etc.)|October 16|
56 | |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3.2 Human evaluation|October 23|
57 | 
58 | ## Organizers
59 | * [Michel Galley](https://www.microsoft.com/en-us/research/people/mgalley/)
60 | * [Chris Brockett](https://www.microsoft.com/en-us/research/people/chrisbkt/)
61 | * [Sean Xiang Gao](https://www.linkedin.com/in/gxiang1228/)
62 | * [Bill Dolan](https://www.microsoft.com/en-us/research/people/billdol/)
63 | * [Jianfeng Gao](https://www.microsoft.com/en-us/research/people/jfgao/)
64 | 
65 | ## Reference
66 | If you submit any system to DSTC7-Task2 or publish any other work making use of the resources provided on this project, we ask you to cite the following task description paper:
67 | 
68 | ```Michel Galley, Chris Brockett, Xiang Gao, Bill Dolan, Jianfeng Gao. End-to-End conversation Modeling: DSTC7 Task 2 Description. In DSTC7 workshop (forthcoming).```
69 | 
70 | ## Contact Information
71 | * ~~For questions specific to Task 2, you can contact us at <dstc7-task2@microsoft.com>.~~ (No longer maintained.)
72 | * ~~You can get the latest updates and participate in discussions on [DSTC mailing list](http://workshop.colips.org/dstc7/contact.html).~~
73 | 


--------------------------------------------------------------------------------
/baseline/README.md:
--------------------------------------------------------------------------------
 1 | # Baseline Model
 2 | This is a baseline model for [DSTC task 2](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling). This is a GRU-based seq2seq generation system. Since it is a baseline, the model does not use grounding information ("facts"), attention or beam search. It uses greedy decoding (unkown token disabled). The implementation is in Python, adapted from a Keras [tutorial](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html). 
 3 | 
 4 | ## Requirement
 5 | The scripts are tested on [Python 3.6.4](https://www.python.org/downloads/) with the following libaries
 6 | * [Keras 2.1.6](https://keras.io/), which requires another backend lib. We used [Tensorflow 1.8.0](https://www.tensorflow.org/)
 7 | * [numpy 1.14.0](http://www.numpy.org/)
 8 | 
 9 | ## Input files
10 | Trial data extraction script and instructions are available [here](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction). Based on the conversation text file, the following input files are generated with the command:
11 | ```
12 | python create_input_files.py
13 | ```
14 | |generated file|description|
15 | |---|---|
16 | |dict.txt|The vocab list. The line id is the `word id` (starting from 1) of the word in this line. The words are ordered by their frequencies appeared in the raw input file|
17 | |source_num.txt|The list of source sentences where words are replaced by their `word id`|
18 | |target_num.txt|The list of corresponding target sentences where words are replaced by their `word id`|
19 | 
20 | ## Parameters
21 | Key parameters can be specified in the `main()` function of [baseline.py](baseline.py). We follow [DSTC 2016](https://github.com/dialogtekgeek/DSTC6-End-to-End-Conversation-Modeling/blob/master/ChatbotBaseline/egs/twitter/run.sh) for the hyperparameter value settings.
22 | 
23 | |parameter|description|value|
24 | |---------|-------|-----|
25 | |`token_embed_dim` | length of word embedding vector |100|
26 | |`rnn_units`| number of hidden units of each GRU cell|512|
27 | |`encoder_depth`| number of GRU cells stacked in the encoder|2|
28 | |`decoder_depth`| number of GRU cells stacked in the decoder|2|
29 | |`dropout_rate`| dropout probability|0.5|
30 | |`max_num_token`| if not None, only use top `max_num_token` most frequent tokens|20000|
31 | |`max_seq_len`| tokens after the first `max_seq_len` tokens will be discarded |32|
32 | 
33 | 
34 | ## Run
35 | Use the command:
36 | ```
37 | python baseline.py [mode]
38 | ```
39 | where  `mode` can be one of the following values
40 | 
41 | |mode|description|
42 | |---------|-------|
43 | |`train` | train a new model. The trained model is saved after each epoch |
44 | |`continue` | load existing model and continue the training |
45 | |`eval`| evaluate the model on held-out data. Negative log likelihood loss is printed|
46 | |`interact`| explore the trained model interactively|
47 | 


--------------------------------------------------------------------------------
/baseline/baseline.py:
--------------------------------------------------------------------------------
  1 | import os, random, sys, io
  2 | import numpy as np
  3 | from keras.models import Model
  4 | from keras.layers import Input, GRU, Dense, Embedding, Dropout
  5 | from keras.models import load_model
  6 | from keras.optimizers import Adam
  7 | 
  8 | """
  9 | a simple seq2seq model prepared as a baseline model for DSTC7
 10 | https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling
 11 | 
 12 | following Keras tutorial: 
 13 | https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html
 14 | 
 15 | NOTE:
 16 | *	word-level, GRU-based
 17 | * 	no attention mechanism
 18 | * 	no beam search. greedy decoding, UNK disabled
 19 | 
 20 | CONTACT: 
 21 | Sean Xiang Gao (xiag@microsoft.com) at Microsoft Research
 22 | """
 23 | 
 24 | SOS_token = '_SOS_'
 25 | EOS_token = '_EOS_'
 26 | UNK_token = '_UNK_'
 27 | 
 28 | 
 29 | def set_random_seed(seed=912):
 30 | 	random.seed(seed)
 31 | 	np.random.seed(seed)
 32 | 
 33 | 
 34 | def makedirs(fld):
 35 | 	if not os.path.exists(fld):
 36 | 		os.makedirs(fld)
 37 | 
 38 | class Dataset:
 39 | 
 40 | 	"""
 41 | 	assumptions of the data files
 42 | 	* SOS and EOS are top 2 tokens
 43 | 	* dictionary ordered by frequency
 44 | 	"""
 45 | 
 46 | 	def __init__(self, 
 47 | 		path_source, path_target, path_vocab, 
 48 | 		max_seq_len=32,
 49 | 		test_split=0.2,		# how many hold out as vali data
 50 | 		read_txt=True,
 51 | 		):
 52 | 
 53 | 
 54 | 		# load token dictionary
 55 | 
 56 | 		self.index2token = {0: ''}
 57 | 		self.token2index = {'': 0}
 58 | 		self.max_seq_len = max_seq_len
 59 | 
 60 | 		with io.open(path_vocab, encoding="utf-8") as f:
 61 | 			lines = f.readlines()
 62 | 		for i, line in enumerate(lines):
 63 | 			token = line.strip('\n').strip()
 64 | 			if len(token) == 0:
 65 | 				break
 66 | 			self.index2token[i + 1] = token
 67 | 			self.token2index[token] = i + 1
 68 | 
 69 | 		self.SOS = self.token2index[SOS_token]
 70 | 		self.EOS = self.token2index[EOS_token]
 71 | 		self.UNK = self.token2index[UNK_token]
 72 | 		self.num_tokens = len(self.token2index) - 1	# not including 0-th (padding)
 73 | 		print('num_tokens: %i'%self.num_tokens)
 74 | 
 75 | 		if read_txt:
 76 | 			self.read_txt(path_source, path_target, test_split)
 77 | 
 78 | 
 79 | 	def read_txt(self, path_source, path_target, test_split):
 80 | 		print('loading data from txt files...')
 81 | 		# load source-target pairs, tokenized
 82 | 
 83 | 		seqs = dict()
 84 | 		for k, path in [('source', path_source), ('target', path_target)]:
 85 | 			seqs[k] = []
 86 | 			with io.open(path, encoding="utf-8") as f:
 87 | 				lines = f.readlines()
 88 | 			for line in lines:
 89 | 				seq = []
 90 | 				for c in line.strip('\n').strip().split(' '):
 91 | 					i = int(c)
 92 | 					if i <= self.num_tokens:	# delete the "unkown" words
 93 | 						seq.append(i)
 94 | 				seqs[k].append(seq[-min(self.max_seq_len - 2, len(seq)):])
 95 | 		self.pairs = list(zip(seqs['source'], seqs['target']))
 96 | 
 97 | 		# train-test split
 98 | 
 99 | 		np.random.shuffle(self.pairs)
100 | 		self.n_train = int(len(self.pairs) * (1. - test_split))
101 | 
102 | 		self.i_sample_range = {
103 | 			'train': (0, self.n_train),
104 | 			'test': (self.n_train, len(self.pairs)),
105 | 			}
106 | 		self.i_sample = dict()
107 | 		self.reset()
108 | 
109 | 
110 | 	def reset(self):
111 | 		for task in self.i_sample_range:
112 | 			self.i_sample[task] = self.i_sample_range[task][0]
113 | 
114 | 	def all_loaded(self, task):
115 | 		return self.i_sample[task] == self.i_sample_range[task][1]
116 | 
117 | 	def load_data(self, task, max_num_sample_loaded=None):
118 | 
119 | 		i_sample = self.i_sample[task]
120 | 		if max_num_sample_loaded is None:
121 | 			max_num_sample_loaded = self.i_sample_range[task][1] - i_sample
122 | 		i_sample_next = min(i_sample + max_num_sample_loaded, self.i_sample_range[task][1])
123 | 		num_samples = i_sample_next - i_sample
124 | 		self.i_sample[task] = i_sample_next
125 | 
126 | 		print('building %s data from %i to %i'%(task, i_sample, i_sample_next))
127 | 		
128 | 		encoder_input_data = np.zeros((num_samples, self.max_seq_len))
129 | 		decoder_input_data = np.zeros((num_samples, self.max_seq_len))
130 | 		decoder_target_data = np.zeros((num_samples, self.max_seq_len, self.num_tokens + 1))		# +1 as mask_zero
131 | 
132 | 		source_texts = []
133 | 		target_texts = []
134 | 
135 | 		for i in range(num_samples):
136 | 
137 | 			seq_source, seq_target = self.pairs[i_sample + i]
138 | 			if not bool(seq_target) or not bool(seq_source):
139 | 				continue
140 | 
141 | 			if seq_target[-1] != self.EOS:
142 | 				seq_target.append(self.EOS)
143 | 
144 | 			source_texts.append(' '.join([self.index2token[j] for j in seq_source]))
145 | 			target_texts.append(' '.join([self.index2token[j] for j in seq_target]))
146 | 
147 | 			for t, token_index in enumerate(seq_source):
148 | 				encoder_input_data[i, t] = token_index
149 | 
150 | 			decoder_input_data[i, 0] = self.SOS
151 | 			for t, token_index in enumerate(seq_target):
152 | 				decoder_input_data[i, t + 1] = token_index
153 | 				decoder_target_data[i, t, token_index] = 1.
154 | 
155 | 		return encoder_input_data, decoder_input_data, decoder_target_data, source_texts, target_texts
156 | 
157 | 
158 | 
159 | 
160 | class Seq2Seq:
161 | 
162 | 	def __init__(self, 
163 | 		dataset, model_dir, 
164 | 		token_embed_dim, rnn_units, encoder_depth, decoder_depth, dropout_rate=0.2):
165 | 
166 | 		self.token_embed_dim = token_embed_dim
167 | 		self.rnn_units = rnn_units
168 | 		self.encoder_depth = encoder_depth
169 | 		self.decoder_depth = decoder_depth
170 | 		self.dropout_rate = dropout_rate
171 | 		self.dataset = dataset
172 | 
173 | 		makedirs(model_dir)
174 | 		self.model_dir = model_dir
175 | 
176 | 
177 | 	def load_models(self):
178 | 		self.build_model_train()
179 | 		self.model_train.load_weights(os.path.join(self.model_dir, 'model.h5'))
180 | 		self.build_model_test()
181 | 
182 | 
183 | 	def _stacked_rnn(self, rnns, inputs, initial_states=None):
184 | 		if initial_states is None:
185 | 			initial_states = [None] * len(rnns)
186 | 		outputs, state = rnns[0](inputs, initial_state=initial_states[0])
187 | 		states = [state]
188 | 		for i in range(1, len(rnns)):
189 | 			outputs, state = rnns[i](outputs, initial_state=initial_states[i])
190 | 			states.append(state)
191 | 		return outputs, states
192 | 
193 | 
194 | 	def build_model_train(self):
195 | 
196 | 		# build layers
197 | 		embeding = Embedding(
198 | 				self.dataset.num_tokens + 1,		# +1 as mask_zero 
199 | 				self.token_embed_dim, mask_zero=True, 
200 | 				name='embeding')
201 | 
202 | 		encoder_inputs = Input(shape=(None,), name='encoder_inputs')
203 | 		encoder_rnns = []
204 | 		for i in range(self.encoder_depth):
205 | 			encoder_rnns.append(GRU(
206 | 				self.rnn_units, 
207 | 				return_state=True,
208 | 				return_sequences=True, 
209 | 				name='encoder_rnn_%i'%i,
210 | 				))
211 | 
212 | 		decoder_inputs = Input(shape=(None,), name='decoder_inputs')
213 | 		decoder_rnns = []
214 | 		for i in range(self.decoder_depth):
215 | 			decoder_rnns.append(GRU(
216 | 				self.rnn_units, 
217 | 				return_state=True,
218 | 				return_sequences=True, 
219 | 				name='decoder_rnn_%i'%i,
220 | 				))
221 | 
222 | 		decoder_softmax = Dense(
223 | 			self.dataset.num_tokens + 1, 		# +1 as mask_zero
224 | 			activation='softmax', name='decoder_softmax')
225 | 
226 | 		# set connections: teacher forcing
227 | 
228 | 		encoder_outputs, encoder_states = self._stacked_rnn(
229 | 				encoder_rnns, embeding(encoder_inputs))
230 | 
231 | 		decoder_outputs, decoder_states = self._stacked_rnn(
232 | 				decoder_rnns, embeding(decoder_inputs), [encoder_states[-1]] * self.decoder_depth)
233 | 
234 | 		decoder_outputs = Dropout(self.dropout_rate)(decoder_outputs)
235 | 		decoder_outputs = decoder_softmax(decoder_outputs)
236 | 		self.model_train = Model(
237 | 				[encoder_inputs, decoder_inputs], 	# [input sentences, ground-truth target sentences],
238 | 				decoder_outputs)					# shifted ground-truth sentences
239 | 
240 | 
241 | 	def build_model_test(self):
242 | 
243 | 		# load/build layers
244 | 
245 | 		names = ['embeding', 'decoder_softmax']
246 | 		for i in range(self.encoder_depth):
247 | 			names.append('encoder_rnn_%i'%i)
248 | 		for i in range(self.decoder_depth):
249 | 			names.append('decoder_rnn_%i'%i)
250 | 
251 | 		reused = dict()
252 | 		for name in names:
253 | 			reused[name] = self.model_train.get_layer(name)
254 | 		
255 | 		encoder_inputs = Input(shape=(None,), name='encoder_inputs')
256 | 		decoder_inputs = Input(shape=(None,), name='decoder_inputs')
257 | 		decoder_inital_states = []
258 | 		for i in range(self.decoder_depth):
259 | 			decoder_inital_states.append(Input(shape=(self.rnn_units,), name="decoder_inital_state_%i"%i))
260 | 
261 | 		# set connections: autoregressive
262 | 
263 | 		encoder_outputs, encoder_states = self._stacked_rnn(
264 | 				[reused['encoder_rnn_%i'%i] for i in range(self.encoder_depth)], 
265 | 				reused['embeding'](encoder_inputs))
266 | 		self.model_infer_encoder = Model(encoder_inputs, encoder_states[-1])
267 | 
268 | 		decoder_outputs, decoder_states = self._stacked_rnn(
269 | 				[reused['decoder_rnn_%i'%i] for i in range(self.decoder_depth)], 
270 | 				reused['embeding'](decoder_inputs),
271 | 				decoder_inital_states)
272 | 
273 | 		decoder_outputs = Dropout(self.dropout_rate)(decoder_outputs)
274 | 		decoder_outputs = reused['decoder_softmax'](decoder_outputs)
275 | 		self.model_infer_decoder = Model(
276 | 				[decoder_inputs] + decoder_inital_states,
277 | 				[decoder_outputs] + decoder_states)
278 | 
279 | 
280 | 	def save_model(self, name):
281 | 		path = os.path.join(self.model_dir, name)
282 | 		self.model_train.save_weights(path)
283 | 		print('saved to: '+path)
284 | 
285 | 
286 | 	def train(self, 
287 | 		batch_size, epochs, 
288 | 		batch_per_load=10,
289 | 		lr=0.001):
290 | 
291 | 
292 | 		self.model_train.compile(optimizer=Adam(lr=lr), loss='categorical_crossentropy')
293 | 		max_load = np.ceil(self.dataset.n_train/batch_size/batch_per_load)
294 | 
295 | 		for epoch in range(epochs):
296 | 			load = 0
297 | 			self.dataset.reset()
298 | 			while not self.dataset.all_loaded('train'):
299 | 				load += 1
300 | 				print('\n***** Epoch %i/%i - load %.2f perc *****'%(epoch + 1, epochs, 100*load/max_load))
301 | 				encoder_input_data, decoder_input_data, decoder_target_data, _, _ = self.dataset.load_data('train', batch_size * batch_per_load)
302 | 
303 | 				self.model_train.fit(
304 | 					[encoder_input_data, decoder_input_data], 
305 | 					decoder_target_data,
306 | 					batch_size=batch_size,)
307 | 
308 | 				self.save_model('model_epoch%i.h5'%(epoch + 1))
309 | 		self.save_model('model.h5')
310 | 
311 | 
312 | 	def evaluate(self, samples_per_load=640):
313 | 
314 | 		self.model_train.compile(optimizer=Adam(lr=1e-3), loss='categorical_crossentropy')
315 | 		self.dataset.reset()
316 | 		sum_loss = 0.
317 | 		sum_n = 0
318 | 
319 | 		while not self.dataset.all_loaded('test'):
320 | 			encoder_input_data, decoder_input_data, decoder_target_data, _, _ = self.dataset.load_data('test', samples_per_load)
321 | 
322 | 			print('evaluating')
323 | 			loss = self.model_train.evaluate(
324 | 				x=[encoder_input_data, decoder_input_data], 
325 | 				y=decoder_target_data,
326 | 				verbose=0)
327 | 
328 | 			n = encoder_input_data.shape[0]
329 | 			sum_loss += loss * n
330 | 			sum_n += n
331 | 			print('avg loss: %.2f'%(sum_loss/sum_n))
332 | 		print('done')
333 | 
334 | 
335 | 
336 | 
337 | 	def _infer(self, source_seq_int):
338 | 
339 | 		state = self.model_infer_encoder.predict(source_seq_int)
340 | 		prev_word = np.atleast_2d([self.dataset.SOS])
341 | 		states = [state] * self.decoder_depth
342 | 		decoded_sentence = ''
343 | 		t = 0
344 | 		while True:
345 | 
346 | 			out = self.model_infer_decoder.predict([prev_word] + states)
347 | 			tokens_proba = out[0].ravel()
348 | 			tokens_proba[self.dataset.UNK] = 0	# UNK disabled
349 | 			tokens_proba = tokens_proba/sum(tokens_proba)
350 | 			states = out[1:]
351 | 			sampled_token_index = np.argmax(tokens_proba)
352 | 			sampled_token = self.dataset.index2token[sampled_token_index]
353 | 			decoded_sentence += sampled_token+' '
354 | 
355 | 			t += 1
356 | 			if sampled_token_index == self.dataset.EOS or t > self.dataset.max_seq_len:
357 | 				break
358 | 
359 | 			prev_word = np.atleast_2d([sampled_token_index])
360 | 
361 | 		return decoded_sentence
362 | 
363 | 
364 | 	def dialog(self, input_text):
365 | 
366 | 		source_seq_int = []
367 | 		for token in input_text.strip().strip('\n').split(' '):
368 | 			source_seq_int.append(self.dataset.token2index.get(token, self.dataset.UNK))
369 | 		return self._infer(np.atleast_2d(source_seq_int))
370 | 
371 | 
372 | 	def interact(self):
373 | 		while True:
374 | 			print('----- please input -----')
375 | 			input_text = input()
376 | 			if not bool(input_text):
377 | 				break
378 | 			print(self.dialog(input_text))
379 | 
380 | 
381 | 
382 | def main(mode):
383 | 
384 | 
385 | 	token_embed_dim = 100
386 | 	rnn_units = 512
387 | 	encoder_depth = 2
388 | 	decoder_depth = 2
389 | 	dropout_rate = 0.5
390 | 	learning_rate = 1e-3
391 | 	max_seq_len = 32
392 | 
393 | 	batch_size = 100
394 | 	epochs = 10
395 | 
396 | 	path_source = os.path.join('official','source_num.txt')
397 | 	path_target = os.path.join('official','target_num.txt')
398 | 	path_vocab = os.path.join('official','dict.txt')
399 | 
400 | 	dataset = Dataset(path_source, path_target, path_vocab, max_seq_len=max_seq_len, read_txt=(mode!='interact'))
401 | 	model_dir = 'model'
402 | 	
403 | 	s2s = Seq2Seq(dataset, model_dir, 
404 | 		token_embed_dim, rnn_units, encoder_depth, decoder_depth, dropout_rate)
405 | 
406 | 	if mode == 'train':
407 | 		s2s.build_model_train()
408 | 	else:
409 | 		s2s.load_models()
410 | 
411 | 	if mode in ['train', 'continue']:
412 | 		s2s.train(batch_size, epochs, lr=learning_rate)
413 | 	else:
414 | 		if mode == 'eval':
415 | 			s2s.build_model_test()
416 | 			s2s.evaluate()
417 | 		elif mode == 'interact':
418 | 			s2s.interact()
419 | 	
420 | 
421 | if __name__ == '__main__':
422 | 	set_random_seed()
423 | 	mode = sys.argv[1]		# one of [train, continue, eval, interact]
424 | 	main(mode)
425 | 


--------------------------------------------------------------------------------
/baseline/create_input_files.py:
--------------------------------------------------------------------------------
 1 | import os, queue
 2 | from baseline import SOS_token, EOS_token, UNK_token
 3 | 
 4 | 
 5 | 
 6 | def main(path_txt, fld_out,
 7 | 	max_vocab_size=2e4,
 8 | 	):
 9 | 
10 | 
11 | 	with open(path_txt, encoding="utf-8") as f:
12 | 		lines = f.readlines()
13 | 
14 | 	if not os.path.exists(fld_out):
15 | 		os.makedirs(fld_out)
16 | 
17 | 	path = dict()
18 | 	for end in ['source', 'target']:
19 | 		path[end] = os.path.join(fld_out, '%s_num.txt'%end)
20 | 	path['dict'] = os.path.join(fld_out, 'dict.txt')
21 | 
22 | 	for k in path:
23 | 		open(path[k], 'w')
24 | 
25 | 	sources = []
26 | 	targets = []
27 | 	n = 0
28 | 	for line in lines:
29 | 		n += 1
30 | 		if n%1e5 == 0:
31 | 			print('checked %.2fM/%.2fM lines'%(n/1e6, len(lines)/1e6))
32 | 		sub =  line.split('\t')
33 | 		source = sub[-2]
34 | 		target = sub[-1]
35 | 		if source == 'START':	# skip if source has nothing
36 | 			continue
37 | 		sources.append(source.strip().split())
38 | 		targets.append(target.strip().split())
39 | 
40 | 	vocab = dict()
41 | 	for tokens in sources + targets:
42 | 		for token in tokens:
43 | 			if token not in vocab:
44 | 				vocab[token] = 0
45 | 			vocab[token] += 1
46 | 
47 | 	pq = queue.PriorityQueue()
48 | 	for token in vocab:
49 | 		pq.put((-vocab[token], token))
50 | 
51 | 	ordered_tokens = [SOS_token, EOS_token, UNK_token]
52 | 	while not pq.empty():
53 | 		freq, token = pq.get()
54 | 		ordered_tokens.append(token)
55 | 		if len(ordered_tokens) == max_vocab_size:
56 | 			break
57 | 
58 | 	print('vocab size = %i'%len(ordered_tokens))
59 | 
60 | 	token2index = dict()
61 | 	for i, token in enumerate(ordered_tokens):
62 | 		token2index[token] = str(i + 1)			# +1 as 0 is for padding
63 | 	with open(path['dict'], 'a', encoding="utf-8") as f:
64 | 		f.write('\n'.join(ordered_tokens))
65 | 
66 | 	nums = []
67 | 	for tokens in sources:
68 | 		nums.append(' '.join([token2index.get(token, token2index[UNK_token]) for token in tokens]))
69 | 	with open(path['source'], 'a', encoding="utf-8") as f:
70 | 		f.write('\n'.join(nums))
71 | 
72 | 	nums = []
73 | 	for tokens in targets:
74 | 		nums.append(' '.join([token2index.get(token, token2index[UNK_token]) for token in tokens]))
75 | 	with open(path['target'], 'a', encoding="utf-8") as f:
76 | 		f.write('\n'.join(nums))
77 | 
78 | 
79 | 
80 | if __name__ == '__main__':
81 | 	fld_out = 'official'
82 | 	path_txt = 'F:/DSTC/data-official/train.convos.txt'
83 | 	main(path_txt, fld_out)
84 | 	print('done!')
85 | 
86 | 
87 | 
88 | 
89 | 
90 | 
91 | 


--------------------------------------------------------------------------------
/data_extraction/Makefile:
--------------------------------------------------------------------------------
 1 | .SECONDARY:
 2 | 
 3 | all: train dev valid test refs test.refs
 4 | data: train dev valid test
 5 | 
 6 | refs: data-official-test/test.refs.txt
 7 | test: data-official-test/test.convos.txt data-official-test/test.facts.txt
 8 | valid: data-official-valid/valid.convos.txt data-official-valid/valid.facts.txt
 9 | dev: data-official/dev.convos.txt data-official/dev.facts.txt
10 | train: data-official/train.convos.txt data-official/train.facts.txt
11 | 
12 | WARGS=--tries=0 --retry-connrefused --waitretry=30
13 | 
14 | include src/Makefile.official.targets
15 | 
16 | data-official-test/test.refs.txt: $(OFFICIAL_TEST_REFS)
17 | 	cat $+ | sort | uniq > $@
18 | 
19 | data-official-test/test.convos.txt: $(OFFICIAL_TEST_CONVOS)
20 | 	cat $+ | sort | uniq > $@
21 | 
22 | data-official-test/test.facts.txt: $(OFFICIAL_TEST_FACTS)
23 | 	cat $+ > $@
24 | 
25 | data-official-valid/valid.convos.txt: $(OFFICIAL_VALID_CONVOS)
26 | 	cat $+ | sort | uniq > $@
27 | 
28 | data-official-valid/valid.facts.txt: $(OFFICIAL_VALID_FACTS)
29 | 	cat $+ > $@
30 | 
31 | data-official/dev.convos.txt: $(OFFICIAL_DEV_CONVOS)
32 | 	cat $+ > $@
33 | 
34 | data-official/dev.facts.txt: $(OFFICIAL_DEV_FACTS)
35 | 	cat $+ > $@
36 | 
37 | data-official/train.convos.txt: $(OFFICIAL_TRAIN_CONVOS)
38 | 	cat $+ > $@
39 | 
40 | data-official/train.facts.txt: $(OFFICIAL_TRAIN_FACTS)
41 | 	cat $+ > $@
42 | 
43 | test.refs: data-official-test/test.refs.txt lists/test-multiref.sets
44 | 	cat $< | python src/ids2refs.py lists/test-multiref.sets >  $@
45 | 
46 | data-official-test/%.refs.txt: data-official-test/%.pkl lists/test-multiref.hashes
47 | 	time python src/create_official_data.py --rsinput=reddit/RS_$(*F).zst --rcinput=reddit/RC_$(*F).zst --subreddit_filter=lists/subreddits-official.txt --domain_filter=lists/domains-official.txt --pickle=data-official-test/$(*F).pkl --facts=- --convos=data-official-test/$(*F).refs.txt --anchoronly=True --use_cc=True --test=lists/test-multiref.hashes > logs/$(*F)-refs.log 2> logs/$(*F)-refs.err
48 | 
49 | ## Note: there is no point in removing the '--blind' flag in order to peek at the reference responses (gold), as the organizers will rely on different responses to compute BLEU, etc.
50 | data-official-test/%.facts.txt data-official-test/%.convos.txt data-official-test/%.pkl: reddit/RC_%.zst reddit/RS_%.zst data-official-test/.create lists/test.hashes
51 | 	time python src/create_official_data.py --rsinput=reddit/RS_$(*F).zst --rcinput=reddit/RC_$(*F).zst --subreddit_filter=lists/subreddits-official.txt --domain_filter=lists/domains-official.txt --pickle=data-official-test/$(*F).pkl --facts=data-official-test/$(*F).facts.txt --convos=data-official-test/$(*F).convos.txt --anchoronly=True --use_cc=True --test=lists/test.hashes --blind=True > logs/$(*F).log 2> logs/$(*F).err
52 | 
53 | data-official-valid/%.facts.txt data-official-valid/%.convos.txt: reddit/RC_%.zst reddit/RS_%.zst data-official-valid/.create lists/valid.hashes
54 | 	time python src/create_official_data.py --rsinput=reddit/RS_$(*F).zst --rcinput=reddit/RC_$(*F).zst --subreddit_filter=lists/subreddits-official.txt --domain_filter=lists/domains-official.txt --pickle=data-official-valid/$(*F).pkl --facts=data-official-valid/$(*F).facts.txt --convos=data-official-valid/$(*F).convos.txt --anchoronly=True --use_cc=True --test=lists/valid.hashes > logs/$(*F).log 2> logs/$(*F).err
55 | 
56 | data-official/%.facts.txt data-official/%.convos.txt: reddit/RC_%.zst reddit/RS_%.zst data-official/.create
57 | 	time python src/create_official_data.py --rsinput=reddit/RS_$(*F).zst --rcinput=reddit/RC_$(*F).zst --subreddit_filter=lists/subreddits-official.txt --domain_filter=lists/domains-official.txt --pickle=data-official/$(*F).pkl --facts=data-official/$(*F).facts.txt --convos=data-official/$(*F).convos.txt --anchoronly=True --use_cc=True > logs/$(*F).log 2> logs/$(*F).err
58 | 
59 | data-official-test/.create:
60 | 	mkdir -p logs
61 | 	mkdir -p data-official-test
62 | 	touch $@
63 | 
64 | data-official-valid/.create:
65 | 	mkdir -p logs
66 | 	mkdir -p data-official-valid
67 | 	touch $@
68 | 
69 | data-official/.create:
70 | 	mkdir -p logs
71 | 	mkdir -p data-official
72 | 	touch $@
73 | 
74 | reddit/RS_%.zst:
75 | 	mkdir -p logs
76 | 	mkdir -p reddit
77 | 	wget $(WARGS) https://files.pushshift.io/reddit/submissions/RS_$(*F).zst -O reddit/RS_$(*F).zst -o logs/RS_$(*F).zst.log -c
78 | 
79 | reddit/RC_%.zst:
80 | 	mkdir -p logs
81 | 	mkdir -p reddit
82 | 	wget $(WARGS) https://files.pushshift.io/reddit/comments/RC_$(*F).zst -O reddit/RC_$(*F).zst -o logs/RC_$(*F).zst.log -c
83 | 


--------------------------------------------------------------------------------
/data_extraction/README.md:
--------------------------------------------------------------------------------
  1 | # Data Extraction for DSTC7: End-to-End Conversation Modeling 
  2 | 
  3 | Task 2 uses conversational data extracted from Reddit. Each conversation in this setup is _grounded_, as each conversation in this data is about a specific web page that was linked at the start of the conversation. This page provides code to extract the data from a Reddit [dump](http://files.pushshift.io/reddit/comments/) and from [Common Crawl](http://commoncrawl.org/). The former data provides the conversation, while the latter offers the grounding. We provide code instead of actual data, as we are unable to directly release this data.
  4 | 
  5 | (Note: the older and now obsolete setup to create the "trial" data can be found [here](https://github.com/DSTC-MSR/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction/trial).)
  6 | 
  7 | If you run into any problem creating the data, please check the [FAQ](https://github.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction#FAQ) section below.
  8 | 
  9 | ## Requirements
 10 | 
 11 | This page assumes you are running a UNIX environment (Linux, macOS, etc.) If you are on Windows, please either use its Ubuntu subsystem (instructions [here](https://docs.microsoft.com/en-us/windows/wsl/install-win10)) or any third-party UNIX-like environment such as [Cygwin](https://www.cygwin.com/). Creating the data requires a fair amount of disk space to store Reddit dump files locally, i.e., 500 GB total. You will also need the following programs:
 12 | 
 13 | * `Python 3.x`, with modules:
 14 |    * `nltk`
 15 |    * `beautifulsoup4`
 16 |    * `chardet`
 17 |    * `zstandard`
 18 | * `make`
 19 | 
 20 | To install the above Python modules, you can run:
 21 | 
 22 | ```pip install -r requirements.txt```
 23 | 
 24 | Please also run set `PYTHONIOENCODING=UTF-8` in your environment, e.g., by running this in bash:
 25 | ```export PYTHONIOENCODING=UTF-8``
 26 | 
 27 | ## Create official data (train and dev):
 28 | 
 29 | First, make sure that Python and all required modules are installed, for example by generating a subset of the data (January 2011):
 30 | 
 31 | ```make data-official/2011-01.convos.txt```
 32 | 
 33 | If the target file is missing after running that command, that probably means Python or one of its modules is missing, is the wrong version, or is misconfigured. Please inspect logs files in `logs/*.log` or `logs/*.err` to find the cause of the problem (please refer to the "Requirements" section above). 
 34 | 
 35 | If everything is setup properly, please run the following command to create the official training data:
 36 | 
 37 | ```make -j4```
 38 | 
 39 | This will run the extraction pipeline with 4 processes. Depending on your number of cores on your machine, you might want to increase or descrease that number. This will take 2-5 days to run, depending on the number of processes selected. This will create two tab-separated (tsv) files `data/train.convos.txt` and `data/train.facts.txt`, which respectively contain the conversational data and grounded text ("facts"). This will also create two files for the dev set.
 40 | 
 41 | The data is generated from Reddit and the web, so some of it is noisy and occasionally contains offensive language. While we mostly selected Reddit boards (i.e., "subreddits") and web domains that are mostly "safe for work", explicit and offensive language sometimes appears in the data and we did not attempt to eliminate it further (for the sake of simplicity and reproducibility of our pipeline).
 42 | 
 43 | Note: if you set a large number of processes, the server hosting the Reddit data (`files.pushshift.io`) might complain about "too many open connections". If so, you might want to use the makefile to first create all `reddit/*.zst` files and only then run e.g. `make -j7`.
 44 | 
 45 | Generated data files (see data description below):
 46 | 1. `train.convos.txt`: Conversations of the training set
 47 | 2. `train.facts.txt`: Facts of the training set
 48 | 3. `dev.convos.txt`: Conversations of the dev set
 49 | 4. `dev.facts.txt`: Facts of the dev set
 50 | 
 51 | Role of each dataset:
 52 | 1. TRAIN: do anything you want with this data (train, analyze, etc.);
 53 | 2. DEV: you may tune model hyperparameters on this data, but you may NOT train any model directly on that;
 54 | 3. TEST (official release Sept 10): The blind test set will provide facts and conversational contexts, but the gold responses will be hidden (replaced with `__UNDISCLOSED__`). 
 55 | 
 56 | If you participate in the challenge and plan to send us system outputs, **please do NOT use any other training data**, other than the data provided on this page.
 57 | 
 58 | ## Data description:
 59 | 
 60 | Each conversation in this dataset consist of Reddit `submission` and its following discussion-like `comments`. In this data, we restrict ourselves to submissions that provide a `URL` along with a `title` (see [example Reddit submission](https://www.reddit.com/r/todayilearned/comments/f2ruz/til_a_woman_fell_30000_feet_from_an_airplane_and/), which refers to [this web page](https://en.wikipedia.org/wiki/Vesna_Vulovi%C4%87)). The web page scraped from the URL provides grounding or context to the conversation, and is additional (non-conversational) input that models can condition on to produce responses that are more informative and contentful. 
 61 | 
 62 | ### Conversation file:
 63 | 
 64 | Each line of `train.convos.txt` and `dev.convos.txt` contains a Reddit response and its preceding conversational context. Long conversational contexts are truncated by keeping the last 100 words. The file contains 5 columns:
 65 | 
 66 | 1. hash value (only for sanity check)
 67 | 2. subreddit name
 68 | 3. conversation ID
 69 | 4. response score
 70 | 5. dialogue turn number (e.g., "1" = start of the conversation, "2" = 2nd turn of a conversation)
 71 | 6. conversational context, usually multiple turns (input of the model)
 72 | 7. response (output of the model)
 73 | 
 74 | The converational context may contain:
 75 | * EOS: special symbol indicating a turn transition
 76 | * START: special symbol indicating the start of the conversation
 77 | 
 78 | ### Facts file:
 79 | 
 80 | Each line of `train.facts.txt` and `dev.facts.txt` contains a "fact", either a sentence, paragraph (or other snippet of text) relevant to the current conversation. Use conversation IDs to find the facts relevant to each conversation. Note: facts relevant to a given conversation are ordered as they appear on the original web page. The file contains 3 columns:
 81 | 
 82 | 1. hash value (only for sanity check)
 83 | 2. subreddit name
 84 | 3. conversation ID
 85 | 4. domain name
 86 | 5. fact
 87 | 
 88 | To produce the facts relevant to each conversation, we extracted the text of the page using an html-to-text converter ([BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)), but kept the most important tags intact (`<title>, <h1-6>, <p>, etc`). As web formatting differs substantially from domain to domain and common tags like `<p>` may not be used in some domains, we decided to keep all the text of the original page (however, we do remove javascript and style code). As some of the fact data tend to be noisy, you may want restrict yourself to facts delimited by these tags.
 89 | 
 90 | 
 91 | #### Labeled anchors
 92 | 
 93 | A substantial number of URLs contain labeled achors, for example:
 94 | 
 95 | ```http://en.wikipedia.org/wiki/John_Rhys-Davies#The_Lord_of_the_Rings_trilogy```
 96 | 
 97 | which here refers to the label `The_Lord_of_the_Rings_trilogy`. This information is preserved in the facts, and indicated with the tags `<anchor>` and `</anchor>`. As many web pages in this dataset are lengthy, anchors are probably useful information, as they indicate what text the model should likely attend to in order to produce a good response.
 98 | 
 99 | ### Data statistics:
100 | 
101 | |                   | Trial data    |   Train set | Dev set |
102 | | ----              | ----          |   ----      | ----    |
103 | |# dialogue turns   |   649,866     |   2,364,228 | -       |
104 | |# facts            | 4,320,438     |  15,180,800 | -       |
105 | |# tagged facts (1) |   998,032     |   2,185,893 | -       |
106 | 
107 | (1): facts tagged with html markup (e.g., <title>) and therefore potentially important.
108 | 
109 | ### Sample data:
110 | 
111 | #### Sample conversation turn:
112 | 
113 | ```<hash> \t todayilearned \t f2ruz \t 145 \t 2 \t START EOS til a woman fell 30,000 feet from an airplane and survived . \t the page states that a 2009 report found the plane only fell several hundred meters .```
114 | 
115 | Maps to:
116 | 
117 | 1. hash value: ...
118 | 2. subreddit name: `TodayILearned`
119 | 3. conversation ID: `f2ruz`
120 | 4. response score: `145`
121 | 5. dialogue turn number: `2`
122 | 6. conversational context: `START EOS til a woman fell 30,000 feet from an airplane and survived .`
123 | 7. response: `the page states that a 2009 report found the plane only fell several hundred meters .`
124 | 
125 | #### Sample fact:
126 | 
127 | ```<hash> \t todayilearned \t f2ruz \t en.wikipedia.org \t <p> four years later , peter hornung-andersen and pavel theiner , two prague-based journalists , claimed that flight 367 had been mistaken for an enemy aircraft and shot down by the czechoslovak air force at an altitude of 800 metres ( 2,600 ft ) . </p>```
128 | 
129 | Maps to:
130 | 1. hash value: ...
131 | 2. subreddit name: `TodayILearned`
132 | 3. conversation ID: `f2ruz`
133 | 4. domain name: `en.wikipedia.org`
134 | 5. fact: `<p> four years later , peter hornung-andersen and pavel theiner , two prague-based journalists , claimed that flight 367 had been mistaken for an enemy aircraft and shot down by the czechoslovak air force at an altitude of 800 metres ( 2,600 ft ) . </p>`
135 | 
136 | <a name="FAQ"></a>
137 | ## FAQ
138 | 
139 | **Q:** `make` crashed and returned some non-zero code(s), e.g.: `make: *** [reddit/RC_2013-05.zst] Error 8`
140 | <br>
141 | **A:** It might be a temporary network connection problem. If you rerun the same `make` command, the scripts will resume data download and creation from where it left off. 
142 | <br>
143 | **A:** Alternatively, it might be because you ran `make` with a large number of processes (> 4). The server hosting the Reddit data doesn't allow more than 4 simultaneous connections from the same IP address.
144 | 
145 | **Q:** Creating the data with these scripts is inconvenient. Instead, could you please send us the data directly?
146 | <br>
147 | **A:** The data (either Reddit or web) is not our own, so we are unable to redistribute it. The same situtation happended last year for [DSTC6 Task 2](https://github.com/dialogtekgeek/DSTC6-End-to-End-Conversation-Modeling), and the organizers then also provided scraping code that let participants construct the data themselves.
148 | 
149 | **Q:** I have trouble running the download script because of Internet control in my country. 
150 | <br>
151 | **A:** You might want to consider VPN solutions. Alternatively, you could try to team up with other people or team who live in a different country. Participants from different institutions can collaborate and submit system outputs together. 
152 | 
153 | 


--------------------------------------------------------------------------------
/data_extraction/lists/domains-official.txt:
--------------------------------------------------------------------------------
  1 | 3news.co.nz
  2 | 9news.com.au
  3 | abc.net.au
  4 | abcnews.go.com
  5 | afr.com
  6 | al.com
  7 | aljazeera.com
  8 | anandtech.com
  9 | animenewsnetwork.com
 10 | arcgames.com
 11 | arstechnica.com
 12 | au.news.yahoo.com
 13 | autoblog.com
 14 | baltimoresun.com
 15 | basketball-reference.com
 16 | bbc.co.uk
 17 | bbc.com
 18 | bestbuy.com
 19 | bigstory.ap.org
 20 | bleacherreport.com
 21 | bleedingcool.com
 22 | blog.chron.com
 23 | blog.eu.playstation.com
 24 | blog.sfgate.com
 25 | blog.us.playstation.com
 26 | bloomberg.com
 27 | boardgamegeek.com
 28 | boingboing.net
 29 | bostonglobe.com
 30 | brisbanetimes.com.au
 31 | business.financialpost.com
 32 | businessweek.com
 33 | buzzfeed.com
 34 | canberratimes.com.au
 35 | cbc.ca
 36 | charlotteobserver.com
 37 | chicagotribune.com
 38 | chron.com
 39 | clarin.com
 40 | cleveland.com
 41 | cnet.com
 42 | cnn.com
 43 | collider.com
 44 | comicbook.com
 45 | comicbookmovie.com
 46 | comingsoon.net
 47 | cp24.com
 48 | crash.net
 49 | crunchyroll.com
 50 | csmonitor.com
 51 | dailymail.co.uk
 52 | dailymotion.com
 53 | dailyrecord.co.uk
 54 | dallasnews.com
 55 | deadline.com
 56 | deadspin.com
 57 | deviantart.com
 58 | digitaltrends.com
 59 | dnainfo.com
 60 | economist.com
 61 | edition.cnn.com
 62 | en.m.wikipedia.org
 63 | en.wikipedia.org
 64 | engadget.com
 65 | english.alarabiya.net
 66 | escapistmagazine.com
 67 | espn.go.com
 68 | euractiv.com
 69 | eurekalert.org
 70 | euronews.com
 71 | ew.com
 72 | faz.net
 73 | finance.yahoo.com
 74 | fivethirtyeight.com
 75 | flickeringmyth.com
 76 | football-italia.net
 77 | forbes.com
 78 | foreignpolicy.com
 79 | forum.xda-developers.com
 80 | forums.robertsspaceindustries.com
 81 | fourfourtwo.com
 82 | foxnews.com
 83 | foxsports.com
 84 | ft.com
 85 | gawker.com
 86 | gizmodo.com
 87 | globalnews.ca
 88 | goal.com
 89 | gq.com
 90 | guardian.co.uk
 91 | haaretz.com
 92 | huffingtonpost.ca
 93 | humblebundle.com
 94 | hurriyetdailynews.com
 95 | ibtimes.com
 96 | ign.com
 97 | independent.co.uk
 98 | independent.ie
 99 | indiegogo.com
100 | inquisitr.com
101 | io9.com
102 | irishexaminer.com
103 | irishtimes.com
104 | jalopnik.com
105 | japantimes.co.jp
106 | joystiq.com
107 | jsonline.com
108 | kansascity.com
109 | kickstarter.com
110 | latimes.com
111 | liverpoolecho.co.uk
112 | livescience.com
113 | macleans.ca
114 | majornelson.com
115 | manchestereveningnews.co.uk
116 | marca.com
117 | mashable.com
118 | mercurynews.com
119 | metro.co.uk
120 | miamiherald.com
121 | mirror.co.uk
122 | mlb.mlb.com
123 | mlive.com
124 | mlssoccer.com
125 | montrealgazette.com
126 | msn.com
127 | nasa.gov
128 | nature.com
129 | nba.com
130 | nbcnews.com
131 | news.nationalgeographic.com
132 | news.nationalpost.com
133 | news.yahoo.com
134 | newsarama.com
135 | newscientist.com
136 | newshub.co.nz
137 | newyorker.com
138 | nfl.com
139 | nhl.com
140 | nj.com
141 | nola.com
142 | npr.org
143 | nydailynews.com
144 | nypost.com
145 | oglobo.globo.com
146 | oregonlive.com
147 | orlandosentinel.com
148 | pbs.org
149 | pcgamer.com
150 | pcworld.com
151 | philly.com
152 | phillymag.com
153 | phys.org
154 | playstationlifestyle.net
155 | politico.com
156 | polygon.com
157 | popularmechanics.com
158 | post-gazette.com
159 | pro-football-reference.com
160 | profootballtalk.nbcsports.com
161 | qz.com
162 | redditgifts.com
163 | reuters.com
164 | robertsspaceindustries.com
165 | rockpapershotgun.com
166 | rollingstone.com
167 | rt.com
168 | sacbee.com
169 | sanfrancisco.cbslocal.com
170 | sbnation.com
171 | scotsman.com
172 | screenrant.com
173 | sfchronicle.com
174 | sfgate.com
175 | si.com
176 | skysports.com
177 | slate.com
178 | smashboards.com
179 | smh.com.au
180 | smithsonianmag.com
181 | snopes.com
182 | sports.yahoo.com
183 | sputniknews.com
184 | stackoverflow.com
185 | steamcommunity.com
186 | store.playstation.com
187 | sun-sentinel.com
188 | surrenderat20.net
189 | target.com
190 | techcrunch.com
191 | technologyreview.com
192 | telegraph.co.uk
193 | theage.com.au
194 | theatlantic.com
195 | thebiglead.com
196 | theglobeandmail.com
197 | theguardian.com
198 | thehill.com
199 | thejournal.ie
200 | themoscowtimes.com
201 | thenextweb.com
202 | theringer.com
203 | thestar.com
204 | thesun.co.uk
205 | theverge.com
206 | thewrap.com
207 | thinkprogress.org
208 | time.com
209 | timesofisrael.com
210 | tmz.com
211 | torontosun.com
212 | tsn.ca
213 | tvline.com
214 | uefa.com
215 | uk.businessinsider.com
216 | upi.com
217 | us.battle.net
218 | usatoday.com
219 | vancouversun.com
220 | variety.com
221 | vg247.com
222 | vulture.com
223 | walmart.com
224 | wikipedia.org
225 | wired.com
226 | zdnet.com
227 | 


--------------------------------------------------------------------------------
/data_extraction/lists/subreddits-official.txt:
--------------------------------------------------------------------------------
  1 | 49ers
  2 | amateurradio
  3 | anime
  4 | arrow
  5 | asoiaf
  6 | australia
  7 | aviation
  8 | badphilosophy
  9 | baltimore
 10 | battlefield_4
 11 | belgium
 12 | bengals
 13 | bicycling
 14 | bindingofisaac
 15 | bjj
 16 | boardgames
 17 | bonnaroo
 18 | borussiadortmund
 19 | bostonceltics
 20 | bourbon
 21 | breakingbad
 22 | buccaneers
 23 | buffalobills
 24 | business
 25 | canada
 26 | canucks
 27 | cars
 28 | chelseafc
 29 | cincinnati
 30 | classicalmusic
 31 | climbing
 32 | comicbooks
 33 | comics
 34 | cordcutters
 35 | cowboys
 36 | creepy
 37 | dayz
 38 | de
 39 | diabetes
 40 | diablo3
 41 | disney
 42 | doctorwho
 43 | dogs
 44 | dotamasterrace
 45 | dvdcollection
 46 | eagles
 47 | electronicmusic
 48 | energy
 49 | entertainment
 50 | eu4
 51 | europe
 52 | everymanshouldknow
 53 | exjw
 54 | falcons
 55 | fivenightsatfreddys
 56 | flying
 57 | formula1
 58 | frugalmalefashion
 59 | ftm
 60 | furry
 61 | futurama
 62 | gamegrumps
 63 | gameofthrones
 64 | gamernews
 65 | germany
 66 | guns
 67 | hardware
 68 | haskell
 69 | hawks
 70 | heroesofthestorm
 71 | homestuck
 72 | horror
 73 | houston
 74 | humor
 75 | interestingasfuck
 76 | ipad
 77 | iphone
 78 | ireland
 79 | kansascity
 80 | knitting
 81 | korea
 82 | lakers
 83 | leafs
 84 | leagueoflegends
 85 | linguistics
 86 | linux
 87 | linux_gaming
 88 | london
 89 | lotr
 90 | madmen
 91 | malefashionadvice
 92 | martialarts
 93 | math
 94 | medicine
 95 | melbourne
 96 | metalgearsolid
 97 | miamidolphins
 98 | minnesotavikings
 99 | montreal
100 | motogp
101 | motorcycles
102 | movies
103 | nba
104 | newzealand
105 | nfl
106 | nhl
107 | nintendo
108 | nursing
109 | nyc
110 | nyjets
111 | oneplus
112 | orangecounty
113 | orlando
114 | ottawa
115 | panthers
116 | paradoxplaza
117 | patientgamers
118 | paydaytheheist
119 | perth
120 | philadelphia
121 | philosophy
122 | phish
123 | pinkfloyd
124 | pittsburgh
125 | progmetal
126 | programming
127 | ravens
128 | redsox
129 | roosterteeth
130 | rugbyunion
131 | running
132 | russia
133 | sanfrancisco
134 | science
135 | scifi
136 | seinfeld
137 | short
138 | shutupandtakemymoney
139 | skeptic
140 | smashbros
141 | soccer
142 | sports
143 | starcitizen
144 | sto
145 | subaru
146 | sydney
147 | sysadmin
148 | tea
149 | tech
150 | television
151 | tennis
152 | teslamotors
153 | texas
154 | tf2
155 | timberwolves
156 | titanfall
157 | todayilearned
158 | toronto
159 | torontoraptors
160 | unitedkingdom
161 | vegan
162 | vexillology
163 | vinyl
164 | visualnovels
165 | vita
166 | wargame
167 | warriors
168 | washingtondc
169 | web_design
170 | webdev
171 | windows
172 | windowsphone
173 | worldnews
174 | wow
175 | writing
176 | xbox360
177 | xboxone
178 | zen
179 | 


--------------------------------------------------------------------------------
/data_extraction/requirements.txt:
--------------------------------------------------------------------------------
1 | chardet==3.0.4
2 | nltk==3.4.1
3 | beautifulsoup4==4.6.0
4 | zstandard==0.18.0
5 | 


--------------------------------------------------------------------------------
/data_extraction/src/Makefile.official:
--------------------------------------------------------------------------------
 1 | .SECONDARY:
 2 | 
 3 | train: data-official/train.convos.txt data-official/train.facts.txt
 4 | 
 5 | include src/Makefile.official.targets
 6 | 
 7 | 
 8 | data-official/train.convos.txt: $(OFFICIAL_TRAIN_CONVOS)
 9 | 	cat $+ > $@
10 | 
11 | data-official/train.facts.txt: $(OFFICIAL_TRAIN_FACTS)
12 | 	cat $+ > $@
13 | 
14 | data-official/%.facts.txt data-official/%.convos.txt: reddit/RC_%.bz2 reddit/RS_%.bz2 data-official/.create
15 | 	python src/create_official_data.py --rsinput=reddit/RS_$(*F).bz2 --rcinput=reddit/RC_$(*F).bz2 --subreddit_filter=lists/subreddits-official.txt --domain_filter=lists/domains-official.txt --pickle=data-official/$(*F).pkl --facts=data-official/$(*F).facts.txt --convos=data-official/$(*F).convos.txt --anchoronly=True --use_cc=True > logs-official/$(*F).log 2> logs-official/$(*F).err
16 | 
17 | data-official/.create:
18 | 	mkdir -p logs
19 | 	mkdir -p data-official
20 | 	mkdir -p logs-official
21 | 	touch $@
22 | 
23 | reddit/RS_%.bz2:
24 | 	mkdir -p reddit
25 | 	wget https://files.pushshift.io/reddit/submissions/RS_$(*F).bz2 -O reddit/RS_$(*F).bz2 -o logs/RS_$(*F).bz2.log -c
26 | 
27 | reddit/RC_%.bz2:
28 | 	mkdir -p reddit
29 | 	wget https://files.pushshift.io/reddit/comments/RC_$(*F).bz2 -O reddit/RC_$(*F).bz2 -o logs/RC_$(*F).bz2.log -c
30 | 


--------------------------------------------------------------------------------
/data_extraction/src/Makefile.official.targets:
--------------------------------------------------------------------------------
 1 | OFFICIAL_TRAIN_CONVOS=data-official/2011-01.convos.txt data-official/2011-02.convos.txt data-official/2011-03.convos.txt data-official/2011-04.convos.txt data-official/2011-05.convos.txt data-official/2011-06.convos.txt data-official/2011-07.convos.txt data-official/2011-08.convos.txt data-official/2011-09.convos.txt data-official/2011-10.convos.txt data-official/2011-11.convos.txt data-official/2011-12.convos.txt data-official/2012-01.convos.txt data-official/2012-02.convos.txt data-official/2012-03.convos.txt data-official/2012-04.convos.txt data-official/2012-05.convos.txt data-official/2012-06.convos.txt data-official/2012-07.convos.txt data-official/2012-08.convos.txt data-official/2012-09.convos.txt data-official/2012-10.convos.txt data-official/2012-11.convos.txt data-official/2012-12.convos.txt data-official/2013-01.convos.txt data-official/2013-02.convos.txt data-official/2013-03.convos.txt data-official/2013-04.convos.txt data-official/2013-05.convos.txt data-official/2013-06.convos.txt data-official/2013-07.convos.txt data-official/2013-08.convos.txt data-official/2013-09.convos.txt data-official/2013-10.convos.txt data-official/2013-11.convos.txt data-official/2013-12.convos.txt data-official/2014-01.convos.txt data-official/2014-02.convos.txt data-official/2014-03.convos.txt data-official/2014-04.convos.txt data-official/2014-05.convos.txt data-official/2014-06.convos.txt data-official/2014-07.convos.txt data-official/2014-08.convos.txt data-official/2014-09.convos.txt data-official/2014-10.convos.txt data-official/2014-11.convos.txt data-official/2014-12.convos.txt data-official/2015-01.convos.txt data-official/2015-02.convos.txt data-official/2015-03.convos.txt data-official/2015-04.convos.txt data-official/2015-05.convos.txt data-official/2015-06.convos.txt data-official/2015-07.convos.txt data-official/2015-08.convos.txt data-official/2015-09.convos.txt data-official/2015-10.convos.txt data-official/2015-11.convos.txt data-official/2015-12.convos.txt data-official/2016-01.convos.txt data-official/2016-02.convos.txt data-official/2016-03.convos.txt data-official/2016-04.convos.txt data-official/2016-05.convos.txt data-official/2016-06.convos.txt data-official/2016-07.convos.txt data-official/2016-08.convos.txt data-official/2016-09.convos.txt data-official/2016-10.convos.txt data-official/2016-11.convos.txt data-official/2016-12.convos.txt 
 2 | 
 3 | OFFICIAL_TRAIN_FACTS=data-official/2011-01.facts.txt data-official/2011-02.facts.txt data-official/2011-03.facts.txt data-official/2011-04.facts.txt data-official/2011-05.facts.txt data-official/2011-06.facts.txt data-official/2011-07.facts.txt data-official/2011-08.facts.txt data-official/2011-09.facts.txt data-official/2011-10.facts.txt data-official/2011-11.facts.txt data-official/2011-12.facts.txt data-official/2012-01.facts.txt data-official/2012-02.facts.txt data-official/2012-03.facts.txt data-official/2012-04.facts.txt data-official/2012-05.facts.txt data-official/2012-06.facts.txt data-official/2012-07.facts.txt data-official/2012-08.facts.txt data-official/2012-09.facts.txt data-official/2012-10.facts.txt data-official/2012-11.facts.txt data-official/2012-12.facts.txt data-official/2013-01.facts.txt data-official/2013-02.facts.txt data-official/2013-03.facts.txt data-official/2013-04.facts.txt data-official/2013-05.facts.txt data-official/2013-06.facts.txt data-official/2013-07.facts.txt data-official/2013-08.facts.txt data-official/2013-09.facts.txt data-official/2013-10.facts.txt data-official/2013-11.facts.txt data-official/2013-12.facts.txt data-official/2014-01.facts.txt data-official/2014-02.facts.txt data-official/2014-03.facts.txt data-official/2014-04.facts.txt data-official/2014-05.facts.txt data-official/2014-06.facts.txt data-official/2014-07.facts.txt data-official/2014-08.facts.txt data-official/2014-09.facts.txt data-official/2014-10.facts.txt data-official/2014-11.facts.txt data-official/2014-12.facts.txt data-official/2015-01.facts.txt data-official/2015-02.facts.txt data-official/2015-03.facts.txt data-official/2015-04.facts.txt data-official/2015-05.facts.txt data-official/2015-06.facts.txt data-official/2015-07.facts.txt data-official/2015-08.facts.txt data-official/2015-09.facts.txt data-official/2015-10.facts.txt data-official/2015-11.facts.txt data-official/2015-12.facts.txt data-official/2016-01.facts.txt data-official/2016-02.facts.txt data-official/2016-03.facts.txt data-official/2016-04.facts.txt data-official/2016-05.facts.txt data-official/2016-06.facts.txt data-official/2016-07.facts.txt data-official/2016-08.facts.txt data-official/2016-09.facts.txt data-official/2016-10.facts.txt data-official/2016-11.facts.txt data-official/2016-12.facts.txt 
 4 | 
 5 | OFFICIAL_DEV_CONVOS=data-official/2017-01.convos.txt data-official/2017-02.convos.txt data-official/2017-03.convos.txt 
 6 | 
 7 | OFFICIAL_DEV_FACTS=data-official/2017-01.facts.txt data-official/2017-02.facts.txt data-official/2017-03.facts.txt 
 8 | 
 9 | # Note: validation == dev set, but validation is a subset used for automatic evaluation
10 | OFFICIAL_VALID_CONVOS=data-official-valid/2017-01.convos.txt data-official-valid/2017-02.convos.txt data-official-valid/2017-03.convos.txt 
11 | 
12 | OFFICIAL_VALID_FACTS=data-official-valid/2017-01.facts.txt data-official-valid/2017-02.facts.txt data-official-valid/2017-03.facts.txt 
13 | 
14 | OFFICIAL_TEST_CONVOS=data-official-test/2017-04.convos.txt data-official-test/2017-05.convos.txt data-official-test/2017-06.convos.txt data-official-test/2017-07.convos.txt data-official-test/2017-08.convos.txt data-official-test/2017-09.convos.txt data-official-test/2017-10.convos.txt data-official-test/2017-11.convos.txt
15 | 
16 | OFFICIAL_TEST_FACTS=data-official-test/2017-04.facts.txt data-official-test/2017-05.facts.txt data-official-test/2017-06.facts.txt data-official-test/2017-07.facts.txt data-official-test/2017-08.facts.txt data-official-test/2017-09.facts.txt data-official-test/2017-10.facts.txt data-official-test/2017-11.facts.txt
17 | 
18 | OFFICIAL_TEST_REFS=data-official-test/2017-04.refs.txt data-official-test/2017-05.refs.txt data-official-test/2017-06.refs.txt data-official-test/2017-07.refs.txt data-official-test/2017-08.refs.txt data-official-test/2017-09.refs.txt data-official-test/2017-10.refs.txt data-official-test/2017-11.refs.txt
19 | 


--------------------------------------------------------------------------------
/data_extraction/src/commoncrawl.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | """
  4 | commoncrawl.py
  5 | Class to download web pages from Common Crawl, optionally specifying a target year and month.
  6 | Author: Michel Galley, Microsoft Research NLP Group (dstc7-task2@microsoft.com)
  7 | """
  8 | 
  9 | import sys
 10 | import time
 11 | import gzip
 12 | import json
 13 | import urllib.parse
 14 | import urllib.request
 15 | import chardet
 16 | from datetime import datetime
 17 | from io import BytesIO
 18 | 
 19 | class CommonCrawl:
 20 | 
 21 |     month_keys = [ '2018-05', '2018-04', '2018-03', '2018-02', '2018-01', '2017-12', '2017-11', '2017-10', '2017-09', '2017-08', '2017-07', '2017-06', '2017-05', '2017-04', '2017-03', '2017-02', '2017-01', '2016-12', '2016-10', '2016-09', '2016-08', '2016-07', '2016-06', '2016-05', '2016-04', '2016-02', '2015-11', '2015-09', '2015-08', '2015-07', '2015-06', '2015-05', '2015-04', '2015-03', '2015-02', '2015-01', '2014-12', '2014-11', '2014-10', '2014-09', '2014-08', '2014-07', '2014-04', '2014-03', '2014-02', '2013-09' ]
 22 |     month_ids = [ '2018-22', '2018-17', '2018-13', '2018-09', '2018-05', '2017-51', '2017-47', '2017-43', '2017-39', '2017-34', '2017-30', '2017-26', '2017-22', '2017-17', '2017-13', '2017-09', '2017-04', '2016-50', '2016-44', '2016-40', '2016-36', '2016-30', '2016-26', '2016-22', '2016-18', '2016-07', '2015-48', '2015-40', '2015-35', '2015-32', '2015-27', '2015-22', '2015-18', '2015-14', '2015-11', '2015-06', '2014-52', '2014-49', '2014-42', '2014-41', '2014-35', '2014-23', '2014-15', '2014-10', '2013-48', '2013-20' ]
 23 |     index_url_prefix = 'http://index.commoncrawl.org/CC-MAIN-'
 24 |     data_url = 'https://data.commoncrawl.org/'
 25 |     index_url_suffix = '%2F&output=json'
 26 |     error_internal_server = 500
 27 |     error_bad_gateway = 502
 28 |     error_service_unavailable = 503
 29 |     error_gateway_timeout = 504
 30 |     error_not_found = 404
 31 |     max_retry = 5
 32 |     retry_wait = 3
 33 | 
 34 |     def __init__(self, month_offset):
 35 |         self.month_keys_dic = dict([ (self.month_keys[i], i) for i in range(0, len(self.month_keys))])
 36 |         self.month_offset = month_offset
 37 | 
 38 | 
 39 |     def get_key(self, url, month, year=None):
 40 |         if year == None:
 41 |             return f"{url}|{month}"
 42 |         return f"{url}|{year}-{month}"
 43 | 
 44 |     def get_match(self, cc_match, url, month_id):
 45 |         k = self.get_key(url, month_id)
 46 |         return cc_match[k]
 47 | 
 48 |     def _get_month_id(self, year, month):
 49 |         k = year + "-" + month
 50 |         iyear = int(year)
 51 |         imonth = int(month)
 52 |         if k in self.month_keys_dic.keys():
 53 |             idx = self.month_keys_dic[k]
 54 |         elif (iyear == 2014 and imonth <= 2) or (iyear == 2013 and imonth >= 10):
 55 |             idx = self.month_keys_dic['2014-02']
 56 |         elif int(iyear) <= 2013:
 57 |             idx = self.month_keys_dic['2013-09']
 58 |         else:
 59 |             idx = 0 # >= 2018
 60 |         idx += self.month_offset
 61 |         return max(0, min(idx, len(self.month_keys)-1))
 62 | 
 63 |     def download(self, url, year=None, month=None, backward=True, cc_match=None):
 64 |         """
 65 |         Returns html from a url using Common Crawl (CC).
 66 |             url = identifies page to retrieve
 67 |             year = of the page
 68 |             month = month of the page
 69 |             backward = whether to search backward in time if page isn't found (if false, search forward)
 70 |             Returns (response, date), where response is the html as a string, and the date the page
 71 |             was originally retrieved (datetime object).
 72 |         """
 73 |         idx = 0
 74 |         if cc_match:
 75 |             old_date = f'{year}-{month}'
 76 |             year, month = self.get_match(cc_match, url, year + '-' + month).split('-')
 77 |             new_date = f'{year}-{month}'
 78 |             print(f'MATCH\t{url}\t{old_date}\t{new_date}')
 79 |         if year != None and month != None:
 80 |             idx = self._get_month_id(year, month)
 81 |         step = int(backward)*2-1
 82 |         retry = 0
 83 |         while 0 <= idx and idx < len(self.month_keys):
 84 |             month_id = self.month_ids[idx]
 85 |             iurl = self.index_url_prefix + month_id + '-index?url=' + urllib.parse.quote_plus(url) + self.index_url_suffix
 86 |             #print(" --> try: %s [%s]" % (url, self.month_keys[idx]))
 87 |             try:
 88 |                 # Find page in index:
 89 |                 u = urllib.request.urlopen(iurl)
 90 |                 pages = [json.loads(x) for x in u.read().decode('utf-8').strip().split('\n')]
 91 |                 page = pages[0] # To do: if get multiple pages, find closest match in time
 92 | 
 93 |                 # Get data from warc file:
 94 |                 offset, length = int(page['offset']), int(page['length'])
 95 |                 offset_end = offset + length - 1
 96 |                 dataurl = self.data_url + page['filename']
 97 |                 request = urllib.request.Request(dataurl)
 98 |                 rangestr = 'bytes={}-{}'.format(offset, offset_end)
 99 |                 request.add_header('Range', rangestr)
100 |                 u = urllib.request.urlopen(request)
101 |                 content = u.read()
102 |                 raw_data = BytesIO(content)
103 |                 f = gzip.GzipFile(fileobj=raw_data)
104 | 
105 |                 data = f.read()
106 |                 enc = chardet.detect(data)
107 |                 els = data.decode(enc['encoding']).strip().split('\r\n\r\n', 2)
108 |                 if len(els) != 3:
109 |                     idx = idx + step
110 |                 else:
111 |                     warc, header, response = els
112 |                     date = datetime.strptime(page['timestamp'],'%Y%m%d%H%M%S')
113 |                     return response, self.month_keys[idx], date
114 |             except UnicodeDecodeError:
115 |                 idx = idx + step
116 |             except urllib.error.HTTPError as err:
117 |                 if err.code == self.error_service_unavailable or \
118 |                    err.code == self.error_gateway_timeout or \
119 |                    err.code == self.error_bad_gateway or \
120 |                    err.code == self.error_internal_server:
121 |                     if retry >= self.max_retry:
122 |                         idx = idx + step
123 |                         retry = 0
124 |                     else:
125 |                         retry += 1
126 |                         time.sleep(self.retry_wait)
127 |                         msg = "Common Crawl error code %d, waiting %d seconds... (retry attempt %d/%d), url: %s"
128 |                         print(msg % (err.code, self.retry_wait, retry, self.max_retry, iurl), file=sys.stderr)
129 |                         sys.stderr.flush()
130 |                 elif err.code == self.error_not_found:
131 |                     idx = idx + step
132 |                     retry = 0
133 |                 else:
134 |                     idx = idx + step
135 |                     retry = 0
136 |                     print("Unexpected error code: %d" % err.code, file=sys.stderr)
137 |                     sys.stderr.flush()
138 |             except Exception as err:
139 |                     print("Unexpected error: %s, waiting %d seconds" % (err, self.retry_wait), file=sys.stderr)
140 |                     sys.stderr.flush()
141 |                     time.sleep(self.retry_wait)
142 |         return None, None, None
143 | 
144 | if __name__== "__main__":
145 |     cc = CommonCrawl(-2)
146 |     if len(sys.argv) != 2 and len(sys.argv) != 4:
147 |         print("Usage: python %s URL [YEAR] [MONTH]\n\nArguments YEAR and MONTH must have the following format: YYYY and MM" % sys.argv[0], file=sys.stderr)
148 |     else:
149 |         url = sys.argv[1]
150 |         month = None
151 |         year = None
152 |         if len(sys.argv) == 4:
153 |             year = sys.argv[2]
154 |             month = sys.argv[3]
155 |         html, date = cc.download(url, year, month)
156 |         if html != None:
157 |             print(html)
158 |             print("<!-- Retrieved on: " + str(date) + " -->")
159 | 


--------------------------------------------------------------------------------
/data_extraction/src/create_official_data.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | """
  4 | create_official_data.py
  5 | A script to turn extract:
  6 | (1) conversation from Reddit file dumps (originally downloaded from https://files.pushshift.io/reddit/daily/)
  7 | (2) grounded data ("facts") extracted from the web, respecting robots.txt
  8 | """
  9 | 
 10 | import sys
 11 | import io
 12 | import time
 13 | import os.path
 14 | import re
 15 | import argparse
 16 | import traceback
 17 | import json
 18 | import zstandard as zstd
 19 | import pickle
 20 | import nltk
 21 | import urllib.request
 22 | import urllib.robotparser
 23 | import hashlib
 24 | 
 25 | from bs4 import BeautifulSoup
 26 | from bs4.element import NavigableString
 27 | from bs4.element import CData
 28 | from multiprocessing import Pool
 29 | from nltk.tokenize import TweetTokenizer
 30 | from commoncrawl import CommonCrawl
 31 | 
 32 | parser = argparse.ArgumentParser()
 33 | parser.add_argument("--rsinput", help="Submission (RS) file to load.")
 34 | parser.add_argument("--rcinput", help="Comments (RC) file to load.")
 35 | parser.add_argument("--test", help="Hashes of test set convos.", default="")
 36 | parser.add_argument("--facts", help="Facts file to create.")
 37 | parser.add_argument("--convos", help="Convo file to create.")
 38 | parser.add_argument("--pickle", help="Pickle that contains conversations and facts.", default="data.pkl")
 39 | parser.add_argument("--subreddit_filter", help="List of subreddits (inoffensive, safe for work, etc.)")
 40 | parser.add_argument("--domain_filter", help="Filter on subreddits and domains.")
 41 | parser.add_argument("--nsubmissions", help="Number of submissions to process (< 0 means all)", default=-1, type=int)
 42 | parser.add_argument("--min_fact_len", help="Minimum number of tokens in each fact (reduce noise in html).", default=0, type=int)
 43 | parser.add_argument("--min_res_len", help="Min number of characters in response.", default=2, type=int)
 44 | parser.add_argument("--max_res_len", help="Max number of characters in response.", default=280, type=int)
 45 | parser.add_argument("--max_context_len", help="Max number of words in context.", default=200, type=int)
 46 | parser.add_argument("--max_depth", help="Maximum length of conversation.", default=5, type=int)
 47 | parser.add_argument("--mincomments", help="Minimum number of comments per submission.", default=10, type=int)
 48 | parser.add_argument("--minscore", help="Minimum score of each comment.", default=1, type=int)
 49 | parser.add_argument("--delay", help="Seconds of delay when crawling web pages", default=0, type=int)
 50 | parser.add_argument("--tokenize", help="Whether to tokenize facts and conversations.", default=True, type=bool)
 51 | parser.add_argument("--anchoronly", help="Filter out URLs with no named anchors.", default=False, type=bool)
 52 | parser.add_argument("--use_robots_txt", help="Whether to respect robots.txt (disable this only if urls have been previously checked by other means!)", default=True, type=bool)
 53 | parser.add_argument("--use_cc", help="Whether to download pages from Common Crawl.", default=False, type=bool)
 54 | parser.add_argument("--cc_match", help="tsv file containing the precomputation of [url, submission month] -> [cc month]", default='lists/cc-match.tsv', type=str)
 55 | parser.add_argument("--dryrun", help="Just collect stats about data; don't create any data.", default=False, type=bool)
 56 | parser.add_argument("--blind", help="Don't print out responses.", default=False, type=bool)
 57 | args = parser.parse_args()
 58 | 
 59 | fields = [ "id", "subreddit", "score", "num_comments", "domain", "title", "url", "permalink" ]
 60 | important_tags = ['title', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'p']
 61 | notext_tags = ['script', 'style']
 62 | deleted_str = '[deleted]'
 63 | undisclosed_str = '__UNDISCLOSED__'
 64 | max_window_size = 2**31
 65 | 
 66 | batch_download_facts = False
 67 | robotparsers = {}
 68 | tokenizer = TweetTokenizer(preserve_case=False)
 69 | if args.cc_match:
 70 |     cc = CommonCrawl(0)
 71 | else:
 72 |     cc = CommonCrawl(-2)
 73 | 
 74 | def get_subreddit(submission):
 75 |     return submission["subreddit"]
 76 | def get_domain(submission):
 77 |     return submission["domain"]
 78 | def get_url(submission):
 79 |     return submission["url"]
 80 | def get_submission_text(submission):
 81 |     return submission["title"]
 82 | def get_permalink(submission):
 83 |     return submission["permalink"]
 84 | def get_submission_id(submission):
 85 |     return submission["id"]
 86 | def get_comment_id(comment):
 87 |     return "t1_" + comment["id"]
 88 |     #return comment["name"]
 89 | def get_parent_comment_id(comment):
 90 |     return comment["parent_id"]
 91 | def get_text(comment):
 92 |     return comment["body"]
 93 | def get_user(comment):
 94 |     return comment["author"]
 95 | def get_score(comment):
 96 |     return comment["score"]
 97 | def get_linked_submission_id(comment):
 98 |     return comment["link_id"].split("_")[1]
 99 | 
100 | def get_anchor(url):
101 |     pos = url.find("#")
102 |     if (pos > 0):
103 |         label = url[pos+1:]
104 |         label = label.strip()
105 |         return label
106 |     return ""
107 | 
108 | def filter_submission(submission):
109 |     """Determines whether to filter out this submission (over-18, deleted user, etc.)."""
110 |     if submission["num_comments"] < args.mincomments:
111 |         return True
112 |     if "num_crossposts" in submission and submission["num_crossposts"] > 0:
113 |         return True
114 |     if "locked" in submission and submission["locked"]:
115 |         return True
116 |     if "over-18" in submission and submission["over_18"]:
117 |         return True
118 |     if "brand_safe" in submission and not submission["brand_safe"]:
119 |         return True
120 |     if submission["distinguished"] != None:
121 |         return True
122 |     if "subreddit_type" in submission:
123 |         if submission["subreddit_type"] == "restricted": # filter only public
124 |             return True
125 |         if submission["subreddit_type"] == "archived":
126 |             return True
127 |     url = get_url(submission)
128 |     domain = get_domain(submission)
129 |     if domain.find("reddit.com") >= 0 or domain.find("twitter.com") >= 0 or domain.find("youtube.com") >= 0 or domain.find("youtube.com") >= 0 or domain.find("imgur.com") >= 0 or domain.find("flickr.com") >= 0 or domain.find("ebay.com") >= 0:
130 |         return True
131 |     if args.anchoronly and len(get_anchor(url)) <= 2:
132 |         return True
133 |     if url.find(" ") >= 0:
134 |         return True
135 |     if url.endswith("jpg") or url.endswith("gif") or url.endswith("png") or url.endswith("pdf"):
136 |         return True
137 |     return False
138 | 
139 | def norm_article(t):
140 |     """Minimalistic processing with linebreaking."""
141 |     t = re.sub("\s*\n+\s*","\n", t)
142 |     t = re.sub(r'(</[pP]>)',r'\1\n', t)
143 |     t = re.sub("[ \t]+"," ", t)
144 |     t = t.strip()
145 |     return t
146 | 
147 | def norm_sentence(t):
148 |     """Minimalistic processing: remove extra space characters."""
149 |     t = re.sub("[ \n\r\t]+", " ", t)
150 |     t = t.strip()
151 |     if args.tokenize:
152 |         t = " ".join(tokenizer.tokenize(t))
153 |         t = t.replace('[ deleted ]','[deleted]');
154 |     return t
155 | 
156 | def add_webpage(submission, year, month, cc_match=None):
157 |     """Retrive sentences ('facts') from submission["url"]. """
158 |     if args.use_cc:
159 |         return add_cc_webpage(submission, year, month, cc_match)
160 |     return add_live_webpage(submission)
161 | 
162 | def add_cc_webpage(submission, year, month, cc_match=None):
163 |     url = get_url(submission)
164 |     kstr = cc.get_key(url, month, year)
165 |     if cc_match and kstr not in cc_match:
166 |         print("Can't fetch (precomputed): [%s] submission month: [%s-%s]" % (url, year, month))
167 |         sys.stdout.flush()
168 |         return None
169 |     src, month_id, date = cc.download(url, year, month, False, cc_match=cc_match)
170 |     sys.stdout.flush()
171 |     if src == None:
172 |         src, month_id, date = cc.download(url, year, month, True, cc_match=cc_match)
173 |         sys.stdout.flush()
174 |         if src == None:
175 |             print("Can't fetch: [%s] submission month: [%s-%s]" % (url, year, month))
176 |             sys.stdout.flush()
177 |             return None
178 |     print("Fetching url: [%s] submission month: [%s-%s] -> [%s] commoncrawl date: [%s]" % (url, year, month, month_id, str(date)))
179 |     print("FETCH\t%s\t%s-%s\t%s" % (url, year, month, month_id))
180 |     sys.stdout.flush()
181 |     submission["source"] = src
182 |     return submission
183 | 
184 | def add_live_webpage(submission):
185 |     url = get_url(submission)
186 |     domain = get_domain(submission)
187 |     try:
188 |         if args.use_robots_txt:
189 |             if args.delay > 0:
190 |                 time.sleep(args.delay)
191 |             if domain in robotparsers.keys():
192 |                 rp = robotparsers[domain]
193 |             else:
194 |                 rp = urllib.robotparser.RobotFileParser()
195 |                 robotparsers[domain] = rp
196 |                 rurl = "http://" + domain + "/robots.txt"
197 |                 print("Fetching robots.txt: [%s]" % rurl)
198 |                 rp.set_url(rurl)
199 |                 rp.read()
200 |             if not rp.can_fetch("*", url):
201 |                 print("Can't download url due to robots.txt: [%s] domain: [%s]" % (url, domain))
202 |                 return None
203 |         print("Fetching url: [%s] domain: [%s]" % (url, domain))
204 |         u = urllib.request.urlopen(url)
205 |         src = u.read()
206 |         submission["source"] = src
207 |         return submission
208 |     except urllib.error.HTTPError:
209 |         return None
210 |     except urllib.error.URLError:
211 |         return None
212 |     except UnicodeEncodeError:
213 |         return None
214 |     except:
215 |         traceback.print_exc()
216 |         return None
217 | 
218 | def add_webpages(submissions):
219 |     """Use multithreading to retrieve multiple webpages at once."""
220 |     print("Downloading %d pages:" % len(submissions))
221 |     pool = Pool()
222 |     submissions = pool.map(add_webpage, submissions)
223 |     print("\nDone.")
224 |     return [s for s in submissions if s is not None]
225 | 
226 | def get_date(file_name):
227 |     m = re.search(r'(\d\d\d\d)-(\d\d)', file_name)
228 |     year = m.group(1)
229 |     month = m.group(2)
230 |     return year, month
231 | 
232 | def get_submissions(rs_file, subreddit_file, domain_file, cc_match=None):
233 |     """Return all submissions from a dump submission file rs_file (RS_*.zst),
234 |        restricted to the subreddit+domain listed in filter_file."""
235 |     submissions = []
236 |     subreddit_dic = None
237 |     domain_dic = None
238 |     year, month = get_date(rs_file)
239 |     if subreddit_file != None:
240 |         with open(subreddit_file) as f:
241 |             subreddit_dic = dict([ (el.strip(), 1) for el in f.readlines() ])
242 |     if domain_file != None:
243 |         with open(domain_file) as f:
244 |             domain_dic = dict([ (el.strip(), 1) for el in f.readlines() ])
245 | 
246 |     with open(rs_file, 'rb') as fh:
247 |         dctx = zstd.ZstdDecompressor(max_window_size=max_window_size)
248 |         with dctx.stream_reader(fh) as reader:
249 |             i = 0
250 |             for line in io.TextIOWrapper(io.BufferedReader(reader), encoding='utf-8'):
251 |                 try:
252 |                     submission = json.loads(line)
253 |                     if not filter_submission(submission):
254 |                         subreddit = get_subreddit(submission)
255 |                         domain = get_domain(submission)
256 |                         scheck = subreddit_dic == None or subreddit in subreddit_dic
257 |                         dcheck = domain_dic == None or domain in domain_dic
258 |                         if scheck and dcheck:
259 |                             s = dict([ (f, submission[f]) for f in fields ])
260 |                             print("keeping: subreddit=%s\tdomain=%s" % (subreddit, domain))
261 |                             if args.dryrun:
262 |                                 continue
263 |                             if not batch_download_facts:
264 |                                 s = add_webpage(s, year, month, cc_match=cc_match)
265 |                             submissions.append(s)
266 |                             sys.stdout.flush()
267 |                             sys.stderr.flush()
268 |                             i += 1
269 |                             if i == args.nsubmissions:
270 |                                 break
271 |                         else:
272 |                             print("skipping: subreddit=%s\tdomain=%s (%s %s)" % (subreddit, domain, scheck, dcheck))
273 |                             pass
274 |                 except json.decoder.JSONDecodeError:
275 |                     pass
276 |                 except Exception:
277 |                     traceback.print_exc()
278 |                     pass
279 |     if batch_download_facts:
280 |         submissions = add_webpages(submissions)
281 |     else:
282 |         submissions = [s for s in submissions if s is not None]
283 |     return dict([ (get_submission_id(s), s) for s in submissions ])
284 | 
285 | def get_comments(rc_file, submissions):
286 |     """Return all conversation triples from rc_file (RC_*.bz2),
287 |        restricted to given submissions."""
288 |     comments = {}
289 |     with open(rc_file, 'rb') as fh:
290 |         dctx = zstd.ZstdDecompressor(max_window_size=max_window_size)
291 |         with dctx.stream_reader(fh) as reader:
292 |             for line in io.TextIOWrapper(io.BufferedReader(reader), encoding='utf-8'):
293 |                 try:
294 |                     comment = json.loads(line)
295 |                     sid = get_linked_submission_id(comment)
296 |                     if sid in submissions.keys():
297 |                         comments[get_comment_id(comment)] = comment
298 |                 except Exception:
299 |                     traceback.print_exc()
300 |                     pass
301 |     return comments
302 | 
303 | def load_data(cc_match=None):
304 |     """Load data either from a pickle file if it exists,
305 |        and otherwise from RC_* RS_* and directly from the web."""
306 |     if not os.path.isfile(args.pickle):
307 |         submissions = get_submissions(args.rsinput, args.subreddit_filter, args.domain_filter, cc_match=cc_match)
308 |         comments = get_comments(args.rcinput, submissions)
309 |         with open(args.pickle, 'wb') as f:
310 |             pickle.dump([submissions, comments], f, protocol=pickle.HIGHEST_PROTOCOL)
311 |     else:
312 |         with open(args.pickle, 'rb') as f:
313 |             [submissions, comments] = pickle.load(f)
314 |     return submissions, comments
315 | 
316 | def insert_escaped_tags(tags, label=None):
317 |     """For each tag in "tags", insert contextual tags (e.g., <p> </p>) as escaped text
318 |        so that these tags are still there when html markup is stripped out."""
319 |     found = False
320 |     for tag in tags:
321 |         strs = list(tag.strings)
322 |         if len(strs) > 0:
323 |             if label != None:
324 |                 l = label
325 |             else:
326 |                 l = tag.name
327 |             strs[0].parent.insert(0, NavigableString("<"+l+">"))
328 |             strs[-1].parent.append(NavigableString("</"+l+">"))
329 |             found = True
330 |     return found
331 | 
332 | def save_facts(submissions, sids = None):
333 |     subs = {}
334 |     i = 0
335 |     if args.facts == '-':
336 |         return submissions
337 |     with open(args.facts, 'wt', encoding="utf-8") as f:
338 |         for id in sorted(submissions.keys()):
339 |             s = submissions[id]
340 |             url = get_url(s)
341 |             label = get_anchor(url)
342 |             print("Processing submission %s...\n\turl: %s\n\tanchor: %s\n\tpermalink: http://reddit.com%s" % (id, url, str(label), get_permalink(s)))
343 |             subs[id] = s
344 |             if sids == None or id in sids.keys():
345 |                 b = BeautifulSoup(s["source"],'html.parser')
346 |                 # If there is any anchor in the url, locate it in the facts:
347 |                 if label != "":
348 |                     if not insert_escaped_tags(b.find_all(True, attrs={"id": label}), 'anchor'):
349 |                         print("\t(couldn't find anchor on page: %s)" % label)
350 |                 # Remove tags whose text we don't care about (javascript, etc.):
351 |                 for el in b(notext_tags):
352 |                     el.decompose()
353 |                 # Delete other unimportant tags, but keep the text:
354 |                 for tag in b.findAll(True):
355 |                     if tag.name not in important_tags:
356 |                         tag.append(' ')
357 |                         tag.replaceWithChildren()
358 |                 # All tags left are important (e.g., <p>) so add them to the text: 
359 |                 insert_escaped_tags(b.find_all(True))
360 |                 # Extract facts from html:
361 |                 t = b.get_text(" ")
362 |                 t = norm_article(t)
363 |                 facts = []
364 |                 for sent in filter(None, t.split("\n")):
365 |                     if len(sent.split(" ")) >= args.min_fact_len:
366 |                        facts.append(norm_sentence(sent))
367 |                 for fact in facts:
368 |                     out_str = "\t".join([get_subreddit(s), id, get_domain(s), fact])
369 |                     hash_str = hashlib.sha224(out_str.encode("utf-8")).hexdigest()
370 |                     f.write(hash_str + "\t" + out_str + "\n")
371 |                 s["facts"] = facts
372 |                 i += 1
373 |                 if i == args.nsubmissions:
374 |                     break
375 |     return subs
376 | 
377 | def get_convo(id, submissions, comments, depth=args.max_depth):
378 |     c = comments[id]
379 |     pid = get_parent_comment_id(c)
380 |     if pid in comments.keys() and depth > 0:
381 |         els = get_convo(pid, submissions, comments, depth-1)
382 |     else:
383 |         s = submissions[get_linked_submission_id(c)]
384 |         els = [ "START", norm_sentence(get_submission_text(s)) ]
385 |     els.append(norm_sentence(get_text(c)))
386 |     return els
387 | 
388 | def save_tuple(f, subreddit, sid, pos, user, context, message, response, score, test_hashes):
389 |     cwords = re.split("\s+", context)
390 |     mwords = re.split("\s+", message)
391 |     max_len = max(args.max_context_len, len(mwords)+1)
392 |     if len(cwords) > max_len:
393 |         ndel = len(cwords) - max_len
394 |         del cwords[:ndel]
395 |         context = "... " + " ".join(cwords)
396 |     if len(response) <= args.max_res_len and len(response) >= args.min_res_len and response != deleted_str and user != deleted_str and response.find(">") < 0:
397 |         if context.find(deleted_str) < 0:
398 |             if score >= args.minscore:
399 |                 out_str = "\t".join([subreddit, sid, str(score), str(pos), context, response])
400 |                 hash_str = hashlib.sha224(out_str.encode("utf-8")).hexdigest()
401 |                 if test_hashes == None or hash_str in test_hashes.keys():
402 |                     if args.blind:
403 |                         ## Note: there is no point in removing the '--blind' flag in order to peek at the reference responses (gold), 
404 |                         ## as the organizers will rely on different responses to compute BLEU, etc.
405 |                         out_str = "\t".join([subreddit, sid, str(score), str(pos), context, undisclosed_str])
406 |                     f.write(hash_str + "\t" + out_str + "\n")
407 | 
408 | def save_tuples(submissions, comments, test_hashes):
409 |     has_firstturn = {}
410 |     with open(args.convos, 'wt', encoding="utf-8") as f:
411 |         for id in sorted(comments.keys()):
412 |             comment = comments[id]
413 |             user = get_user(comment)
414 |             score = get_score(comment)
415 |             sid = get_linked_submission_id(comment)
416 |             if sid in submissions.keys():
417 |                 s = submissions[sid]
418 |                 convo = get_convo(id, submissions, comments)
419 |                 pos = len(convo) - 1
420 |                 context = " EOS ".join(convo[:-1])
421 |                 message = convo[-2]
422 |                 response = convo[-1]
423 |                 if len(convo) == 3 and not sid in has_firstturn.keys():
424 |                     save_tuple(f, get_subreddit(s), sid, pos-1, "", convo[-3], convo[-3], message, 1, test_hashes)
425 |                     has_firstturn[sid] = 1
426 |                 save_tuple(f, get_subreddit(s), sid, pos, user, context, message, response, score, test_hashes)
427 | 
428 | def read_test_hashes(hash_file):
429 |     hashes = {}
430 |     with open(hash_file, 'r') as f:
431 |         for line in f:
432 |             hash_str = line.rstrip()
433 |             hashes[hash_str] = 1
434 |     return hashes
435 | 
436 | def load_cc_match(match_file):
437 |     cc_match = {}
438 |     with open(match_file, 'r') as f:
439 |         for line in f:
440 |             els = line.strip().split('\t')
441 |             k = cc.get_key(els[0], els[1])
442 |             cc_match[k] = els[2]
443 |     return cc_match
444 | 
445 | if __name__== "__main__":
446 |     test_hashes = None
447 |     if args.test != "":
448 |         test_hashes = read_test_hashes(args.test)
449 |     cc_match = None
450 |     if args.cc_match != '':
451 |         cc_match = load_cc_match(args.cc_match)
452 |     submissions, comments = load_data(cc_match=cc_match)
453 |     submissions = save_facts(submissions)
454 |     save_tuples(submissions, comments, test_hashes)
455 | 


--------------------------------------------------------------------------------
/data_extraction/src/ids2refs.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import os
 3 | 
 4 | refs = {}
 5 | for line in sys.stdin:
 6 | 	line = line.rstrip()
 7 | 	els = line.split("\t")
 8 | 	hashstr = els[0]
 9 | 	response = els[6]
10 | 	refs[hashstr] = response
11 | 
12 | with open(sys.argv[1]) as f:
13 | 	for line in f:
14 | 		line = line.rstrip()
15 | 		els = line.split("\t")
16 | 		sys.stdout.write(els[0])
17 | 		for i in range(1,len(els)):
18 | 			p = els[i].split('|')
19 | 			score = p[0]
20 | 			hashstr = p[1]
21 | 			if hashstr in refs.keys():
22 | 				sys.stdout.write("\t" + score + "|" + refs[hashstr])
23 | 			else:
24 | 				print("WARNING: missing ref, automatic eval scores may differ: [%s]" % hashstr, file=sys.stderr)
25 | 		sys.stdout.write("\n")
26 | 


--------------------------------------------------------------------------------
/data_extraction/trial/README.md:
--------------------------------------------------------------------------------
 1 | # Data Extraction for DSTC7: End-to-End Conversation Modeling 
 2 | 
 3 | Task 2 uses conversational data extracted from Reddit, along with the text of the link that started these conversations. This page provides scripts to extract the data from a Reddit [dump](http://files.pushshift.io/reddit/comments/), as we are unable to release the data directly ourselves.
 4 | 
 5 | *Note: In the original proposal, we planned to use Twitter data (conversational data) and Foursquare (grounded data), but decided to use Reddit, owing to the volatility of Twitter data, as well the technical difficulties of aligning Twitter content with data from other sources. Reddit provides an intuitive direct link to external data in the submissions that can be utilized for this task.*
 6 | 
 7 | ## Requirements
 8 | 
 9 | * `Python 3.x`, with modules:
10 |    * `nltk`
11 |    * `beautifulsoup4`
12 | * `make`
13 | * `wget`
14 | 
15 | 
16 | ## Create trial data:
17 | 
18 | To create the trial data, please run:
19 | 
20 | ```src/create_trial_data.sh```
21 | 
22 | This will create two tab-separated (tsv) files `data/trial.convos.txt` and `data/trial.facts.txt`, which respectively contain the conversational data and grounded text ("facts"). This requires about 20 GB of disk space.
23 | 
24 | ### Notes:
25 | 
26 | * **Web crawling**: The above script downloads grounding information directly from the web, but does respect the servers' `robots.txt` rules. The official version of the data (forthcoming) will extract that data from [Common Crawl](http://commoncrawl.org/), to ensure that all participants use exactly the same data, and to minimize the number of dead links.
27 | * **Data split**: The official data will be divided into train/dev/test, but the trial data isn't.
28 | * **Offensive language**: We restricted the data to subreddits that are generally inoffensive. However, even the most "well behaved" subreddits occasionally contain offensive and explicit language, and the trial-version of the data does not attempt to remove it.
29 | 
30 | ## Data description:
31 | 
32 | Each conversation in this dataset consist of Reddit `submission` and its following discussion-like `comments`. In this data, we restrict ourselves to submissions that provide an `URL` along with a `title` (see [example Reddit submission](https://www.reddit.com/r/todayilearned/comments/f2ruz/til_a_woman_fell_30000_feet_from_an_airplane_and/), which refers to [this web page](https://en.wikipedia.org/wiki/Vesna_Vulovi%C4%87)). The web page scraped from the URL provides grounding or context to the conversation, and is additional (non-conversational) input that models can condition on to produce responses that are more informative and contentful. 
33 | 
34 | ### Conversation file:
35 | 
36 | Each line of `trial.convos.txt` contains a Reddit response and its preceding conversational context. Long conversational contexts are truncated by keeping the last 100 words. The file contains 5 columns:
37 | 
38 | 1. subreddit name
39 | 2. conversation ID
40 | 3. response score
41 | 4. conversational context, usually multiple turns (input of the model)
42 | 5. response (output of the model)
43 | 
44 | The converational context may contain:
45 | * EOS: special symbol indicating a turn transition
46 | * START: special symbol indicating the start of the conversation
47 | 
48 | ### Facts file:
49 | 
50 | Each line of `trial.facts.txt` contains a "fact", either a sentence, paragraph (or other snippet of text) relevant to the current conversation. Use conversation IDs to find the facts relevant to each conversation. Note: facts relevant to a given conversation are ordered as they appear on the original web page. The file contains 3 columns:
51 | 
52 | 1. subreddit name
53 | 2. conversation ID
54 | 3. fact
55 | 
56 | To produce the facts relevant to each conversation, we extracted the text of the page using an html-to-text converter ([BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)), but kept the most important tags intact (`<title>, <h1-6>, <p>, etc`). As web formatting differs substantially from domain to domain and common tags like `<p>` may not be used in some domains, we decided to keep all the text of the original page (however, we do remove javascript and style code). As some of the fact data tend to be noisy, you may want restrict yourself to facts delimited by these tags.
57 | 
58 | 
59 | #### Labeled anchors
60 | 
61 | A substantial number of URLs contain labeled achors, for example:
62 | 
63 | ```http://en.wikipedia.org/wiki/John_Rhys-Davies#The_Lord_of_the_Rings_trilogy```
64 | 
65 | which here refers to the label `The_Lord_of_the_Rings_trilogy`. This information is preserved in the facts, and indicated with the tags `<anchor>` and `</anchor>`. As many web pages in this dataset are lengthy, anchors are probably useful information, as they indicate what text the model should likely attend to in order to produce a good response.
66 | 
67 | ### Data statistics:
68 | 
69 | |                   | Trial data    | Train set | Dev set | Test set |
70 | | ----              | ----          | ----      | ----    | ----     |
71 | |# dialogue turns   |   649,866     | -         | -       | -        |
72 | |# facts            | 4,320,438     | -         | -       | -        |
73 | |# tagged facts (1) |   998,032     | -         | -       | -        |
74 | 
75 | (1): facts tagged with html markup (e.g., <title>) and therefore potentially important.
76 | 
77 | ### Sample data:
78 | 
79 | #### Sample conversation turn (from trial.convos.txt):
80 | 
81 | ```todayilearned \t f2ruz \t 145 \t START EOS til a woman fell 30,000 feet from an airplane and survived . \t the page states that a 2009 report found the plane only fell several hundred meters .```
82 | 
83 | Maps to:
84 | 1. subreddit name: `TodayILearned`
85 | 2. conversation ID: `f2ruz`
86 | 3. response score: `145`
87 | 4. conversational context: `START EOS til a woman fell 30,000 feet from an airplane and survived .`
88 | 5. response: `the page states that a 2009 report found the plane only fell several hundred meters .`
89 | 
90 | #### Sample fact (from trial.facts.txt):
91 | 
92 | ```todayilearned \t f2ruz \t <p> four years later , peter hornung-andersen and pavel theiner , two prague-based journalists , claimed that flight 367 had been mistaken for an enemy aircraft and shot down by the czechoslovak air force at an altitude of 800 metres ( 2,600 ft ) . </p>```
93 | 
94 | Maps to:
95 | 1. subreddit name: `TodayILearned`
96 | 2. conversation ID: `f2ruz`
97 | 3. fact: `<p> four years later , peter hornung-andersen and pavel theiner , two prague-based journalists , claimed that flight 367 had been mistaken for an enemy aircraft and shot down by the czechoslovak air force at an altitude of 800 metres ( 2,600 ft ) . </p>`
98 | 


--------------------------------------------------------------------------------
/data_extraction/trial/lists/data-trial.txt:
--------------------------------------------------------------------------------
 1 | 2011-01
 2 | 2011-02
 3 | 2011-03
 4 | 2011-04
 5 | 2011-05
 6 | 2011-06
 7 | 2011-07
 8 | 2011-08
 9 | 2011-09
10 | 2011-10
11 | 2011-11
12 | 2011-12
13 | 


--------------------------------------------------------------------------------
/data_extraction/trial/lists/domains.txt:
--------------------------------------------------------------------------------
 1 | abc.net.au
 2 | androidpolice.com
 3 | arstechnica.com
 4 | bbc.co.uk
 5 | brisbanetimes.com.au
 6 | cbc.ca
 7 | cnn.com
 8 | en.m.wikipedia.org
 9 | en.wikipedia.org
10 | engadget.com
11 | english.aljazeera.net
12 | espn.go.com
13 | eurogamer.net
14 | france24.com
15 | github.com
16 | gizmodo.com
17 | globalpost.com
18 | huffingtonpost.com
19 | mlb.mlb.com
20 | nasa.gov
21 | nba.com
22 | ncbi.nlm.nih.gov
23 | news.bbc.co.uk
24 | news.cnet.com
25 | news.yahoo.com
26 | nfl.com
27 | nhl.com
28 | nzherald.co.nz
29 | pbs.org
30 | phys.org
31 | physorg.com
32 | profootballtalk.nbcsports.com
33 | reuters.com
34 | smh.com.au
35 | snopes.com
36 | soccernet.espn.go.com
37 | space.com
38 | sports.espn.go.com
39 | sports.yahoo.com
40 | stuff.co.nz
41 | techdirt.com
42 | telegraph.co.uk
43 | theage.com.au
44 | thestar.com
45 | theverge.com
46 | uk.eurosport.yahoo.com
47 | uk.reuters.com
48 | wikipedia.org
49 | 


--------------------------------------------------------------------------------
/data_extraction/trial/lists/subreddits.txt:
--------------------------------------------------------------------------------
 1 | android
 2 | australia
 3 | canada
 4 | europe
 5 | games
 6 | ireland
 7 | movies
 8 | nba
 9 | newzealand
10 | nfl
11 | programming
12 | science
13 | soccer
14 | todayilearned
15 | toronto
16 | unitedkingdom
17 | worldnews
18 | 


--------------------------------------------------------------------------------
/data_extraction/trial/src/Makefile.trial:
--------------------------------------------------------------------------------
 1 | .SECONDARY:
 2 | 
 3 | trial: data/trial.convos.txt data/trial.facts.txt
 4 | 
 5 | TRIAL_CONVOS=data/2011-01.convos.txt data/2011-02.convos.txt data/2011-03.convos.txt data/2011-04.convos.txt data/2011-05.convos.txt data/2011-06.convos.txt data/2011-07.convos.txt data/2011-08.convos.txt data/2011-09.convos.txt data/2011-10.convos.txt data/2011-11.convos.txt data/2011-12.convos.txt
 6 | 
 7 | TRIAL_FACTS=data/2011-01.facts.txt data/2011-02.facts.txt data/2011-03.facts.txt data/2011-04.facts.txt data/2011-05.facts.txt data/2011-06.facts.txt data/2011-07.facts.txt data/2011-08.facts.txt data/2011-09.facts.txt data/2011-10.facts.txt data/2011-11.facts.txt data/2011-12.facts.txt
 8 | 
 9 | data/trial.convos.txt: $(TRIAL_CONVOS)
10 | 	cat $+ > data/trial.convos.txt
11 | 
12 | data/trial.facts.txt: $(TRIAL_FACTS)
13 | 	cat $+ > data/trial.facts.txt
14 | 
15 | data/%.facts.txt data/%.convos.txt: reddit/RC_%.bz2 reddit/RS_%.bz2
16 | 	python src/create_trial_data.py --rsinput=../reddit/RS_$(*F).bz2 --rcinput=../reddit/RC_$(*F).bz2 --subreddit_filter=lists/subreddits.txt --domain_filter=lists/domains.txt --pickle=data/$(*F).pkl --facts=data/$(*F).facts.txt --convos=data/$(*F).convos.txt > logs/$(*F).log 2> logs/$(*F).err
17 | 
18 | reddit/RS_%.bz2:
19 | 	wget https://files.pushshift.io/reddit/submissions/RS_$(*F).bz2 -O ../reddit/RS_$(*F).bz2 -o logs/RS_$(*F).bz2.log -c
20 | 
21 | reddit/RC_%.bz2:
22 | 	wget https://files.pushshift.io/reddit/comments/RC_$(*F).bz2 -O ../reddit/RC_$(*F).bz2 -o logs/RC_$(*F).bz2.log -c
23 | 


--------------------------------------------------------------------------------
/data_extraction/trial/src/create_trial_data.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | """
  4 | create_trial_data.py
  5 | A script to turn extract:
  6 | (1) conversation from Reddit file dumps (originally downloaded from https://files.pushshift.io/reddit/daily/)
  7 | (2) grounded data ("facts") extracted from the web, respecting robots.txt
  8 | Author: Michel Galley, Microsoft Research NLP Group (dstc7-task2@microsoft.com)
  9 | """
 10 | 
 11 | import sys
 12 | import time
 13 | import os.path
 14 | import re
 15 | import argparse
 16 | import traceback
 17 | import json
 18 | import bz2
 19 | import pickle
 20 | import nltk
 21 | import urllib.request
 22 | import urllib.robotparser
 23 | 
 24 | from bs4 import BeautifulSoup
 25 | from bs4.element import NavigableString
 26 | from bs4.element import CData
 27 | from multiprocessing import Pool
 28 | from nltk.tokenize import TweetTokenizer
 29 | 
 30 | parser = argparse.ArgumentParser()
 31 | parser.add_argument("--rsinput", help="Submission (RS) file to load.")
 32 | parser.add_argument("--rcinput", help="Comments (RC) file to load.")
 33 | parser.add_argument("--facts", help="Facts file to create.")
 34 | parser.add_argument("--convos", help="Convo file to create.")
 35 | parser.add_argument("--pickle", help="Pickle that contains conversations and facts.", default="data.pkl")
 36 | parser.add_argument("--subreddit_filter", help="List of subreddits (inoffensive, safe for work, etc.)")
 37 | parser.add_argument("--domain_filter", help="Filter on subreddits and domains.")
 38 | parser.add_argument("--nsubmissions", help="Number of submissions to process (< 0 means all)", default=-1, type=int)
 39 | parser.add_argument("--min_fact_len", help="Minimum number of tokens in each fact (reduce noise in html).", default=0, type=int)
 40 | parser.add_argument("--max_res_len", help="Max number of characters in response.", default=280, type=int)
 41 | parser.add_argument("--max_context_len", help="Max number of words in context.", default=100, type=int)
 42 | parser.add_argument("--max_depth", help="Maximum length of conversation.", default=5, type=int)
 43 | parser.add_argument("--mincomments", help="Minimum number of comments per submission.", default=10, type=int)
 44 | parser.add_argument("--delay", help="Seconds of delay when crawling web pages", default=1, type=int)
 45 | parser.add_argument("--tokenize", help="Whether to tokenize facts and conversations.", default=True, type=bool)
 46 | parser.add_argument("--dryrun", help="Just collect stats about data; don't create any data.", default=False, type=bool)
 47 | args = parser.parse_args()
 48 | 
 49 | fields = [ "id", "subreddit", "score", "num_comments", "domain", "title", "url", "permalink" ]
 50 | important_tags = ['title', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'p']
 51 | notext_tags = ['script', 'style']
 52 | deleted_str = '[deleted]'
 53 | 
 54 | batch_download_facts = False
 55 | robotparsers = {}
 56 | tokenizer = TweetTokenizer(preserve_case=False)
 57 | 
 58 | def get_subreddit(submission):
 59 |     return submission["subreddit"]
 60 | def get_domain(submission):
 61 |     return submission["domain"]
 62 | def get_url(submission):
 63 |     return submission["url"]
 64 | def get_submission_text(submission):
 65 |     return submission["title"]
 66 | def get_permalink(submission):
 67 |     return submission["permalink"]
 68 | def get_submission_id(submission):
 69 |     return submission["id"]
 70 | def get_comment_id(comment):
 71 |     return comment["name"]
 72 | def get_parent_comment_id(comment):
 73 |     return comment["parent_id"]
 74 | def get_text(comment):
 75 |     return comment["body"]
 76 | def get_user(comment):
 77 |     return comment["author"]
 78 | def get_score(comment):
 79 |     return comment["score"]
 80 | def get_linked_submission_id(comment):
 81 |     return comment["link_id"].split("_")[1]
 82 | 
 83 | def filter_submission(submission):
 84 |     """Determines whether to filter out this submission (over-18, deleted user, etc.)."""
 85 |     if submission["num_comments"] < args.mincomments:
 86 |         return True
 87 |     if "num_crossposts" in submission and submission["num_crossposts"] > 0:
 88 |         return True
 89 |     if "locked" in submission and submission["locked"]:
 90 |         return True
 91 |     if "over-18" in submission and submission["over_18"]:
 92 |         return True
 93 |     if "brand_safe" in submission and not submission["brand_safe"]:
 94 |         return True
 95 |     if submission["distinguished"] != None:
 96 |         return True
 97 |     if "subreddit_type" in submission:
 98 |         if submission["subreddit_type"] == "restricted": # filter only public
 99 |             return True
100 |         if submission["subreddit_type"] == "archived":
101 |             return True
102 |     url = get_url(submission)
103 |     if url.find("reddit.com") >= 0 or url.find("twitter.com") >= 0:
104 |         return True
105 |     if url.find(" ") >= 0:
106 |         return True
107 |     if url.endswith("jpg") or url.endswith("gif") or url.endswith("png") or url.endswith("pdf"):
108 |         return True
109 |     return False
110 | 
111 | def norm_article(t):
112 |     """Minimalistic processing with linebreaking."""
113 |     t = re.sub("\s*\n+\s*","\n", t)
114 |     t = re.sub(r'(</[pP]>)',r'\1\n', t)
115 |     t = re.sub("[ \t]+"," ", t)
116 |     t = t.strip()
117 |     return t
118 | 
119 | def norm_sentence(t):
120 |     """Minimalistic processing: remove extra space characters."""
121 |     t = re.sub("[ \n\r\t]+", " ", t)
122 |     t = t.strip()
123 |     if args.tokenize:
124 |         t = " ".join(tokenizer.tokenize(t))
125 |         t = t.replace('[ deleted ]','[deleted]');
126 |     return t
127 | 
128 | def add_webpage(submission):
129 |     """Retrive sentences ('facts') from submission["url"]. 
130 |        Note: For the final version, of the training/dev/test data, we will rely on 
131 |        the Common Crawl so that we can ensure the data is the same for each participant.
132 |        This will also speed up data creation."""
133 |     sys.stderr.flush()
134 |     url = get_url(submission)
135 |     domain = get_domain(submission)
136 |     try:
137 |         if args.delay > 0:
138 |             time.sleep(args.delay) 
139 |         if domain in robotparsers.keys():
140 |             rp = robotparsers[domain]
141 |         else:
142 |             rp = urllib.robotparser.RobotFileParser()
143 |             robotparsers[domain] = rp
144 |             rurl = "http://" + domain + "/robots.txt"
145 |             print("Fetching robots.txt: [%s]" % rurl, file=sys.stderr)
146 |             rp.set_url(rurl)
147 |             rp.read()
148 |         if not rp.can_fetch("*", url):
149 |             print("Can't download url due to robots.txt: [%s]" % url, file=sys.stderr)
150 |             return None
151 |         print("Fetching url: [%s]" % url, file=sys.stderr)
152 |         u = urllib.request.urlopen(url)
153 |         src = u.read()
154 |         submission["source"] = src
155 |         return submission
156 |     except urllib.error.HTTPError:
157 |         return None
158 |     except urllib.error.URLError:
159 |         return None
160 |     except UnicodeEncodeError:
161 |         return None
162 |     except:
163 |         traceback.print_exc()
164 |         return None
165 | 
166 | def add_webpages(submissions):
167 |     """Use multithreading to retrieve multiple webpages at once."""
168 |     print("Downloading %d pages:" % len(submissions), file=sys.stderr)
169 |     sys.stderr.flush()
170 |     pool = Pool()
171 |     submissions = pool.map(add_webpage, submissions)
172 |     print("\nDone.", file=sys.stderr)
173 |     sys.stderr.flush()
174 |     return [s for s in submissions if s is not None]
175 | 
176 | def get_submissions(rs_file, subreddit_file, domain_file):
177 |     """Return all submissions from a dump submission file rs_file (RS_*.bz2),
178 |        restricted to the subreddit+domain listed in filter_file."""
179 |     submissions = []
180 |     subreddit_dic = None
181 |     domain_dic = None
182 |     if subreddit_file != None:
183 |         with open(subreddit_file) as f:
184 |             subreddit_dic = dict([ (el.strip(), 1) for el in f.readlines() ])
185 |     if domain_file != None:
186 |         with open(domain_file) as f:
187 |             domain_dic = dict([ (el.strip(), 1) for el in f.readlines() ])
188 |     with bz2.open(rs_file, 'rt', encoding="utf-8") as f:
189 |         i = 0
190 |         for line in f:
191 |             try:
192 |                 submission = json.loads(line)
193 |                 if not filter_submission(submission):
194 |                     subreddit = get_subreddit(submission)
195 |                     domain = get_domain(submission)
196 |                     scheck = subreddit_dic == None or subreddit in subreddit_dic
197 |                     dcheck = domain_dic == None or domain in domain_dic
198 |                     if scheck and dcheck:
199 |                         s = dict([ (f, submission[f]) for f in fields ])
200 |                         print("keeping: subreddit=%s\tdomain=%s" % (subreddit, domain))
201 |                         if args.dryrun:
202 |                             continue
203 |                         if not batch_download_facts:
204 |                             s = add_webpage(s)
205 |                         submissions.append(s)
206 |                         i += 1
207 |                         if i == args.nsubmissions:
208 |                             break
209 |                     else:
210 |                         print("skipping: subreddit=%s\tdomain=%s (%s %s)" % (subreddit, domain, scheck, dcheck))
211 |                         pass
212 | 
213 |             except Exception:
214 |                 traceback.print_exc()
215 |                 pass
216 |     if batch_download_facts:
217 |         submissions = add_webpages(submissions)
218 |     else:
219 |         submissions = [s for s in submissions if s is not None]
220 |     return dict([ (get_submission_id(s), s) for s in submissions ])
221 | 
222 | def get_comments(rc_file, submissions):
223 |     """Return all conversation triples from rc_file (RC_*.bz2),
224 |        restricted to given submissions."""
225 |     comments = {}
226 |     with bz2.open(rc_file, 'rt', encoding="utf-8") as f:
227 |         for line in f:
228 |             try:
229 |                 comment = json.loads(line)
230 |                 sid = get_linked_submission_id(comment)
231 |                 if sid in submissions.keys():
232 |                     comments[get_comment_id(comment)] = comment
233 |             except Exception:
234 |                 traceback.print_exc()
235 |                 pass
236 |     return comments
237 | 
238 | def load_data():
239 |     """Load data either from a pickle file if it exists, 
240 |        and otherwise from RC_* RS_* and directly from the web."""
241 |     if not os.path.isfile(args.pickle): 
242 |         submissions = get_submissions(args.rsinput, args.subreddit_filter, args.domain_filter)
243 |         comments = get_comments(args.rcinput, submissions)
244 |         with open(args.pickle, 'wb') as f:
245 |             pickle.dump([submissions, comments], f, protocol=pickle.HIGHEST_PROTOCOL)
246 |     else:
247 |         with open(args.pickle, 'rb') as f:
248 |             [submissions, comments] = pickle.load(f)
249 |     return submissions, comments
250 | 
251 | def insert_escaped_tags(tags, label=None):
252 |     """For each tag in "tags", insert contextual tags (e.g., <p> </p>) as escaped text
253 |        so that these tags are still there when html markup is stripped out."""
254 |     found = False
255 |     for tag in tags:
256 |         strs = list(tag.strings)
257 |         if len(strs) > 0:
258 |             if label != None:
259 |                 l = label
260 |             else:
261 |                 l = tag.name
262 |             strs[0].parent.insert(0, NavigableString("<"+l+">"))
263 |             strs[-1].parent.append(NavigableString("</"+l+">"))
264 |             found = True
265 |     return found
266 | 
267 | def save_facts(submissions, sids = None):
268 |     subs = {}
269 |     i = 0
270 |     with open(args.facts, 'wt', encoding="utf-8") as f:
271 |         for id in sorted(submissions.keys()):
272 |             s = submissions[id] 
273 |             url = get_url(s)
274 |             label = ""
275 |             pos = url.find("#")
276 |             if (pos > 0):
277 |                 label = url[pos+1:]
278 |                 label = label.strip()
279 |             print("Processing submission %s...\n\turl: %s\n\tanchor: %s\n\tpermalink: http://reddit.com%s" % (id, url, str(label), get_permalink(s)), file=sys.stderr)
280 |             sys.stdout.flush()
281 |             subs[id] = s
282 |             if sids == None or id in sids.keys():
283 |                 b = BeautifulSoup(s["source"],'html.parser')
284 |                 # If there is any anchor in the url, locate it in the facts:
285 |                 if label != "":
286 |                     if not insert_escaped_tags(b.find_all(True, attrs={"id": label}), 'anchor'):
287 |                         print("\t(couldn't find anchor on page: %s)" % label, file=sys.stderr)
288 |                 # Remove tags whose text we don't care about (javascript, etc.):
289 |                 for el in b(notext_tags):
290 |                     el.decompose()
291 |                 # Delete other unimportant tags, but keep the text:
292 |                 for tag in b.findAll(True):
293 |                     if tag.name not in important_tags:
294 |                         tag.append(' ')
295 |                         tag.replaceWithChildren()
296 |                 # All tags left are important (e.g., <p>) so add them to the text: 
297 |                 insert_escaped_tags(b.find_all(True))
298 |                 # Extract facts from html:
299 |                 t = b.get_text(" ")
300 |                 t = norm_article(t)
301 |                 facts = []
302 |                 for sent in filter(None, t.split("\n")):
303 |                     if len(sent.split(" ")) >= args.min_fact_len:
304 |                        facts.append(norm_sentence(sent))
305 |                 for fact in facts:
306 |                     f.write("\t".join([get_subreddit(s), id, fact]) + "\n")
307 |                 s["facts"] = facts
308 |                 i += 1
309 |                 if i == args.nsubmissions:
310 |                     break
311 |     return subs
312 | 
313 | def get_convo(id, submissions, comments, depth=args.max_depth):
314 |     c = comments[id]
315 |     pid = get_parent_comment_id(c)
316 |     if pid in comments.keys() and depth > 0:
317 |         els = get_convo(pid, submissions, comments, depth-1)
318 |     else:
319 |         s = submissions[get_linked_submission_id(c)]
320 |         els = [ "START", norm_sentence(get_submission_text(s)) ]
321 |     els.append(norm_sentence(get_text(c)))
322 |     return els
323 | 
324 | def save_triples(submissions, comments):
325 |     sids = {}
326 |     with open(args.convos, 'wt', encoding="utf-8") as f:
327 |         for id in sorted(comments.keys()):
328 |             comment = comments[id]
329 |             sid = get_linked_submission_id(comment) 
330 |             if sid in submissions.keys():
331 |                 convo = get_convo(id, submissions, comments)
332 |                 context = " EOS ".join(convo[:-1])
333 |                 response = convo[-1]
334 |                 cwords = re.split("\s+", context)
335 |                 if len(cwords) > args.max_context_len:
336 |                     ndel = len(cwords) - args.max_context_len
337 |                     del cwords[:ndel]
338 |                     context = "... " + " ".join(cwords)
339 |                 s = submissions[sid]
340 |                 if len(response) <= args.max_res_len and response != deleted_str and get_user(comment) != deleted_str and response.find(">") < 0:
341 |                     if context.find(deleted_str) < 0:
342 |                         f.write("\t".join([get_subreddit(s), sid, str(get_score(comment)), context, response]) + "\n")
343 |                         sids[sid] = 1
344 |     return sids
345 | 
346 | if __name__== "__main__":
347 |     submissions, comments = load_data()
348 |     submissions = save_facts(submissions)
349 |     save_triples(submissions, comments)
350 | 


--------------------------------------------------------------------------------
/data_extraction/trial/src/create_trial_data.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Creates trial data for DSTC7 Task 2
 4 | 
 5 | mkdir reddit
 6 | mkdir data
 7 | mkdir logs
 8 | 
 9 | TRIAL=`cat lists/data-trial.txt`
10 | 
11 | for id in $TRIAL; do
12 | 	echo Downloading $id
13 | 	wget https://files.pushshift.io/reddit/submissions/RS_$id.bz2 -O ../reddit/RS_$id.bz2 -o logs/RS_$id.bz2.log -c
14 | 	wget https://files.pushshift.io/reddit/comments/RC_$id.bz2 -O ../reddit/RC_$id.bz2 -o logs/RC_$id.bz2.log -c
15 | 	python src/create_trial_data.py --rsinput=../reddit/RS_$id.bz2 --rcinput=../reddit/RC_$id.bz2 --subreddit_filter=lists/subreddits.txt --domain_filter=lists/domains.txt --pickle=data/$id.pkl --facts=data/$id.facts.txt --convos=data/$id.convos.txt > logs/$id.log 2> logs/$id.err
16 | done
17 | 
18 | eval "cat data/{`echo $TRIAL | sed 's/ /,/g'`}.convos.txt" > data/trial.convos.txt
19 | eval "cat data/{`echo $TRIAL | sed 's/ /,/g'`}.facts.txt" > data/trial.facts.txt
20 | 


--------------------------------------------------------------------------------
/data_extraction/trial/src/requirements.txt:
--------------------------------------------------------------------------------
1 | nltk==3.4.1
2 | beautifulsoup4==4.6.0
3 | 


--------------------------------------------------------------------------------
/doc/DSTC7_task2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/e2b50e535e10fc07c55946c20571ced972c6c059/doc/DSTC7_task2.pdf


--------------------------------------------------------------------------------
/doc/proposal.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/e2b50e535e10fc07c55946c20571ced972c6c059/doc/proposal.pdf


--------------------------------------------------------------------------------
/evaluation/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | # Open Use of Data Agreement v1.0
 2 | 
 3 | This is the Open Use of Data Agreement, Version 1.0 (the "O-UDA"). Capitalized terms are defined in Section 5. Data Provider and you agree as follows:
 4 | 
 5 | 1. **Provision of the Data**
 6 | 
 7 |     1.1. You may use, modify, and distribute the Data made available to you by the Data Provider under this O-UDA if you follow the O-UDA's terms.
 8 | 
 9 |     1.2. Data Provider will not sue you or any Downstream Recipient for any claim arising out of the use, modification, or distribution of the Data provided you meet the terms of the O-UDA.
10 | 
11 |     1.3 This O-UDA does not restrict your use, modification, or distribution of any portions of the Data that are in the public domain or that may be used, modified, or distributed under any other legal exception or limitation.
12 | 
13 | 2. **No Restrictions on Use or Results**
14 | 
15 |     2.1. The O-UDA does not impose any restriction with respect to:
16 | 
17 |       2.1.1. the use or modification of Data; or
18 | 
19 |       2.1.2. the use, modification, or distribution of Results.
20 | 
21 | 3. **Redistribution of Data**
22 | 
23 |     3.1. You may redistribute the Data under terms of your choice, so long as:
24 | 
25 |       3.1.1. You include with any Data you redistribute all credit or attribution information that you received with the Data, and your terms require any Downstream Recipient to do the same; and
26 | 
27 |       3.1.2. Your terms include a warranty disclaimer and limitation of liability for Upstream Data Providers at least as broad as those contained in Section 4.2 and 4.3 of the O-UDA.
28 | 
29 | 4. **No Warranty, Limitation of Liability**
30 | 
31 |     4.1. Data Provider does not represent or warrant that it has any rights whatsoever in the Data.
32 | 
33 |     4.2. THE DATA IS PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
34 | 
35 |     4.3. NEITHER DATA PROVIDER NOR ANY UPSTREAM DATA PROVIDER SHALL HAVE ANY LIABILITY FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING WITHOUT LIMITATION LOST PROFITS), HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE DATA OR RESULTS, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
36 | 
37 | 5. **Definitions**
38 | 
39 |     5.1. "Data" means the material you receive under the O-UDA in modified or unmodified form, but not including Results.
40 | 
41 |     5.2. "Data Provider" means the source from which you receive the Data and with whom you enter into the O-UDA.
42 | 
43 |     5.3. "Downstream Recipient" means any person or persons who receives the Data directly or indirectly from you in accordance with the O-UDA.
44 | 
45 |     5.4. "Result" means anything that you develop or improve from your use of Data that does not include more than a de minimis portion of the Data on which the use is based.  Results may include de minimis portions of the Data necessary to report on or explain use that has been conducted with the Data, such as figures in scientific papers, but do not include more.  Artificial intelligence models trained on Data (and which do not include more than a de minimis portion of Data) are Results.
46 | 
47 |     5.5. "Upstream Data Providers" means the source or sources from which the Data Provider directly or indirectly received, under the terms of the O-UDA, material that is included in the Data.
48 | 
49 | 


--------------------------------------------------------------------------------
/evaluation/README.md:
--------------------------------------------------------------------------------
 1 | # Evaluation
 2 | 
 3 | * Full evaluation results and data: [tgz](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/blob/master/evaluation/dstc7-task2-human_eval.tgz)
 4 | 
 5 | ## Important Dates
 6 | * 9/10/2018: Task organizers release test data
 7 | * 10/1/2018 (11:59pm PST): Deadline for sending system outputs to task organizers (email to <dstc7-task2@microsoft.com>)
 8 | * 10/8/2018: Task organizers release automatic evaluation results (BLEU, METEOR, etc.)
 9 | * 10/16/2018: Task organizers release human evaluation results
10 | * 11/5/2018: System descriptions due to [DSTC7](http://workshop.colips.org/dstc7/) organizers (coming soon)
11 | 
12 | ## Create test data (estimated time: 1 week maximum)
13 | 
14 | Please refer to the [data extraction page](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction) to create data. The scripts on that page have just been updated to create validation and test data, so please dowload the latest version of the code (e.g., with `git pull`). Then, you just need to run the following command:
15 | 
16 | ```make -j4 valid test```
17 | 
18 | This will create the followng four files:
19 | 
20 | * Validation data: ``valid.convos.txt`` and ``valid.facts.txt``
21 | * Test data: ``test.convos.txt`` and ``test.facts.txt``
22 | 
23 | These files are in exactly the same format as ``train.convos.txt`` and ``train.facts.txt`` already explained [here](https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction). The only difference is that the ``response`` field of test.convos.txt has been replaced with the strings ``__UNDISCLOSED__``.
24 | 
25 | Notes: 
26 | * The two validation files are optional and you can skip them if you want (e.g., no need to send us system outputs for them). We provide them so that you can run your own automatic evaluation (BLEU, etc.) by comparing the ``response`` field with your own system outputs. 
27 | * Obviously, you may not use the validation or test data for training any of your models or systems.
28 | * Data creation should take about 1-4 days (depending on your internet connection, etc.). If you run into trouble creating the data or data extraction isn't complete by September 17, please contact us.
29 | 
30 | ### Data statistics
31 | 
32 | Number of conversational responses: 
33 | * Validation (valid.convos.txt): 4542 lines
34 | * Test (test.convos.txt): 13440 lines
35 | 
36 | Due to the way the data is created by querying Common Crawl, there may be small differences between your version of the data and our own. To make pairwise comparisons between systems of each pair of participants, we will rely on the largest subset of the test set that is common to both participants.  **However, if your file test.convos.txt contains less than 13,000 lines, this might be an indication of a problem so please contact us immediately**.
37 | 
38 | ## Create system outputs (estimated time: 2 weeks maximum)
39 | 
40 | **By October 8th**, please send us a modification of ``test.convos.txt`` where ``__UNDISCLOSED__`` has been replaced by your own system output, which is an output based on the query, subreddit, and conversation ID specified on the same line. On October 4th, we sent an email to registered participants with a URL to the system output submission site. If you registed but didn't receive such an email, please contact us at <dstc7-task2@microsoft.com>.
41 | 
42 | In order for us to process these files automatically, please ensure the following about file format:
43 | * **Other than replacing ``__UNDISCLOSED__`` with your own output, the rest of the file should not be altered in any way.** That is, it needs to have exactly 7 columns on each line, and columns 1-6 should be exactly the same as the test.convos.txt file created by our scripts. These other columns are also important as they will help us sort out differences between the test sets of the different participants.
44 | * You may submit multiple systems, in which case we ask that you: (1) Please give a different name to each file (e.g., system1.convos.txt, system2.convos.txt, ...). (2) Please identify one of your system as primary, as we may only be able to use one of them for human evaluation.
45 | * Please do not send us any other files (e.g., we do not need validation sets, and any facts files.) 
46 | 
47 | *Before submitting, **please check the format of your files** and make sure they are as described above, as we will process submissions automatically for metric evaluation (and semi-automatically for human evaluation). Formatting your data correctly as described above will minimize problems in the evaluation pipeline and the risk of getting lower metric scores. If you are not sure about the format, just ask.*
48 | 
49 | If you have any questions, again please email us at <dstc7-task2@microsoft.com>.
50 | 


--------------------------------------------------------------------------------
/evaluation/dstc7-task2-human_eval.tgz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/e2b50e535e10fc07c55946c20571ced972c6c059/evaluation/dstc7-task2-human_eval.tgz


--------------------------------------------------------------------------------
/evaluation/old/dstc7-task2-individual_judgments.tgz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/e2b50e535e10fc07c55946c20571ced972c6c059/evaluation/old/dstc7-task2-individual_judgments.tgz


--------------------------------------------------------------------------------
/evaluation/old/dstc7-task2-individual_judgments.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/e2b50e535e10fc07c55946c20571ced972c6c059/evaluation/old/dstc7-task2-individual_judgments.xlsx


--------------------------------------------------------------------------------
/evaluation/old/dstc7-task2-scores.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/e2b50e535e10fc07c55946c20571ced972c6c059/evaluation/old/dstc7-task2-scores.xlsx


--------------------------------------------------------------------------------
/evaluation/src/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | temp
3 | __pycache__
4 | 


--------------------------------------------------------------------------------
/evaluation/src/3rdparty/.create:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/e2b50e535e10fc07c55946c20571ced972c6c059/evaluation/src/3rdparty/.create


--------------------------------------------------------------------------------
/evaluation/src/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Automatic evaluation script for DSTC7 Task 2
 3 | 
 4 | Steps:
 5 | 1) Make sure you 'git pull' the latest changes (from October 15, 2018), including changes in ../../data_extraction.
 6 | 2) cd to `../../data_extraction` and type make. This will create the multi-reference file used by the metrics (`../../data_extraction/test.refs`).
 7 | 3) Install 3rd party software as instructed below (METEOR and mteval-v14c.pl).
 8 | 5) Run the following command, where `[SUBMISSION]` is the submission file you want to evaluate: (same format as the one you submitted on Oct 8.)
 9 | ```
10 | python dstc.py -c [SUBMISSION] --refs ../../data_extraction/test.refs
11 | ```
12 | 
13 | Important: the results printed by dstc.py might differ slightly from the official results, if part of your test set failed to download.
14 | 
15 | 
16 | 
17 | # What does it do?
18 | (Based on this [repo](https://github.com/golsun/NLP-tools) by [Sean Xiang Gao](https://www.linkedin.com/in/gxiang1228/))
19 | 
20 | *  **evaluation**: calculate automated NLP metrics (BLEU, NIST, METEOR, entropy, etc...)
21 | ```python
22 | from metrics import nlp_metrics
23 | nist, bleu, meteor, entropy, diversity, avg_len = nlp_metrics(
24 | 	  path_refs=["demo/ref0.txt", "demo/ref1.txt"], 
25 | 	  path_hyp="demo/hyp.txt")
26 | 	  
27 | # nist = [1.8338, 2.0838, 2.1949, 2.1949]
28 | # bleu = [0.4667, 0.441, 0.4017, 0.3224]
29 | # meteor = 0.2832
30 | # entropy = [2.5232, 2.4849, 2.1972, 1.7918]
31 | # diversity = [0.8667, 1.000]
32 | # avg_len = 5.0000
33 | ```
34 | * **tokenization**: clean string and deal with punctation, contraction, url, mention, tag, etc
35 | ```python
36 | from tokenizers import clean_str
37 | s = " I don't know:). how about this?https://github.com"
38 | clean_str(s)
39 | 
40 | # i do n't know :) . how about this ? __url__
41 | ```
42 | 
43 | # Requirements
44 | * Works fine for both Python 2.7 and 3.6
45 | * Please **downloads** the following 3rd-party packages and save in a new folder `3rdparty`:
46 | 	* [**mteval-v14c.pl**](https://goo.gl/YUFajQ) to compute [NIST](http://www.mt-archive.info/HLT-2002-Doddington.pdf). You may need to install the following [perl](https://www.perl.org/get.html) modules (e.g. by `cpan install`): XML:Twig, Sort:Naturally and String:Util.
47 | 	* [**meteor-1.5**](http://www.cs.cmu.edu/~alavie/METEOR/download/meteor-1.5.tar.gz) to compute [METEOR](http://www.cs.cmu.edu/~alavie/METEOR/index.html). It requires [Java](https://www.java.com/en/download/help/download_options.xml).
48 | 
49 | 


--------------------------------------------------------------------------------
/evaluation/src/automatic-evaluation.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Automatic evaluation script:
 4 | 
 5 | # Name of any of the files submitted to DSTC7 Task2
 6 | # (or any more recent file)
 7 | SUBMISSION=systems/constant-baseline.txt
 8 | 
 9 | # Make sure this file exists:
10 | REFS=../../data_extraction/test.refs
11 | 
12 | if [ ! -f $REFS ]; then
13 | 	echo "Reference file not found. Please move to ../../data_extraction and type make."
14 | else
15 | 	python dstc.py -c $SUBMISSION --refs $REFS
16 | fi
17 | 
18 | 


--------------------------------------------------------------------------------
/evaluation/src/demo.py:
--------------------------------------------------------------------------------
 1 | from metrics import *
 2 | from tokenizers import *
 3 | 
 4 | # evaluation
 5 | 
 6 | 
 7 | nist, bleu, meteor, entropy, diversity, avg_len = nlp_metrics(
 8 | 	path_refs=['demo/ref0.txt', 'demo/ref1.txt'], 
 9 | 	path_hyp='demo/hyp.txt')
10 | 
11 | print(nist)
12 | print(bleu)
13 | print(meteor)
14 | print(entropy)
15 | print(diversity)
16 | print(avg_len)
17 | 
18 | # tokenization 
19 | 
20 | s = " I don't know:). how about this?https://github.com/golsun/deep-RL-time-series"
21 | print(clean_str(s))
22 | 


--------------------------------------------------------------------------------
/evaluation/src/demo/hyp.txt:
--------------------------------------------------------------------------------
1 | i do n't know .
2 | he is a rocket scientist .
3 | i love it !


--------------------------------------------------------------------------------
/evaluation/src/demo/ref0.txt:
--------------------------------------------------------------------------------
1 | ok that 's fine
2 | he is a trader
3 | i love it !


--------------------------------------------------------------------------------
/evaluation/src/demo/ref1.txt:
--------------------------------------------------------------------------------
1 | well it 's ok
2 | he is an engineer
3 | i 'm not a fan


--------------------------------------------------------------------------------
/evaluation/src/dstc.py:
--------------------------------------------------------------------------------
  1 | # author: Xiang Gao @ Microsoft Research, Oct 2018
  2 | # evaluate DSTC-task2 submissions. https://github.com/DSTC-MSR-NLP/DSTC7-End-to-End-Conversation-Modeling
  3 | 
  4 | from util import *
  5 | from metrics import *
  6 | from tokenizers import *
  7 | 
  8 | def extract_cells(path_in, path_hash):
  9 | 	keys = [line.strip('\n') for line in open(path_hash)]
 10 | 	cells = dict()
 11 | 	for line in open(path_in, encoding='utf-8'):
 12 | 		c = line.strip('\n').split('\t')
 13 | 		k = c[0]
 14 | 		if k in keys:
 15 | 			cells[k] = c[1:]
 16 | 	return cells
 17 | 
 18 | 
 19 | def extract_hyp_refs(raw_hyp, raw_ref, path_hash, fld_out, n_refs=6, clean=False, vshuman=-1):
 20 | 	cells_hyp = extract_cells(raw_hyp, path_hash)
 21 | 	cells_ref = extract_cells(raw_ref, path_hash)
 22 | 	if not os.path.exists(fld_out):
 23 | 		os.makedirs(fld_out)
 24 | 
 25 | 	def _clean(s):
 26 | 		if clean:
 27 | 			return clean_str(s)
 28 | 		else:
 29 | 			return s
 30 | 
 31 | 	keys = sorted(cells_hyp.keys())
 32 | 	with open(fld_out + '/hash.txt', 'w', encoding='utf-8') as f:
 33 | 		f.write(unicode('\n'.join(keys)))
 34 | 
 35 | 	lines = [_clean(cells_hyp[k][-1]) for k in keys]
 36 | 	path_hyp = fld_out + '/hyp.txt'
 37 | 	with open(path_hyp, 'w', encoding='utf-8') as f:
 38 | 		f.write(unicode('\n'.join(lines)))
 39 | 	
 40 | 	lines = []
 41 | 	for _ in range(n_refs):
 42 | 		lines.append([])
 43 | 	for k in keys:
 44 | 		refs = cells_ref[k]
 45 | 		for i in range(n_refs):
 46 | 			idx = i % len(refs)
 47 | 			if idx == vshuman:
 48 | 			    idx = (idx + 1) % len(refs)
 49 | 			lines[i].append(_clean(refs[idx].split('|')[1]))
 50 | 
 51 | 	path_refs = []
 52 | 	for i in range(n_refs):
 53 | 		path_ref = fld_out + '/ref%i.txt'%i
 54 | 		with open(path_ref, 'w', encoding='utf-8') as f:
 55 | 			f.write(unicode('\n'.join(lines[i])))
 56 | 		path_refs.append(path_ref)
 57 | 
 58 | 	return path_hyp, path_refs
 59 | 
 60 | 
 61 | def eval_one_system(submitted, keys, multi_ref, n_refs=6, n_lines=None, clean=False, vshuman=-1, PRINT=True):
 62 | 
 63 | 	print('evaluating %s' % submitted)
 64 | 
 65 | 	fld_out = submitted.replace('.txt','')
 66 | 	if clean:
 67 | 		fld_out += '_cleaned'
 68 | 	path_hyp, path_refs = extract_hyp_refs(submitted, multi_ref, keys, fld_out, n_refs, clean=clean, vshuman=vshuman)
 69 | 	nist, bleu, meteor, entropy, div, avg_len = nlp_metrics(path_refs, path_hyp, fld_out, n_lines=n_lines)
 70 | 	
 71 | 	if n_lines is None:
 72 | 		n_lines = len(open(path_hyp, encoding='utf-8').readlines())
 73 | 
 74 | 	if PRINT:
 75 | 		print('n_lines = '+str(n_lines))
 76 | 		print('NIST = '+str(nist))
 77 | 		print('BLEU = '+str(bleu))
 78 | 		print('METEOR = '+str(meteor))
 79 | 		print('entropy = '+str(entropy))
 80 | 		print('diversity = ' + str(div))
 81 | 		print('avg_len = '+str(avg_len))
 82 | 
 83 | 	return [n_lines] + nist + bleu + [meteor] + entropy + div + [avg_len]
 84 | 
 85 | 
 86 | def eval_all_systems(files, path_report, keys, multi_ref, n_refs=6, n_lines=None, clean=False, vshuman=False):
 87 | 	# evaluate all systems (*.txt) in each folder `files`
 88 | 
 89 | 	with open(path_report, 'w') as f:
 90 | 		f.write('\t'.join(
 91 | 				['fname', 'n_lines'] + \
 92 | 				['nist%i'%i for i in range(1, 4+1)] + \
 93 | 				['bleu%i'%i for i in range(1, 4+1)] + \
 94 | 				['meteor'] + \
 95 | 				['entropy%i'%i for i in range(1, 4+1)] +\
 96 | 				['div1','div2','avg_len']
 97 | 			) + '\n')
 98 | 
 99 | 	for fl in files:
100 | 		if fl.endswith('.txt'):
101 | 			submitted = fl
102 | 			results = eval_one_system(submitted, keys=keys, multi_ref=multi_ref, n_refs=n_refs, clean=clean, n_lines=n_lines, vshuman=vshuman, PRINT=False)
103 | 			with open(path_report, 'a') as f:
104 | 				f.write('\t'.join(map(str, [submitted] + results)) + '\n')
105 | 		else:
106 | 			for fname in os.listdir(fl):
107 | 				if fname.endswith('.txt'):
108 | 					submitted = fl + '/' + fname
109 | 					results = eval_one_system(submitted, keys=keys, multi_ref=multi_ref, n_refs=n_refs, clean=clean, n_lines=n_lines, vshuman=vshuman, PRINT=False)
110 | 					with open(path_report, 'a') as f:
111 | 						f.write('\t'.join(map(str, [submitted] + results)) + '\n')
112 | 
113 | 	print('report saved to: '+path_report, file=sys.stderr)
114 | 
115 | 
116 | if __name__ == '__main__':
117 | 
118 | 	parser = argparse.ArgumentParser()
119 | 	parser.add_argument('submitted')	# if 'all' or '*', eval all teams listed in dstc/teams.txt
120 | 	                                    # elif endswith '.txt', eval this single file
121 | 	                                    # else, eval all *.txt in folder `submitted_fld`
122 | 
123 | 	parser.add_argument('--clean', '-c', action='store_true')     # whether to clean ref and hyp before eval
124 | 	parser.add_argument('--n_lines', '-n', type=int, default=-1)  # eval all lines (default) or top n_lines (e.g., for fast debugging)
125 | 	parser.add_argument('--n_refs', '-r', type=int, default=6)    # number of references
126 | 	parser.add_argument('--vshuman', '-v', type=int, default='1') # when evaluating against human performance (N in refN.txt that should be removed) 
127 | 	                                                                      # in which case we need to remove human output from refs
128 | 	parser.add_argument('--refs', '-g', default='dstc/test.refs')
129 | 	parser.add_argument('--keys', '-k', default='keys/test.2k.txt')
130 | 	parser.add_argument('--teams', '-i', type=str, default='dstc/teams.txt')
131 | 	parser.add_argument('--report', '-o', type=str, default=None)
132 | 	args = parser.parse_args()
133 | 	print('Args: %s\n' % str(args), file=sys.stderr)
134 | 
135 | 	if args.n_lines < 0:
136 | 		n_lines = None	# eval all lines
137 | 	else:
138 | 		n_lines = args.n_lines	# just eval top n_lines
139 | 
140 | 	if args.submitted.endswith('.txt'):
141 | 		eval_one_system(args.submitted, keys=args.keys, multi_ref=args.refs, clean=args.clean, n_lines=n_lines, n_refs=args.n_refs, vshuman=args.vshuman)
142 | 	else:
143 | 		fname_report = 'report_ref%i'%args.n_refs
144 | 		if args.clean:
145 | 			fname_report += '_cleaned'
146 | 		fname_report += '.tsv'
147 | 		if args.submitted == 'all' or args.submitted == '*':
148 | 			files = ['dstc/' + line.strip('\n') for line in open(args.teams)]
149 | 			path_report = 'dstc/' + fname_report
150 | 		else:
151 | 			files = [args.submitted]
152 | 			path_report = args.submitted + '/' + fname_report
153 | 		if args.report != None:
154 | 			path_report = args.report
155 | 		eval_all_systems(files, path_report, keys=args.keys, multi_ref=args.refs, clean=args.clean, n_lines=n_lines, n_refs=args.n_refs, vshuman=args.vshuman)
156 | 


--------------------------------------------------------------------------------
/evaluation/src/metrics.py:
--------------------------------------------------------------------------------
  1 | # author: Xiang Gao @ Microsoft Research, Oct 2018
  2 | # compute NLP evaluation metrics
  3 | 
  4 | import re
  5 | from util import *
  6 | from collections import defaultdict
  7 | 
  8 | 
  9 | def calc_nist_bleu(path_refs, path_hyp, fld_out='temp', n_lines=None):
 10 | 	# call mteval-v14c.pl
 11 | 	# ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v14c.pl
 12 | 	# you may need to cpan install XML:Twig Sort:Naturally String:Util 
 13 | 
 14 | 	makedirs(fld_out)
 15 | 
 16 | 	if n_lines is None:
 17 | 		n_lines = len(open(path_hyp, encoding='utf-8').readlines())	
 18 | 	_write_xml([''], fld_out + '/src.xml', 'src', n_lines=n_lines)
 19 | 	_write_xml([path_hyp], fld_out + '/hyp.xml', 'hyp', n_lines=n_lines)
 20 | 	_write_xml(path_refs, fld_out + '/ref.xml', 'ref', n_lines=n_lines)
 21 | 
 22 | 	time.sleep(1)
 23 | 	cmd = [
 24 | 		'perl','3rdparty/mteval-v14c.pl',
 25 | 		'-s', '%s/src.xml'%fld_out,
 26 | 		'-t', '%s/hyp.xml'%fld_out,
 27 | 		'-r', '%s/ref.xml'%fld_out,
 28 | 		]
 29 | 	process = subprocess.Popen(cmd, stdout=subprocess.PIPE)
 30 | 	output, error = process.communicate()
 31 | 
 32 | 	lines = output.decode().split('\n')
 33 | 	try:
 34 | 		nist = lines[-6].strip('\r').split()[1:5]
 35 | 		bleu = lines[-4].strip('\r').split()[1:5]
 36 | 		return [float(x) for x in nist], [float(x) for x in bleu]
 37 | 
 38 | 	except Exception:
 39 | 		print('mteval-v14c.pl returns unexpected message')
 40 | 		print('cmd = '+str(cmd))
 41 | 		print(output.decode())
 42 | 		print(error.decode())
 43 | 		return [-1]*4, [-1]*4
 44 | 
 45 | 	
 46 | 
 47 | 
 48 | def calc_cum_bleu(path_refs, path_hyp):
 49 | 	# call multi-bleu.pl
 50 | 	# https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl
 51 | 	# the 4-gram cum BLEU returned by this one should be very close to calc_nist_bleu
 52 | 	# however multi-bleu.pl doesn't return cum BLEU of lower rank, so in nlp_metrics we preferr calc_nist_bleu
 53 | 	# NOTE: this func doesn't support n_lines argument and output is not parsed yet
 54 | 
 55 | 	process = subprocess.Popen(
 56 | 			['perl', '3rdparty/multi-bleu.perl'] + path_refs, 
 57 | 			stdout=subprocess.PIPE, 
 58 | 			stdin=subprocess.PIPE
 59 | 			)
 60 | 	with open(path_hyp, encoding='utf-8') as f:
 61 | 		lines = f.readlines()
 62 | 	for line in lines:
 63 | 		process.stdin.write(line.encode())
 64 | 	output, error = process.communicate()
 65 | 	return output.decode()
 66 | 
 67 | 
 68 | def calc_meteor(path_refs, path_hyp, fld_out='temp', n_lines=None, pretokenized=True):
 69 | 	# Call METEOR code.
 70 | 	# http://www.cs.cmu.edu/~alavie/METEOR/index.html
 71 | 
 72 | 	makedirs(fld_out)
 73 | 	path_merged_refs = fld_out + '/refs_merged.txt'
 74 | 	_write_merged_refs(path_refs, path_merged_refs)
 75 | 
 76 | 	cmd = [
 77 | 			'java', '-Xmx1g',	# heapsize of 1G to avoid OutOfMemoryError
 78 | 			'-jar', '3rdparty/meteor-1.5/meteor-1.5.jar', 
 79 | 			path_hyp, path_merged_refs, 
 80 | 			'-r', '%i'%len(path_refs), 	# refCount 
 81 | 			'-l', 'en', '-norm' 	# also supports language: cz de es fr ar
 82 | 			]
 83 | 	
 84 | 	process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
 85 | 	output, error = process.communicate()
 86 | 	for line in output.decode().split('\n'):
 87 | 		if "Final score:" in line:
 88 | 			return float(line.split()[-1])
 89 | 
 90 | 	print('meteor-1.5.jar returns unexpected message')
 91 | 	print("cmd = " + " ".join(cmd))
 92 | 	print(output.decode())
 93 | 	print(error.decode())
 94 | 	return -1 
 95 | 
 96 | 
 97 | def calc_entropy(path_hyp, n_lines=None):
 98 | 	# based on Yizhe Zhang's code
 99 | 	etp_score = [0.0,0.0,0.0,0.0]
100 | 	counter = [defaultdict(int),defaultdict(int),defaultdict(int),defaultdict(int)]
101 | 	i = 0
102 | 	for line in open(path_hyp, encoding='utf-8'):
103 | 		i += 1
104 | 		words = line.strip('\n').split()
105 | 		for n in range(4):
106 | 			for idx in range(len(words)-n):
107 | 				ngram = ' '.join(words[idx:idx+n+1])
108 | 				counter[n][ngram] += 1
109 | 		if i == n_lines:
110 | 			break
111 | 
112 | 	for n in range(4):
113 | 		total = sum(counter[n].values())
114 | 		for v in counter[n].values():
115 | 			etp_score[n] += - v /total * (np.log(v) - np.log(total))
116 | 
117 | 	return etp_score
118 | 
119 | 
120 | def calc_len(path, n_lines):
121 | 	l = []
122 | 	for line in open(path, encoding='utf8'):
123 | 		l.append(len(line.strip('\n').split()))
124 | 		if len(l) == n_lines:
125 | 			break
126 | 	return np.mean(l)
127 | 
128 | 
129 | def calc_diversity(path_hyp):
130 | 	tokens = [0.0,0.0]
131 | 	types = [defaultdict(int),defaultdict(int)]
132 | 	for line in open(path_hyp, encoding='utf-8'):
133 | 		words = line.strip('\n').split()
134 | 		for n in range(2):
135 | 			for idx in range(len(words)-n):
136 | 				ngram = ' '.join(words[idx:idx+n+1])
137 | 				types[n][ngram] = 1
138 | 				tokens[n] += 1
139 | 	div1 = len(types[0].keys())/tokens[0]
140 | 	div2 = len(types[1].keys())/tokens[1]
141 | 	return [div1, div2]
142 | 
143 | 
144 | def nlp_metrics(path_refs, path_hyp, fld_out='temp',  n_lines=None):
145 | 	nist, bleu = calc_nist_bleu(path_refs, path_hyp, fld_out, n_lines)
146 | 	meteor = calc_meteor(path_refs, path_hyp, fld_out, n_lines)
147 | 	entropy = calc_entropy(path_hyp, n_lines)
148 | 	div = calc_diversity(path_hyp)
149 | 	avg_len = calc_len(path_hyp, n_lines)
150 | 	return nist, bleu, meteor, entropy, div, avg_len
151 | 
152 | 
153 | def _write_merged_refs(paths_in, path_out, n_lines=None):
154 | 	# prepare merged ref file for meteor-1.5.jar (calc_meteor)
155 | 	# lines[i][j] is the ref from i-th ref set for the j-th query
156 | 
157 | 	lines = []
158 | 	for path_in in paths_in:
159 | 		lines.append([line.strip('\n') for line in open(path_in, encoding='utf-8')])
160 | 
161 | 	with open(path_out, 'w', encoding='utf-8') as f:
162 | 		for j in range(len(lines[0])):
163 | 			for i in range(len(paths_in)):
164 | 				f.write(unicode(lines[i][j]) + "\n")
165 | 
166 | 
167 | 
168 | def _write_xml(paths_in, path_out, role, n_lines=None):
169 | 	# prepare .xml files for mteval-v14c.pl (calc_nist_bleu)
170 | 	# role = 'src', 'hyp' or 'ref'
171 | 
172 | 	lines = [
173 | 		'<?xml version="1.0" encoding="UTF-8"?>',
174 | 		'<!DOCTYPE mteval SYSTEM "">',
175 | 		'<!-- generated by https://github.com/golsun/NLP-tools -->',
176 | 		'<!-- from: %s -->'%paths_in,
177 | 		'<!-- as inputs for ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v14c.pl -->',
178 | 		'<mteval>',
179 | 		]
180 | 
181 | 	for i_in, path_in in enumerate(paths_in):
182 | 
183 | 		# header ----
184 | 
185 | 		if role == 'src':
186 | 			lines.append('<srcset setid="unnamed" srclang="src">')
187 | 			set_ending = '</srcset>'
188 | 		elif role == 'hyp':
189 | 			lines.append('<tstset setid="unnamed" srclang="src" trglang="tgt" sysid="unnamed">')
190 | 			set_ending = '</tstset>'
191 | 		elif role == 'ref':
192 | 			lines.append('<refset setid="unnamed" srclang="src" trglang="tgt" refid="ref%i">'%i_in)
193 | 			set_ending = '</refset>'
194 | 		
195 | 		lines.append('<doc docid="unnamed" genre="unnamed">')
196 | 
197 | 		# body -----
198 | 
199 | 		if role == 'src':
200 | 			body = [''] * n_lines
201 | 		else:
202 | 			with open(path_in, 'r', encoding='utf-8') as f:
203 | 				body = f.readlines()
204 | 			if n_lines is not None:
205 | 				body = body[:n_lines]
206 | 		for i in range(len(body)):
207 | 			line = body[i].strip('\n')
208 | 			line = line.replace('&',' ').replace('<',' ')		# remove illegal xml char
209 | 			if len(line) == 0:
210 | 				line = '__empty__'
211 | 			lines.append('<p><seg id="%i"> %s </seg></p>'%(i + 1, line))
212 | 
213 | 		# ending -----
214 | 
215 | 		lines.append('</doc>')
216 | 		if role == 'src':
217 | 			lines.append('</srcset>')
218 | 		elif role == 'hyp':
219 | 			lines.append('</tstset>')
220 | 		elif role == 'ref':
221 | 			lines.append('</refset>')
222 | 
223 | 	lines.append('</mteval>')
224 | 	with open(path_out, 'w', encoding='utf-8') as f:
225 | 		f.write(unicode('\n'.join(lines)))
226 | 


--------------------------------------------------------------------------------
/evaluation/src/tokenizers.py:
--------------------------------------------------------------------------------
 1 | # author: Xiang Gao @ Microsoft Research, Oct 2018
 2 | # clean and tokenize natural language text
 3 | 
 4 | import re
 5 | from util import *
 6 | from nltk.tokenize import TweetTokenizer
 7 | 
 8 | def clean_str(txt):
 9 | 	#print("in=[%s]" % txt)
10 | 	txt = txt.lower()
11 | 	txt = re.sub('^',' ', txt)
12 | 	txt = re.sub('$',' ', txt)
13 | 
14 | 	# url and tag
15 | 	words = []
16 | 	for word in txt.split():
17 | 		i = word.find('http') 
18 | 		if i >= 0:
19 | 			word = word[:i] + ' ' + '__url__'
20 | 		words.append(word.strip())
21 | 	txt = ' '.join(words)
22 | 
23 | 	# remove markdown URL
24 | 	txt = re.sub(r'\[([^\]]*)\] \( *__url__ *\)', r'\1', txt)
25 | 
26 | 	# remove illegal char
27 | 	txt = re.sub('__url__','URL',txt)
28 | 	txt = re.sub(r"[^A-Za-z0-9():,.!?\"\']", " ", txt)
29 | 	txt = re.sub('URL','__url__',txt)	
30 | 
31 | 	# contraction
32 | 	add_space = ["'s", "'m", "'re", "n't", "'ll","'ve","'d","'em"]
33 | 	tokenizer = TweetTokenizer(preserve_case=False)
34 | 	txt = ' ' + ' '.join(tokenizer.tokenize(txt)) + ' '
35 | 	txt = txt.replace(" won't ", " will n't ")
36 | 	txt = txt.replace(" can't ", " can n't ")
37 | 	for a in add_space:
38 | 		txt = txt.replace(a+' ', ' '+a+' ')
39 | 
40 | 	txt = re.sub(r'^\s+', '', txt)
41 | 	txt = re.sub(r'\s+$', '', txt)
42 | 	txt = re.sub(r'\s+', ' ', txt) # remove extra spaces
43 | 	
44 | 	#print("out=[%s]" % txt)
45 | 	return txt
46 | 
47 | 
48 | if __name__ == '__main__':
49 | 	ss = [
50 | 		" I don't know:). how about this?https://github.com/golsun/deep-RL-time-series",
51 | 		"please try [ GitHub ] ( https://github.com )",
52 | 		]
53 | 	for s in ss:
54 | 		print(s)
55 | 		print(clean_str(s))
56 | 		print()
57 | 	
58 | 


--------------------------------------------------------------------------------
/evaluation/src/util.py:
--------------------------------------------------------------------------------
 1 | import os, time, subprocess, io, sys, re, argparse
 2 | import numpy as np
 3 | 
 4 | py_version = sys.version.split('.')[0]
 5 | if py_version == '2':
 6 | 	open = io.open
 7 | else:
 8 | 	unicode = str
 9 | 
10 | def makedirs(fld):
11 | 	if not os.path.exists(fld):
12 | 		makedirs(fld)
13 | 
14 | 
15 | def str2bool(s):
16 | 	# to avoid issue like this: https://stackoverflow.com/questions/15008758/parsing-boolean-values-with-argparse
17 | 	if s.lower() in ['t','true','1','y']:
18 | 		return True
19 | 	elif s.lower() in ['f','false','0','n']:
20 | 		return False
21 | 	else:
22 | 		raise ValueError


--------------------------------------------------------------------------------