├── index_articles.py
├── convert-to-trec.py
├── README.md
└── terrier.properties
/index_articles.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import requests
3 |
4 | es_url = sys.argv[1]
5 | filename = sys.argv[2]
6 |
7 | doc_num = 1
8 | with open(filename) as f:
9 | for line in f:
10 | requests.post(url=es_url + '/articles/article', data=line)
11 | print('Added document:\t' + str(doc_num))
12 | doc_num += 1
13 |
--------------------------------------------------------------------------------
/convert-to-trec.py:
--------------------------------------------------------------------------------
1 | # -- coding: utf-8 --
2 | import json
3 | import sys
4 | import getopt
5 |
6 |
7 | def usage():
8 | print ("usage: convert-to-trec.py -i inputfile -o outputfile \n\n"+
9 | " inputfile: the original JSONL file of the dataset\n\n"+
10 | "Copyright (c) 2015 by Singal Media Ltd.")
11 |
12 | def main(argv):
13 | inputfile = ''
14 | outputfile = ''
15 | try:
16 | opts, args = getopt.getopt(argv,"hi:o:",["ifile=","ofile="])
17 | except getopt.GetoptError:
18 | usage()
19 | sys.exit(2)
20 | for opt, arg in opts:
21 | if opt == '-h':
22 | usage()
23 | sys.exit()
24 | elif opt in ("-i", "--ifile"):
25 | inputfile = arg
26 | elif opt in ("-o", "--ofile"):
27 | outputfile = arg
28 | if not inputfile:
29 | usage()
30 | sys.exit()
31 | if not outputfile:
32 | usage()
33 | sys.exit()
34 | with open(inputfile) as fin, open(outputfile, 'w') as fout:
35 | for line in fin:
36 | news_article = json.loads(line)
37 | trecdoc = u"\n"
38 | trecdoc += u"{}\n".format(news_article["id"])
39 | trecdoc += u"{}\n".format(news_article["source"])
40 | trecdoc += u"{}\n".format(news_article["media-type"])
41 | trecdoc += u"{}\n".format(news_article["published"])
42 | trecdoc += u"{}\n".format(news_article["content"])
43 | trecdoc += u"\n"
44 | try: # py27
45 | fout.write(trecdoc.encode('utf-8'))
46 | except TypeError: # py34
47 | fout.write(trecdoc)
48 |
49 | if __name__ == "__main__":
50 | main(sys.argv[1:])
51 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Signal-1M-Tools
2 |
3 | ## What is the Signal 1M Dataset?
4 |
5 | The [Signal Media One-Million News Articles Dataset](http://research.signalmedia.co/newsir16/signal-dataset.html) dataset by [Signal Media](http://signal.uk.com/) was released to facilitate conducting research on news articles. It can be used for submissions to the NewsIR'16 workshop, but it is intended to serve the community for research on news retrieval in general.
6 |
7 | The articles of the dataset were originally collected by Moreover Technologies (one of Signal's content providers) from a variety of news sources for a period of 1 month (1-30 September 2015). It contains 1 million articles that are mainly English, but they also include non-English and multi-lingual articles. Sources of these articles include major ones, such as Reuters, in addition to local news sources and blogs.
8 |
9 | ## Getting Started
10 |
11 | ### Downloading the dataset
12 |
13 | To obtain the dataset, follow the download link [here](http://research.signalmedia.co/newsir16/signal-dataset.html).
14 |
15 |
16 | ### Elasticsearch
17 |
18 | [Elasticsearch](https://www.elastic.co/products/elasticsearch) is a powerful distributed RESTful search engine that can be used to store and index large amounts of data. At Signal, we use Elasticsearch to handle most of our search requests.
19 |
20 | #### Installation
21 |
22 | 1. Download [Elasticsearch](https://www.elastic.co/downloads/elasticsearch) and unzip.
23 | 2. Run `bin/elasticsearch` on Unix or `bin/elasticsearch.bat` on Windows.
24 | 3. Run `curl -X GET http://localhost:9200/`
25 |
26 | At this point, Elasticsearch should be running locally on port `9200`. More information about Elasticsearch can be found at their [GitHub page](https://github.com/elastic/elasticsearch).
27 |
28 | We advise that you use a tool to interact with Elasticsearch. Here are a few good ones:
29 | * [Sense (Beta) Chrome Plugin](https://chrome.google.com/webstore/detail/sense-beta/lhjgkmllcaadmopgmanpapmpjgmfcfig?hl=en)
30 | * [Postman](https://www.getpostman.com/)
31 |
32 | #### Creating an index
33 |
34 | In order to store articles, you need to create an index. First, create an `articles` index:
35 |
36 | ```bash
37 | curl -X PUT 'http://localhost:9200/articles'
38 | ```
39 |
40 | or in Sense:
41 |
42 | ```javascript
43 | PUT articles
44 | ```
45 |
46 | #### Indexing the million articles
47 |
48 | To index the million articles into Elasticsearch using python, first install [Requests](https://github.com/kennethreitz/requests/):
49 |
50 | ```bash
51 | pip install requests
52 | ```
53 |
54 | Then run:
55 |
56 | ```bash
57 | python index_articles.py http://localhost:9200 ./million.jsonl
58 | ```
59 |
60 | #### Term frequencies
61 |
62 | The term and document frequencies are also available using these links. These values were calculated after routine tokenisation and stop-word removal.
63 |
64 | * [Document frequencies](http://research.signalmedia.co/store/merged-df.edn)
65 |
66 | * [Inverse document frequencies](http://research.signalmedia.co/store/merged-idf.edn)
67 |
68 | * [Term frequencies](http://research.signalmedia.co/store/merged-df-totalf.edn)
69 |
70 | These files are in [edn format](https://github.com/edn-format/edn).
71 |
72 |
73 | ### TREC
74 |
75 | #### Signal-1M-Convert-To-TREC
76 | A script to convert the [Signal Media One-Million News Articles Dataset](http://research.signalmedia.co/newsir16/signal-dataset.html) to TREC format.
77 | The TREC format allows researchers to index the dataset using popular Information Retrieval platforms such as http://terrier.org
78 |
79 | #### Running the script
80 | After obtaining the dataset through this form http://goo.gl/forms/5i4KldoWIX, you can extract the JSONL file from the the downloaded Gzip file
81 | Then you run the script like this
82 |
83 | ```bash
84 | python convert-to-trec.py -i -o
85 | ```
86 |
87 | #### Indexing the dataset with Terrier
88 | We recommend using the terrier.properties file included in this repository to index the dataset with Terrier.
89 | In your Terrier etc folder, add a text file "signal.spec" with one line containing the path to the file you created above (The TREC formatted dataset)
90 |
--------------------------------------------------------------------------------
/terrier.properties:
--------------------------------------------------------------------------------
1 | #This file provides some examples of how to set up properties.
2 |
3 | #index path - where terrier stores its index
4 | #terrier.index.path=/usr/local/terrier/var/index/
5 |
6 | #etc path - terrier configuration files
7 | #terrier.etc=/usr/local/terrier/etc/
8 |
9 | #share path - files from the distribution terrier uses
10 | #terrier.share=/usr/local/terrier/share/
11 |
12 |
13 | ###################################################################
14 | #indexing settings:
15 | ###################################################################
16 |
17 | #collection.spec is where a Collection object expects to find information
18 | #about how it should be initialised. Eg for a TREC collection, its a file
19 | #containing the list of files it should index.
20 | collection.spec=etc/signal.spec
21 | #default collection.spec is etc/collection.spec
22 |
23 | #To record term positions (blocks) in the index, set
24 | #block.indexing=true
25 |
26 | #block indexing can be controlled using several properties:
27 | #blocks.size=1 = this property records the accuracy of one block
28 | #blocks.max - how many blocks to record in a document
29 |
30 | #fields: Terrier can save the frequency of term occurrences in various tags
31 | #to specify fields to not during indexing
32 | FieldTags.process=TITLE,ELSE
33 | #note that ELSE is a special field, which is everything not in one of the other fields
34 |
35 | #which collection to use to index?
36 | #or if you're indexing nonEnglish collections:
37 | #trec.collection.class=TRECUTFCollection
38 |
39 | ###################################################################
40 | #TRECCollection specific properties
41 | ###################################################################
42 |
43 | #If you're using TRECCollection to Index, you should specify the tags that should be indexed:
44 | #document tags specification
45 | #for processing the contents of
46 | #the documents, ignoring DOCHDR
47 | TrecDocTags.doctag=DOC
48 | TrecDocTags.idtag=DOCNO
49 | TrecDocTags.skip=DOCHDR
50 | #set to false if the tags can be of various case eg DOC and DoC
51 | TrecDocTags.casesensitive=true
52 |
53 |
54 | indexer.meta.forward.keys=docno,SOURCE,TITLE,MEDIATYPE,PUBLISHED
55 | indexer.meta.forward.keylens=40,200,500,500,20
56 |
57 | ###################################################################
58 | #Web Interface and Query-biased summarisation
59 | ###################################################################
60 |
61 | #TRECWebCollection extends the traditional TRECCollection by saving meta-data from the
62 | #document headers (url,crawldate,etc) as document properties.
63 | #trec.collection.class=TRECWebCollection
64 |
65 | #We want to save two additional tags as abstracts in the properties of each document, called title and body
66 | #TaggedDocument.abstracts=title,body
67 |
68 | #They come from the tags in the TRECCollection, in this case the WT2G collection. We will save the
69 | #TITLE tags in the title abstract. ELSE is a special case that saves everything that is not added to
70 | #an existing abstract (everything but the TITLE tag). In this case we are saving ELSE to the body abstract.
71 | #TaggedDocument.abstracts.tags=TITLE,ELSE
72 |
73 | #We also need to specify how many characters we should save in each field
74 | #TaggedDocument.abstracts.lengths=120,500
75 |
76 | #Next we need to tell the meta index to save all of these extra document properties
77 | #so that they are not thrown away. In addition to the docno which is always available,
78 | #and the title and url tasks saved by TaggedDocument, we will also save the url and
79 | #crawldate which are added to each document by the TRECWebCollection class.
80 | #indexer.meta.forward.keys=docno,title,body,url,crawldate
81 |
82 | #Again, we need to specify how many characters long each meta index entry is. Note that the title and body
83 | #abstract lengths should be the same here as in TaggedDocument.abstracts.
84 | #indexer.meta.forward.keylens=32,120,500,140,35
85 |
86 | ###################################################################
87 | #Collections for WARC files
88 | ###################################################################
89 | #these collection classes have no properties to configure them
90 | #trec.collection.class=WARC09Collection
91 | #trec.collection.class=WARC018Collection
92 |
93 |
94 | ###################################################################
95 | #SimpleFileCollection specific properties
96 | ###################################################################
97 | #trec.collection.class=SimpleFileCollection
98 |
99 | #indexing.simplefilecollection.extensionsparsers - use this to define parsers for know file extensions
100 | #indexing.simplefilecollection.extensionsparsers=txt:FileDocument,text:FileDocument,tex:FileDocument,bib:FileDocument, pdf:PDFDocument,html:HTMLDocument,htm:HTMLDocument,xhtml:HTMLDocument, xml:HTMLDocument,doc:MSWordDocument,ppt:MSPowerpointDocument,xls:MSExcelDocument
101 |
102 | #indexing.simplefilecollection.defaultparser
103 | #if this is defined, then terrier will attempt to open any file it doesn't have an explicit parser for with the parser given
104 | #indexing.simplefilecollection.defaultparser=FileDocument
105 |
106 | ###################################################################
107 | #SimpleXMLCollection specific properties
108 | ###################################################################
109 | #trec.collection.class=SimpleXMLCollection
110 |
111 | #what tag defines the document
112 | xml.doctag=document
113 | #what tag defines the document number
114 | xml.idtag=docno
115 | #what tags hold text to be indexed
116 | xml.terms=text
117 | #will the text be in non-English?
118 | #string.use_utf=false
119 |
120 | #Terrier has two indexing strategies:
121 | #classical two-pass and single-pass. Different properies control each
122 |
123 | #####################
124 | #(a) classical two-pass indexing
125 | #####################
126 | #control the number of terms to invert at once (defined by # of pointers).
127 | invertedfile.processpointers=20000000
128 | #default is less for block indexing
129 | invertedfile.processpointers=2000000
130 | #reduce these if you experience OutOfMemory while inverted the direct file
131 |
132 |
133 | ####################
134 | #(b) single-pass indexing
135 | #####################
136 | #the properties about single-pass indexing are related to memory consumption:
137 | #
138 | #memory.reserved: least amount of memory (bytes) before commit a run to disk. Set this too low and
139 | #an OutOfMemoryError will occur while saving the run. Set this too high and
140 | #more runs will be saved, making indexing slower
141 | memory.reserved=50000000
142 | #memory.heap.usage: how much of the heap must be allocated (%) before a run can be committed to disk
143 | #set this too low and runs will be committed very often, before the java heap has reached full size.
144 | memory.heap.usage=0.85
145 | #docs.checks: how often to check memory consumption. Set this too high and OutOfMemoryError may
146 | #occur between as memory checks (i.e. {docs.checks} docs may fill {memory.reserved}+1 MB.
147 | docs.checks=20
148 |
149 | #####################################################
150 | #shared indexing and retrieval settings:
151 | #####################################################
152 | #stop-words file, relative paths are assumed to be in terrier.share
153 | stopwords.filename=stopword-list.txt
154 |
155 |
156 | #the processing stages a term goes through. Following is
157 | #the default setting:
158 | termpipelines=Stopwords,PorterStemmer
159 | #Using the above default setting, the system stops a
160 | #term and then stem it if it is not a stop-word. If you
161 | #want to keep all the stop-words and use a weak version of
162 | #the Porter's stemmer, the setting should be as follows:
163 | #termpipelines=WeakPorterStemmer
164 |
165 | #The stemmers have changed since Terrier v2.0. If you wish to use the
166 | #older stemmers, use TRv2PorterStemmer and TRv2WeakPorterStemmer
167 |
168 | #If you wish to keep all terms unchanged, use
169 | #termpipelines=
170 |
171 |
172 | #####################################################
173 | # retrieval controls
174 | #####################################################
175 |
176 | #default controls for query expansion
177 | querying.postprocesses.order=QueryExpansion
178 | querying.postprocesses.controls=qe:QueryExpansion
179 |
180 | #default controls for the web-based interface. SimpleDecorate
181 | #is the simplest metadata decorator. For more control, see
182 | #Decorate.
183 | querying.postfilters.order=SimpleDecorate,SiteFilter,Scope
184 | querying.postfilters.controls=decorate:SimpleDecorate,site:SiteFilter,scope:Scope
185 |
186 | #default and allowed controls
187 | querying.default.controls=
188 | querying.allowed.controls=scope,qe,qemodel,start,end,site,scope
189 | #start control - sets the first result to be retrieved (default to rank 0)
190 | #end control - sets the last result to be retrieved (defaults to 0 - display all results)
191 | #qe control - enable query expansion (needs a direct index)
192 | #qemodel control - specify the name of the model for QueryExpansion (default to Bo1, set automatically by TRECQuerying to value of trec.qe.model property)
193 | #site control - filter the retrieved results by the suffix of their URL's hostname
194 | #scope control - filter the retrieved results by the prefix of their docno
195 |
196 | #####################################################
197 | #Advanced Web-based Interface Controls
198 | #####################################################
199 |
200 | #an example of using the full Decorate class, it does text highlighing
201 | #and query-biased summarisation.
202 | #querying.postfilters.controls=decorate:org.terrier.querying.Decorate
203 | #querying.postfilters.order=org.terrier.querying.Decorate
204 |
205 | #here we turn decorate on, generate a summary from the body meta index entry and add emphasis
206 | #to the query terms in both the title and body fields. Note that summaries overwrite the original
207 | #field, emphasis creates a new meta index entry of the form _emph, e.g. title_emph
208 | #querying.default.controls=start:0,end:9,decorate:on,summaries:body,emphasis:title;body
209 | #querying.allowed.controls=scope,decorate,start,end
210 |
211 | #####################################################
212 | #batch retrieval
213 | #####################################################
214 |
215 | #TrecTerrier -r does batch querying, by using a QuerySource to find
216 | #a stream of queries to retrieve documents for
217 |
218 | #the list of query files to process are normally listed in the file etc/trec.topics.list
219 | #instead, you can override this by specifying a query file
220 | #trec.topics=/path/to/some/queries.gz
221 |
222 |
223 |
224 | #you can change what QuerySource is used.
225 | #the default QuerySource, TRECQuery, parses TREC notation topic files
226 | #trec.topics.parser=TRECQuery
227 |
228 | #TRECQuery can be configured as follows:
229 | #query tags specification
230 | #TrecQueryTags.doctag stands for the name of tag that marks
231 | #the start of a TREC topic.
232 | TrecQueryTags.doctag=TOP
233 | #TrecQueryTags.idtag stands for the name of tag that marks
234 | #the id of a TREC topic.
235 | TrecQueryTags.idtag=NUM
236 | #TrecQueryTags.process stands for the fields in a TREC
237 | #topic to process. TrecQueryTags.skip stands for the fields
238 | #in a TREC topic to ignore. Following is the default setting:
239 | TrecQueryTags.process=TOP,NUM,TITLE
240 | TrecQueryTags.skip=DESC,NARR
241 | #Using the above default setting, Terrier processes the titles
242 | #in a TREC topic, while ignores descriptions and narratives. If
243 | #you want to run a querying process with description-only
244 | #queries, the setting should be:
245 | #TrecQueryTags.process=TOP,NUM,DESC
246 | #TrecQueryTags.skip=TITLE,NARR
247 | #Note that "TITLE" is in the list of tags to ignore.
248 |
249 | #Alternatively, topics may be in a single file
250 | #trec.topics.parser=SingleLineTRECQuery
251 | #the first token on each line is the query id
252 | SingleLineTRECQuery.queryid.exists=true
253 | #should periods be removed from the query stream (there break the query parser)
254 | #SingleLineTRECQuery.periods.allowed=false
255 |
256 |
257 | #batch results are written to var/results/
258 | #you can override this with
259 | #trec.results=/path/to/results/for/WT2G
260 | #files are written to this folder using the querycounter. However, this is a problem if Terrier instances
261 | #are writing to the same folder
262 | #trec.querycounter.type=random
263 |
264 |
265 | #or just specify the exact filename for the results
266 | #trec.results.file=/path/to/myExperiment.res
267 |
268 | #you can override the format of the results file:
269 | #trec.querying.outputformat=my.package.some.other.OutputFormat
270 |
271 | #####################################################
272 | # query expansion during retrieval
273 | #####################################################
274 |
275 | #Terrier also provides the functionality of query
276 | #expansion (pseudo-relevance feedback). Using the following setting, the query
277 | #expansion component extracts the 40 most informative
278 | #terms from the 10 top-returned documents
279 | #in first-pass retrieval as the expanded query terms.
280 |
281 | expansion.documents=3
282 | expansion.terms=10
283 |
284 | #If the above two properties are not specified
285 | #in terrier.properties, Terrier will use the default
286 | #setting which is expansion.documents=3 and
287 | #expansion.terms=10.
288 |
289 | #Terrier applies a parameter-free query expansion mechanism
290 | #by default. An alternate approach is the Rocchio's query
291 | #expansion. To enable the latter, user should set property
292 | #parameter.free.expansion to false:
293 |
294 | #parameter.free.expansion=false
295 |
296 | #To set the parameter beta of the Rocchio's query expansion,
297 | #input rocchio.beta=X in terrier.properties. For example:
298 |
299 | #rocchio.beta=0.8
300 |
301 | #If rocchio.beta is not specified in terrier.properties, Terrier
302 | #applies rocchio.beta=0.4 by default.
303 |
304 | #####################################################
305 | # field-based models
306 | #####################################################
307 | #Field based models take the frequency in each field
308 | #into account.
309 | #trec.model=BM25F
310 | #trec.model=PL2F
311 | #For these models, you need a length normalisation property
312 | #for each field, starting at 0. Note that field names (as specified by FieldTags.process)
313 | #are NOT used here.
314 | #c.0=1
315 | #c.1=1
316 | #etc
317 |
318 | #Similarly, each field needs a weight. Some models may implicitly have a restriction
319 | #on these weights (e.g. sum to 1)
320 | #w.0=1
321 | #w.1=1
322 | #etc
323 | #####################################################
324 | # term dependence models
325 | #####################################################
326 |
327 | #choose between DFR or Markov Random Fields dependence models
328 | #matching.dsms=DFRDependenceScoreModifier
329 | #matching.dsms=MRFDependenceScoreModifier
330 |
331 | #choose SD or FD (sequential or full dependence)
332 | proximity.dependency.type=SD
333 |
334 | #size of window: 2 is an exact phrase, 10 is a larger window
335 | proximity.ngram.length=2
336 | #weight of SD
337 | proximity.w_o=1
338 | #weight of FD
339 | proximity.w_u=1
340 |
341 | #for DFR, choose between pBiL and pBiL2
342 | #true for pBiL
343 | proximity.norm2=true
344 | #default length normalisation is 1
345 | proximity.norm2.c=1
346 |
347 | #for MRF
348 | #mu in the dirichlet formula.
349 | mrf.mu=4000
350 |
--------------------------------------------------------------------------------