├── index_articles.py ├── convert-to-trec.py ├── README.md └── terrier.properties /index_articles.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import requests 3 | 4 | es_url = sys.argv[1] 5 | filename = sys.argv[2] 6 | 7 | doc_num = 1 8 | with open(filename) as f: 9 | for line in f: 10 | requests.post(url=es_url + '/articles/article', data=line) 11 | print('Added document:\t' + str(doc_num)) 12 | doc_num += 1 13 | -------------------------------------------------------------------------------- /convert-to-trec.py: -------------------------------------------------------------------------------- 1 | # -- coding: utf-8 -- 2 | import json 3 | import sys 4 | import getopt 5 | 6 | 7 | def usage(): 8 | print ("usage: convert-to-trec.py -i inputfile -o outputfile \n\n"+ 9 | " inputfile: the original JSONL file of the dataset\n\n"+ 10 | "Copyright (c) 2015 by Singal Media Ltd.") 11 | 12 | def main(argv): 13 | inputfile = '' 14 | outputfile = '' 15 | try: 16 | opts, args = getopt.getopt(argv,"hi:o:",["ifile=","ofile="]) 17 | except getopt.GetoptError: 18 | usage() 19 | sys.exit(2) 20 | for opt, arg in opts: 21 | if opt == '-h': 22 | usage() 23 | sys.exit() 24 | elif opt in ("-i", "--ifile"): 25 | inputfile = arg 26 | elif opt in ("-o", "--ofile"): 27 | outputfile = arg 28 | if not inputfile: 29 | usage() 30 | sys.exit() 31 | if not outputfile: 32 | usage() 33 | sys.exit() 34 | with open(inputfile) as fin, open(outputfile, 'w') as fout: 35 | for line in fin: 36 | news_article = json.loads(line) 37 | trecdoc = u"\n" 38 | trecdoc += u"{}\n".format(news_article["id"]) 39 | trecdoc += u"{}\n".format(news_article["source"]) 40 | trecdoc += u"{}\n".format(news_article["media-type"]) 41 | trecdoc += u"{}\n".format(news_article["published"]) 42 | trecdoc += u"{}\n".format(news_article["content"]) 43 | trecdoc += u"\n" 44 | try: # py27 45 | fout.write(trecdoc.encode('utf-8')) 46 | except TypeError: # py34 47 | fout.write(trecdoc) 48 | 49 | if __name__ == "__main__": 50 | main(sys.argv[1:]) 51 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Signal-1M-Tools 2 | 3 | ## What is the Signal 1M Dataset? 4 | 5 | The [Signal Media One-Million News Articles Dataset](http://research.signalmedia.co/newsir16/signal-dataset.html) dataset by [Signal Media](http://signal.uk.com/) was released to facilitate conducting research on news articles. It can be used for submissions to the NewsIR'16 workshop, but it is intended to serve the community for research on news retrieval in general. 6 | 7 | The articles of the dataset were originally collected by Moreover Technologies (one of Signal's content providers) from a variety of news sources for a period of 1 month (1-30 September 2015). It contains 1 million articles that are mainly English, but they also include non-English and multi-lingual articles. Sources of these articles include major ones, such as Reuters, in addition to local news sources and blogs. 8 | 9 | ## Getting Started 10 | 11 | ### Downloading the dataset 12 | 13 | To obtain the dataset, follow the download link [here](http://research.signalmedia.co/newsir16/signal-dataset.html). 14 | 15 | 16 | ### Elasticsearch 17 | 18 | [Elasticsearch](https://www.elastic.co/products/elasticsearch) is a powerful distributed RESTful search engine that can be used to store and index large amounts of data. At Signal, we use Elasticsearch to handle most of our search requests. 19 | 20 | #### Installation 21 | 22 | 1. Download [Elasticsearch](https://www.elastic.co/downloads/elasticsearch) and unzip. 23 | 2. Run `bin/elasticsearch` on Unix or `bin/elasticsearch.bat` on Windows. 24 | 3. Run `curl -X GET http://localhost:9200/` 25 | 26 | At this point, Elasticsearch should be running locally on port `9200`. More information about Elasticsearch can be found at their [GitHub page](https://github.com/elastic/elasticsearch). 27 | 28 | We advise that you use a tool to interact with Elasticsearch. Here are a few good ones: 29 | * [Sense (Beta) Chrome Plugin](https://chrome.google.com/webstore/detail/sense-beta/lhjgkmllcaadmopgmanpapmpjgmfcfig?hl=en) 30 | * [Postman](https://www.getpostman.com/) 31 | 32 | #### Creating an index 33 | 34 | In order to store articles, you need to create an index. First, create an `articles` index: 35 | 36 | ```bash 37 | curl -X PUT 'http://localhost:9200/articles' 38 | ``` 39 | 40 | or in Sense: 41 | 42 | ```javascript 43 | PUT articles 44 | ``` 45 | 46 | #### Indexing the million articles 47 | 48 | To index the million articles into Elasticsearch using python, first install [Requests](https://github.com/kennethreitz/requests/): 49 | 50 | ```bash 51 | pip install requests 52 | ``` 53 | 54 | Then run: 55 | 56 | ```bash 57 | python index_articles.py http://localhost:9200 ./million.jsonl 58 | ``` 59 | 60 | #### Term frequencies 61 | 62 | The term and document frequencies are also available using these links. These values were calculated after routine tokenisation and stop-word removal. 63 | 64 | * [Document frequencies](http://research.signalmedia.co/store/merged-df.edn) 65 | 66 | * [Inverse document frequencies](http://research.signalmedia.co/store/merged-idf.edn) 67 | 68 | * [Term frequencies](http://research.signalmedia.co/store/merged-df-totalf.edn) 69 | 70 | These files are in [edn format](https://github.com/edn-format/edn). 71 | 72 | 73 | ### TREC 74 | 75 | #### Signal-1M-Convert-To-TREC 76 | A script to convert the [Signal Media One-Million News Articles Dataset](http://research.signalmedia.co/newsir16/signal-dataset.html) to TREC format. 77 | The TREC format allows researchers to index the dataset using popular Information Retrieval platforms such as http://terrier.org 78 | 79 | #### Running the script 80 | After obtaining the dataset through this form http://goo.gl/forms/5i4KldoWIX, you can extract the JSONL file from the the downloaded Gzip file 81 | Then you run the script like this 82 | 83 | ```bash 84 | python convert-to-trec.py -i -o 85 | ``` 86 | 87 | #### Indexing the dataset with Terrier 88 | We recommend using the terrier.properties file included in this repository to index the dataset with Terrier. 89 | In your Terrier etc folder, add a text file "signal.spec" with one line containing the path to the file you created above (The TREC formatted dataset) 90 | -------------------------------------------------------------------------------- /terrier.properties: -------------------------------------------------------------------------------- 1 | #This file provides some examples of how to set up properties. 2 | 3 | #index path - where terrier stores its index 4 | #terrier.index.path=/usr/local/terrier/var/index/ 5 | 6 | #etc path - terrier configuration files 7 | #terrier.etc=/usr/local/terrier/etc/ 8 | 9 | #share path - files from the distribution terrier uses 10 | #terrier.share=/usr/local/terrier/share/ 11 | 12 | 13 | ################################################################### 14 | #indexing settings: 15 | ################################################################### 16 | 17 | #collection.spec is where a Collection object expects to find information 18 | #about how it should be initialised. Eg for a TREC collection, its a file 19 | #containing the list of files it should index. 20 | collection.spec=etc/signal.spec 21 | #default collection.spec is etc/collection.spec 22 | 23 | #To record term positions (blocks) in the index, set 24 | #block.indexing=true 25 | 26 | #block indexing can be controlled using several properties: 27 | #blocks.size=1 = this property records the accuracy of one block 28 | #blocks.max - how many blocks to record in a document 29 | 30 | #fields: Terrier can save the frequency of term occurrences in various tags 31 | #to specify fields to not during indexing 32 | FieldTags.process=TITLE,ELSE 33 | #note that ELSE is a special field, which is everything not in one of the other fields 34 | 35 | #which collection to use to index? 36 | #or if you're indexing nonEnglish collections: 37 | #trec.collection.class=TRECUTFCollection 38 | 39 | ################################################################### 40 | #TRECCollection specific properties 41 | ################################################################### 42 | 43 | #If you're using TRECCollection to Index, you should specify the tags that should be indexed: 44 | #document tags specification 45 | #for processing the contents of 46 | #the documents, ignoring DOCHDR 47 | TrecDocTags.doctag=DOC 48 | TrecDocTags.idtag=DOCNO 49 | TrecDocTags.skip=DOCHDR 50 | #set to false if the tags can be of various case eg DOC and DoC 51 | TrecDocTags.casesensitive=true 52 | 53 | 54 | indexer.meta.forward.keys=docno,SOURCE,TITLE,MEDIATYPE,PUBLISHED 55 | indexer.meta.forward.keylens=40,200,500,500,20 56 | 57 | ################################################################### 58 | #Web Interface and Query-biased summarisation 59 | ################################################################### 60 | 61 | #TRECWebCollection extends the traditional TRECCollection by saving meta-data from the 62 | #document headers (url,crawldate,etc) as document properties. 63 | #trec.collection.class=TRECWebCollection 64 | 65 | #We want to save two additional tags as abstracts in the properties of each document, called title and body 66 | #TaggedDocument.abstracts=title,body 67 | 68 | #They come from the tags in the TRECCollection, in this case the WT2G collection. We will save the 69 | #TITLE tags in the title abstract. ELSE is a special case that saves everything that is not added to 70 | #an existing abstract (everything but the TITLE tag). In this case we are saving ELSE to the body abstract. 71 | #TaggedDocument.abstracts.tags=TITLE,ELSE 72 | 73 | #We also need to specify how many characters we should save in each field 74 | #TaggedDocument.abstracts.lengths=120,500 75 | 76 | #Next we need to tell the meta index to save all of these extra document properties 77 | #so that they are not thrown away. In addition to the docno which is always available, 78 | #and the title and url tasks saved by TaggedDocument, we will also save the url and 79 | #crawldate which are added to each document by the TRECWebCollection class. 80 | #indexer.meta.forward.keys=docno,title,body,url,crawldate 81 | 82 | #Again, we need to specify how many characters long each meta index entry is. Note that the title and body 83 | #abstract lengths should be the same here as in TaggedDocument.abstracts. 84 | #indexer.meta.forward.keylens=32,120,500,140,35 85 | 86 | ################################################################### 87 | #Collections for WARC files 88 | ################################################################### 89 | #these collection classes have no properties to configure them 90 | #trec.collection.class=WARC09Collection 91 | #trec.collection.class=WARC018Collection 92 | 93 | 94 | ################################################################### 95 | #SimpleFileCollection specific properties 96 | ################################################################### 97 | #trec.collection.class=SimpleFileCollection 98 | 99 | #indexing.simplefilecollection.extensionsparsers - use this to define parsers for know file extensions 100 | #indexing.simplefilecollection.extensionsparsers=txt:FileDocument,text:FileDocument,tex:FileDocument,bib:FileDocument, pdf:PDFDocument,html:HTMLDocument,htm:HTMLDocument,xhtml:HTMLDocument, xml:HTMLDocument,doc:MSWordDocument,ppt:MSPowerpointDocument,xls:MSExcelDocument 101 | 102 | #indexing.simplefilecollection.defaultparser 103 | #if this is defined, then terrier will attempt to open any file it doesn't have an explicit parser for with the parser given 104 | #indexing.simplefilecollection.defaultparser=FileDocument 105 | 106 | ################################################################### 107 | #SimpleXMLCollection specific properties 108 | ################################################################### 109 | #trec.collection.class=SimpleXMLCollection 110 | 111 | #what tag defines the document 112 | xml.doctag=document 113 | #what tag defines the document number 114 | xml.idtag=docno 115 | #what tags hold text to be indexed 116 | xml.terms=text 117 | #will the text be in non-English? 118 | #string.use_utf=false 119 | 120 | #Terrier has two indexing strategies: 121 | #classical two-pass and single-pass. Different properies control each 122 | 123 | ##################### 124 | #(a) classical two-pass indexing 125 | ##################### 126 | #control the number of terms to invert at once (defined by # of pointers). 127 | invertedfile.processpointers=20000000 128 | #default is less for block indexing 129 | invertedfile.processpointers=2000000 130 | #reduce these if you experience OutOfMemory while inverted the direct file 131 | 132 | 133 | #################### 134 | #(b) single-pass indexing 135 | ##################### 136 | #the properties about single-pass indexing are related to memory consumption: 137 | # 138 | #memory.reserved: least amount of memory (bytes) before commit a run to disk. Set this too low and 139 | #an OutOfMemoryError will occur while saving the run. Set this too high and 140 | #more runs will be saved, making indexing slower 141 | memory.reserved=50000000 142 | #memory.heap.usage: how much of the heap must be allocated (%) before a run can be committed to disk 143 | #set this too low and runs will be committed very often, before the java heap has reached full size. 144 | memory.heap.usage=0.85 145 | #docs.checks: how often to check memory consumption. Set this too high and OutOfMemoryError may 146 | #occur between as memory checks (i.e. {docs.checks} docs may fill {memory.reserved}+1 MB. 147 | docs.checks=20 148 | 149 | ##################################################### 150 | #shared indexing and retrieval settings: 151 | ##################################################### 152 | #stop-words file, relative paths are assumed to be in terrier.share 153 | stopwords.filename=stopword-list.txt 154 | 155 | 156 | #the processing stages a term goes through. Following is 157 | #the default setting: 158 | termpipelines=Stopwords,PorterStemmer 159 | #Using the above default setting, the system stops a 160 | #term and then stem it if it is not a stop-word. If you 161 | #want to keep all the stop-words and use a weak version of 162 | #the Porter's stemmer, the setting should be as follows: 163 | #termpipelines=WeakPorterStemmer 164 | 165 | #The stemmers have changed since Terrier v2.0. If you wish to use the 166 | #older stemmers, use TRv2PorterStemmer and TRv2WeakPorterStemmer 167 | 168 | #If you wish to keep all terms unchanged, use 169 | #termpipelines= 170 | 171 | 172 | ##################################################### 173 | # retrieval controls 174 | ##################################################### 175 | 176 | #default controls for query expansion 177 | querying.postprocesses.order=QueryExpansion 178 | querying.postprocesses.controls=qe:QueryExpansion 179 | 180 | #default controls for the web-based interface. SimpleDecorate 181 | #is the simplest metadata decorator. For more control, see 182 | #Decorate. 183 | querying.postfilters.order=SimpleDecorate,SiteFilter,Scope 184 | querying.postfilters.controls=decorate:SimpleDecorate,site:SiteFilter,scope:Scope 185 | 186 | #default and allowed controls 187 | querying.default.controls= 188 | querying.allowed.controls=scope,qe,qemodel,start,end,site,scope 189 | #start control - sets the first result to be retrieved (default to rank 0) 190 | #end control - sets the last result to be retrieved (defaults to 0 - display all results) 191 | #qe control - enable query expansion (needs a direct index) 192 | #qemodel control - specify the name of the model for QueryExpansion (default to Bo1, set automatically by TRECQuerying to value of trec.qe.model property) 193 | #site control - filter the retrieved results by the suffix of their URL's hostname 194 | #scope control - filter the retrieved results by the prefix of their docno 195 | 196 | ##################################################### 197 | #Advanced Web-based Interface Controls 198 | ##################################################### 199 | 200 | #an example of using the full Decorate class, it does text highlighing 201 | #and query-biased summarisation. 202 | #querying.postfilters.controls=decorate:org.terrier.querying.Decorate 203 | #querying.postfilters.order=org.terrier.querying.Decorate 204 | 205 | #here we turn decorate on, generate a summary from the body meta index entry and add emphasis 206 | #to the query terms in both the title and body fields. Note that summaries overwrite the original 207 | #field, emphasis creates a new meta index entry of the form _emph, e.g. title_emph 208 | #querying.default.controls=start:0,end:9,decorate:on,summaries:body,emphasis:title;body 209 | #querying.allowed.controls=scope,decorate,start,end 210 | 211 | ##################################################### 212 | #batch retrieval 213 | ##################################################### 214 | 215 | #TrecTerrier -r does batch querying, by using a QuerySource to find 216 | #a stream of queries to retrieve documents for 217 | 218 | #the list of query files to process are normally listed in the file etc/trec.topics.list 219 | #instead, you can override this by specifying a query file 220 | #trec.topics=/path/to/some/queries.gz 221 | 222 | 223 | 224 | #you can change what QuerySource is used. 225 | #the default QuerySource, TRECQuery, parses TREC notation topic files 226 | #trec.topics.parser=TRECQuery 227 | 228 | #TRECQuery can be configured as follows: 229 | #query tags specification 230 | #TrecQueryTags.doctag stands for the name of tag that marks 231 | #the start of a TREC topic. 232 | TrecQueryTags.doctag=TOP 233 | #TrecQueryTags.idtag stands for the name of tag that marks 234 | #the id of a TREC topic. 235 | TrecQueryTags.idtag=NUM 236 | #TrecQueryTags.process stands for the fields in a TREC 237 | #topic to process. TrecQueryTags.skip stands for the fields 238 | #in a TREC topic to ignore. Following is the default setting: 239 | TrecQueryTags.process=TOP,NUM,TITLE 240 | TrecQueryTags.skip=DESC,NARR 241 | #Using the above default setting, Terrier processes the titles 242 | #in a TREC topic, while ignores descriptions and narratives. If 243 | #you want to run a querying process with description-only 244 | #queries, the setting should be: 245 | #TrecQueryTags.process=TOP,NUM,DESC 246 | #TrecQueryTags.skip=TITLE,NARR 247 | #Note that "TITLE" is in the list of tags to ignore. 248 | 249 | #Alternatively, topics may be in a single file 250 | #trec.topics.parser=SingleLineTRECQuery 251 | #the first token on each line is the query id 252 | SingleLineTRECQuery.queryid.exists=true 253 | #should periods be removed from the query stream (there break the query parser) 254 | #SingleLineTRECQuery.periods.allowed=false 255 | 256 | 257 | #batch results are written to var/results/ 258 | #you can override this with 259 | #trec.results=/path/to/results/for/WT2G 260 | #files are written to this folder using the querycounter. However, this is a problem if Terrier instances 261 | #are writing to the same folder 262 | #trec.querycounter.type=random 263 | 264 | 265 | #or just specify the exact filename for the results 266 | #trec.results.file=/path/to/myExperiment.res 267 | 268 | #you can override the format of the results file: 269 | #trec.querying.outputformat=my.package.some.other.OutputFormat 270 | 271 | ##################################################### 272 | # query expansion during retrieval 273 | ##################################################### 274 | 275 | #Terrier also provides the functionality of query 276 | #expansion (pseudo-relevance feedback). Using the following setting, the query 277 | #expansion component extracts the 40 most informative 278 | #terms from the 10 top-returned documents 279 | #in first-pass retrieval as the expanded query terms. 280 | 281 | expansion.documents=3 282 | expansion.terms=10 283 | 284 | #If the above two properties are not specified 285 | #in terrier.properties, Terrier will use the default 286 | #setting which is expansion.documents=3 and 287 | #expansion.terms=10. 288 | 289 | #Terrier applies a parameter-free query expansion mechanism 290 | #by default. An alternate approach is the Rocchio's query 291 | #expansion. To enable the latter, user should set property 292 | #parameter.free.expansion to false: 293 | 294 | #parameter.free.expansion=false 295 | 296 | #To set the parameter beta of the Rocchio's query expansion, 297 | #input rocchio.beta=X in terrier.properties. For example: 298 | 299 | #rocchio.beta=0.8 300 | 301 | #If rocchio.beta is not specified in terrier.properties, Terrier 302 | #applies rocchio.beta=0.4 by default. 303 | 304 | ##################################################### 305 | # field-based models 306 | ##################################################### 307 | #Field based models take the frequency in each field 308 | #into account. 309 | #trec.model=BM25F 310 | #trec.model=PL2F 311 | #For these models, you need a length normalisation property 312 | #for each field, starting at 0. Note that field names (as specified by FieldTags.process) 313 | #are NOT used here. 314 | #c.0=1 315 | #c.1=1 316 | #etc 317 | 318 | #Similarly, each field needs a weight. Some models may implicitly have a restriction 319 | #on these weights (e.g. sum to 1) 320 | #w.0=1 321 | #w.1=1 322 | #etc 323 | ##################################################### 324 | # term dependence models 325 | ##################################################### 326 | 327 | #choose between DFR or Markov Random Fields dependence models 328 | #matching.dsms=DFRDependenceScoreModifier 329 | #matching.dsms=MRFDependenceScoreModifier 330 | 331 | #choose SD or FD (sequential or full dependence) 332 | proximity.dependency.type=SD 333 | 334 | #size of window: 2 is an exact phrase, 10 is a larger window 335 | proximity.ngram.length=2 336 | #weight of SD 337 | proximity.w_o=1 338 | #weight of FD 339 | proximity.w_u=1 340 | 341 | #for DFR, choose between pBiL and pBiL2 342 | #true for pBiL 343 | proximity.norm2=true 344 | #default length normalisation is 1 345 | proximity.norm2.c=1 346 | 347 | #for MRF 348 | #mu in the dirichlet formula. 349 | mrf.mu=4000 350 | --------------------------------------------------------------------------------