├── LICENSE ├── README.md ├── data ├── queries.json └── tagme_annotations.json ├── nordlys ├── __init__.py ├── config.py ├── elr │ ├── __init__.py │ ├── field_mapping.py │ ├── query_annot.py │ ├── retrieval_elr.py │ ├── scorer_elr.py │ └── top_fields.py └── retrieval │ ├── __init__.py │ ├── indexer.py │ ├── lucene_tools.py │ ├── results.py │ ├── retrieval.py │ └── scorer.py ├── qrels ├── qrels-INEX_LD.txt ├── qrels-ListSearch.txt ├── qrels-QALD.txt ├── qrels-SemSearch_ES.txt ├── qrels-v3.9.txt └── queries.txt ├── requirements.txt └── runs ├── default_params(Table5) ├── fsdm(default).treceval ├── fsdm_elr(default).treceval ├── lm_elr(default).treceval ├── prms_elr(default).treceval ├── sdm(default).treceval └── sdm_elr(default).treceval ├── fsdm.treceval ├── fsdm_elr.treceval ├── lm.treceval ├── lm_elr.treceval ├── mlm-all.treceval ├── mlm-all_elr.treceval ├── mlm-tc.treceval ├── mlm-tc_elr.treceval ├── prms.treceval ├── prms_elr.treceval ├── sdm.treceval └── sdm_elr.treceval /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 Faegheh Hasibi 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Entity Linking integrated Retrieval (ELR) 2 | 3 | This repository contains resources developed within the following paper: 4 | 5 | F. Hasibi, K. Balog, and S.E. Bratsberg. “Exploiting Entity Linking in Queries for Entity Retrieval”, 6 | In proceedings of ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR ’16), Newark, DE, USA, Sep 2016. 7 | 8 | You can check the [paper](http://hasibi.com/files/ictir2016-elr.pdf) and [presentation](http://www.slideshare.net/FaeghehHasibi/ictir2016-elr) for detailed information. 9 | 10 | The repository is structured as follows: 11 | 12 | - `nordlys/`: Code required for running entity retrieval methods. 13 | - `data/`: Query set and data required for running the code. 14 | - `qrels/`: Qrels files for the [DBpedia-entity test collection](http://krisztianbalog.com/resources/sigir-2013-dbpedia/) (version 3.9). 15 | - `runs/`: Run files reported in the paper. 16 | 17 | 18 | ## Usage 19 | 20 | Use the following command to run the code: 21 | 22 | ``` 23 | python -m nordlys.elr.retrieval_elr 24 | ``` 25 | Using this command, the retrieval results are produced using the recommended parameters in the paper. 26 | For detailed descriptions and setting different parameters read the help using the command `python -m nordlys.elr.retrieval_elr -h`. 27 | 28 | Python v2.7 is required for running the code. 29 | 30 | ## Code 31 | 32 | Check the `nordlys/elr/scorer_elr.py` file for the actual implementation of the ELR framework and the baseline methods. 33 | 34 | 35 | ## Data 36 | 37 | The indices required for running this code are described in the paper. You can also contact the authors to get the indices. 38 | The following files under the `data` folder are also required for running the code: 39 | 40 | - `queries.json`: The DBpedia-entity queries, stopped as described in the paper. 41 | - `tagme_annotations.json`: Entity annotations of the queries obtained from the [TAGME API](https://tagme.d4science.org/tagme/). 42 | 43 | 44 | ## Citation 45 | 46 | If you use the resources presented in this repository, please cite: 47 | 48 | ``` 49 | @inproceedings{Hasibi:2016:ELR, 50 | author = {Hasibi, Faegheh and Balog, Krisztian and Bratsberg, Svein Erik}, 51 | title = {Exploiting Entity Linking in Queries for Entity Retrieval}, 52 | booktitle = {Proceedings of ACM SIGIR International Conference on the Theory of Information Retrieval}, 53 | series = {ICTIR '16}, 54 | year = {2016}, 55 | pages= {209-218}, 56 | publisher = {ACM}, 57 | DOI = {ttp://dx.doi.org/10.1145/2970398.2970406} 58 | } 59 | ``` 60 | 61 | ## Contact 62 | 63 | Should you have any questions, please contact Faegheh Hasibi at . 64 | -------------------------------------------------------------------------------- /data/queries.json: -------------------------------------------------------------------------------- 1 | { 2 | "INEX_LD-2009022": "Szechwan dish food cuisine", 3 | "INEX_LD-2009039": "roman architecture", 4 | "INEX_LD-2009053": "finland car industry manufacturer saab sisu", 5 | "INEX_LD-2009061": "france second world war normandy", 6 | "INEX_LD-2009062": "social network group selection", 7 | "INEX_LD-2009063": "D-Day normandy invasion", 8 | "INEX_LD-2009074": "web ranking scoring algorithm", 9 | "INEX_LD-2009096": "Eiffel", 10 | "INEX_LD-2009111": "europe solar power facility", 11 | "INEX_LD-2009115": "virtual museums", 12 | "INEX_LD-2010004": "Indian food", 13 | "INEX_LD-2010014": "composer museum", 14 | "INEX_LD-2010019": "gallo roman architecture in paris", 15 | "INEX_LD-2010020": "electricity source in France", 16 | "INEX_LD-2010037": "social network API", 17 | "INEX_LD-2010043": "List of films from the surrealist category", 18 | "INEX_LD-2010057": "Einstein Relativity theory", 19 | "INEX_LD-2010069": "summer flowers", 20 | "INEX_LD-2010100": "house concrete wood", 21 | "INEX_LD-2010106": "organic food advantages disadvantages", 22 | "INEX_LD-20120111": "vietnam war movie", 23 | "INEX_LD-20120112": "vietnam war facts", 24 | "INEX_LD-20120121": "vietnam food recipes", 25 | "INEX_LD-20120122": "vietnamese food blog", 26 | "INEX_LD-20120131": "vietnam travel national park", 27 | "INEX_LD-20120132": "vietnam travel airports", 28 | "INEX_LD-20120211": "guitar chord tuning", 29 | "INEX_LD-20120212": "guitar chord minor", 30 | "INEX_LD-20120221": "guitar classical flamenco", 31 | "INEX_LD-20120222": "guitar classical bach", 32 | "INEX_LD-20120231": "guitar origin Russia", 33 | "INEX_LD-20120232": "guitar origin blues", 34 | "INEX_LD-20120311": "tango culture movies", 35 | "INEX_LD-20120312": "tango culture countries", 36 | "INEX_LD-20120321": "tango music composers", 37 | "INEX_LD-20120322": "tango music instruments", 38 | "INEX_LD-20120331": "tango dance styles", 39 | "INEX_LD-20120332": "tango dance history", 40 | "INEX_LD-20120411": "bicycle sport races", 41 | "INEX_LD-20120412": "bicycle sport disciplines", 42 | "INEX_LD-20120421": "bicycle holiday towns", 43 | "INEX_LD-20120422": "bicycle holiday nature", 44 | "INEX_LD-20120431": "bicycle benefits health", 45 | "INEX_LD-20120432": "bicycle benefits environment", 46 | "INEX_LD-20120511": "female rock singers", 47 | "INEX_LD-20120512": "south korean girl groups", 48 | "INEX_LD-20120521": "electronic music genres", 49 | "INEX_LD-20120522": "digital music notation formats", 50 | "INEX_LD-20120531": "music conferences", 51 | "INEX_LD-20120532": "intellectual property rights lobby", 52 | "INEX_LD-2012301": "Niagara falls origin lake", 53 | "INEX_LD-2012303": "Valley fever fungal infection San Joaquin", 54 | "INEX_LD-2012305": "North Dakota's lowest river of another colour", 55 | "INEX_LD-2012307": "July, 1850 president died Millard Fillmore sworn following day", 56 | "INEX_LD-2012309": "residents small island city-state Malay Peninsula Chinese", 57 | "INEX_LD-2012311": "John Lennon Yoko Ono album Starting Over", 58 | "INEX_LD-2012313": "John Turturro 1991 Coen Brothers film", 59 | "INEX_LD-2012315": "Baguio Quezon City Manila official independence 1945", 60 | "INEX_LD-2012317": "daggeroso inclined to use a dagger novel Sons and Lovers", 61 | "INEX_LD-2012318": "Directed Bela Glen Glenda Bride Monster Plan 9 Outer Space", 62 | "INEX_LD-2012319": "1994 short story collection Alice Munro is Open", 63 | "INEX_LD-2012321": "Asian port state-city Sir Stamford Raffles", 64 | "INEX_LD-2012323": "Large glaciers island nation Langjokull Hofsjokull Vatnajokull", 65 | "INEX_LD-2012325": "successor James G. Blaine studied law", 66 | "INEX_LD-2012327": "Beloved author African-American Nobel Prize Literature", 67 | "INEX_LD-2012329": "Sweden Iceland currency", 68 | "INEX_LD-2012331": "Seoul Korea river name ethnic group China", 69 | "INEX_LD-2012333": "Prime minister Canada nicknamed Silver-Tongued Laurier longest unbroken term", 70 | "INEX_LD-2012335": "U.S. president authorise nuclear weapons against Japan", 71 | "INEX_LD-2012336": "1906 territory Papua island Australian", 72 | "INEX_LD-2012337": "Texas city Baylor University tornado 1953", 73 | "INEX_LD-2012339": "Nelson Mandela John Dube", 74 | "INEX_LD-2012341": "1997 Houston airport president", 75 | "INEX_LD-2012343": "The Heart of a Woman poet's autobiography", 76 | "INEX_LD-2012345": "Kennedy assassination governor of Texas seriously injured", 77 | "INEX_LD-2012347": "seat Florida country Dade", 78 | "INEX_LD-2012349": "Alexander Nevsky Cathedral Bulgarian city liberation Turks", 79 | "INEX_LD-2012351": "Indian Cuisine dish rice dhal vegetables roti papad", 80 | "INEX_LD-2012353": "country German language", 81 | "INEX_LD-2012354": "greatest guitarist", 82 | "INEX_LD-2012355": "England football player highest paid", 83 | "INEX_LD-2012357": "prima ballerina Bolshoi Theatre 1960", 84 | "INEX_LD-2012359": "Bob Ricker Executive Director the latest front group for the anti-gun movement", 85 | "INEX_LD-2012361": "most famous award winning actor singer", 86 | "INEX_LD-2012363": "American twins famous American professional tennis double players", 87 | "INEX_LD-2012365": "mathematician computer scientist MIT's six inaugural MacVicar Faculty Fellows", 88 | "INEX_LD-2012367": "invented telescope", 89 | "INEX_LD-2012369": "most famous civic-military airports", 90 | "INEX_LD-2012371": "most beautiful railway stations world cities located", 91 | "INEX_LD-2012372": "famous historical battlefields opponents fought", 92 | "INEX_LD-2012373": "birds cannot fly", 93 | "INEX_LD-2012375": "animals lay eggs mammals", 94 | "INEX_LD-2012377": "allegedly caused World War I", 95 | "INEX_LD-2012379": "pairs cities same language same longitude different countries", 96 | "INEX_LD-2012381": "movie directors directed a block buster", 97 | "INEX_LD-2012383": "famous computer scientists disappeared at sea", 98 | "INEX_LD-2012385": "famous politicians vegetarians", 99 | "INEX_LD-2012387": "famous river confluence dam constructed", 100 | "INEX_LD-2012389": "frequently visited sharks gulf Indian Ocean", 101 | "INEX_LD-2012390": "baseball player most homeruns national league", 102 | "INEX_XER-100": "Operating systems to Steve Jobs related", 103 | "INEX_XER-106": "Noble english person from the Hundred Years' War", 104 | "INEX_XER-108": "State capitals of the United States of America", 105 | "INEX_XER-109": "National capitals situated on islands", 106 | "INEX_XER-110": "Nobel Prize in Literature winners were also poets", 107 | "INEX_XER-113": "Formula 1 drivers that won the Monaco Grand Prix", 108 | "INEX_XER-114": "Formula one races in Europe", 109 | "INEX_XER-115": "Formula One World Constructors' Champions", 110 | "INEX_XER-116": "Italian nobel prize winners", 111 | "INEX_XER-117": "Musicians appeared in the Blues Brothers movies", 112 | "INEX_XER-118": "French car models in 1960's", 113 | "INEX_XER-119": "Swiss cantons they speak German", 114 | "INEX_XER-121": "US presidents since 1960", 115 | "INEX_XER-122": "Movies with eight or more Academy Awards", 116 | "INEX_XER-123": "FIFA world cup national team winners since 1974", 117 | "INEX_XER-124": "Novels that won the Booker Prize", 118 | "INEX_XER-125": "countries have won the FIFA world cup", 119 | "INEX_XER-126": "toy train manufacturers that are still in business", 120 | "INEX_XER-127": "german female politicians", 121 | "INEX_XER-128": "Bond girls", 122 | "INEX_XER-129": "Science fiction book written in the 1980", 123 | "INEX_XER-130": "Star Trek Captains", 124 | "INEX_XER-132": "living nordic classical composers", 125 | "INEX_XER-133": "EU countries", 126 | "INEX_XER-134": "record-breaking sprinters in male 100-meter sprints", 127 | "INEX_XER-135": "professional baseball team in Japan", 128 | "INEX_XER-136": "Japanese players in Major League Baseball", 129 | "INEX_XER-138": "National Parks East Coast Canada US", 130 | "INEX_XER-139": "Films directed by Akira Kurosawa", 131 | "INEX_XER-140": "Airports in Germany", 132 | "INEX_XER-141": "Universities in Catalunya", 133 | "INEX_XER-143": "Hanseatic league in Germany in the Netherlands Circle", 134 | "INEX_XER-144": "chess world champions", 135 | "INEX_XER-147": "Chemical elements that are named after people", 136 | "INEX_XER-60": "olympic classes dinghy sailing", 137 | "INEX_XER-62": "Neil Gaiman novels", 138 | "INEX_XER-63": "Hugo awarded best novels", 139 | "INEX_XER-64": "Alan Moore graphic novels adapted to film", 140 | "INEX_XER-65": "Pacific navigators Australia explorers", 141 | "INEX_XER-67": "Ferris and observation wheels", 142 | "INEX_XER-72": "films shot in Venice", 143 | "INEX_XER-73": "magazines about indie-music", 144 | "INEX_XER-74": "circus mammals", 145 | "INEX_XER-79": "Works by Charles Rennie Mackintosh", 146 | "INEX_XER-81": "Movies about English hooligans", 147 | "INEX_XER-86": "List of countries in World War Two", 148 | "INEX_XER-87": "Axis powers of World War II", 149 | "INEX_XER-88": "Nordic authors are known for children's literature", 150 | "INEX_XER-91": "Paul Auster novels", 151 | "INEX_XER-94": "Hybrid cars sold in Europe", 152 | "INEX_XER-95": "Tom Hanks movies he plays a leading role.", 153 | "INEX_XER-96": "Pure object-oriented programing languages", 154 | "INEX_XER-97": "Compilers that can compile both C and C++", 155 | "INEX_XER-98": "Makers of lawn tennis rackets", 156 | "INEX_XER-99": "Computer systems that have a recursive acronym for the name", 157 | "QALD2_te-1": "German cities have more than 250000 inhabitants?", 158 | "QALD2_te-100": "produces Orangina?", 159 | "QALD2_te-11": "is the Formula 1 race driver with the most races?", 160 | "QALD2_te-12": "all world heritage sites designated within the past five years.", 161 | "QALD2_te-13": "is the youngest player in the Premier League?", 162 | "QALD2_te-14": "all members of Prodigy.", 163 | "QALD2_te-15": "is the longest river?", 164 | "QALD2_te-17": "all cars that are produced in Germany.", 165 | "QALD2_te-19": "all people that were born in Vienna and died in Berlin.", 166 | "QALD2_te-2": "was the successor of John F. Kennedy?", 167 | "QALD2_te-21": "is the capital of Canada?", 168 | "QALD2_te-22": "is the governor of Texas?", 169 | "QALD2_te-24": "was the father of Queen Elizabeth II?", 170 | "QALD2_te-25": "U.S. state has been admitted latest?", 171 | "QALD2_te-27": "Sean Parnell is the governor of U.S. state?", 172 | "QALD2_te-28": "all movies directed by Francis Ford Coppola.", 173 | "QALD2_te-29": "all actors starring in movies directed by and starring William Shatner.", 174 | "QALD2_te-3": "is the mayor of Berlin?", 175 | "QALD2_te-31": "all current Methodist national leaders.", 176 | "QALD2_te-33": "all Australian nonprofit organizations.", 177 | "QALD2_te-34": "In military conflicts did Lawrence of Arabia participate?", 178 | "QALD2_te-35": "developed Skype?", 179 | "QALD2_te-39": "all companies in Munich.", 180 | "QALD2_te-40": "List all boardgames by GMT.", 181 | "QALD2_te-41": "founded Intel?", 182 | "QALD2_te-42": "is the husband of Amanda Palmer?", 183 | "QALD2_te-43": "all breeds of the German Shepherd dog.", 184 | "QALD2_te-44": "cities does the Weser flow through?", 185 | "QALD2_te-45": "countries are connected by the Rhine?", 186 | "QALD2_te-46": "professional surfers were born on the Philippines?", 187 | "QALD2_te-48": "In UK city are the headquarters of the MI6?", 188 | "QALD2_te-49": "other weapons did the designer of the Uzi develop?", 189 | "QALD2_te-5": "is the second highest mountain on Earth?", 190 | "QALD2_te-51": "all Frisian islands that belong to the Netherlands.", 191 | "QALD2_te-53": "is the ruling party in Lisbon?", 192 | "QALD2_te-55": "Greek goddesses dwelt on Mount Olympus?", 193 | "QALD2_te-57": "the Apollo 14 astronauts.", 194 | "QALD2_te-58": "is the time zone of Salt Lake City?", 195 | "QALD2_te-59": "U.S. states are in the same timezone as Utah?", 196 | "QALD2_te-6": "all professional skateboarders from Sweden.", 197 | "QALD2_te-60": "a list of all lakes in Denmark.", 198 | "QALD2_te-63": "all Argentine films.", 199 | "QALD2_te-64": "all launch pads operated by NASA.", 200 | "QALD2_te-65": "instruments did John Lennon play?", 201 | "QALD2_te-66": "ships were called after Benjamin Franklin?", 202 | "QALD2_te-67": "are the parents of the wife of Juan Carlos I?", 203 | "QALD2_te-72": "In U.S. state is Area 51 located?", 204 | "QALD2_te-75": "daughters of British earls died in the same place they were born in?", 205 | "QALD2_te-76": "List the children of Margaret Thatcher.", 206 | "QALD2_te-77": "was called Scarface?", 207 | "QALD2_te-8": "To countries does the Himalayan mountain system extend?", 208 | "QALD2_te-80": "all books by William Goldman with more than 300 pages.", 209 | "QALD2_te-81": "books by Kerouac were published by Viking Press?", 210 | "QALD2_te-82": "a list of all American inventions.", 211 | "QALD2_te-84": "created the comic Captain America?", 212 | "QALD2_te-86": "is the largest city in Australia?", 213 | "QALD2_te-87": "composed the music for Harold and Maude?", 214 | "QALD2_te-88": "films starring Clint Eastwood did he direct himself?", 215 | "QALD2_te-89": "In city was the former Dutch queen Juliana buried?", 216 | "QALD2_te-9": "a list of all trumpet players that were bandleaders.", 217 | "QALD2_te-90": "is the residence of the prime minister of Spain?", 218 | "QALD2_te-91": "U.S. State has the abbreviation MN?", 219 | "QALD2_te-92": "all songs from Bruce Springsteen released between 1980 and 1990.", 220 | "QALD2_te-93": "movies did Sam Raimi direct after Army of Darkness?", 221 | "QALD2_te-95": "wrote the lyrics for the Polish national anthem?", 222 | "QALD2_te-97": "painted The Storm on the Sea of Galilee?", 223 | "QALD2_te-98": "country does the creator of Miffy come from?", 224 | "QALD2_te-99": "For label did Elvis record his first album?", 225 | "QALD2_tr-1": "all female Russian astronauts.", 226 | "QALD2_tr-10": "In country does the Nile start?", 227 | "QALD2_tr-11": "countries have places with more than two caves?", 228 | "QALD2_tr-13": "classis does the Millepede belong to?", 229 | "QALD2_tr-15": "created Goofy?", 230 | "QALD2_tr-16": "the capitals of all countries in Africa.", 231 | "QALD2_tr-17": "all cities in New Jersey with more than 100000 inhabitants.", 232 | "QALD2_tr-18": "museum exhibits The Scream by Munch?", 233 | "QALD2_tr-21": "states border Illinois?", 234 | "QALD2_tr-22": "In country is the Limerick Lake?", 235 | "QALD2_tr-23": "television shows were created by Walt Disney?", 236 | "QALD2_tr-24": "mountain is the highest after the Annapurna?", 237 | "QALD2_tr-25": "In films directed by Garry Marshall was Julia Roberts starring?", 238 | "QALD2_tr-26": "bridges are of the same type as the Manhattan Bridge?", 239 | "QALD2_tr-28": "European countries have a constitutional monarchy?", 240 | "QALD2_tr-29": "awards did WikiLeaks win?", 241 | "QALD2_tr-3": "is the daughter of Bill Clinton married to?", 242 | "QALD2_tr-30": "state of the USA has the highest population density?", 243 | "QALD2_tr-31": "is the currency of the Czech Republic?", 244 | "QALD2_tr-32": "countries in the European Union adopted the Euro?", 245 | "QALD2_tr-34": "countries have more than two official languages?", 246 | "QALD2_tr-35": "is the owner of Universal Studios?", 247 | "QALD2_tr-36": "Through countries does the Yenisei river flow?", 248 | "QALD2_tr-38": "monarchs of the United Kingdom were married to a German?", 249 | "QALD2_tr-4": "river does the Brooklyn Bridge cross?", 250 | "QALD2_tr-40": "is the highest mountain in Australia?", 251 | "QALD2_tr-41": "all soccer clubs in Spain.", 252 | "QALD2_tr-42": "are the official languages of the Philippines?", 253 | "QALD2_tr-43": "is the mayor of New York City?", 254 | "QALD2_tr-44": "designed the Brooklyn Bridge?", 255 | "QALD2_tr-45": "telecommunications organizations are located in Belgium?", 256 | "QALD2_tr-47": "is the highest place of Karakoram?", 257 | "QALD2_tr-49": "all companies in the advertising industry.", 258 | "QALD2_tr-50": "did Bruce Carver die from?", 259 | "QALD2_tr-51": "all school types.", 260 | "QALD2_tr-52": "presidents were born in 1945?", 261 | "QALD2_tr-53": "all presidents of the United States.", 262 | "QALD2_tr-54": "was the wife of U.S. president Lincoln?", 263 | "QALD2_tr-55": "developed the video game World of Warcraft?", 264 | "QALD2_tr-57": "List all episodes of the first season of the HBO television series The Sopranos!", 265 | "QALD2_tr-58": "produced the most films?", 266 | "QALD2_tr-59": "all people with first name Jimmy.", 267 | "QALD2_tr-6": "did Abraham Lincoln die?", 268 | "QALD2_tr-61": "mountains are higher than the Nanga Parbat?", 269 | "QALD2_tr-62": "created Wikipedia?", 270 | "QALD2_tr-63": "all actors starring in Batman Begins.", 271 | "QALD2_tr-64": "software has been developed by organizations founded in California?", 272 | "QALD2_tr-65": "companies work in the aerospace industry as well as on nuclear reactor technology?", 273 | "QALD2_tr-68": "actors were born in Germany?", 274 | "QALD2_tr-69": "caves have more than 3 entrances?", 275 | "QALD2_tr-70": "all films produced by Hal Roach.", 276 | "QALD2_tr-71": "all video games published by Mean Hamster Software.", 277 | "QALD2_tr-72": "languages are spoken in Estonia?", 278 | "QALD2_tr-73": "owns Aldi?", 279 | "QALD2_tr-74": "capitals in Europe were host cities of the summer olympic games?", 280 | "QALD2_tr-75": "has been the 5th president of the United States of America?", 281 | "QALD2_tr-77": "music albums contain the song Last Christmas?", 282 | "QALD2_tr-78": "all books written by Danielle Steel.", 283 | "QALD2_tr-79": "airports are located in California, USA?", 284 | "QALD2_tr-8": "states of Germany are governed by the Social Democratic Party?", 285 | "QALD2_tr-80": "all Canadian Grunge record labels.", 286 | "QALD2_tr-81": "country has the most official languages?", 287 | "QALD2_tr-82": "In programming language is GIMP written?", 288 | "QALD2_tr-83": "produced films starring Natalie Portman?", 289 | "QALD2_tr-84": "all movies with Tom Cruise.", 290 | "QALD2_tr-85": "In films did Julia Roberts as well as Richard Gere play?", 291 | "QALD2_tr-86": "all female German chancellors.", 292 | "QALD2_tr-87": "wrote the book The pillars of the Earth?", 293 | "QALD2_tr-89": "all soccer clubs in the Premier League.", 294 | "QALD2_tr-9": "U.S. states possess gold minerals?", 295 | "QALD2_tr-91": "organizations were founded in 1950?", 296 | "QALD2_tr-92": "is the highest mountain?", 297 | "SemSearch_ES-1": "44 magnum hunting", 298 | "SemSearch_ES-10": "asheville north carolina", 299 | "SemSearch_ES-100": "YMCA Tampa", 300 | "SemSearch_ES-101": "ashley wagner", 301 | "SemSearch_ES-102": "beach flowers", 302 | "SemSearch_ES-103": "bounce city humble tx", 303 | "SemSearch_ES-104": "bourbonnais il", 304 | "SemSearch_ES-105": "cedar garden apartments", 305 | "SemSearch_ES-106": "chase masterson", 306 | "SemSearch_ES-107": "concord steel", 307 | "SemSearch_ES-108": "danielia cotton", 308 | "SemSearch_ES-109": "david hewlett", 309 | "SemSearch_ES-11": "austin powers", 310 | "SemSearch_ES-111": "eagle rock, ca", 311 | "SemSearch_ES-112": "espresso tv stands", 312 | "SemSearch_ES-114": "glenn frey", 313 | "SemSearch_ES-115": "goodwill of michigan", 314 | "SemSearch_ES-118": "iowa energy", 315 | "SemSearch_ES-119": "john elliott", 316 | "SemSearch_ES-12": "austin texas", 317 | "SemSearch_ES-120": "lawrence general hospital", 318 | "SemSearch_ES-123": "michael zimmerman", 319 | "SemSearch_ES-124": "motorola bluetooth hs850", 320 | "SemSearch_ES-125": "nokia e73", 321 | "SemSearch_ES-127": "palm tungsten e2 handheld", 322 | "SemSearch_ES-128": "philadelphia neufchatel cheese", 323 | "SemSearch_ES-129": "pizza populous detroit mi", 324 | "SemSearch_ES-13": "banana paper making", 325 | "SemSearch_ES-130": "plymouth police department", 326 | "SemSearch_ES-131": "scpa san diego", 327 | "SemSearch_ES-132": "sealy mattress co", 328 | "SemSearch_ES-133": "sedona hiking trails", 329 | "SemSearch_ES-134": "skye woods", 330 | "SemSearch_ES-135": "spring shoes canada", 331 | "SemSearch_ES-136": "sri lanka government gazette", 332 | "SemSearch_ES-137": "steak express", 333 | "SemSearch_ES-138": "syracuse spca", 334 | "SemSearch_ES-139": "the big texan steak house", 335 | "SemSearch_ES-14": "ben franklin", 336 | "SemSearch_ES-140": "toledo bend realty", 337 | "SemSearch_ES-141": "ventura county court", 338 | "SemSearch_ES-142": "windsor hotel philadelphia", 339 | "SemSearch_ES-15": "bradley center", 340 | "SemSearch_ES-16": "brooklyn bridge", 341 | "SemSearch_ES-17": "butte montana", 342 | "SemSearch_ES-18": "canasta cards", 343 | "SemSearch_ES-19": "carl lewis", 344 | "SemSearch_ES-2": "B. F. Skinner", 345 | "SemSearch_ES-20": "carolina", 346 | "SemSearch_ES-21": "charles darwin", 347 | "SemSearch_ES-22": "city of charlotte", 348 | "SemSearch_ES-23": "city of virginia beach", 349 | "SemSearch_ES-24": "coastal carolina", 350 | "SemSearch_ES-25": "david suchet", 351 | "SemSearch_ES-26": "disney orlando", 352 | "SemSearch_ES-27": "earl may", 353 | "SemSearch_ES-28": "el salvador", 354 | "SemSearch_ES-29": "ellis college", 355 | "SemSearch_ES-3": "Bookwork", 356 | "SemSearch_ES-30": "eloan line of credit", 357 | "SemSearch_ES-31": "emery", 358 | "SemSearch_ES-32": "fitzgerald auto mall chambersburg pa", 359 | "SemSearch_ES-33": "harry potter", 360 | "SemSearch_ES-34": "harry potter movie", 361 | "SemSearch_ES-35": "hospice of cincinnati", 362 | "SemSearch_ES-36": "imdb batman returns", 363 | "SemSearch_ES-37": "jack johnson", 364 | "SemSearch_ES-38": "jack the ripper", 365 | "SemSearch_ES-39": "james caldwell high school", 366 | "SemSearch_ES-4": "NAACP Image Awards", 367 | "SemSearch_ES-40": "james clayton md", 368 | "SemSearch_ES-41": "joan of arc", 369 | "SemSearch_ES-42": "john maxwell", 370 | "SemSearch_ES-45": "keith urban", 371 | "SemSearch_ES-47": "king arthur", 372 | "SemSearch_ES-48": "la scala restaurant philadelphia", 373 | "SemSearch_ES-49": "laura bush", 374 | "SemSearch_ES-5": "Scott County", 375 | "SemSearch_ES-50": "laura steele bob and tom", 376 | "SemSearch_ES-51": "lexus of maplewood", 377 | "SemSearch_ES-52": "lincoln park", 378 | "SemSearch_ES-53": "lynchburg virginia", 379 | "SemSearch_ES-54": "marc anthony", 380 | "SemSearch_ES-55": "marcus theaters", 381 | "SemSearch_ES-56": "mario bros", 382 | "SemSearch_ES-57": "martin luther king", 383 | "SemSearch_ES-58": "mason ohio", 384 | "SemSearch_ES-59": "mercy hospital in des moines, ia", 385 | "SemSearch_ES-6": "air wisconsin", 386 | "SemSearch_ES-60": "michael douglas", 387 | "SemSearch_ES-61": "mr rourke fantasy island", 388 | "SemSearch_ES-63": "old winchester shotguns", 389 | "SemSearch_ES-64": "omeara ford", 390 | "SemSearch_ES-65": "orlando florida", 391 | "SemSearch_ES-66": "overeaters anonymous", 392 | "SemSearch_ES-67": "ovguide movies", 393 | "SemSearch_ES-68": "pierce county washington", 394 | "SemSearch_ES-69": "piosenki mp3", 395 | "SemSearch_ES-7": "airsoft glock", 396 | "SemSearch_ES-70": "radio italia online", 397 | "SemSearch_ES-71": "richmond virginia", 398 | "SemSearch_ES-72": "rock 103 memphis", 399 | "SemSearch_ES-73": "rowan university", 400 | "SemSearch_ES-74": "sacred heart u", 401 | "SemSearch_ES-75": "sagemont church houston tx", 402 | "SemSearch_ES-76": "san antonio", 403 | "SemSearch_ES-77": "savannah tech", 404 | "SemSearch_ES-78": "sharp pc", 405 | "SemSearch_ES-79": "shobana masala", 406 | "SemSearch_ES-8": "aloha sol", 407 | "SemSearch_ES-80": "sonny and cher", 408 | "SemSearch_ES-81": "south dakota state university", 409 | "SemSearch_ES-82": "st lucia", 410 | "SemSearch_ES-83": "st paul saints", 411 | "SemSearch_ES-84": "the dish danielle fishel", 412 | "SemSearch_ES-85": "the longest yard sale", 413 | "SemSearch_ES-86": "the morning call lehigh valley pa", 414 | "SemSearch_ES-87": "the quick lift", 415 | "SemSearch_ES-88": "thomas jefferson", 416 | "SemSearch_ES-89": "university of north dakota", 417 | "SemSearch_ES-9": "american embassy nairobi", 418 | "SemSearch_ES-90": "university of phoenix", 419 | "SemSearch_ES-91": "westminster abbey", 420 | "SemSearch_ES-93": "08 toyota tundra", 421 | "SemSearch_ES-94": "Hugh Downs", 422 | "SemSearch_ES-95": "MADRID", 423 | "SemSearch_ES-96": "New England Coffee", 424 | "SemSearch_ES-97": "PINK PANTHER 2", 425 | "SemSearch_ES-98": "University of Texas at Austin", 426 | "SemSearch_ES-99": "University of York", 427 | "SemSearch_LS-1": "Apollo astronauts walked on the Moon", 428 | "SemSearch_LS-10": "did nicole kidman have any siblings", 429 | "SemSearch_LS-11": "dioceses of the church of ireland", 430 | "SemSearch_LS-12": "first targets of the atomic bomb", 431 | "SemSearch_LS-13": "five great epics of Tamil literature", 432 | "SemSearch_LS-14": "gods dwelt on Mount Olympus", 433 | "SemSearch_LS-16": "hijackers in the September 11 attacks", 434 | "SemSearch_LS-17": "houses of the Russian parliament", 435 | "SemSearch_LS-18": "john lennon, parents", 436 | "SemSearch_LS-19": "kenya's captain in cricket", 437 | "SemSearch_LS-2": "Arab states of the Persian Gulf", 438 | "SemSearch_LS-20": "kublai khan siblings", 439 | "SemSearch_LS-21": "lilly allen parents", 440 | "SemSearch_LS-22": "major leagues in the united states", 441 | "SemSearch_LS-24": "matt berry tv series", 442 | "SemSearch_LS-25": "members of u2?", 443 | "SemSearch_LS-26": "movies starring erykah badu", 444 | "SemSearch_LS-29": "nations Portuguese is an official language", 445 | "SemSearch_LS-3": "astronauts landed on the Moon", 446 | "SemSearch_LS-30": "orders (or 'choirs') of angels", 447 | "SemSearch_LS-31": "permanent members of the UN Security Council", 448 | "SemSearch_LS-32": "presidents depicted on mount rushmore died of shooting", 449 | "SemSearch_LS-33": "provinces and territories of Canada", 450 | "SemSearch_LS-34": "ratt albums", 451 | "SemSearch_LS-35": "republics of the former Yugoslavia", 452 | "SemSearch_LS-36": "revolutionaries of 1959 in Cuba", 453 | "SemSearch_LS-37": "standard axioms of set theory", 454 | "SemSearch_LS-38": "states that border oklahoma", 455 | "SemSearch_LS-39": "ten ancient Greek city-kingdoms of Cyprus", 456 | "SemSearch_LS-4": "Axis powers of World War II", 457 | "SemSearch_LS-40": "the first 13 american states", 458 | "SemSearch_LS-41": "the four of the companions of the prophet", 459 | "SemSearch_LS-42": "twelve tribes or sons of Israel", 460 | "SemSearch_LS-43": "books did paul of tarsus write?", 461 | "SemSearch_LS-44": "languages do they speak in afghanistan", 462 | "SemSearch_LS-46": "the British monarch is also head of state", 463 | "SemSearch_LS-49": "invented the python programming language", 464 | "SemSearch_LS-5": "books of the Jewish canon", 465 | "SemSearch_LS-50": "wonders of the ancient world", 466 | "SemSearch_LS-6": "boroughs of New York City", 467 | "SemSearch_LS-7": "Branches of the US military", 468 | "SemSearch_LS-8": "continents in the world", 469 | "SemSearch_LS-9": "degrees of Eastern Orthodox monasticism", 470 | "TREC_Entity-1": "Carriers that Blackberry makes phones for.", 471 | "TREC_Entity-10": "Campuses of Indiana University.", 472 | "TREC_Entity-11": "Donors to the Home Depot Foundation.", 473 | "TREC_Entity-12": "Airlines that Air Canada has code share flights with.", 474 | "TREC_Entity-14": "Authors awarded an Anthony Award at Bouchercon in 2007.", 475 | "TREC_Entity-15": "Universities that are members of the SEC conference for football.", 476 | "TREC_Entity-16": "Sponsors of the Mancuso quilt festivals.", 477 | "TREC_Entity-17": "Chefs with a show on the Food Network.", 478 | "TREC_Entity-18": "Members of the band Jefferson Airplane.", 479 | "TREC_Entity-19": "Companies that John Hennessey serves on the board of.", 480 | "TREC_Entity-2": "Winners of the ACM Athena award.", 481 | "TREC_Entity-20": "Scotch whisky distilleries on the island of Islay.", 482 | "TREC_Entity-4": "Professional sports teams in Philadelphia.", 483 | "TREC_Entity-5": "Products of Medimmune, Inc.", 484 | "TREC_Entity-6": "Organizations that award Nobel prizes.", 485 | "TREC_Entity-7": "Airlines that currently use Boeing 747 planes.", 486 | "TREC_Entity-9": "Members of The Beaux Arts Trio." 487 | } -------------------------------------------------------------------------------- /nordlys/__init__.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | -------------------------------------------------------------------------------- /nordlys/config.py: -------------------------------------------------------------------------------- 1 | """ 2 | Global nordlys config. 3 | 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 5 | @author: Krisztian Balog (krisztian.balog@uis.no) 6 | """ 7 | 8 | from os import path 9 | 10 | NORDLYS_DIR = path.dirname(path.abspath(__file__)) 11 | DATA_DIR = path.dirname(path.dirname(path.abspath(__file__))) + "/data" 12 | OUTPUT_DIR = path.dirname(path.dirname(path.abspath(__file__))) + "/runs" 13 | 14 | TERM_INDEX_DIR = "path/to/term/index" 15 | URI_INDEX_DIR = "path/to/URI/index" 16 | print "Term index:", TERM_INDEX_DIR 17 | print "URI index:", URI_INDEX_DIR 18 | 19 | QUERIES = DATA_DIR + "/queries.json" 20 | ANNOTATIONS = DATA_DIR + "/tagme_annotations.json" 21 | 22 | -------------------------------------------------------------------------------- /nordlys/elr/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'faeghehhasibi' 2 | -------------------------------------------------------------------------------- /nordlys/elr/field_mapping.py: -------------------------------------------------------------------------------- 1 | """ 2 | Computes PRMS field mapping probabilities. 3 | 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 5 | """ 6 | 7 | from __future__ import division 8 | from pprint import PrettyPrinter 9 | 10 | from nordlys.retrieval.scorer import ScorerPRMS 11 | from nordlys.elr.top_fields import TopFields 12 | 13 | 14 | class FieldMapping(object): 15 | DEBUG = 0 16 | MAPPING_DEBUG = 0 17 | 18 | def __init__(self, lucene_term, lucene_uri, n): 19 | self.lucene_term = lucene_term 20 | self.lucene_uri = lucene_uri 21 | self.n = n 22 | 23 | def map(self, query_annot, slop=None): 24 | """ 25 | Computes PRMS field mapping probabilities for URIs, terms, ordered, and unordered phrases. 26 | 27 | :param query_annot: nordlys.elr.QueryAnnot 28 | :param slop: number of terms in between 29 | :return: interprets: {'uris': {uri:{field: prob, ..}, ..}, 'terms': {..}, 'ordered': {..}, 'unordered': {..}} 30 | """ 31 | T, phrases, E = set(query_annot.T), set(query_annot.get_all_phrases()), set(query_annot.E.keys()) 32 | field_mappings = {'uris': self.get_mapping_uris(E), 33 | 'terms': self.get_mapping_terms(T), 34 | 'ordered': self.get_mapping_phrases(phrases, 0, True)} 35 | print " ordered done!" 36 | if slop is not None: 37 | field_mappings['unordered'] = self.get_mapping_phrases(phrases, slop, False) 38 | print " unordered done!" 39 | print "===" 40 | return field_mappings 41 | 42 | def get_mapping_uris(self, uris): 43 | """ 44 | Computes field mapping probability for URIs. 45 | 46 | :param uris: list of uris 47 | :return: Dictionary {uri: {field: weight, ..}, ..} 48 | """ 49 | field_mappings = {} 50 | for uri in uris: 51 | top_fields = TopFields(self.lucene_uri).get_top_term(uri, self.n) 52 | scorer_prms = ScorerPRMS(self.lucene_uri, None, {'fields': top_fields}) 53 | field_mappings[uri] = scorer_prms.get_mapping_prob(uri) 54 | if self.DEBUG: 55 | print uri 56 | PrettyPrinter(depth=4).pprint(sorted(field_mappings[uri].items(), key=lambda f: f[1], reverse=True)) 57 | return field_mappings 58 | 59 | def get_mapping_terms(self, terms): 60 | """ 61 | Computes PRMS field mapping probability for terms. 62 | 63 | :param terms: list of terms 64 | :return: Dictionary {term: {field: weight, ..}, ..} 65 | """ 66 | field_mappings = {} 67 | top_fields = TopFields(self.lucene_term).get_top_index(self.n) 68 | for term in terms: 69 | scorer_prms = ScorerPRMS(self.lucene_term, None, {'fields': top_fields}) 70 | field_mappings[term] = scorer_prms.get_mapping_prob(term) 71 | if self.DEBUG: 72 | print term 73 | PrettyPrinter(depth=4).pprint(sorted(field_mappings[term].items(), key=lambda f: f[1], reverse=True)) 74 | return field_mappings 75 | 76 | def get_mapping_phrases(self, phrases, slop, ordered): 77 | """ 78 | Computes PRMS field mapping probability for phrases. 79 | 80 | :param phrases: list of phrases 81 | :param ordered: if True, performs ordered search 82 | :param slop: number of terms between the terms of phrase 83 | :return: Dictionary {phrase: {field: weight, ..}, ..} 84 | """ 85 | field_mappings = {} 86 | top_fields = TopFields(self.lucene_term).get_top_index(self.n) 87 | for phrase in phrases: 88 | coll_freqs = self.__get_coll_freqs(phrase, top_fields, slop, ordered) 89 | scorer_prms = ScorerPRMS(self.lucene_term, None, {'fields': top_fields}) 90 | field_mappings[phrase] = scorer_prms.get_mapping_prob(phrase, coll_termfreq_fields=coll_freqs) 91 | if self.DEBUG: 92 | print phrase 93 | PrettyPrinter(depth=4).pprint(sorted(field_mappings[phrase].items(), key=lambda f: f[1], reverse=True)) 94 | return field_mappings 95 | 96 | def __get_coll_freqs(self, phrase, fields, slop, ordered): 97 | """Gets collection term frequency for all fields.""" 98 | coll_freqs = {} 99 | for f in fields: 100 | doc_phrase_freq = self.lucene_term.get_doc_phrase_freq(phrase, f, slop=slop, ordered=ordered) 101 | coll_freqs[f] = sum(doc_phrase_freq.values()) 102 | return coll_freqs -------------------------------------------------------------------------------- /nordlys/elr/query_annot.py: -------------------------------------------------------------------------------- 1 | """ 2 | Class for query annotations in the json file. 3 | 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 5 | """ 6 | from nordlys.retrieval.lucene_tools import Lucene 7 | 8 | 9 | class QueryAnnot(object): 10 | def __init__(self, annotations, score_th, qid=None): 11 | self.annotations = annotations 12 | self.score_th = score_th 13 | self.qid = qid 14 | self.__E = None 15 | self.__T = None 16 | self.__mentions = None 17 | 18 | @property 19 | def query(self): 20 | return self.annotations.get('query', None) 21 | 22 | @property 23 | def field_mappings(self): 24 | """Returns field mappings.""" 25 | return self.annotations.get('field_mappings', {}) 26 | 27 | @field_mappings.setter 28 | def field_mappings(self, value): 29 | if "field_mappings" not in self.annotations: 30 | self.annotations['field_mappings'] = {} 31 | self.annotations['field_mappings'].update(value) 32 | 33 | @property 34 | def E(self): 35 | """Returns set of annotated entities.""" 36 | if self.__E is None: 37 | self.__E = {} 38 | for interpretation in self.annotations['interpretations'].values(): 39 | for annot in interpretation['annots'].values(): 40 | if float(annot['score']) >= self.score_th: 41 | self.__E[annot['uri']] = annot['score'] 42 | return self.__E 43 | 44 | @property 45 | def T(self): 46 | """Returns all query terms.""" 47 | if self.__T is None: 48 | analyzed_query = Lucene.preprocess(self.query) 49 | self.__T = analyzed_query.split(" ") 50 | return self.__T 51 | 52 | @property 53 | def mentions(self): 54 | """Returns all mentions (among all annotations).""" 55 | if self.__mentions is None: 56 | self.__mentions = {} 57 | for interpretation in self.annotations['interpretations'].values(): 58 | for mention, annot in interpretation['annots'].iteritems(): 59 | if float(annot['score']) >= self.score_th: 60 | analyzed_phrase = Lucene.preprocess(mention) 61 | if (analyzed_phrase is not None) and (analyzed_phrase.strip() != ""): 62 | self.__mentions[analyzed_phrase] = annot['score'] 63 | return self.__mentions 64 | 65 | def get_all_phrases(self): 66 | """Returns phrases for the ordered part of the model. (bigram and n-gram of mentions)""" 67 | all_phrases = set() 68 | for s_t in self.mentions: 69 | if len(s_t.split(" ")) > 1: 70 | all_phrases.add(s_t) 71 | analyzed_query = Lucene.preprocess(self.query) 72 | query_terms = analyzed_query.split(" ") 73 | for i in range(0, len(query_terms)-1): 74 | bigram = " ".join([query_terms[i], query_terms[i+1]]) 75 | all_phrases.add(bigram) 76 | return all_phrases 77 | 78 | def update(self, key, value): 79 | """Updates the annotation.""" 80 | self.annotations[key] = value 81 | -------------------------------------------------------------------------------- /nordlys/elr/retrieval_elr.py: -------------------------------------------------------------------------------- 1 | """ 2 | Class for entity retrieval 3 | 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 5 | """ 6 | 7 | import argparse 8 | import json 9 | import os 10 | 11 | from nordlys.config import QUERIES, TERM_INDEX_DIR, URI_INDEX_DIR, OUTPUT_DIR, ANNOTATIONS 12 | from nordlys.elr.query_annot import QueryAnnot 13 | from nordlys.elr.scorer_elr import ScorerMRF 14 | from nordlys.retrieval.lucene_tools import Lucene 15 | from nordlys.retrieval.results import RetrievalResults 16 | from nordlys.retrieval.retrieval import Retrieval 17 | 18 | 19 | class RetrievalELR(Retrieval): 20 | def __init__(self, model, query_file, annot_file, el_th=None, lambd=None, n_fields=None): 21 | query_file = query_file 22 | config = {'model': model, 23 | 'index_dir': TERM_INDEX_DIR, 24 | 'query_file': query_file, 25 | 'lambda': lambd, 26 | 'th': el_th, 27 | 'n_fields': n_fields, 28 | 'first_pass_num_docs': 1000, 29 | 'num_docs': 100, 30 | 'fields': None} 31 | 32 | lambd_str = "_lambda" + "_".join([str(l) for l in lambd]) if lambd is not None else "" 33 | th_str = "_th" + str(el_th) if el_th is not None else "" 34 | fields_str = str(n_fields) if n_fields is not None else "" 35 | run_id = model + fields_str + th_str + lambd_str 36 | config['run_id'] = run_id 37 | config['output_file'] = OUTPUT_DIR + "/" + run_id + ".treceval" 38 | super(RetrievalELR, self).__init__(config) 39 | 40 | self.annot_file = annot_file 41 | 42 | def _load_query_annotations(self): 43 | """Loads field annotation file.""" 44 | self.query_annotations = json.load(open(self.annot_file)) 45 | 46 | def _open_index(self): 47 | self.lucene_term = Lucene(TERM_INDEX_DIR) 48 | self.lucene_uri = Lucene(URI_INDEX_DIR) 49 | self.lucene_term.open_searcher() 50 | self.lucene_uri.open_searcher() 51 | 52 | def _close_index(self): 53 | self.lucene_term.close_reader() 54 | self.lucene_uri.close_reader() 55 | 56 | def _second_pass_scoring(self, res1, scorer): 57 | """ 58 | Returns second-pass scoring of documents. 59 | 60 | :param res1: first pass results 61 | :param scorer: scorer object 62 | :return: RetrievalResults object 63 | """ 64 | print "\tSecond pass scoring... " 65 | results = RetrievalResults() 66 | for doc_id, orig_score in res1.get_scores_sorted(): 67 | score = scorer.score_doc(doc_id) 68 | results.append(doc_id, score) 69 | print "done" 70 | return results 71 | 72 | def retrieve(self, store_json=True): 73 | """Scores queries and outputs results.""" 74 | self._open_index() 75 | self._load_queries() 76 | self._load_query_annotations() 77 | 78 | # init output file 79 | if os.path.exists(self.config['output_file']): 80 | os.remove(self.config['output_file']) 81 | out = open(self.config['output_file'], "w") 82 | print "Number of queries:", len(self.queries) 83 | 84 | for qid in sorted(self.queries): 85 | query = Lucene.preprocess(self.queries[qid]) 86 | print "scoring [" + qid + "] " + query 87 | query_annot = QueryAnnot(self.query_annotations[qid], self.config['th'], qid=qid) 88 | 89 | # score documents 90 | res1 = self._first_pass_scoring(self.lucene_term, query) 91 | scorer = ScorerMRF.get_scorer(self.lucene_term, self.lucene_uri, self.config, query_annot) 92 | results = self._second_pass_scoring(res1, scorer) 93 | 94 | # write results to output file 95 | results.write_trec_format(qid, self.config['run_id'], out, self.config['num_docs']) 96 | break 97 | 98 | out.close() 99 | self._close_index() 100 | 101 | print "Output results: " + self.config['output_file'] 102 | 103 | 104 | def arg_parser(): 105 | valid_models = ["lm", "mlm", "mlm-tc", "mlm-all", "prms", "sdm", "fsdm", 106 | "lm_elr", "mlm_elr", "mlm-tc_elr", "prms_elr", "sdm_elr", "fsdm_elr"] 107 | parser = argparse.ArgumentParser() 108 | parser.add_argument("model", help="Model name", type=str, choices=valid_models) 109 | parser.add_argument("-q", "--queries", help="Query file", type=str, default=QUERIES) 110 | parser.add_argument("-a", "--annot", help="Annotation file (with field mappings)", type=str, default=ANNOTATIONS) 111 | parser.add_argument("-t", "--threshold", help="Entity linking threshold", type=float, default=0.1) 112 | parser.add_argument("-n", "--nfields", help="number of fields", type=int, default=10) 113 | parser.add_argument("-l", "--lambd", help="Lambdas, comma separated values for ", type=str) 114 | args = parser.parse_args() 115 | return args 116 | 117 | 118 | def main(args): 119 | lambda_params = None 120 | if args.lambd is not None: 121 | lambdas = args.lambd.split(",") 122 | lambda_params = [float(l.strip()) for l in lambdas] 123 | 124 | RetrievalELR(args.model, args.queries, args.annot, el_th=args.threshold, lambd=lambda_params, 125 | n_fields=args.nfields).retrieve() 126 | 127 | if __name__ == '__main__': 128 | main(arg_parser()) 129 | -------------------------------------------------------------------------------- /nordlys/elr/scorer_elr.py: -------------------------------------------------------------------------------- 1 | """ 2 | ELR extension of MRF based models: LM, MLM, PRMS, SDM, and FSDM 3 | 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 5 | """ 6 | 7 | from __future__ import division 8 | 9 | import math 10 | 11 | from nordlys.elr.field_mapping import FieldMapping 12 | from nordlys.elr.top_fields import TopFields 13 | from nordlys.retrieval.lucene_tools import Lucene 14 | from nordlys.retrieval.scorer import ScorerLM 15 | 16 | 17 | class ScorerMRF(object): 18 | DEBUG = 0 19 | 20 | TERM = "terms" 21 | ORDERED = "ordered" 22 | UNORDERED = "unordered" 23 | URI = "uris" 24 | SLOP = 6 # Window = 8 25 | 26 | def __init__(self, lucene_term, lucene_uri, params, query_annot): 27 | self.lucene_term = lucene_term 28 | self.lucene_uri = lucene_uri 29 | self.params = params 30 | self.query_annot = query_annot 31 | self.phrase_freq = {} 32 | 33 | self.scorer_lm_term = ScorerLM(self.lucene_term, None, {'smoothing_method': "dirichlet"}) 34 | self.scorer_lm_uri = ScorerLM(self.lucene_uri, None, {}) 35 | self.instance_list = [] 36 | self.__n_fields = None 37 | self.__bigrams = None 38 | self.__mlm_all_mapping = None 39 | 40 | @property 41 | def n_fields(self): 42 | """Returns number of fields for fielded models.""" 43 | if self.__n_fields is None: 44 | model = self.params['model'] 45 | if ("prms" in model) or ("fsdm" in model) or ("mlm-all" in model): 46 | self.__n_fields = 10 if self.params['n_fields'] is None else self.params['n_fields'] 47 | return self.__n_fields 48 | 49 | @property 50 | def bigrams(self): 51 | """Returns all query bigrams.""" 52 | if self.__bigrams is None: 53 | self.__bigrams = [] 54 | for i in range(0, len(self.query_annot.T)-1): 55 | bigram = " ".join([self.query_annot.T[i], self.query_annot.T[i+1]]) 56 | self.__bigrams.append(bigram) 57 | return self.__bigrams 58 | 59 | @property 60 | def mlm_all_mapping(self): 61 | if self.__mlm_all_mapping is None: 62 | self.__mlm_all_mapping = {} 63 | fields = TopFields(self.lucene_term).get_top_index(self.n_fields) 64 | weight = 1.0 / len(fields) 65 | for field in fields: 66 | self.__mlm_all_mapping[field] = weight 67 | return self.__mlm_all_mapping 68 | 69 | @staticmethod 70 | def get_scorer(lucene_term, lucene_uri, params, query_annot): 71 | """ 72 | Returns Scorer object (Scorer factory). 73 | 74 | :param lucene_term: Lucene object for terms 75 | :param lucene_uri: Lucene object for uris 76 | :param params: dict with models parameters 77 | :param query_annot: query annotation with the mapping probabilities 78 | """ 79 | model = params['model'] 80 | lambd = params['lambda'] 81 | print "\t" + model + " scoring ..." 82 | if (model == "lm") or (model == "prms") or (model == "mlm-all") or (model == "mlm-tc"): 83 | params['lambda'] = [1.0, 0.0, 0.0] if lambd is None else lambd 84 | return ScorerFSDM(lucene_term, lucene_uri, params, query_annot) 85 | elif (model == "sdm") or (model == "fsdm"): 86 | params['lambda'] = [0.8, 0.1, 0.1] if lambd is None else lambd 87 | return ScorerFSDM(lucene_term, lucene_uri, params, query_annot) 88 | elif (model == "lm_elr") or (model == "prms_elr") or (model == "mlm-tc_elr") or (model == "mlm-all_elr"): 89 | params['lambda'] = [0.9, 0.0, 0.0, 0.1] if lambd is None else lambd 90 | return ScorerELR(lucene_term, lucene_uri, params, query_annot) 91 | elif (model == "sdm_elr") or (model == "fsdm_elr"): 92 | params['lambda'] = [0.8, 0.05, 0.05, 0.1] if lambd is None else lambd 93 | return ScorerELR(lucene_term, lucene_uri, params, query_annot) 94 | else: 95 | raise Exception("Unknown model '" + model + "'") 96 | 97 | def get_field_weights(self, clique_type, c): 98 | """ 99 | Returns field mappings 100 | 101 | :param clique_type: [TERM | ORDERED | UNORDERED | URI] 102 | :param c: str (term, phrase, or uri) 103 | :return: {field: prob} 104 | """ 105 | model = self.params['model'] 106 | if (model == "lm") or (model == "lm_elr") or (model == "sdm") or (model == "sdm_elr"): 107 | return {Lucene.FIELDNAME_CONTENTS: 1} 108 | elif (model == "prms") or (model == "prms_elr") or (model == "fsdm") or (model == "fsdm_elr"): 109 | return self.get_prms_mapping(clique_type)[c] 110 | elif (model == "mlm-tc") or (model == "mlm-tc_elr"): 111 | if clique_type == self.URI: 112 | return self.get_prms_mapping(clique_type)[c] 113 | else: 114 | return {'names': 0.2, 'contents': 0.8} 115 | elif (model == "mlm-all") or (model == "mlm-all_elr"): 116 | if clique_type == self.URI: 117 | return self.get_prms_mapping(clique_type)[c] 118 | else: 119 | return self.mlm_all_mapping 120 | 121 | def get_prms_mapping(self, clique_type): 122 | """ 123 | Gets PRMS mapping probability for a clique type 124 | 125 | :param clique_type: [TERM | ORDERED | UNORDERED | URI] 126 | :return Dictionary {phrase: {field: weight, ..}, ..} 127 | """ 128 | if clique_type not in self.query_annot.field_mappings: 129 | mapper = FieldMapping(self.lucene_term, self.lucene_uri, self.n_fields) 130 | if clique_type == self.TERM: 131 | self.query_annot.field_mappings = {clique_type: mapper.get_mapping_terms(set(self.query_annot.T))} 132 | elif clique_type == self.ORDERED: 133 | self.query_annot.field_mappings = {clique_type: mapper.get_mapping_phrases(set(self.bigrams), 0, True)} 134 | elif clique_type == self.UNORDERED: 135 | self.query_annot.field_mappings = {clique_type: mapper.get_mapping_phrases(set(self.bigrams), 136 | self.SLOP, False)} 137 | elif clique_type == self.URI: 138 | self.query_annot.field_mappings = {clique_type: mapper.get_mapping_uris(set(self.query_annot.E))} 139 | return self.query_annot.field_mappings[clique_type] 140 | 141 | def set_phrase_freq(self, clique_type, c, fields): 142 | """Sets document and collection frequency for phrase.""" 143 | if clique_type not in self.phrase_freq: 144 | self.phrase_freq[clique_type] = {} 145 | if c not in self.phrase_freq.get(clique_type, {}): 146 | self.phrase_freq[clique_type][c] = {} 147 | for f in fields: 148 | if clique_type == self.ORDERED: 149 | doc_freq = self.lucene_term.get_doc_phrase_freq(c, f, 0, True) 150 | elif clique_type == self.UNORDERED: 151 | doc_freq = self.lucene_term.get_doc_phrase_freq(c, f, self.SLOP, False) 152 | 153 | self.phrase_freq[clique_type][c][f] = doc_freq 154 | self.phrase_freq[clique_type][c][f]['coll_freq'] = sum(doc_freq.values()) 155 | 156 | @staticmethod 157 | def normalize_el_scores(scores): 158 | """Normalize entity linking score, so that sum of all scores equal to 1""" 159 | normalized_scores = {} 160 | sum_score = sum(scores.values()) 161 | for item, score in scores.iteritems(): 162 | normalized_scores[item] = score / sum_score 163 | return normalized_scores 164 | 165 | def get_p_t_d(self, t, field_weights, doc_id): 166 | """ 167 | p(t|d) = sum_{f in F} p(t|d_f) p(f|t) 168 | 169 | :param t: term 170 | :param field_weights: Dictionary {f: p_f_t, ...} 171 | :param doc_id: entity id 172 | :return p(t|d) 173 | """ 174 | lucene_doc_id_t = self.lucene_term.get_lucene_document_id(doc_id) 175 | p_t_d = 0 176 | for f, p_f_t in field_weights.iteritems(): 177 | if self.DEBUG: 178 | print "\tt:", t, "f:", f 179 | p_t_d_f = self.scorer_lm_term.get_term_prob(lucene_doc_id_t, f, t) 180 | p_t_d += p_t_d_f * p_f_t 181 | if self.DEBUG: 182 | print "\t\tp(t|d_f):", p_t_d_f, "p(f|t):", p_f_t, "p(t|d_f).p(f|t):", p_t_d_f * p_f_t 183 | if self.DEBUG: 184 | print "\tp(t|d):", p_t_d 185 | return p_t_d 186 | 187 | def get_p_o_d(self, o, field_weights, doc_id): 188 | """ 189 | p(o|d) = sum_{f in F} p(o|d_f) p(f|o) for ordered search 190 | 191 | :param o: phrase (ordered search) 192 | :param field_weights: Dictionary {f: p_f_o, ...} 193 | :param doc_id: entity id 194 | :return p(o|d) 195 | """ 196 | lucene_doc_id_t = self.lucene_term.get_lucene_document_id(doc_id) 197 | self.set_phrase_freq(self.ORDERED, o, field_weights) 198 | p_o_d = 0 199 | for f, p_f_o in field_weights.iteritems(): 200 | if self.DEBUG: 201 | print "\to:", o, "f:", f 202 | tf_t_d_f = self.phrase_freq[self.ORDERED][o].get(f, {}).get(doc_id, 0) 203 | tf_t_C_f = self.phrase_freq[self.ORDERED][o].get(f, {}).get('coll_freq', 0) 204 | p_o_d_f = self.scorer_lm_term.get_term_prob(lucene_doc_id_t, f, o, tf_t_d_f=tf_t_d_f, tf_t_C_f=tf_t_C_f) 205 | p_o_d += p_o_d_f * p_f_o 206 | if self.DEBUG: 207 | print "\t\tp(o|d_f):", p_o_d_f, "p(f|o):", p_f_o, "p(o|d_f).p(f|o):", p_o_d_f * p_f_o 208 | if self.DEBUG: 209 | print "\tp(o|d):", p_o_d 210 | return p_o_d 211 | 212 | def get_p_u_d(self, u, field_weights, doc_id): 213 | """ 214 | p(u|d) = sum_{f in F} p(u|d_f) p(f|u) for unordered search 215 | 216 | :param u: phrase (unordered search) 217 | :param field_weights: Dictionary {f: p_f_u, ...} 218 | :param doc_id: entity id 219 | :return p(o|d) 220 | """ 221 | lucene_doc_id_t = self.lucene_term.get_lucene_document_id(doc_id) 222 | self.set_phrase_freq(self.UNORDERED, u, field_weights) 223 | p_u_d = 0 224 | for f, p_f_u in field_weights.iteritems(): 225 | if self.DEBUG: 226 | print "\tu:", u, "f:", f 227 | tf_t_d_f = self.phrase_freq[self.UNORDERED][u].get(f, {}).get(doc_id, 0) 228 | tf_t_C_f = self.phrase_freq[self.UNORDERED][u].get(f, {}).get('coll_freq', 0) 229 | p_u_d_f = self.scorer_lm_term.get_term_prob(lucene_doc_id_t, f, u, tf_t_d_f=tf_t_d_f, tf_t_C_f=tf_t_C_f) 230 | p_u_d += p_u_d_f * p_f_u 231 | if self.DEBUG: 232 | print "\t\tp(u|d_f):", p_u_d_f, "p(f|u):", p_f_u, "p(u|d_f).p(f|u):", p_u_d_f * p_f_u 233 | if self.DEBUG: 234 | print "\tp(u|d):", p_u_d 235 | return p_u_d 236 | 237 | def get_p_e_d(self, e, field_weights, doc_id): 238 | """ 239 | p(e|d) = sum_{f in F} p(e|d_f) p(f|e) 240 | 241 | :param e: entity URI 242 | :param field_weights: Dictionary {f: p_f_t, ...} 243 | :param doc_id: entity id 244 | :return p(e|d) 245 | """ 246 | if self.DEBUG: 247 | print "\te:", e 248 | p_e_d = 0 249 | for f, p_f_e in field_weights.iteritems(): 250 | p_e_d_f = self.__get_uri_prob(doc_id, f, e) 251 | p_e_d += p_e_d_f * p_f_e 252 | if self.DEBUG: 253 | print "\t\tp(e|d_f):", p_e_d_f, "p(f|e):", p_f_e, "p(e|d_f).p(f|e):", p_e_d_f * p_f_e 254 | if self.DEBUG: 255 | print "\tp(e|d):", p_e_d 256 | return p_e_d 257 | 258 | def __get_uri_prob(self, doc_id, field, e, lambd=0.1): 259 | """ 260 | P(e|d_f) = P(e|d_f)= (1 - lambda) tf(e, d_f)+ lambda df(f, e) / df(f) 261 | 262 | :param doc_id: document id 263 | :param field: field name 264 | :param e: entity uri 265 | :param lambd: smoothing parameter 266 | :return: P(e|d_f) 267 | """ 268 | if self.DEBUG: 269 | print "\t\tf:", field 270 | lucene_doc_id_u = self.lucene_uri.get_lucene_document_id(doc_id) 271 | tf = self.scorer_lm_uri.get_tf(lucene_doc_id_u, field) 272 | tf_e_d_f = 1 if tf.get(e, 0) > 0 else 0 273 | df_f_e = self.lucene_uri.get_doc_freq(e, field) 274 | df_f = self.lucene_uri.get_doc_count(field) 275 | p_e_d_f = ((1 - lambd) * tf_e_d_f) + (lambd * df_f_e / df_f) 276 | if self.DEBUG: 277 | print "\t\t\ttf(e,d_f):", tf_e_d_f, "df(f, e):", df_f_e, "df(f):", df_f, "P(e|d_f):", p_e_d_f 278 | return p_e_d_f 279 | 280 | 281 | class ScorerFSDM(ScorerMRF): 282 | DEBUG_FSDM = 0 283 | 284 | def __init__(self, lucene_term, lucene_uri, params, query_annot): 285 | ScorerMRF.__init__(self, lucene_term, lucene_uri, params, query_annot) 286 | self.lambda_T = self.params['lambda'][0] 287 | self.lambda_O = self.params['lambda'][1] 288 | self.lambda_U = self.params['lambda'][2] 289 | self.T = self.query_annot.T 290 | 291 | def score_doc(self, doc_id): 292 | """ 293 | P(q|e) = lambda_T sum_{t in T}P(t|d) + lambda_O sum_{o in O}P(o|d) + lambda_U sum_{u in U}P(u|d) 294 | P(t|d) = sum_{f in F} p(t|d_f) p(f|t) 295 | P(o|d) = sum_{f in F} p(o|d_f) p(f|o) 296 | P(u|d) = sum_{f in F} p(u|d_f) p(f|u) 297 | 298 | :param doc_id: document id 299 | :return: p(q|d) 300 | """ 301 | if self.DEBUG_FSDM: 302 | print "Scoring doc ID=" + doc_id 303 | 304 | if self.lucene_term.get_lucene_document_id(doc_id) is None: 305 | return None 306 | 307 | p_T_d = 0 308 | if self.lambda_T != 0: 309 | for t in self.T: 310 | p_t_d = self.get_p_t_d(t, self.get_field_weights(self.TERM, t), doc_id) 311 | if p_t_d != 0: 312 | p_T_d += math.log(p_t_d) 313 | 314 | p_O_d = 0 315 | if self.lambda_O != 0: 316 | for b in self.bigrams: 317 | p_o_d = self.get_p_o_d(b, self.get_field_weights(self.ORDERED, b), doc_id) 318 | if p_o_d != 0: 319 | p_O_d += math.log(p_o_d) 320 | 321 | p_U_d = 0 322 | if self.lambda_U != 0: 323 | for b in self.bigrams: 324 | p_u_d = self.get_p_u_d(b, self.get_field_weights(self.UNORDERED, b), doc_id) 325 | if p_u_d != 0: 326 | p_U_d += math.log(p_u_d) 327 | 328 | p_q_d = (self.lambda_T * p_T_d) + (self.lambda_O * p_O_d) + (self.lambda_U * p_U_d) 329 | if self.DEBUG_FSDM: 330 | print "\t\tP(T|d) = ", p_T_d, "P(O|d):", p_O_d, "p(U|d):", p_U_d, "P(q|d):", p_q_d 331 | 332 | return p_q_d 333 | 334 | 335 | class ScorerELR(ScorerFSDM): 336 | DEBUG_ELR = 0 337 | 338 | def __init__(self, lucene_term, lucene_uri, params, query_annot): 339 | ScorerFSDM.__init__(self, lucene_term, lucene_uri, params, query_annot) 340 | self.lambda_E = self.params['lambda'][3] 341 | self.E = ScorerMRF.normalize_el_scores(self.query_annot.E) 342 | 343 | def score_doc(self, doc_id): 344 | """ 345 | P(q|e) = lambda_T sum_{t}P(t|d) + lambda_O sum_{o}P(o|d) + lambda_U sum_{u}P(u|d) + + lambda_E sum_{e}P(e|d) 346 | P(T|d) = sum_{f in F} p(t|d_f) p(f|t) 347 | P(O|d) = sum_{f in F} p(o|d_f) p(f|o) 348 | P(U|d) = sum_{f in F} p(u|d_f) p(f|u) 349 | P(E|d) = sum_{f in F} p(e|d_f) p(f|e) 350 | 351 | :param doc_id: document id 352 | :return: p(q|d) 353 | """ 354 | if self.DEBUG_ELR: 355 | print "Scoring doc ID=" + doc_id 356 | 357 | if self.lucene_term.get_lucene_document_id(doc_id) is None: 358 | # print doc_id, self.lucene_term.get_lucene_document_id(doc_id) 359 | return None 360 | 361 | p_T_d = 0 362 | n_T = len(self.T) 363 | if self.lambda_T != 0: 364 | for t in self.T: 365 | p_t_d = self.get_p_t_d(t, self.get_field_weights(self.TERM, t), doc_id) 366 | if p_t_d != 0: 367 | p_T_d += math.log(p_t_d) / n_T 368 | 369 | p_O_d = 0 370 | n_O = len(self.bigrams) 371 | if self.lambda_O != 0: 372 | for b in self.bigrams: 373 | p_o_d = self.get_p_o_d(b, self.get_field_weights(self.ORDERED, b), doc_id) 374 | if p_o_d != 0: 375 | p_O_d += math.log(p_o_d) / n_O 376 | 377 | p_U_d = 0 378 | n_U = len(self.bigrams) 379 | if self.lambda_U != 0: 380 | for b in self.bigrams: 381 | p_u_d = self.get_p_u_d(b, self.get_field_weights(self.UNORDERED, b), doc_id) 382 | if p_u_d != 0: 383 | p_U_d += math.log(p_u_d) / n_U 384 | 385 | p_E_d = 0 386 | if self.lambda_E != 0: 387 | for e, score in self.E.iteritems(): 388 | p_e_d = self.get_p_e_d(e, self.get_field_weights(self.URI, e), doc_id) 389 | if p_e_d != 0: 390 | p_E_d += score * math.log(p_e_d) 391 | 392 | p_q_d = (self.lambda_T * p_T_d) + (self.lambda_O * p_O_d) + (self.lambda_U * p_U_d) + (self.lambda_E * p_E_d) 393 | if self.DEBUG_ELR: 394 | print "\t\tP(T|d) = ", p_T_d, "P(O|d):", p_O_d, "p(U|d):", p_U_d, "p(E|d):", p_E_d, "P(q|d):", p_q_d 395 | 396 | return p_q_d 397 | -------------------------------------------------------------------------------- /nordlys/elr/top_fields.py: -------------------------------------------------------------------------------- 1 | """ 2 | This class returns top fields based on document frequency 3 | 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 5 | """ 6 | 7 | from nordlys.retrieval.lucene_tools import Lucene 8 | 9 | 10 | class TopFields(object): 11 | DEBUG = 0 12 | 13 | def __init__(self, lucene): 14 | self.lucene = lucene 15 | self.__fields = None 16 | 17 | @property 18 | def fields(self): 19 | if self.__fields is None: 20 | self.__fields = set(self.lucene.get_fields()) 21 | return self.__fields 22 | 23 | def get_top_index(self, n): 24 | """Return top-n fields with highest document frequency across the whole index""" 25 | doc_freq_field = {} 26 | for field in self.fields: 27 | if field == Lucene.FIELDNAME_ID: 28 | continue 29 | doc_freq_field[field] = self.lucene.get_doc_count(field) 30 | return self.__get_top_n(doc_freq_field, n) 31 | 32 | def get_top_term(self, term, n): 33 | """Returns top-n fields with highest document frequency for the given term.""" 34 | doc_freq = {} 35 | if self.DEBUG: 36 | print "Term:[" + term + "]" 37 | for field in self.fields: 38 | df = self.lucene.get_doc_freq(term, field) 39 | if df > 0: 40 | doc_freq[field] = df 41 | top_fields = self.__get_top_n(doc_freq, n) 42 | return top_fields 43 | 44 | def __get_top_n(self, fields_freq, n): 45 | """Sorts fields and returns top-n.""" 46 | sorted_fields = sorted(fields_freq.items(), key=lambda item: (item[1], item[0]), reverse=True) 47 | top_fields = dict() 48 | i = 0 49 | for field, freq in sorted_fields: 50 | if i >= n: 51 | break 52 | i += 1 53 | top_fields[field] = freq 54 | if self.DEBUG: 55 | print "(" + field + ", " + str(freq) + ")", 56 | if self.DEBUG: 57 | print "\nNumber of fields:", len(top_fields), "\n" 58 | return top_fields 59 | -------------------------------------------------------------------------------- /nordlys/retrieval/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hasibi/EntityLinkingRetrieval-ELR/b53d7bce81f8050dd5b7a96a8e8b99f0ed258ba6/nordlys/retrieval/__init__.py -------------------------------------------------------------------------------- /nordlys/retrieval/indexer.py: -------------------------------------------------------------------------------- 1 | """ 2 | Creates a Lucene index for DBpedia from MongoDB. 3 | 4 | - URI values are resolved using a simple heuristic 5 | - fields are indexed as multi-valued 6 | - catch-all fields are not indexed with positions, other fields are 7 | 8 | -------------------------------------------------------------------------------------------------- 9 | NOTE: Please note that this code cannot be run due to dependencies to the DBpedia Mongo collection. 10 | Yet, this is the main code used fo generating the indices and can be used as a reference. 11 | To get the original indices, please contact the first author. 12 | -------------------------------------------------------------------------------------------------- 13 | 14 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 15 | @author: Krisztian Balog (krisztian.balog@uis.no) 16 | """ 17 | 18 | import sys 19 | from urllib import unquote 20 | from pprint import pprint 21 | 22 | from nordlys import config 23 | from nordlys.entity.config import COLLECTION_DBPEDIA 24 | from nordlys.entity.dbpedia.fields import Fields 25 | from nordlys.storage.mongo import Mongo 26 | from nordlys.retrieval.lucene_tools import Lucene 27 | 28 | 29 | class MongoDBToLucene(object): 30 | def __init__(self, host=config.MONGO_HOST, db=config.MONGO_DB, collection=COLLECTION_DBPEDIA): 31 | self.mongo = Mongo(host, db, collection) 32 | self.contents = None 33 | 34 | def __resolve_uri(self, uri): 35 | """Resolves the URI using a simple heuristic.""" 36 | uri = unquote(uri) # decode percent encoding 37 | if uri.startswith("<") and uri.endswith(">"): 38 | # Part between last ':' and '>', and _ replaced with space. 39 | # Works fine for and 40 | return uri[uri.rfind(":") + 1:-1].replace("_", " ") 41 | else: 42 | return uri 43 | 44 | def __is_uri(self, value): 45 | """ Returns true if the value is uri. """ 46 | if value.startswith(""): 47 | return True 48 | return False 49 | 50 | def __get_field_value(self, value, only_uris=False): 51 | """ 52 | Converts mongoDB field value to indexable values by resolving URIs. 53 | It may be a string or a list and the return value is of the same data type. 54 | """ 55 | if type(value) is list: 56 | nval = [] # holds resolved values 57 | for v in value: 58 | if not only_uris: 59 | nval.append(Lucene.preprocess(self.__resolve_uri(v))) 60 | elif only_uris and self.__is_uri(v): 61 | nval.append(v) 62 | return nval 63 | else: 64 | if not only_uris: 65 | return Lucene.preprocess(self.__resolve_uri(value)) 66 | elif only_uris and self.__is_uri(value): 67 | return value 68 | # return self.__resolve_uri(value) if only_uris else value 69 | return None 70 | 71 | def __add_to_contents(self, field_name, field_value, field_type): 72 | """ 73 | Adds field to document contents. 74 | Field value can be a list, where each item is added separately (i.e., the field is multi-valued). 75 | """ 76 | if type(field_value) is list: 77 | for fv in field_value: 78 | self.__add_to_contents(field_name, fv, field_type) 79 | else: 80 | if len(field_value) > 0: # ignore empty fields 81 | self.contents.append({'field_name': field_name, 82 | 'field_value': field_value, 83 | 'field_type': field_type}) 84 | 85 | def build_index(self, index_config, only_uris=False, max_shingle_size=None): 86 | """Builds index. 87 | 88 | :param index_config: index configuration 89 | """ 90 | lucene = Lucene(index_config['index_dir'], max_shingle_size) 91 | lucene.open_writer() # generated shingle analyzer if the param is not None 92 | 93 | fieldtype_tv = Lucene.FIELDTYPE_ID_TV if only_uris else Lucene.FIELDTYPE_TEXT_TV 94 | fieldtype_tvp = Lucene.FIELDTYPE_ID_TV if only_uris else Lucene.FIELDTYPE_TEXT_TVP 95 | fieldtype_id = Lucene.FIELDTYPE_ID_TV if only_uris else Lucene.FIELDTYPE_ID 96 | fieldtype_ntv = Lucene.FIELDTYPE_ID_TV if only_uris else Lucene.FIELDTYPE_TEXT_NTV 97 | 98 | # iterate through MongoDB contents 99 | i = 0 100 | for mdoc in self.mongo.find_all(): 101 | 102 | # this is just to speed up things a bit 103 | # we can skip the document right away if the ID does not start 104 | # with "": 174 | fields[f] = {'must_have': True, 'copy_to': ["names"]} 175 | elif (f == "") or (f == "!"): 176 | fields[f] = {'copy_to': ["names"]} 177 | elif (f == "") or (f == ""): 178 | fields[f] = {'copy_to': ["types"]} 179 | elif f == "": 180 | fields[f] = {'must_have': True} 181 | else: 182 | fields[f] = {} 183 | 184 | 185 | # Config of index7 186 | config_index7 = {'index_dir': "path/to/index", 187 | 'fields': fields, 188 | 'catchall_all': True, 189 | 'ignore': [""] # except these 190 | } 191 | 192 | # config of config7_only_uri; Similar to index7, but keeps only uris 193 | index_config7_only_uri = {'index_dir': "path/to/uri_only index", 194 | 'fields': fields, 195 | 'catchall_all': True, 196 | 'ignore': [""] # except these 197 | } 198 | 199 | pprint(config_index7) 200 | m2l = MongoDBToLucene() 201 | m2l.build_index(config_index7, only_uris=False) 202 | m2l.build_index(config_index7, only_uris=True) 203 | print "index build" + config_index7['index_dir'] 204 | 205 | if __name__ == "__main__": 206 | main(sys.argv[1:]) 207 | -------------------------------------------------------------------------------- /nordlys/retrieval/lucene_tools.py: -------------------------------------------------------------------------------- 1 | """ 2 | Tools for Lucene. 3 | All Lucene features should be accessed in nordlys through this class. 4 | 5 | - Lucene class for ensuring that the same version, analyzer, etc. 6 | are used across nordlys modules. Handles IndexReader, IndexWriter, etc. 7 | - Command line tools for checking indexed document content 8 | 9 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 10 | @author: Krisztian Balog (krisztian.balog@uis.no) 11 | """ 12 | import argparse 13 | import lucene 14 | from nordlys.retrieval.results import RetrievalResults 15 | from java.io import File 16 | from java.util import HashMap, TreeSet 17 | from java.io import StringReader 18 | from java.lang import StringBuilder 19 | from org.apache.lucene.analysis.tokenattributes import CharTermAttribute 20 | from org.apache.lucene.analysis.core import StopFilter 21 | from org.apache.lucene.analysis.core import StopAnalyzer 22 | from org.apache.lucene.analysis.standard import StandardTokenizer 23 | from org.apache.lucene.analysis.standard import StandardAnalyzer 24 | from org.apache.lucene.analysis.shingle import ShingleAnalyzerWrapper 25 | from org.apache.lucene.document import Document 26 | from org.apache.lucene.document import Field 27 | from org.apache.lucene.document import FieldType 28 | from org.apache.lucene.index import MultiFields 29 | from org.apache.lucene.index import IndexWriter 30 | from org.apache.lucene.index import IndexWriterConfig 31 | from org.apache.lucene.index import DirectoryReader 32 | from org.apache.lucene.index import Term 33 | from org.apache.lucene.index import TermContext 34 | from org.apache.lucene.queryparser.classic import QueryParser 35 | from org.apache.lucene.search import IndexSearcher 36 | from org.apache.lucene.search import BooleanClause 37 | from org.apache.lucene.search import TermQuery 38 | from org.apache.lucene.search import BooleanQuery 39 | from org.apache.lucene.search import PhraseQuery 40 | from org.apache.lucene.search.spans import SpanNearQuery 41 | from org.apache.lucene.search.spans import SpanTermQuery 42 | from org.apache.lucene.search import FieldValueFilter 43 | from org.apache.lucene.search.similarities import LMJelinekMercerSimilarity 44 | from org.apache.lucene.search.similarities import LMDirichletSimilarity 45 | from org.apache.lucene.store import SimpleFSDirectory 46 | from org.apache.lucene.util import BytesRefIterator 47 | from org.apache.lucene.util import Version 48 | from org.apache.lucene.index import SlowCompositeReaderWrapper 49 | 50 | # has java VM for Lucene been initialized 51 | lucene_vm_init = False 52 | 53 | 54 | class Lucene(object): 55 | 56 | # default fieldnames for id and contents 57 | FIELDNAME_ID = "id" 58 | FIELDNAME_CONTENTS = "contents" 59 | 60 | # internal fieldtypes 61 | # used as Enum, the actual values don't matter 62 | FIELDTYPE_ID = "id" 63 | FIELDTYPE_ID_TV = "id_tv" 64 | FIELDTYPE_TEXT = "text" 65 | FIELDTYPE_TEXT_TV = "text_tv" 66 | FIELDTYPE_TEXT_TVP = "text_tvp" 67 | FIELDTYPE_TEXT_NTV = "text_ntv" 68 | FIELDTYPE_TEXT_NTVP = "text_ntvp" 69 | 70 | def __init__(self, index_dir, max_shingle_size=None): 71 | global lucene_vm_init 72 | 73 | if not lucene_vm_init: 74 | lucene.initVM(vmargs=['-Djava.awt.headless=true']) 75 | lucene_vm_init = True 76 | self.dir = SimpleFSDirectory(File(index_dir)) 77 | self.max_shingle_size = max_shingle_size 78 | self.analyzer = None 79 | self.reader = None 80 | self.searcher = None 81 | self.writer = None 82 | self.ldf = None 83 | 84 | @staticmethod 85 | def get_version(): 86 | """Get Lucene version.""" 87 | return Version.LUCENE_48 88 | 89 | @staticmethod 90 | def preprocess(text): 91 | """Tokenize and stop the input text.""" 92 | ts = StandardTokenizer(Lucene.get_version(), StringReader(text.lower())) 93 | ts = StopFilter(Lucene.get_version(), ts, StopAnalyzer.ENGLISH_STOP_WORDS_SET) 94 | string_builder = StringBuilder() 95 | ts.reset() 96 | char_term_attr = ts.addAttribute(CharTermAttribute.class_) 97 | while ts.incrementToken(): 98 | if string_builder.length() > 0: 99 | string_builder.append(" ") 100 | string_builder.append(char_term_attr.toString()) 101 | return string_builder.toString() 102 | 103 | def get_analyzer(self): 104 | """Get analyzer.""" 105 | if self.analyzer is None: 106 | std_analyzer = StandardAnalyzer(Lucene.get_version()) 107 | if self.max_shingle_size is None: 108 | self.analyzer = std_analyzer 109 | else: 110 | self.analyzer = ShingleAnalyzerWrapper(std_analyzer, self.max_shingle_size) 111 | return self.analyzer 112 | 113 | def open_reader(self): 114 | """Open IndexReader.""" 115 | if self.reader is None: 116 | self.reader = DirectoryReader.open(self.dir) 117 | 118 | def get_reader(self): 119 | return self.reader 120 | 121 | def close_reader(self): 122 | """Close IndexReader.""" 123 | if self.reader is not None: 124 | self.reader.close() 125 | self.reader = None 126 | else: 127 | raise Exception("There is no open IndexReader to close") 128 | 129 | def open_searcher(self): 130 | """ 131 | Open IndexSearcher. Automatically opens an IndexReader too, 132 | if it is not already open. There is no close method for the 133 | searcher. 134 | """ 135 | if self.searcher is None: 136 | self.open_reader() 137 | self.searcher = IndexSearcher(self.reader) 138 | 139 | def get_searcher(self): 140 | """Returns index searcher (opens it if needed).""" 141 | self.open_searcher() 142 | return self.searcher 143 | 144 | def set_lm_similarity_jm(self, method="jm", smoothing_param=0.1): 145 | """ 146 | Set searcher to use LM similarity. 147 | 148 | :param method: LM similarity ("jm" or "dirichlet") 149 | :param smoothing_param: smoothing parameter (lambda or mu) 150 | """ 151 | if method == "jm": 152 | similarity = LMJelinekMercerSimilarity(smoothing_param) 153 | elif method == "dirichlet": 154 | similarity = LMDirichletSimilarity(smoothing_param) 155 | else: 156 | raise Exception("Unknown method") 157 | 158 | if self.searcher is None: 159 | raise Exception("Searcher has not been created") 160 | self.searcher.setSimilarity(similarity) 161 | 162 | def open_writer(self): 163 | """Open IndexWriter.""" 164 | if self.writer is None: 165 | config = IndexWriterConfig(Lucene.get_version(), self.get_analyzer()) 166 | config.setOpenMode(IndexWriterConfig.OpenMode.CREATE) 167 | self.writer = IndexWriter(self.dir, config) 168 | else: 169 | raise Exception("IndexWriter is already open") 170 | 171 | def close_writer(self): 172 | """Close IndexWriter.""" 173 | if self.writer is not None: 174 | self.writer.close() 175 | self.writer = None 176 | else: 177 | raise Exception("There is no open IndexWriter to close") 178 | 179 | def add_document(self, contents): 180 | """ 181 | Adds a Lucene document with the specified contents to the index. 182 | See LuceneDocument.create_document() for the explanation of contents. 183 | """ 184 | if self.ldf is None: # create a single LuceneDocument object that will be reused 185 | self.ldf = LuceneDocument() 186 | self.writer.addDocument(self.ldf.create_document(contents)) 187 | 188 | def get_lucene_document_id(self, doc_id): 189 | """Loads a document from a Lucene index based on its id.""" 190 | self.open_searcher() 191 | query = TermQuery(Term(self.FIELDNAME_ID, doc_id)) 192 | tophit = self.searcher.search(query, 1).scoreDocs 193 | if len(tophit) == 1: 194 | return tophit[0].doc 195 | else: 196 | return None 197 | 198 | def get_document_id(self, lucene_doc_id): 199 | """Gets lucene document id and returns the document id.""" 200 | self.open_reader() 201 | return self.reader.document(lucene_doc_id).get(self.FIELDNAME_ID) 202 | 203 | def print_document(self, lucene_doc_id, term_vect=False): 204 | """Prints document contents.""" 205 | if lucene_doc_id is None: 206 | print "Document is not found in the index." 207 | else: 208 | doc = self.reader.document(lucene_doc_id) 209 | print "Document ID (field '" + self.FIELDNAME_ID + "'): " + doc.get(self.FIELDNAME_ID) 210 | 211 | # first collect (unique) field names 212 | fields = [] 213 | for f in doc.getFields(): 214 | if f.name() != self.FIELDNAME_ID and f.name() not in fields: 215 | fields.append(f.name()) 216 | 217 | for fname in fields: 218 | print fname 219 | for fv in doc.getValues(fname): # printing (possibly multiple) field values 220 | print "\t" + fv 221 | # term vector 222 | if term_vect: 223 | print "-----" 224 | termfreqs = self.get_doc_termfreqs(lucene_doc_id, fname) 225 | for term in termfreqs: 226 | print term + " : " + str(termfreqs[term]) 227 | print "-----" 228 | 229 | def get_lucene_query(self, query, field=FIELDNAME_CONTENTS): 230 | """Creates Lucene query from keyword query.""" 231 | query = query.replace("(", "").replace(")", "").replace("!", "") 232 | return QueryParser(Lucene.get_version(), field, 233 | self.get_analyzer()).parse(query) 234 | 235 | def analyze_query(self, query, field=FIELDNAME_CONTENTS): 236 | """ 237 | Analyses the query and returns query terms. 238 | 239 | :param query: query 240 | :param field: field name 241 | :return: list of query terms 242 | """ 243 | qterms = [] # holds a list of analyzed query terms 244 | ts = self.get_analyzer().tokenStream(field, query) 245 | term = ts.addAttribute(CharTermAttribute.class_) 246 | ts.reset() 247 | while ts.incrementToken(): 248 | qterms.append(term.toString()) 249 | ts.end() 250 | ts.close() 251 | return qterms 252 | 253 | def get_id_lookup_query(self, id, field=None): 254 | """Creates Lucene query for searching by (external) document id.""" 255 | if field is None: 256 | field = self.FIELDNAME_ID 257 | return TermQuery(Term(field, id)) 258 | 259 | def get_and_query(self, queries): 260 | """Creates an AND Boolean query from multiple Lucene queries.""" 261 | # empty boolean query with Similarity.coord() disabled 262 | bq = BooleanQuery(False) 263 | for q in queries: 264 | bq.add(q, BooleanClause.Occur.MUST) 265 | return bq 266 | 267 | def get_or_query(self, queries): 268 | """Creates an OR Boolean query from multiple Lucene queries.""" 269 | # empty boolean query with Similarity.coord() disabled 270 | bq = BooleanQuery(False) 271 | for q in queries: 272 | bq.add(q, BooleanClause.Occur.SHOULD) 273 | return bq 274 | 275 | def get_phrase_query(self, query, field): 276 | """Creates phrase query for searching exact phrase.""" 277 | phq = PhraseQuery() 278 | for t in query.split(): 279 | phq.add(Term(field, t)) 280 | return phq 281 | 282 | def get_span_query(self, terms, field, slop, ordered=True): 283 | """ 284 | Creates near span query 285 | 286 | :param terms: list of terms 287 | :param field: field name 288 | :param slop: number of terms between the query terms 289 | :param ordered: If true, ordered search; otherwise unordered search 290 | :return: lucene span near query 291 | """ 292 | span_queries = [] 293 | for term in terms: 294 | span_queries.append(SpanTermQuery(Term(field, term))) 295 | span_near_query = SpanNearQuery(span_queries, slop, ordered) 296 | return span_near_query 297 | 298 | def get_doc_phrase_freq(self, phrase, field, slop, ordered): 299 | """ 300 | Returns collection frequency for a given phrase and field. 301 | 302 | :param phrase: str 303 | :param field: field name 304 | :param slop: number of terms in between 305 | :param ordered: If true, term occurrences should be ordered 306 | :return: dictionary {doc: freq, ...} 307 | """ 308 | # creates span near query 309 | span_near_query = self.get_span_query(phrase.split(" "), field, slop=slop, ordered=ordered) 310 | 311 | # extracts document frequency 312 | self.open_searcher() 313 | index_reader_context = self.searcher.getTopReaderContext() 314 | term_contexts = HashMap() 315 | terms = TreeSet() 316 | span_near_query.extractTerms(terms) 317 | for term in terms: 318 | term_contexts.put(term, TermContext.build(index_reader_context, term)) 319 | leaves = index_reader_context.leaves() 320 | doc_phrase_freq = {} 321 | # iterates over all atomic readers 322 | for atomic_reader_context in leaves: 323 | bits = atomic_reader_context.reader().getLiveDocs() 324 | spans = span_near_query.getSpans(atomic_reader_context, bits, term_contexts) 325 | while spans.next(): 326 | lucene_doc_id = spans.doc() 327 | doc_id = atomic_reader_context.reader().document(lucene_doc_id).get(self.FIELDNAME_ID) 328 | if doc_id not in doc_phrase_freq: 329 | doc_phrase_freq[doc_id] = 1 330 | else: 331 | doc_phrase_freq[doc_id] += 1 332 | return doc_phrase_freq 333 | 334 | def get_id_filter(self): 335 | return FieldValueFilter(self.FIELDNAME_ID) 336 | 337 | def __to_retrieval_results(self, scoredocs, field_id=FIELDNAME_ID): 338 | """Converts Lucene scoreDocs results to RetrievalResults format.""" 339 | rr = RetrievalResults() 340 | if scoredocs is not None: 341 | for i in xrange(len(scoredocs)): 342 | score = scoredocs[i].score 343 | lucene_doc_id = scoredocs[i].doc # internal doc_id 344 | doc_id = self.reader.document(lucene_doc_id).get(field_id) 345 | rr.append(doc_id, score, lucene_doc_id) 346 | return rr 347 | 348 | def score_query(self, query, field_content=FIELDNAME_CONTENTS, field_id=FIELDNAME_ID, num_docs=100): 349 | """Scores a given query and return results as a RetrievalScores object.""" 350 | lucene_query = self.get_lucene_query(query, field_content) 351 | scoredocs = self.searcher.search(lucene_query, num_docs).scoreDocs 352 | return self.__to_retrieval_results(scoredocs, field_id) 353 | 354 | def num_docs(self): 355 | """Returns number of documents in the index.""" 356 | self.open_reader() 357 | return self.reader.numDocs() 358 | 359 | def num_fields(self): 360 | """Returns number of fields in the index.""" 361 | self.open_reader() 362 | atomic_reader = SlowCompositeReaderWrapper.wrap(self.reader) 363 | return atomic_reader.getFieldInfos().size() 364 | 365 | def get_fields(self): 366 | """Returns name of fields in the index.""" 367 | fields = [] 368 | self.open_reader() 369 | atomic_reader = SlowCompositeReaderWrapper.wrap(self.reader) 370 | for fieldInfo in atomic_reader.getFieldInfos().iterator(): 371 | fields.append(fieldInfo.name) 372 | return fields 373 | 374 | def get_doc_termvector(self, lucene_doc_id, field): 375 | """Outputs the document term vector as a generator.""" 376 | terms = self.reader.getTermVector(lucene_doc_id, field) 377 | if terms: 378 | termenum = terms.iterator(None) 379 | for bytesref in BytesRefIterator.cast_(termenum): 380 | yield bytesref.utf8ToString(), termenum 381 | 382 | def get_doc_termfreqs(self, lucene_doc_id, field): 383 | """ 384 | Returns term frequencies for a given document field. 385 | 386 | :param lucene_doc_id: Lucene document ID 387 | :param field: document field 388 | :return dict: with terms 389 | """ 390 | termfreqs = {} 391 | for term, termenum in self.get_doc_termvector(lucene_doc_id, field): 392 | termfreqs[term] = int(termenum.totalTermFreq()) 393 | return termfreqs 394 | 395 | def get_doc_termfreqs_all_fields(self, lucene_doc_id): 396 | """ 397 | Returns term frequency for all fields in the given document. 398 | 399 | :param lucene_doc_id: Lucene document ID 400 | :return: dictionary {field: {term: freq, ...}, ...} 401 | """ 402 | doc_termfreqs = {} 403 | vectors = self.reader.getTermVectors(lucene_doc_id) 404 | if vectors: 405 | for field in vectors.iterator(): 406 | doc_termfreqs[field] = {} 407 | terms = vectors.terms(field) 408 | if terms: 409 | termenum = terms.iterator(None) 410 | for bytesref in BytesRefIterator.cast_(termenum): 411 | doc_termfreqs[field][bytesref.utf8ToString()] = int(termenum.totalTermFreq()) 412 | print doc_termfreqs[field] 413 | return doc_termfreqs 414 | 415 | def get_coll_termvector(self, field): 416 | """ Returns collection term vector for the given field.""" 417 | self.open_reader() 418 | fields = MultiFields.getFields(self.reader) 419 | if fields is not None: 420 | terms = fields.terms(field) 421 | if terms: 422 | termenum = terms.iterator(None) 423 | for bytesref in BytesRefIterator.cast_(termenum): 424 | yield bytesref.utf8ToString(), termenum 425 | 426 | def get_coll_termfreq(self, term, field): 427 | """ 428 | Returns collection term frequency for the given field. 429 | 430 | :param term: string 431 | :param field: string, document field 432 | :return: int 433 | """ 434 | self.open_reader() 435 | return self.reader.totalTermFreq(Term(field, term)) 436 | 437 | def get_doc_freq(self, term, field): 438 | """ 439 | Returns document frequency for the given term and field. 440 | 441 | :param term: string, term 442 | :param field: string, document field 443 | :return: int 444 | """ 445 | self.open_reader() 446 | return self.reader.docFreq(Term(field, term)) 447 | 448 | def get_doc_count(self, field): 449 | """ 450 | Returns number of documents with at least one term for the given field. 451 | 452 | :param field: string, field name 453 | :return: int 454 | """ 455 | self.open_reader() 456 | return self.reader.getDocCount(field) 457 | 458 | def get_coll_length(self, field): 459 | """ 460 | Returns length of field in the collection. 461 | 462 | :param field: string, field name 463 | :return: int 464 | """ 465 | self.open_reader() 466 | return self.reader.getSumTotalTermFreq(field) 467 | 468 | def get_avg_len(self, field): 469 | """ 470 | Returns average length of a field in the collection. 471 | 472 | :param field: string, field name 473 | """ 474 | self.open_reader() 475 | n = self.reader.getDocCount(field) # number of documents with at least one term for this field 476 | len_all = self.reader.getSumTotalTermFreq(field) 477 | if n == 0: 478 | return 0 479 | else: 480 | return len_all / float(n) 481 | 482 | class LuceneDocument(object): 483 | """Internal representation of a Lucene document.""" 484 | 485 | def __init__(self): 486 | self.ldf = LuceneDocumentField() 487 | 488 | def create_document(self, contents): 489 | """Create a Lucene document from the specified contents. 490 | Contents is a list of fields to be indexed, represented as a dictionary 491 | with keys 'field_name', 'field_type', and 'field_value'.""" 492 | doc = Document() 493 | for f in contents: 494 | doc.add(Field(f['field_name'], f['field_value'], 495 | self.ldf.get_field(f['field_type']))) 496 | return doc 497 | 498 | 499 | class LuceneDocumentField(object): 500 | """Internal handler class for possible field types.""" 501 | 502 | def __init__(self): 503 | """Init possible field types.""" 504 | 505 | # FIELD_ID: stored, indexed, non-tokenized 506 | self.field_id = FieldType() 507 | self.field_id.setIndexed(True) 508 | self.field_id.setStored(True) 509 | self.field_id.setTokenized(False) 510 | 511 | # FIELD_ID_TV: stored, indexed, not tokenized, with term vectors (without positions) 512 | # for storing IDs with term vector info 513 | self.field_id_tv = FieldType() 514 | self.field_id_tv.setIndexed(True) 515 | self.field_id_tv.setStored(True) 516 | self.field_id_tv.setTokenized(False) 517 | self.field_id_tv.setStoreTermVectors(True) 518 | 519 | # FIELD_TEXT: stored, indexed, tokenized, with positions 520 | self.field_text = FieldType() 521 | self.field_text.setIndexed(True) 522 | self.field_text.setStored(True) 523 | self.field_text.setTokenized(True) 524 | 525 | # FIELD_TEXT_TV: stored, indexed, tokenized, with term vectors (without positions) 526 | self.field_text_tv = FieldType() 527 | self.field_text_tv.setIndexed(True) 528 | self.field_text_tv.setStored(True) 529 | self.field_text_tv.setTokenized(True) 530 | self.field_text_tv.setStoreTermVectors(True) 531 | 532 | # FIELD_TEXT_TVP: stored, indexed, tokenized, with term vectors and positions 533 | # (but no character offsets) 534 | self.field_text_tvp = FieldType() 535 | self.field_text_tvp.setIndexed(True) 536 | self.field_text_tvp.setStored(True) 537 | self.field_text_tvp.setTokenized(True) 538 | self.field_text_tvp.setStoreTermVectors(True) 539 | self.field_text_tvp.setStoreTermVectorPositions(True) 540 | 541 | # FIELD_TEXT_NTV: not stored, indexed, tokenized, with term vectors (without positions) 542 | self.field_text_ntv = FieldType() 543 | self.field_text_ntv.setIndexed(True) 544 | self.field_text_ntv.setStored(False) 545 | self.field_text_ntv.setTokenized(True) 546 | self.field_text_ntv.setStoreTermVectors(True) 547 | 548 | # FIELD_TEXT_TVP: not stored, indexed, tokenized, with term vectors and positions 549 | # (but no character offsets) 550 | self.field_text_ntvp = FieldType() 551 | self.field_text_ntvp.setIndexed(True) 552 | self.field_text_ntvp.setStored(False) 553 | self.field_text_ntvp.setTokenized(True) 554 | self.field_text_ntvp.setStoreTermVectors(True) 555 | self.field_text_ntvp.setStoreTermVectorPositions(True) 556 | 557 | def get_field(self, type): 558 | """Gets Lucene FieldType object for the corresponding internal FIELDTYPE_ value.""" 559 | if type == Lucene.FIELDTYPE_ID: 560 | return self.field_id 561 | elif type == Lucene.FIELDTYPE_ID_TV: 562 | return self.field_id_tv 563 | elif type == Lucene.FIELDTYPE_TEXT: 564 | return self.field_text 565 | elif type == Lucene.FIELDTYPE_TEXT_TV: 566 | return self.field_text_tv 567 | elif type == Lucene.FIELDTYPE_TEXT_TVP: 568 | return self.field_text_tvp 569 | elif type == Lucene.FIELDTYPE_TEXT_NTV: 570 | return self.field_text_ntv 571 | elif type == Lucene.FIELDTYPE_TEXT_NTVP: 572 | return self.field_text_ntvp 573 | else: 574 | raise Exception("Unknown field type") -------------------------------------------------------------------------------- /nordlys/retrieval/results.py: -------------------------------------------------------------------------------- 1 | """ 2 | Result list representation. 3 | 4 | - for each hit it holds score and both internal and external doc_ids 5 | 6 | @author: Krisztian Balog (krisztian.balog@uis.no) 7 | """ 8 | 9 | import operator 10 | 11 | 12 | class RetrievalResults(object): 13 | """Class for storing retrieval scores for a given query.""" 14 | def __init__(self): 15 | self.scores = {} 16 | # mapping from external to internal doc_ids -s 17 | self.doc_ids = {} 18 | 19 | def append(self, doc_id, score, doc_id_int=None): 20 | """Adds document to the result list""" 21 | self.scores[doc_id] = score 22 | if doc_id_int is not None: 23 | self.doc_ids[doc_id] = doc_id_int 24 | 25 | def increase(self, doc_id, score): 26 | """Increases the score of a document (adds it to the results list 27 | if it is not already there)""" 28 | if doc_id not in self.scores: 29 | self.scores[doc_id] = 0 30 | self.scores[doc_id] += score 31 | 32 | def num_docs(self): 33 | """Returns the number of documents in the result list.""" 34 | return len(self.scores) 35 | 36 | def get_scores_sorted(self): 37 | """Returns all results sorted by score""" 38 | return sorted(self.scores.iteritems(), key=operator.itemgetter(1), reverse=True) 39 | 40 | def get_doc_id_int(self, doc_id): 41 | """Returns internal doc_id for a given doc_id.""" 42 | if doc_id in self.doc_ids: 43 | return self.doc_ids[doc_id] 44 | return None 45 | 46 | def write_trec_format(self, query_id, run_id, out, max_rank=100): 47 | """Outputs results in TREC format""" 48 | rank = 1 49 | for doc_id, score in self.get_scores_sorted(): 50 | if rank <= max_rank: 51 | out.write(query_id + "\tQ0\t" + doc_id + "\t" + str(rank) + "\t" + str(score) + "\t" + run_id + "\n") 52 | rank += 1 53 | -------------------------------------------------------------------------------- /nordlys/retrieval/retrieval.py: -------------------------------------------------------------------------------- 1 | """ 2 | Console application for general-purpose retrieval. 3 | 4 | first pass: get top N documents using Lucene's default retrieval method (based on the catch-all content field) 5 | second pass: perform (expensive) scoring of the top N documents using the Scorer class 6 | 7 | General config parameters: 8 | - index_dir: index directory 9 | - query_file: query file (JSON) 10 | - model: accepted values: lucene, lm, mlm, prms (default: lm) 11 | - output_file: output file name 12 | - output_format: (default: trec) -- not used yet 13 | - run_id: run in (only for "trec" output format) 14 | - num_docs: number of documents to return (default: 100) 15 | - field_id: id field to be returned (default: Lucene.FIELDNAME_ID) 16 | - first_pass_num_docs: number of documents in first-pass scoring (default: 10000) 17 | - first_pass_field: field used in first pass retrieval (default: Lucene.FIELDNAME_CONTENTS) 18 | 19 | Model-specific parameters: 20 | - smoothing_method: jm or dirichlet (lm and mlm, default: jm) 21 | - smoothing_param: value of lambda or alpha (jm default: 0.1, dirichlet default: average field length) 22 | - field_weights: dict with fields and corresponding weights (only mlm) 23 | - field: field name for LM model 24 | - fields: fields for PRMS model 25 | 26 | 27 | @author: Krisztian Balog (krisztian.balog@uis.no) 28 | """ 29 | from datetime import datetime 30 | 31 | import sys 32 | import json 33 | import os 34 | from nordlys.retrieval.lucene_tools import Lucene 35 | from scorer import Scorer 36 | from results import RetrievalResults 37 | 38 | 39 | class Retrieval(object): 40 | def __init__(self, config): 41 | """ 42 | Loads config file, checks params, and sets default values. 43 | 44 | :param config: JSON config file or a dictionary 45 | """ 46 | # set configurations 47 | if type(config) == dict: 48 | self.config = config 49 | else: 50 | try: 51 | self.config = json.load(open(config)) 52 | except Exception, e: 53 | print "Error loading config file: ", e 54 | sys.exit(1) 55 | 56 | # check params and set default values 57 | try: 58 | if 'index_dir' not in self.config: 59 | raise Exception("index_dir is missing") 60 | if 'query_file' not in self.config: 61 | raise Exception("query_file is missing") 62 | if 'output_file' not in self.config: 63 | raise Exception("output_file is missing") 64 | if 'run_id' not in self.config: 65 | raise Exception("run_id is missing") 66 | if 'model' not in self.config: 67 | self.config['model'] = "lm" 68 | if 'num_docs' not in self.config: 69 | self.config['num_docs'] = 100 70 | if 'field_id' not in self.config: 71 | self.config['field_id'] = Lucene.FIELDNAME_ID 72 | if 'first_pass_num_docs' not in self.config: 73 | self.config['first_pass_num_docs'] = 10000 74 | if 'first_pass_field' not in self.config: 75 | self.config['first_pass_field'] = Lucene.FIELDNAME_CONTENTS 76 | 77 | # model specific params 78 | if self.config['model'] == "lm" or self.config['model'] == "mlm" or self.config['model'] == "prms": 79 | if 'smoothing_method' not in self.config: 80 | self.config['smoothing_method'] = "jm" 81 | # if 'smoothing_param' not in self.config: 82 | # self.config['smoothing_param'] = 0.1 83 | 84 | if self.config['model'] == "mlm": 85 | if 'field_weights' not in self.config: 86 | raise Exception("field_weights is missing") 87 | 88 | if self.config['model'] == "prms": 89 | if 'fields' not in self.config: 90 | raise Exception("fields is missing") 91 | 92 | except Exception, e: 93 | print "Error in config file: ", e 94 | sys.exit(1) 95 | 96 | def _open_index(self): 97 | self.lucene = Lucene(self.config['index_dir']) 98 | 99 | self.lucene.open_searcher() 100 | 101 | def _close_index(self): 102 | self.lucene.close_reader() 103 | 104 | def _load_queries(self): 105 | self.queries = json.load(open(self.config['query_file'])) 106 | 107 | def _first_pass_scoring(self, lucene, query): 108 | """ 109 | Returns first-pass scoring of documents. 110 | 111 | :param query: raw query 112 | :return RetrievalResults object 113 | """ 114 | print "\tFirst pass scoring... ", 115 | results = lucene.score_query(query, field_content=self.config['first_pass_field'], 116 | field_id=self.config['field_id'], 117 | num_docs=self.config['first_pass_num_docs']) 118 | print results.num_docs() 119 | return results 120 | 121 | def _second_pass_scoring(self, res1, scorer): 122 | """ 123 | Returns second-pass scoring of documents. 124 | 125 | :param res1: first pass results 126 | :return: RetrievalResults object 127 | """ 128 | print "\tSecond pass scoring... " 129 | results = RetrievalResults() 130 | for doc_id, orig_score in res1.get_scores_sorted(): 131 | doc_id_int = res1.get_doc_id_int(doc_id) 132 | score = scorer.score_doc(doc_id, doc_id_int) 133 | results.append(doc_id, score) 134 | print "done" 135 | return results 136 | 137 | def retrieve(self): 138 | """Scores queries and outputs results.""" 139 | s_t = datetime.now() # start time 140 | total_time = 0.0 141 | 142 | self._load_queries() 143 | self._open_index() 144 | 145 | # init output file 146 | if os.path.exists(self.config['output_file']): 147 | os.remove(self.config['output_file']) 148 | out = open(self.config['output_file'], "w") 149 | 150 | for query_id in sorted(self.queries): 151 | # query = Query.preprocess(self.queries[query_id]) 152 | query = Lucene.preprocess(self.queries[query_id]) 153 | print "scoring [" + query_id + "] " + query 154 | # first pass scoring 155 | res1 = self._first_pass_scoring(self.lucene, query) 156 | # second pass scoring (if needed) 157 | if self.config['model'] == "lucene": 158 | results = res1 159 | else: 160 | scorer = Scorer.get_scorer(self.config['model'], self.lucene, query, self.config) 161 | results = self._second_pass_scoring(res1, scorer) 162 | # write results to output file 163 | results.write_trec_format(query_id, self.config['run_id'], out, self.config['num_docs']) 164 | 165 | # close output file 166 | out.close() 167 | # close index 168 | self._close_index() 169 | 170 | e_t = datetime.now() # end time 171 | diff = e_t - s_t 172 | total_time += diff.total_seconds() 173 | time_log = "Execution time(sec):\t" + str(total_time) + "\n" 174 | print time_log 175 | 176 | 177 | def print_usage(): 178 | print sys.argv[0] + " " 179 | sys.exit() 180 | 181 | 182 | def main(argv): 183 | if len(argv) < 1: 184 | print_usage() 185 | 186 | r = Retrieval(argv[0]) 187 | r.retrieve() 188 | 189 | 190 | if __name__ == '__main__': 191 | main(sys.argv[1:]) 192 | -------------------------------------------------------------------------------- /nordlys/retrieval/scorer.py: -------------------------------------------------------------------------------- 1 | """ 2 | Various retrieval models for scoring a individual document for a given query. 3 | 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 5 | @author: Krisztian Balog (krisztian.balog@uis.no) 6 | """ 7 | 8 | from __future__ import division 9 | import math 10 | from lucene_tools import Lucene 11 | 12 | 13 | class Scorer(object): 14 | """Base scorer class.""" 15 | 16 | SCORER_DEBUG = 0 17 | 18 | def __init__(self, lucene, query, params): 19 | self.lucene = lucene 20 | self.query = query 21 | self.params = params 22 | self.lucene.open_searcher() 23 | """ 24 | @todo consider the field for analysis 25 | """ 26 | # NOTE: The analyser might return terms that are not in the collection. 27 | # These terms are filtered out later in the score_doc functions. 28 | self.query_terms = lucene.analyze_query(self.query) if query is not None else None 29 | 30 | @staticmethod 31 | def get_scorer(model, lucene, query, params): 32 | """ 33 | Returns Scorer object (Scorer factory). 34 | 35 | :param model: accepted values: lucene, lm or mlm 36 | :param lucene: Lucene object 37 | :param query: raw query (to be analyzed) 38 | :param params: dict with models parameters 39 | """ 40 | if model == "lm": 41 | print "\tLM scoring ... " 42 | return ScorerLM(lucene, query, params) 43 | elif model == "mlm": 44 | print "\tMLM scoring ..." 45 | return ScorerMLM(lucene, query, params) 46 | elif model == "prms": 47 | print "\tPRMS scoring ..." 48 | return ScorerPRMS(lucene, query, params) 49 | else: 50 | raise Exception("Unknown model '" + model + "'") 51 | 52 | 53 | class ScorerLM(Scorer): 54 | def __init__(self, lucene, query, params): 55 | super(ScorerLM, self).__init__(lucene, query, params) 56 | self.smoothing_method = params.get('smoothing_method', "jm").lower() 57 | if (self.smoothing_method != "jm") and (self.smoothing_method != "dirichlet"): 58 | raise Exception(self.params['smoothing_method'] + " smoothing method is not supported!") 59 | self.tf = {} 60 | 61 | @staticmethod 62 | def get_jm_prob(tf_t_d, len_d, tf_t_C, len_C, lambd): 63 | """ 64 | Computes JM-smoothed probability 65 | p(t|theta_d) = [(1-lambda) tf(t, d)/|d|] + [lambda tf(t, C)/|C|] 66 | 67 | :param tf_t_d: tf(t,d) 68 | :param len_d: |d| 69 | :param tf_t_C: tf(t,C) 70 | :param len_C: |C| = \sum_{d \in C} |d| 71 | :param lambd: \lambda 72 | :return: 73 | """ 74 | p_t_d = tf_t_d / len_d if len_d > 0 else 0 75 | p_t_C = tf_t_C / len_C if len_C > 0 else 0 76 | return (1 - lambd) * p_t_d + lambd * p_t_C 77 | 78 | @staticmethod 79 | def get_dirichlet_prob(tf_t_d, len_d, tf_t_C, len_C, mu): 80 | """ 81 | Computes Dirichlet-smoothed probability 82 | P(t|theta_d) = [tf(t, d) + mu P(t|C)] / [|d| + mu] 83 | 84 | :param tf_t_d: tf(t,d) 85 | :param len_d: |d| 86 | :param tf_t_C: tf(t,C) 87 | :param len_C: |C| = \sum_{d \in C} |d| 88 | :param mu: \mu 89 | :return: 90 | """ 91 | if mu == 0: # i.e. field does not have any content in the collection 92 | return 0 93 | else: 94 | p_t_C = tf_t_C / len_C if len_C > 0 else 0 95 | return (tf_t_d + mu * p_t_C) / (len_d + mu) 96 | 97 | def get_tf(self, lucene_doc_id, field): 98 | if lucene_doc_id not in self.tf: 99 | self.tf[lucene_doc_id] = {} 100 | if field not in self.tf[lucene_doc_id]: 101 | self.tf[lucene_doc_id][field] = self.lucene.get_doc_termfreqs(lucene_doc_id, field) 102 | return self.tf[lucene_doc_id][field] 103 | 104 | def get_term_prob(self, lucene_doc_id, field, t, tf_t_d_f=None, tf_t_C_f=None): 105 | """ 106 | Returns probability of a given term for the given field. 107 | 108 | :param lucene_doc_id: internal Lucene document ID 109 | :param field: entity field name, e.g. 110 | :param t: term 111 | :return: P(t|d_f) 112 | """ 113 | # Gets term freqs for field of document 114 | tf = {} 115 | if lucene_doc_id is not None: 116 | tf = self.get_tf(lucene_doc_id, field) 117 | 118 | len_d_f = sum(tf.values()) 119 | len_C_f = self.lucene.get_coll_length(field) 120 | 121 | tf_t_d_f = tf.get(t, 0) if tf_t_d_f is None else tf_t_d_f 122 | tf_t_C_f = self.lucene.get_coll_termfreq(t, field) if tf_t_C_f is None else tf_t_C_f 123 | if self.SCORER_DEBUG: 124 | print "\t\tt=" + t + ", f=" + field 125 | print "\t\t\tDoc: tf(t,f)=" + str(tf_t_d_f) + "\t|f|=" + str(len_d_f) 126 | print "\t\t\tColl: tf(t,f)=" + str(tf_t_C_f) + "\t|f|=" + str(len_C_f) 127 | 128 | # JM smoothing: p(t|theta_d_f) = [(1-lambda) tf(t, d_f)/|d_f|] + [lambda tf(t, C_f)/|C_f|] 129 | if self.smoothing_method == "jm": 130 | lambd = self.params.get('smoothing_param', 0.1) 131 | p_t_d_f = self.get_jm_prob(tf_t_d_f, len_d_f, tf_t_C_f, len_C_f, lambd) 132 | if self.SCORER_DEBUG: 133 | print "\t\t\tJM smoothing:" 134 | print "\t\t\tDoc: p(t|theta_d_f)=", p_t_d_f 135 | # Dirichlet smoothing 136 | elif self.smoothing_method == "dirichlet": 137 | mu = self.params.get('smoothing_param', self.lucene.get_avg_len(field)) 138 | p_t_d_f = self.get_dirichlet_prob(tf_t_d_f, len_d_f, tf_t_C_f, len_C_f, mu) 139 | if self.SCORER_DEBUG: 140 | print "\t\t\tDirichlet smoothing:" 141 | print "\t\t\tmu:", mu 142 | print "\t\t\tDoc: p(t|theta_d_f)=", p_t_d_f 143 | return p_t_d_f 144 | 145 | def get_term_probs(self, lucene_doc_id, field): 146 | """ 147 | Returns probability of all query terms for the given field. 148 | 149 | :param lucene_doc_id: internal Lucene document ID 150 | :param field: entity field name, e.g. 151 | :return: dictionary of terms with their probabilities 152 | """ 153 | p_t_theta_d_f = {} 154 | for t in set(self.query_terms): 155 | p_t_theta_d_f[t] = self.get_term_prob(lucene_doc_id, field, t) 156 | return p_t_theta_d_f 157 | 158 | def score_doc(self, doc_id, lucene_doc_id=None): 159 | """ 160 | Scores the given document using LM. 161 | 162 | :param doc_id: document id 163 | :param lucene_doc_id: internal Lucene document ID 164 | :return float, LM score of document and query 165 | """ 166 | if self.SCORER_DEBUG: 167 | print "Scoring doc ID=" + doc_id 168 | 169 | if lucene_doc_id is None: 170 | lucene_doc_id = self.lucene.get_lucene_document_id(doc_id) 171 | 172 | field = self.params.get('field', Lucene.FIELDNAME_CONTENTS) 173 | 174 | p_t_theta_d = self.get_term_probs(lucene_doc_id, field) 175 | if sum(p_t_theta_d.values()) == 0: # none of query terms are in the field collection 176 | if self.SCORER_DEBUG: 177 | print "\t\tP(q|" + field + ") = None" 178 | return None 179 | # p(q|theta_d) = prod(p(t|theta_d)) ; we return log(p(q|theta_d)) 180 | p_q_theta_d = 0 181 | for t in self.query_terms: 182 | # Skips the term if it is not in the field collection 183 | if p_t_theta_d[t] == 0: 184 | continue 185 | if self.SCORER_DEBUG: 186 | print "\t\tP(" + t + "|" + field + ") = " + str(p_t_theta_d[t]) 187 | p_q_theta_d += math.log(p_t_theta_d[t]) 188 | if self.SCORER_DEBUG: 189 | print "\tP(d|q)=" + str(p_q_theta_d) 190 | return p_q_theta_d 191 | 192 | 193 | class ScorerMLM(ScorerLM): 194 | def __init__(self, lucene, query, params): 195 | super(ScorerMLM, self).__init__(lucene, query, params) 196 | 197 | def get_mlm_term_prob(self, lucene_doc_id, weights, t): 198 | """ 199 | Returns MLM probability for the given term and field-weights. 200 | 201 | :param lucene_doc_id: internal Lucene document ID 202 | :param weights: dictionary, {field: weights, ...} 203 | :param t: term 204 | :return: P(t|theta_d) 205 | """ 206 | # p(t|theta_d) = sum(mu_f * p(t|theta_d_f)) 207 | p_t_theta_d = 0 208 | for f, mu_f in weights.iteritems(): 209 | p_t_theta_d_f = self.get_term_prob(lucene_doc_id, f, t) 210 | p_t_theta_d += mu_f * p_t_theta_d_f 211 | if self.SCORER_DEBUG: 212 | print "\t\tP(t|theta_d)=" + str(p_t_theta_d) 213 | return p_t_theta_d 214 | 215 | def get_mlm_term_probs(self, lucene_doc_id, weights): 216 | """ 217 | Returns probability of all query terms for the given field weights. 218 | 219 | :param lucene_doc_id: internal Lucene document ID 220 | :param weights: dictionary, {field: weights, ...} 221 | :return: dictionary of terms with their probabilities 222 | """ 223 | p_t_theta_d = {} 224 | for t in set(self.query_terms): 225 | if self.SCORER_DEBUG: 226 | print "\tt=" + t 227 | p_t_theta_d[t] = self.get_mlm_term_prob(lucene_doc_id, weights, t) 228 | return p_t_theta_d 229 | 230 | def score_doc(self, doc_id, lucene_doc_id=None): 231 | """ 232 | Scores the given document using MLM model. 233 | 234 | :param doc_id: document id 235 | :param lucene_doc_id: internal Lucene document ID 236 | :return float, MLM score of document and query 237 | """ 238 | if self.SCORER_DEBUG: 239 | print "Scoring doc ID=" + doc_id 240 | 241 | if lucene_doc_id is None: 242 | lucene_doc_id = self.lucene.get_lucene_document_id(doc_id) 243 | 244 | weights = self.params['field_weights'] 245 | 246 | p_t_theta_d = self.get_mlm_term_probs(lucene_doc_id, weights) 247 | # none of query terms are in the field collection 248 | if sum(p_t_theta_d.values()) == 0: 249 | if self.SCORER_DEBUG: 250 | print "\t\tP_mlm(q|theta_d) = None" 251 | return None 252 | # p(q|theta_d) = prod(p(t|theta_d)) ; we return log(p(q|theta_d)) 253 | p_q_theta_d = 0 254 | for t in self.query_terms: 255 | if p_t_theta_d[t] == 0: 256 | continue 257 | if self.SCORER_DEBUG: 258 | print "\t\tP_mlm(" + t + "|theta_d) = " + str(p_t_theta_d[t]) 259 | p_q_theta_d += math.log(p_t_theta_d[t]) 260 | 261 | return p_q_theta_d 262 | 263 | 264 | class ScorerPRMS(ScorerLM): 265 | def __init__(self, lucene, query, params): 266 | super(ScorerPRMS, self).__init__(lucene, query, params) 267 | self.fields = self.params['fields'] 268 | self.total_field_freq = None 269 | self.mapping_probs = None 270 | 271 | def score_doc(self, doc_id, lucene_doc_id=None): 272 | """ 273 | Scores the given document using PRMS model. 274 | 275 | :param doc_id: document id 276 | :param lucene_doc_id: internal Lucene document ID 277 | :return float, PRMS score of document and query 278 | """ 279 | if self.SCORER_DEBUG: 280 | print "Scoring doc ID=" + doc_id 281 | 282 | if lucene_doc_id is None: 283 | lucene_doc_id = self.lucene.get_lucene_document_id(doc_id) 284 | 285 | # gets mapping probs: p(f|t) 286 | p_f_t = self.get_mapping_probs() 287 | 288 | # gets term probs: p(t|theta_d_f) 289 | p_t_theta_d_f = {} 290 | for field in self.fields: 291 | p_t_theta_d_f[field] = self.get_term_probs(lucene_doc_id, field) 292 | # none of query terms are in the field collection 293 | if sum([sum(p_t_theta_d_f[field].values()) for field in p_t_theta_d_f]) == 0: 294 | return None 295 | 296 | # p(q|theta_d) = prod(p(t|theta_d)) ; we return log(p(q|theta_d)) 297 | p_q_theta_d = 0 298 | for t in self.query_terms: 299 | if self.SCORER_DEBUG: 300 | print "\tt=" + t 301 | # p(t|theta_d) = sum(p(f|t) * p(t|theta_d_f)) 302 | p_t_theta_d = 0 303 | for f in self.fields: 304 | if f in p_f_t[t]: 305 | p_t_theta_d += p_f_t[t][f] * p_t_theta_d_f[f][t] 306 | if self.SCORER_DEBUG: 307 | print "\t\t\tf=" + f + ", p(t|f)=" + str(p_f_t[t][f]) + " P(t|theta_d,f)=" + str(p_t_theta_d_f[f][t]) 308 | 309 | if p_t_theta_d == 0: 310 | continue 311 | p_q_theta_d += math.log(p_t_theta_d) 312 | if self.SCORER_DEBUG: 313 | print "\t\tP(t|theta_d)=" + str(p_t_theta_d) 314 | return p_q_theta_d 315 | 316 | def get_mapping_probs(self): 317 | """Gets (cached) mapping probabilities for all query terms.""" 318 | if self.mapping_probs is None: 319 | self.mapping_probs = {} 320 | for t in set(self.query_terms): 321 | self.mapping_probs[t] = self.get_mapping_prob(t) 322 | return self.mapping_probs 323 | 324 | def get_mapping_prob(self, t, coll_termfreq_fields=None): 325 | """ 326 | Computes PRMS field mapping probability. 327 | p(f|t) = P(t|f)P(f) / sum_f'(P(t|C_{f'_c})P(f')) 328 | 329 | :param t: str 330 | :param coll_termfreq_fields: {field: freq, ...} 331 | :return Dictionary {field: prms_prob, ...} 332 | """ 333 | if coll_termfreq_fields is None: 334 | coll_termfreq_fields = {} 335 | for f in self.fields: 336 | coll_termfreq_fields[f] = self.lucene.get_coll_termfreq(t, f) 337 | 338 | # calculates numerators for all fields: P(t|f)P(f) 339 | numerators = {} 340 | for f in self.fields: 341 | p_t_f = coll_termfreq_fields[f] / self.lucene.get_coll_length(f) 342 | p_f = self.lucene.get_doc_count(f) / self.get_total_field_freq() 343 | p_f_t = p_t_f * p_f 344 | if p_f_t > 0: 345 | numerators[f] = p_f_t 346 | if self.SCORER_DEBUG: 347 | print "\tf= " + f, "t= " + t + " P(t|f)=" + str(p_t_f) + " P(f)=" + str(p_f) 348 | 349 | # calculates denominator: sum_f'(P(t|C_{f'_c})P(f')) 350 | denominator = sum(numerators.values()) 351 | 352 | mapping_probs = {} 353 | if denominator > 0: # if the term is present in the collection 354 | for f in numerators: 355 | mapping_probs[f] = numerators[f] / denominator 356 | if self.SCORER_DEBUG: 357 | print "\t\tf= " + f + " t= " + t + " p(f|t)= " + str(numerators[f]) + "/" + str(sum(numerators.values())) + \ 358 | " = " + str(mapping_probs[f]) 359 | return mapping_probs 360 | 361 | def get_total_field_freq(self): 362 | """Returns total occurrences of all fields""" 363 | if self.total_field_freq is None: 364 | total_field_freq = 0 365 | for f in self.fields: 366 | total_field_freq += self.lucene.get_doc_count(f) 367 | self.total_field_freq = total_field_freq 368 | return self.total_field_freq -------------------------------------------------------------------------------- /qrels/qrels-SemSearch_ES.txt: -------------------------------------------------------------------------------- 1 | SemSearch_ES-1 Q0 1 2 | SemSearch_ES-1 Q0 1 3 | SemSearch_ES-1 Q0 1 4 | SemSearch_ES-1 Q0 1 5 | SemSearch_ES-1 Q0 1 6 | SemSearch_ES-1 Q0 1 7 | SemSearch_ES-1 Q0 2 8 | SemSearch_ES-1 Q0 1 9 | SemSearch_ES-1 Q0 2 10 | SemSearch_ES-1 Q0 2 11 | SemSearch_ES-1 Q0 1 12 | SemSearch_ES-1 Q0 1 13 | SemSearch_ES-1 Q0 2 14 | SemSearch_ES-1 Q0 1 15 | SemSearch_ES-10 Q0 1 16 | SemSearch_ES-10 Q0 1 17 | SemSearch_ES-10 Q0 2 18 | SemSearch_ES-10 Q0 2 19 | SemSearch_ES-10 Q0 1 20 | SemSearch_ES-10 Q0 2 21 | SemSearch_ES-10 Q0 1 22 | SemSearch_ES-10 Q0 1 23 | SemSearch_ES-10 Q0 2 24 | SemSearch_ES-10 Q0 1 25 | SemSearch_ES-10 Q0 1 26 | SemSearch_ES-100 Q0 1 27 | SemSearch_ES-100 Q0 1 28 | SemSearch_ES-101 Q0 2 29 | SemSearch_ES-102 Q0 1 30 | SemSearch_ES-102 Q0 1 31 | SemSearch_ES-102 Q0 1 32 | SemSearch_ES-102 Q0 1 33 | SemSearch_ES-102 Q0 1 34 | SemSearch_ES-102 Q0 1 35 | SemSearch_ES-102 Q0 1 36 | SemSearch_ES-102 Q0 1 37 | SemSearch_ES-102 Q0 1 38 | SemSearch_ES-103 Q0 1 39 | SemSearch_ES-104 Q0 2 40 | SemSearch_ES-104 Q0 2 41 | SemSearch_ES-104 Q0 1 42 | SemSearch_ES-104 Q0 1 43 | SemSearch_ES-104 Q0 1 44 | SemSearch_ES-105 Q0 1 45 | SemSearch_ES-106 Q0 1 46 | SemSearch_ES-106 Q0 1 47 | SemSearch_ES-106 Q0 2 48 | SemSearch_ES-107 Q0 1 49 | SemSearch_ES-107 Q0 1 50 | SemSearch_ES-108 Q0 2 51 | SemSearch_ES-109 Q0 2 52 | SemSearch_ES-109 Q0 1 53 | SemSearch_ES-109 Q0 1 54 | SemSearch_ES-109 Q0 1 55 | SemSearch_ES-109 Q0 1 56 | SemSearch_ES-109 Q0 1 57 | SemSearch_ES-109 Q0 1 58 | SemSearch_ES-11 Q0 2 59 | SemSearch_ES-11 Q0 2 60 | SemSearch_ES-11 Q0 2 61 | SemSearch_ES-11 Q0 2 62 | SemSearch_ES-11 Q0 1 63 | SemSearch_ES-11 Q0 2 64 | SemSearch_ES-11 Q0 1 65 | SemSearch_ES-11 Q0 2 66 | SemSearch_ES-11 Q0 2 67 | SemSearch_ES-11 Q0 2 68 | SemSearch_ES-11 Q0 2 69 | SemSearch_ES-11 Q0 2 70 | SemSearch_ES-11 Q0 2 71 | SemSearch_ES-111 Q0 2 72 | SemSearch_ES-111 Q0 1 73 | SemSearch_ES-111 Q0 1 74 | SemSearch_ES-111 Q0 1 75 | SemSearch_ES-111 Q0 2 76 | SemSearch_ES-112 Q0 1 77 | SemSearch_ES-114 Q0 1 78 | SemSearch_ES-114 Q0 1 79 | SemSearch_ES-114 Q0 1 80 | SemSearch_ES-114 Q0 1 81 | SemSearch_ES-114 Q0 2 82 | SemSearch_ES-115 Q0 1 83 | SemSearch_ES-115 Q0 1 84 | SemSearch_ES-118 Q0 2 85 | SemSearch_ES-118 Q0 1 86 | SemSearch_ES-118 Q0 1 87 | SemSearch_ES-118 Q0 1 88 | SemSearch_ES-118 Q0 2 89 | SemSearch_ES-118 Q0 1 90 | SemSearch_ES-119 Q0 2 91 | SemSearch_ES-119 Q0 1 92 | SemSearch_ES-119 Q0 2 93 | SemSearch_ES-119 Q0 2 94 | SemSearch_ES-119 Q0 1 95 | SemSearch_ES-119 Q0 2 96 | SemSearch_ES-119 Q0 2 97 | SemSearch_ES-119 Q0 1 98 | SemSearch_ES-119 Q0 2 99 | SemSearch_ES-119 Q0 2 100 | SemSearch_ES-119 Q0 2 101 | SemSearch_ES-119 Q0 2 102 | SemSearch_ES-119 Q0 2 103 | SemSearch_ES-119 Q0 2 104 | SemSearch_ES-119 Q0 2 105 | SemSearch_ES-119 Q0 2 106 | SemSearch_ES-119 Q0 1 107 | SemSearch_ES-12 Q0 1 108 | SemSearch_ES-12 Q0 1 109 | SemSearch_ES-12 Q0 1 110 | SemSearch_ES-12 Q0 1 111 | SemSearch_ES-12 Q0 1 112 | SemSearch_ES-12 Q0 1 113 | SemSearch_ES-12 Q0 1 114 | SemSearch_ES-12 Q0 2 115 | SemSearch_ES-12 Q0 1 116 | SemSearch_ES-12 Q0 2 117 | SemSearch_ES-12 Q0 2 118 | SemSearch_ES-12 Q0 1 119 | SemSearch_ES-12 Q0 2 120 | SemSearch_ES-12 Q0 1 121 | SemSearch_ES-12 Q0 1 122 | SemSearch_ES-120 Q0 1 123 | SemSearch_ES-123 Q0 2 124 | SemSearch_ES-123 Q0 2 125 | SemSearch_ES-123 Q0 1 126 | SemSearch_ES-123 Q0 1 127 | SemSearch_ES-123 Q0 2 128 | SemSearch_ES-123 Q0 2 129 | SemSearch_ES-123 Q0 1 130 | SemSearch_ES-124 Q0 1 131 | SemSearch_ES-125 Q0 1 132 | SemSearch_ES-125 Q0 1 133 | SemSearch_ES-127 Q0 1 134 | SemSearch_ES-127 Q0 1 135 | SemSearch_ES-127 Q0 1 136 | SemSearch_ES-128 Q0 1 137 | SemSearch_ES-129 Q0 1 138 | SemSearch_ES-13 Q0 2 139 | SemSearch_ES-13 Q0 1 140 | SemSearch_ES-130 Q0 1 141 | SemSearch_ES-130 Q0 1 142 | SemSearch_ES-130 Q0 1 143 | SemSearch_ES-130 Q0 1 144 | SemSearch_ES-131 Q0 1 145 | SemSearch_ES-131 Q0 1 146 | SemSearch_ES-131 Q0 1 147 | SemSearch_ES-131 Q0 1 148 | SemSearch_ES-131 Q0 1 149 | SemSearch_ES-132 Q0 2 150 | SemSearch_ES-133 Q0 1 151 | SemSearch_ES-134 Q0 1 152 | SemSearch_ES-135 Q0 1 153 | SemSearch_ES-136 Q0 1 154 | SemSearch_ES-136 Q0 1 155 | SemSearch_ES-136 Q0 1 156 | SemSearch_ES-136 Q0 1 157 | SemSearch_ES-136 Q0 1 158 | SemSearch_ES-136 Q0 1 159 | SemSearch_ES-136 Q0 1 160 | SemSearch_ES-136 Q0 1 161 | SemSearch_ES-137 Q0 1 162 | SemSearch_ES-138 Q0 1 163 | SemSearch_ES-138 Q0 1 164 | SemSearch_ES-138 Q0 1 165 | SemSearch_ES-139 Q0 1 166 | SemSearch_ES-139 Q0 1 167 | SemSearch_ES-14 Q0 2 168 | SemSearch_ES-14 Q0 2 169 | SemSearch_ES-14 Q0 2 170 | SemSearch_ES-14 Q0 1 171 | SemSearch_ES-14 Q0 2 172 | SemSearch_ES-14 Q0 1 173 | SemSearch_ES-14 Q0 1 174 | SemSearch_ES-14 Q0 2 175 | SemSearch_ES-14 Q0 1 176 | SemSearch_ES-14 Q0 2 177 | SemSearch_ES-14 Q0 1 178 | SemSearch_ES-14 Q0 2 179 | SemSearch_ES-14 Q0 2 180 | SemSearch_ES-140 Q0 1 181 | SemSearch_ES-140 Q0 1 182 | SemSearch_ES-141 Q0 1 183 | SemSearch_ES-141 Q0 1 184 | SemSearch_ES-141 Q0 1 185 | SemSearch_ES-141 Q0 1 186 | SemSearch_ES-142 Q0 1 187 | SemSearch_ES-142 Q0 1 188 | SemSearch_ES-142 Q0 1 189 | SemSearch_ES-15 Q0 2 190 | SemSearch_ES-15 Q0 1 191 | SemSearch_ES-15 Q0 1 192 | SemSearch_ES-15 Q0 1 193 | SemSearch_ES-16 Q0 1 194 | SemSearch_ES-16 Q0 2 195 | SemSearch_ES-16 Q0 1 196 | SemSearch_ES-16 Q0 2 197 | SemSearch_ES-16 Q0 2 198 | SemSearch_ES-16 Q0 1 199 | SemSearch_ES-16 Q0 1 200 | SemSearch_ES-16 Q0 1 201 | SemSearch_ES-16 Q0 1 202 | SemSearch_ES-16 Q0 1 203 | SemSearch_ES-16 Q0 1 204 | SemSearch_ES-16 Q0 2 205 | SemSearch_ES-16 Q0 1 206 | SemSearch_ES-16 Q0 1 207 | SemSearch_ES-16 Q0 1 208 | SemSearch_ES-17 Q0 1 209 | SemSearch_ES-17 Q0 1 210 | SemSearch_ES-17 Q0 1 211 | SemSearch_ES-17 Q0 2 212 | SemSearch_ES-17 Q0 1 213 | SemSearch_ES-17 Q0 2 214 | SemSearch_ES-17 Q0 1 215 | SemSearch_ES-17 Q0 2 216 | SemSearch_ES-17 Q0 2 217 | SemSearch_ES-17 Q0 1 218 | SemSearch_ES-17 Q0 2 219 | SemSearch_ES-17 Q0 1 220 | SemSearch_ES-17 Q0 1 221 | SemSearch_ES-18 Q0 2 222 | SemSearch_ES-18 Q0 2 223 | SemSearch_ES-18 Q0 1 224 | SemSearch_ES-19 Q0 1 225 | SemSearch_ES-19 Q0 1 226 | SemSearch_ES-19 Q0 2 227 | SemSearch_ES-19 Q0 1 228 | SemSearch_ES-19 Q0 2 229 | SemSearch_ES-2 Q0 1 230 | SemSearch_ES-2 Q0 2 231 | SemSearch_ES-2 Q0 1 232 | SemSearch_ES-2 Q0 2 233 | SemSearch_ES-2 Q0 1 234 | SemSearch_ES-20 Q0 1 235 | SemSearch_ES-20 Q0 1 236 | SemSearch_ES-20 Q0 1 237 | SemSearch_ES-20 Q0 1 238 | SemSearch_ES-20 Q0 1 239 | SemSearch_ES-20 Q0 1 240 | SemSearch_ES-20 Q0 1 241 | SemSearch_ES-20 Q0 1 242 | SemSearch_ES-20 Q0 1 243 | SemSearch_ES-20 Q0 1 244 | SemSearch_ES-20 Q0 1 245 | SemSearch_ES-20 Q0 1 246 | SemSearch_ES-20 Q0 1 247 | SemSearch_ES-20 Q0 2 248 | SemSearch_ES-20 Q0 1 249 | SemSearch_ES-20 Q0 1 250 | SemSearch_ES-20 Q0 1 251 | SemSearch_ES-20 Q0 1 252 | SemSearch_ES-20 Q0 1 253 | SemSearch_ES-20 Q0 1 254 | SemSearch_ES-20 Q0 1 255 | SemSearch_ES-20 Q0 1 256 | SemSearch_ES-21 Q0 2 257 | SemSearch_ES-21 Q0 1 258 | SemSearch_ES-21 Q0 1 259 | SemSearch_ES-21 Q0 1 260 | SemSearch_ES-21 Q0 2 261 | SemSearch_ES-21 Q0 2 262 | SemSearch_ES-21 Q0 2 263 | SemSearch_ES-21 Q0 1 264 | SemSearch_ES-21 Q0 2 265 | SemSearch_ES-21 Q0 2 266 | SemSearch_ES-21 Q0 1 267 | SemSearch_ES-21 Q0 1 268 | SemSearch_ES-21 Q0 1 269 | SemSearch_ES-21 Q0 1 270 | SemSearch_ES-21 Q0 1 271 | SemSearch_ES-21 Q0 1 272 | SemSearch_ES-21 Q0 1 273 | SemSearch_ES-21 Q0 2 274 | SemSearch_ES-22 Q0 1 275 | SemSearch_ES-22 Q0 1 276 | SemSearch_ES-22 Q0 2 277 | SemSearch_ES-22 Q0 1 278 | SemSearch_ES-22 Q0 2 279 | SemSearch_ES-22 Q0 1 280 | SemSearch_ES-22 Q0 1 281 | SemSearch_ES-22 Q0 1 282 | SemSearch_ES-22 Q0 1 283 | SemSearch_ES-22 Q0 2 284 | SemSearch_ES-22 Q0 1 285 | SemSearch_ES-22 Q0 1 286 | SemSearch_ES-22 Q0 2 287 | SemSearch_ES-22 Q0 2 288 | SemSearch_ES-23 Q0 2 289 | SemSearch_ES-23 Q0 1 290 | SemSearch_ES-23 Q0 1 291 | SemSearch_ES-23 Q0 1 292 | SemSearch_ES-23 Q0 1 293 | SemSearch_ES-23 Q0 1 294 | SemSearch_ES-23 Q0 1 295 | SemSearch_ES-23 Q0 2 296 | SemSearch_ES-23 Q0 1 297 | SemSearch_ES-23 Q0 1 298 | SemSearch_ES-23 Q0 2 299 | SemSearch_ES-23 Q0 1 300 | SemSearch_ES-23 Q0 1 301 | SemSearch_ES-23 Q0 2 302 | SemSearch_ES-23 Q0 1 303 | SemSearch_ES-23 Q0 1 304 | SemSearch_ES-23 Q0 2 305 | SemSearch_ES-23 Q0 1 306 | SemSearch_ES-23 Q0 1 307 | SemSearch_ES-23 Q0 1 308 | SemSearch_ES-23 Q0 1 309 | SemSearch_ES-23 Q0 1 310 | SemSearch_ES-23 Q0 1 311 | SemSearch_ES-23 Q0 1 312 | SemSearch_ES-23 Q0 2 313 | SemSearch_ES-23 Q0 1 314 | SemSearch_ES-24 Q0 1 315 | SemSearch_ES-24 Q0 1 316 | SemSearch_ES-24 Q0 1 317 | SemSearch_ES-24 Q0 2 318 | SemSearch_ES-24 Q0 1 319 | SemSearch_ES-24 Q0 1 320 | SemSearch_ES-24 Q0 1 321 | SemSearch_ES-24 Q0 1 322 | SemSearch_ES-24 Q0 1 323 | SemSearch_ES-24 Q0 1 324 | SemSearch_ES-24 Q0 1 325 | SemSearch_ES-24 Q0 2 326 | SemSearch_ES-24 Q0 1 327 | SemSearch_ES-24 Q0 1 328 | SemSearch_ES-24 Q0 1 329 | SemSearch_ES-25 Q0 1 330 | SemSearch_ES-25 Q0 2 331 | SemSearch_ES-25 Q0 2 332 | SemSearch_ES-26 Q0 1 333 | SemSearch_ES-26 Q0 1 334 | SemSearch_ES-26 Q0 2 335 | SemSearch_ES-26 Q0 1 336 | SemSearch_ES-26 Q0 1 337 | SemSearch_ES-26 Q0 1 338 | SemSearch_ES-26 Q0 1 339 | SemSearch_ES-26 Q0 1 340 | SemSearch_ES-26 Q0 1 341 | SemSearch_ES-26 Q0 1 342 | SemSearch_ES-27 Q0 1 343 | SemSearch_ES-27 Q0 1 344 | SemSearch_ES-27 Q0 1 345 | SemSearch_ES-27 Q0 1 346 | SemSearch_ES-28 Q0 2 347 | SemSearch_ES-28 Q0 2 348 | SemSearch_ES-28 Q0 2 349 | SemSearch_ES-28 Q0 1 350 | SemSearch_ES-28 Q0 1 351 | SemSearch_ES-28 Q0 1 352 | SemSearch_ES-28 Q0 2 353 | SemSearch_ES-28 Q0 1 354 | SemSearch_ES-28 Q0 1 355 | SemSearch_ES-28 Q0 1 356 | SemSearch_ES-28 Q0 2 357 | SemSearch_ES-28 Q0 2 358 | SemSearch_ES-28 Q0 2 359 | SemSearch_ES-28 Q0 2 360 | SemSearch_ES-28 Q0 1 361 | SemSearch_ES-28 Q0 2 362 | SemSearch_ES-28 Q0 1 363 | SemSearch_ES-28 Q0 1 364 | SemSearch_ES-28 Q0 1 365 | SemSearch_ES-28 Q0 1 366 | SemSearch_ES-28 Q0 2 367 | SemSearch_ES-28 Q0 1 368 | SemSearch_ES-28 Q0 2 369 | SemSearch_ES-28 Q0 1 370 | SemSearch_ES-28 Q0 2 371 | SemSearch_ES-28 Q0 1 372 | SemSearch_ES-28 Q0 2 373 | SemSearch_ES-28 Q0 2 374 | SemSearch_ES-28 Q0 2 375 | SemSearch_ES-28 Q0 1 376 | SemSearch_ES-28 Q0 1 377 | SemSearch_ES-28 Q0 2 378 | SemSearch_ES-28 Q0 1 379 | SemSearch_ES-29 Q0 1 380 | SemSearch_ES-29 Q0 1 381 | SemSearch_ES-29 Q0 1 382 | SemSearch_ES-3 Q0 1 383 | SemSearch_ES-3 Q0 2 384 | SemSearch_ES-30 Q0 1 385 | SemSearch_ES-30 Q0 1 386 | SemSearch_ES-30 Q0 2 387 | SemSearch_ES-30 Q0 1 388 | SemSearch_ES-30 Q0 1 389 | SemSearch_ES-30 Q0 1 390 | SemSearch_ES-30 Q0 1 391 | SemSearch_ES-30 Q0 1 392 | SemSearch_ES-31 Q0 1 393 | SemSearch_ES-31 Q0 1 394 | SemSearch_ES-31 Q0 1 395 | SemSearch_ES-31 Q0 1 396 | SemSearch_ES-31 Q0 2 397 | SemSearch_ES-31 Q0 1 398 | SemSearch_ES-31 Q0 2 399 | SemSearch_ES-31 Q0 1 400 | SemSearch_ES-31 Q0 1 401 | SemSearch_ES-31 Q0 1 402 | SemSearch_ES-31 Q0 2 403 | SemSearch_ES-31 Q0 2 404 | SemSearch_ES-31 Q0 1 405 | SemSearch_ES-31 Q0 2 406 | SemSearch_ES-31 Q0 1 407 | SemSearch_ES-31 Q0 1 408 | SemSearch_ES-31 Q0 1 409 | SemSearch_ES-31 Q0 1 410 | SemSearch_ES-31 Q0 1 411 | SemSearch_ES-31 Q0 1 412 | SemSearch_ES-32 Q0 1 413 | SemSearch_ES-33 Q0 2 414 | SemSearch_ES-33 Q0 2 415 | SemSearch_ES-33 Q0 2 416 | SemSearch_ES-33 Q0 2 417 | SemSearch_ES-33 Q0 2 418 | SemSearch_ES-33 Q0 1 419 | SemSearch_ES-33 Q0 2 420 | SemSearch_ES-33 Q0 2 421 | SemSearch_ES-33 Q0 2 422 | SemSearch_ES-33 Q0 2 423 | SemSearch_ES-33 Q0 2 424 | SemSearch_ES-33 Q0 2 425 | SemSearch_ES-33 Q0 1 426 | SemSearch_ES-33 Q0 1 427 | SemSearch_ES-33 Q0 1 428 | SemSearch_ES-33 Q0 1 429 | SemSearch_ES-33 Q0 2 430 | SemSearch_ES-33 Q0 2 431 | SemSearch_ES-33 Q0 2 432 | SemSearch_ES-33 Q0 1 433 | SemSearch_ES-33 Q0 2 434 | SemSearch_ES-33 Q0 1 435 | SemSearch_ES-33 Q0 2 436 | SemSearch_ES-33 Q0 2 437 | SemSearch_ES-33 Q0 2 438 | SemSearch_ES-33 Q0 1 439 | SemSearch_ES-33 Q0 2 440 | SemSearch_ES-33 Q0 2 441 | SemSearch_ES-33 Q0 2 442 | SemSearch_ES-34 Q0 2 443 | SemSearch_ES-34 Q0 2 444 | SemSearch_ES-34 Q0 2 445 | SemSearch_ES-34 Q0 2 446 | SemSearch_ES-34 Q0 1 447 | SemSearch_ES-34 Q0 1 448 | SemSearch_ES-34 Q0 2 449 | SemSearch_ES-34 Q0 2 450 | SemSearch_ES-34 Q0 2 451 | SemSearch_ES-34 Q0 2 452 | SemSearch_ES-34 Q0 1 453 | SemSearch_ES-34 Q0 2 454 | SemSearch_ES-34 Q0 2 455 | SemSearch_ES-35 Q0 1 456 | SemSearch_ES-35 Q0 1 457 | SemSearch_ES-35 Q0 1 458 | SemSearch_ES-35 Q0 1 459 | SemSearch_ES-35 Q0 1 460 | SemSearch_ES-35 Q0 1 461 | SemSearch_ES-36 Q0 1 462 | SemSearch_ES-36 Q0 1 463 | SemSearch_ES-36 Q0 1 464 | SemSearch_ES-36 Q0 2 465 | SemSearch_ES-36 Q0 1 466 | SemSearch_ES-36 Q0 2 467 | SemSearch_ES-36 Q0 2 468 | SemSearch_ES-36 Q0 1 469 | SemSearch_ES-36 Q0 1 470 | SemSearch_ES-36 Q0 3 471 | SemSearch_ES-36 Q0 1 472 | SemSearch_ES-36 Q0 1 473 | SemSearch_ES-36 Q0 1 474 | SemSearch_ES-36 Q0 2 475 | SemSearch_ES-36 Q0 2 476 | SemSearch_ES-36 Q0 2 477 | SemSearch_ES-36 Q0 1 478 | SemSearch_ES-36 Q0 1 479 | SemSearch_ES-36 Q0 1 480 | SemSearch_ES-36 Q0 1 481 | SemSearch_ES-37 Q0 2 482 | SemSearch_ES-37 Q0 2 483 | SemSearch_ES-37 Q0 1 484 | SemSearch_ES-37 Q0 2 485 | SemSearch_ES-37 Q0 2 486 | SemSearch_ES-37 Q0 2 487 | SemSearch_ES-37 Q0 1 488 | SemSearch_ES-37 Q0 2 489 | SemSearch_ES-37 Q0 2 490 | SemSearch_ES-37 Q0 2 491 | SemSearch_ES-37 Q0 2 492 | SemSearch_ES-37 Q0 2 493 | SemSearch_ES-37 Q0 1 494 | SemSearch_ES-37 Q0 2 495 | SemSearch_ES-38 Q0 1 496 | SemSearch_ES-38 Q0 2 497 | SemSearch_ES-38 Q0 1 498 | SemSearch_ES-38 Q0 2 499 | SemSearch_ES-38 Q0 1 500 | SemSearch_ES-38 Q0 2 501 | SemSearch_ES-38 Q0 1 502 | SemSearch_ES-38 Q0 2 503 | SemSearch_ES-38 Q0 2 504 | SemSearch_ES-38 Q0 1 505 | SemSearch_ES-38 Q0 2 506 | SemSearch_ES-39 Q0 1 507 | SemSearch_ES-39 Q0 1 508 | SemSearch_ES-39 Q0 1 509 | SemSearch_ES-39 Q0 2 510 | SemSearch_ES-39 Q0 2 511 | SemSearch_ES-39 Q0 1 512 | SemSearch_ES-39 Q0 2 513 | SemSearch_ES-39 Q0 2 514 | SemSearch_ES-4 Q0 2 515 | SemSearch_ES-4 Q0 2 516 | SemSearch_ES-4 Q0 2 517 | SemSearch_ES-4 Q0 2 518 | SemSearch_ES-4 Q0 2 519 | SemSearch_ES-4 Q0 2 520 | SemSearch_ES-4 Q0 2 521 | SemSearch_ES-4 Q0 2 522 | SemSearch_ES-4 Q0 2 523 | SemSearch_ES-4 Q0 2 524 | SemSearch_ES-4 Q0 2 525 | SemSearch_ES-4 Q0 2 526 | SemSearch_ES-4 Q0 2 527 | SemSearch_ES-4 Q0 2 528 | SemSearch_ES-4 Q0 1 529 | SemSearch_ES-4 Q0 2 530 | SemSearch_ES-40 Q0 2 531 | SemSearch_ES-40 Q0 1 532 | SemSearch_ES-40 Q0 2 533 | SemSearch_ES-40 Q0 1 534 | SemSearch_ES-41 Q0 2 535 | SemSearch_ES-41 Q0 1 536 | SemSearch_ES-41 Q0 2 537 | SemSearch_ES-41 Q0 1 538 | SemSearch_ES-41 Q0 2 539 | SemSearch_ES-41 Q0 2 540 | SemSearch_ES-41 Q0 1 541 | SemSearch_ES-41 Q0 1 542 | SemSearch_ES-41 Q0 1 543 | SemSearch_ES-41 Q0 2 544 | SemSearch_ES-41 Q0 2 545 | SemSearch_ES-41 Q0 2 546 | SemSearch_ES-41 Q0 2 547 | SemSearch_ES-41 Q0 1 548 | SemSearch_ES-41 Q0 2 549 | SemSearch_ES-41 Q0 2 550 | SemSearch_ES-41 Q0 1 551 | SemSearch_ES-41 Q0 2 552 | SemSearch_ES-42 Q0 2 553 | SemSearch_ES-42 Q0 2 554 | SemSearch_ES-42 Q0 2 555 | SemSearch_ES-42 Q0 2 556 | SemSearch_ES-42 Q0 1 557 | SemSearch_ES-42 Q0 2 558 | SemSearch_ES-42 Q0 1 559 | SemSearch_ES-42 Q0 1 560 | SemSearch_ES-42 Q0 2 561 | SemSearch_ES-42 Q0 2 562 | SemSearch_ES-42 Q0 2 563 | SemSearch_ES-42 Q0 2 564 | SemSearch_ES-42 Q0 1 565 | SemSearch_ES-45 Q0 1 566 | SemSearch_ES-45 Q0 2 567 | SemSearch_ES-45 Q0 2 568 | SemSearch_ES-45 Q0 2 569 | SemSearch_ES-45 Q0 2 570 | SemSearch_ES-45 Q0 1 571 | SemSearch_ES-45 Q0 1 572 | SemSearch_ES-45 Q0 2 573 | SemSearch_ES-45 Q0 2 574 | SemSearch_ES-45 Q0 1 575 | SemSearch_ES-47 Q0 1 576 | SemSearch_ES-47 Q0 1 577 | SemSearch_ES-47 Q0 2 578 | SemSearch_ES-47 Q0 2 579 | SemSearch_ES-47 Q0 1 580 | SemSearch_ES-47 Q0 1 581 | SemSearch_ES-47 Q0 1 582 | SemSearch_ES-47 Q0 2 583 | SemSearch_ES-47 Q0 2 584 | SemSearch_ES-47 Q0 1 585 | SemSearch_ES-47 Q0 2 586 | SemSearch_ES-47 Q0 2 587 | SemSearch_ES-47 Q0 1 588 | SemSearch_ES-47 Q0 2 589 | SemSearch_ES-47 Q0 2 590 | SemSearch_ES-47 Q0 1 591 | SemSearch_ES-48 Q0 1 592 | SemSearch_ES-48 Q0 1 593 | SemSearch_ES-48 Q0 1 594 | SemSearch_ES-48 Q0 1 595 | SemSearch_ES-48 Q0 1 596 | SemSearch_ES-48 Q0 1 597 | SemSearch_ES-49 Q0 1 598 | SemSearch_ES-49 Q0 2 599 | SemSearch_ES-49 Q0 1 600 | SemSearch_ES-49 Q0 1 601 | SemSearch_ES-49 Q0 2 602 | SemSearch_ES-5 Q0 1 603 | SemSearch_ES-5 Q0 2 604 | SemSearch_ES-5 Q0 1 605 | SemSearch_ES-5 Q0 2 606 | SemSearch_ES-5 Q0 1 607 | SemSearch_ES-5 Q0 2 608 | SemSearch_ES-5 Q0 2 609 | SemSearch_ES-5 Q0 2 610 | SemSearch_ES-5 Q0 2 611 | SemSearch_ES-5 Q0 1 612 | SemSearch_ES-5 Q0 2 613 | SemSearch_ES-5 Q0 2 614 | SemSearch_ES-5 Q0 2 615 | SemSearch_ES-5 Q0 1 616 | SemSearch_ES-5 Q0 2 617 | SemSearch_ES-5 Q0 1 618 | SemSearch_ES-5 Q0 2 619 | SemSearch_ES-5 Q0 2 620 | SemSearch_ES-5 Q0 2 621 | SemSearch_ES-50 Q0 1 622 | SemSearch_ES-51 Q0 1 623 | SemSearch_ES-51 Q0 1 624 | SemSearch_ES-51 Q0 1 625 | SemSearch_ES-52 Q0 2 626 | SemSearch_ES-52 Q0 2 627 | SemSearch_ES-52 Q0 2 628 | SemSearch_ES-52 Q0 1 629 | SemSearch_ES-52 Q0 2 630 | SemSearch_ES-52 Q0 1 631 | SemSearch_ES-52 Q0 1 632 | SemSearch_ES-52 Q0 1 633 | SemSearch_ES-52 Q0 2 634 | SemSearch_ES-52 Q0 2 635 | SemSearch_ES-52 Q0 2 636 | SemSearch_ES-52 Q0 2 637 | SemSearch_ES-52 Q0 1 638 | SemSearch_ES-52 Q0 1 639 | SemSearch_ES-52 Q0 1 640 | SemSearch_ES-52 Q0 2 641 | SemSearch_ES-52 Q0 1 642 | SemSearch_ES-52 Q0 2 643 | SemSearch_ES-52 Q0 2 644 | SemSearch_ES-52 Q0 1 645 | SemSearch_ES-52 Q0 1 646 | SemSearch_ES-52 Q0 2 647 | SemSearch_ES-52 Q0 1 648 | SemSearch_ES-52 Q0 2 649 | SemSearch_ES-53 Q0 2 650 | SemSearch_ES-53 Q0 2 651 | SemSearch_ES-53 Q0 1 652 | SemSearch_ES-53 Q0 2 653 | SemSearch_ES-53 Q0 2 654 | SemSearch_ES-53 Q0 1 655 | SemSearch_ES-53 Q0 1 656 | SemSearch_ES-53 Q0 1 657 | SemSearch_ES-53 Q0 1 658 | SemSearch_ES-53 Q0 1 659 | SemSearch_ES-53 Q0 1 660 | SemSearch_ES-53 Q0 2 661 | SemSearch_ES-53 Q0 1 662 | SemSearch_ES-53 Q0 1 663 | SemSearch_ES-54 Q0 2 664 | SemSearch_ES-54 Q0 1 665 | SemSearch_ES-54 Q0 2 666 | SemSearch_ES-54 Q0 1 667 | SemSearch_ES-54 Q0 1 668 | SemSearch_ES-54 Q0 2 669 | SemSearch_ES-54 Q0 2 670 | SemSearch_ES-54 Q0 1 671 | SemSearch_ES-54 Q0 2 672 | SemSearch_ES-54 Q0 2 673 | SemSearch_ES-54 Q0 2 674 | SemSearch_ES-54 Q0 1 675 | SemSearch_ES-55 Q0 2 676 | SemSearch_ES-55 Q0 1 677 | SemSearch_ES-55 Q0 1 678 | SemSearch_ES-55 Q0 2 679 | SemSearch_ES-56 Q0 2 680 | SemSearch_ES-56 Q0 2 681 | SemSearch_ES-56 Q0 2 682 | SemSearch_ES-56 Q0 2 683 | SemSearch_ES-56 Q0 2 684 | SemSearch_ES-56 Q0 2 685 | SemSearch_ES-56 Q0 2 686 | SemSearch_ES-56 Q0 2 687 | SemSearch_ES-56 Q0 2 688 | SemSearch_ES-56 Q0 2 689 | SemSearch_ES-56 Q0 2 690 | SemSearch_ES-56 Q0 2 691 | SemSearch_ES-56 Q0 2 692 | SemSearch_ES-56 Q0 1 693 | SemSearch_ES-56 Q0 1 694 | SemSearch_ES-56 Q0 2 695 | SemSearch_ES-56 Q0 2 696 | SemSearch_ES-56 Q0 2 697 | SemSearch_ES-56 Q0 2 698 | SemSearch_ES-56 Q0 2 699 | SemSearch_ES-57 Q0 1 700 | SemSearch_ES-57 Q0 1 701 | SemSearch_ES-57 Q0 1 702 | SemSearch_ES-57 Q0 2 703 | SemSearch_ES-57 Q0 1 704 | SemSearch_ES-57 Q0 2 705 | SemSearch_ES-57 Q0 2 706 | SemSearch_ES-57 Q0 1 707 | SemSearch_ES-57 Q0 2 708 | SemSearch_ES-57 Q0 1 709 | SemSearch_ES-57 Q0 2 710 | SemSearch_ES-57 Q0 2 711 | SemSearch_ES-57 Q0 1 712 | SemSearch_ES-57 Q0 1 713 | SemSearch_ES-57 Q0 1 714 | SemSearch_ES-57 Q0 1 715 | SemSearch_ES-57 Q0 2 716 | SemSearch_ES-58 Q0 1 717 | SemSearch_ES-58 Q0 2 718 | SemSearch_ES-58 Q0 1 719 | SemSearch_ES-58 Q0 1 720 | SemSearch_ES-58 Q0 1 721 | SemSearch_ES-58 Q0 1 722 | SemSearch_ES-58 Q0 1 723 | SemSearch_ES-58 Q0 1 724 | SemSearch_ES-58 Q0 1 725 | SemSearch_ES-58 Q0 1 726 | SemSearch_ES-59 Q0 1 727 | SemSearch_ES-59 Q0 1 728 | SemSearch_ES-59 Q0 1 729 | SemSearch_ES-59 Q0 1 730 | SemSearch_ES-6 Q0 1 731 | SemSearch_ES-6 Q0 2 732 | SemSearch_ES-6 Q0 1 733 | SemSearch_ES-6 Q0 1 734 | SemSearch_ES-6 Q0 1 735 | SemSearch_ES-6 Q0 2 736 | SemSearch_ES-6 Q0 1 737 | SemSearch_ES-6 Q0 2 738 | SemSearch_ES-60 Q0 2 739 | SemSearch_ES-60 Q0 1 740 | SemSearch_ES-60 Q0 2 741 | SemSearch_ES-60 Q0 2 742 | SemSearch_ES-60 Q0 1 743 | SemSearch_ES-61 Q0 1 744 | SemSearch_ES-61 Q0 1 745 | SemSearch_ES-63 Q0 1 746 | SemSearch_ES-63 Q0 1 747 | SemSearch_ES-63 Q0 1 748 | SemSearch_ES-63 Q0 2 749 | SemSearch_ES-63 Q0 2 750 | SemSearch_ES-64 Q0 1 751 | SemSearch_ES-64 Q0 1 752 | SemSearch_ES-64 Q0 1 753 | SemSearch_ES-65 Q0 1 754 | SemSearch_ES-65 Q0 1 755 | SemSearch_ES-65 Q0 1 756 | SemSearch_ES-65 Q0 1 757 | SemSearch_ES-65 Q0 1 758 | SemSearch_ES-65 Q0 1 759 | SemSearch_ES-65 Q0 1 760 | SemSearch_ES-65 Q0 1 761 | SemSearch_ES-65 Q0 1 762 | SemSearch_ES-65 Q0 2 763 | SemSearch_ES-65 Q0 1 764 | SemSearch_ES-65 Q0 1 765 | SemSearch_ES-65 Q0 1 766 | SemSearch_ES-66 Q0 1 767 | SemSearch_ES-66 Q0 2 768 | SemSearch_ES-66 Q0 2 769 | SemSearch_ES-67 Q0 1 770 | SemSearch_ES-67 Q0 1 771 | SemSearch_ES-67 Q0 1 772 | SemSearch_ES-67 Q0 1 773 | SemSearch_ES-67 Q0 1 774 | SemSearch_ES-68 Q0 2 775 | SemSearch_ES-68 Q0 1 776 | SemSearch_ES-68 Q0 1 777 | SemSearch_ES-68 Q0 1 778 | SemSearch_ES-68 Q0 2 779 | SemSearch_ES-68 Q0 2 780 | SemSearch_ES-68 Q0 1 781 | SemSearch_ES-68 Q0 1 782 | SemSearch_ES-68 Q0 2 783 | SemSearch_ES-68 Q0 1 784 | SemSearch_ES-68 Q0 1 785 | SemSearch_ES-68 Q0 1 786 | SemSearch_ES-69 Q0 1 787 | SemSearch_ES-69 Q0 1 788 | SemSearch_ES-69 Q0 1 789 | SemSearch_ES-69 Q0 1 790 | SemSearch_ES-7 Q0 1 791 | SemSearch_ES-7 Q0 1 792 | SemSearch_ES-7 Q0 1 793 | SemSearch_ES-70 Q0 1 794 | SemSearch_ES-70 Q0 1 795 | SemSearch_ES-70 Q0 1 796 | SemSearch_ES-70 Q0 1 797 | SemSearch_ES-70 Q0 1 798 | SemSearch_ES-71 Q0 2 799 | SemSearch_ES-71 Q0 1 800 | SemSearch_ES-71 Q0 1 801 | SemSearch_ES-71 Q0 1 802 | SemSearch_ES-71 Q0 1 803 | SemSearch_ES-71 Q0 2 804 | SemSearch_ES-71 Q0 1 805 | SemSearch_ES-71 Q0 1 806 | SemSearch_ES-71 Q0 2 807 | SemSearch_ES-71 Q0 2 808 | SemSearch_ES-71 Q0 2 809 | SemSearch_ES-71 Q0 2 810 | SemSearch_ES-71 Q0 2 811 | SemSearch_ES-71 Q0 1 812 | SemSearch_ES-71 Q0 2 813 | SemSearch_ES-71 Q0 1 814 | SemSearch_ES-71 Q0 1 815 | SemSearch_ES-71 Q0 2 816 | SemSearch_ES-71 Q0 2 817 | SemSearch_ES-72 Q0 2 818 | SemSearch_ES-72 Q0 1 819 | SemSearch_ES-72 Q0 2 820 | SemSearch_ES-72 Q0 1 821 | SemSearch_ES-72 Q0 2 822 | SemSearch_ES-72 Q0 1 823 | SemSearch_ES-72 Q0 1 824 | SemSearch_ES-72 Q0 1 825 | SemSearch_ES-73 Q0 1 826 | SemSearch_ES-73 Q0 1 827 | SemSearch_ES-73 Q0 2 828 | SemSearch_ES-74 Q0 1 829 | SemSearch_ES-74 Q0 1 830 | SemSearch_ES-74 Q0 1 831 | SemSearch_ES-74 Q0 1 832 | SemSearch_ES-74 Q0 1 833 | SemSearch_ES-74 Q0 1 834 | SemSearch_ES-74 Q0 1 835 | SemSearch_ES-74 Q0 2 836 | SemSearch_ES-74 Q0 2 837 | SemSearch_ES-74 Q0 2 838 | SemSearch_ES-74 Q0 1 839 | SemSearch_ES-74 Q0 1 840 | SemSearch_ES-74 Q0 1 841 | SemSearch_ES-74 Q0 1 842 | SemSearch_ES-74 Q0 1 843 | SemSearch_ES-74 Q0 1 844 | SemSearch_ES-74 Q0 1 845 | SemSearch_ES-74 Q0 1 846 | SemSearch_ES-74 Q0 2 847 | SemSearch_ES-75 Q0 1 848 | SemSearch_ES-76 Q0 1 849 | SemSearch_ES-76 Q0 2 850 | SemSearch_ES-76 Q0 1 851 | SemSearch_ES-76 Q0 1 852 | SemSearch_ES-76 Q0 1 853 | SemSearch_ES-76 Q0 1 854 | SemSearch_ES-76 Q0 1 855 | SemSearch_ES-76 Q0 1 856 | SemSearch_ES-76 Q0 2 857 | SemSearch_ES-76 Q0 2 858 | SemSearch_ES-76 Q0 2 859 | SemSearch_ES-76 Q0 2 860 | SemSearch_ES-76 Q0 1 861 | SemSearch_ES-76 Q0 2 862 | SemSearch_ES-76 Q0 1 863 | SemSearch_ES-76 Q0 2 864 | SemSearch_ES-76 Q0 1 865 | SemSearch_ES-76 Q0 2 866 | SemSearch_ES-76 Q0 2 867 | SemSearch_ES-76 Q0 1 868 | SemSearch_ES-77 Q0 1 869 | SemSearch_ES-77 Q0 1 870 | SemSearch_ES-77 Q0 1 871 | SemSearch_ES-77 Q0 1 872 | SemSearch_ES-77 Q0 2 873 | SemSearch_ES-77 Q0 1 874 | SemSearch_ES-78 Q0 1 875 | SemSearch_ES-78 Q0 2 876 | SemSearch_ES-78 Q0 2 877 | SemSearch_ES-78 Q0 2 878 | SemSearch_ES-78 Q0 2 879 | SemSearch_ES-78 Q0 2 880 | SemSearch_ES-78 Q0 2 881 | SemSearch_ES-78 Q0 1 882 | SemSearch_ES-78 Q0 2 883 | SemSearch_ES-78 Q0 2 884 | SemSearch_ES-79 Q0 2 885 | SemSearch_ES-79 Q0 1 886 | SemSearch_ES-79 Q0 1 887 | SemSearch_ES-8 Q0 1 888 | SemSearch_ES-80 Q0 1 889 | SemSearch_ES-80 Q0 1 890 | SemSearch_ES-80 Q0 2 891 | SemSearch_ES-80 Q0 2 892 | SemSearch_ES-80 Q0 1 893 | SemSearch_ES-80 Q0 2 894 | SemSearch_ES-80 Q0 2 895 | SemSearch_ES-80 Q0 1 896 | SemSearch_ES-80 Q0 1 897 | SemSearch_ES-80 Q0 1 898 | SemSearch_ES-80 Q0 1 899 | SemSearch_ES-80 Q0 1 900 | SemSearch_ES-80 Q0 1 901 | SemSearch_ES-80 Q0 2 902 | SemSearch_ES-80 Q0 1 903 | SemSearch_ES-81 Q0 1 904 | SemSearch_ES-81 Q0 1 905 | SemSearch_ES-81 Q0 2 906 | SemSearch_ES-81 Q0 2 907 | SemSearch_ES-81 Q0 2 908 | SemSearch_ES-81 Q0 2 909 | SemSearch_ES-81 Q0 2 910 | SemSearch_ES-81 Q0 1 911 | SemSearch_ES-81 Q0 1 912 | SemSearch_ES-81 Q0 1 913 | SemSearch_ES-81 Q0 1 914 | SemSearch_ES-81 Q0 2 915 | SemSearch_ES-81 Q0 2 916 | SemSearch_ES-81 Q0 1 917 | SemSearch_ES-81 Q0 1 918 | SemSearch_ES-81 Q0 1 919 | SemSearch_ES-82 Q0 2 920 | SemSearch_ES-82 Q0 2 921 | SemSearch_ES-82 Q0 1 922 | SemSearch_ES-82 Q0 1 923 | SemSearch_ES-82 Q0 1 924 | SemSearch_ES-82 Q0 1 925 | SemSearch_ES-82 Q0 1 926 | SemSearch_ES-82 Q0 1 927 | SemSearch_ES-82 Q0 1 928 | SemSearch_ES-82 Q0 1 929 | SemSearch_ES-82 Q0 1 930 | SemSearch_ES-82 Q0 2 931 | SemSearch_ES-82 Q0 1 932 | SemSearch_ES-82 Q0 1 933 | SemSearch_ES-82 Q0 1 934 | SemSearch_ES-82 Q0 1 935 | SemSearch_ES-82 Q0 1 936 | SemSearch_ES-82 Q0 1 937 | SemSearch_ES-82 Q0 2 938 | SemSearch_ES-82 Q0 1 939 | SemSearch_ES-83 Q0 1 940 | SemSearch_ES-83 Q0 2 941 | SemSearch_ES-83 Q0 1 942 | SemSearch_ES-83 Q0 1 943 | SemSearch_ES-83 Q0 1 944 | SemSearch_ES-83 Q0 1 945 | SemSearch_ES-83 Q0 1 946 | SemSearch_ES-83 Q0 1 947 | SemSearch_ES-84 Q0 1 948 | SemSearch_ES-84 Q0 2 949 | SemSearch_ES-84 Q0 1 950 | SemSearch_ES-85 Q0 1 951 | SemSearch_ES-85 Q0 1 952 | SemSearch_ES-85 Q0 1 953 | SemSearch_ES-86 Q0 2 954 | SemSearch_ES-86 Q0 1 955 | SemSearch_ES-86 Q0 2 956 | SemSearch_ES-86 Q0 2 957 | SemSearch_ES-86 Q0 1 958 | SemSearch_ES-86 Q0 1 959 | SemSearch_ES-86 Q0 1 960 | SemSearch_ES-86 Q0 1 961 | SemSearch_ES-86 Q0 1 962 | SemSearch_ES-87 Q0 1 963 | SemSearch_ES-88 Q0 1 964 | SemSearch_ES-88 Q0 2 965 | SemSearch_ES-88 Q0 1 966 | SemSearch_ES-88 Q0 1 967 | SemSearch_ES-88 Q0 1 968 | SemSearch_ES-88 Q0 1 969 | SemSearch_ES-88 Q0 2 970 | SemSearch_ES-88 Q0 2 971 | SemSearch_ES-88 Q0 1 972 | SemSearch_ES-88 Q0 1 973 | SemSearch_ES-88 Q0 1 974 | SemSearch_ES-88 Q0 1 975 | SemSearch_ES-88 Q0 2 976 | SemSearch_ES-88 Q0 1 977 | SemSearch_ES-88 Q0 1 978 | SemSearch_ES-88 Q0 1 979 | SemSearch_ES-88 Q0 1 980 | SemSearch_ES-88 Q0 1 981 | SemSearch_ES-88 Q0 1 982 | SemSearch_ES-88 Q0 1 983 | SemSearch_ES-89 Q0 1 984 | SemSearch_ES-89 Q0 2 985 | SemSearch_ES-89 Q0 1 986 | SemSearch_ES-89 Q0 2 987 | SemSearch_ES-89 Q0 1 988 | SemSearch_ES-89 Q0 2 989 | SemSearch_ES-89 Q0 1 990 | SemSearch_ES-89 Q0 2 991 | SemSearch_ES-89 Q0 1 992 | SemSearch_ES-89 Q0 1 993 | SemSearch_ES-89 Q0 2 994 | SemSearch_ES-89 Q0 1 995 | SemSearch_ES-89 Q0 1 996 | SemSearch_ES-89 Q0 2 997 | SemSearch_ES-89 Q0 1 998 | SemSearch_ES-9 Q0 1 999 | SemSearch_ES-9 Q0 1 1000 | SemSearch_ES-9 Q0 1 1001 | SemSearch_ES-9 Q0 1 1002 | SemSearch_ES-90 Q0 1 1003 | SemSearch_ES-90 Q0 2 1004 | SemSearch_ES-90 Q0 1 1005 | SemSearch_ES-90 Q0 1 1006 | SemSearch_ES-90 Q0 2 1007 | SemSearch_ES-91 Q0 1 1008 | SemSearch_ES-91 Q0 1 1009 | SemSearch_ES-91 Q0 1 1010 | SemSearch_ES-91 Q0 1 1011 | SemSearch_ES-91 Q0 1 1012 | SemSearch_ES-91 Q0 1 1013 | SemSearch_ES-91 Q0 2 1014 | SemSearch_ES-91 Q0 2 1015 | SemSearch_ES-91 Q0 1 1016 | SemSearch_ES-91 Q0 1 1017 | SemSearch_ES-91 Q0 2 1018 | SemSearch_ES-91 Q0 2 1019 | SemSearch_ES-91 Q0 1 1020 | SemSearch_ES-91 Q0 2 1021 | SemSearch_ES-91 Q0 1 1022 | SemSearch_ES-91 Q0 1 1023 | SemSearch_ES-91 Q0 1 1024 | SemSearch_ES-91 Q0 1 1025 | SemSearch_ES-91 Q0 2 1026 | SemSearch_ES-91 Q0 2 1027 | SemSearch_ES-93 Q0 2 1028 | SemSearch_ES-93 Q0 1 1029 | SemSearch_ES-93 Q0 1 1030 | SemSearch_ES-94 Q0 1 1031 | SemSearch_ES-94 Q0 2 1032 | SemSearch_ES-94 Q0 1 1033 | SemSearch_ES-94 Q0 1 1034 | SemSearch_ES-94 Q0 1 1035 | SemSearch_ES-95 Q0 1 1036 | SemSearch_ES-95 Q0 1 1037 | SemSearch_ES-95 Q0 1 1038 | SemSearch_ES-95 Q0 1 1039 | SemSearch_ES-95 Q0 2 1040 | SemSearch_ES-95 Q0 1 1041 | SemSearch_ES-95 Q0 1 1042 | SemSearch_ES-95 Q0 1 1043 | SemSearch_ES-95 Q0 1 1044 | SemSearch_ES-95 Q0 1 1045 | SemSearch_ES-95 Q0 1 1046 | SemSearch_ES-95 Q0 1 1047 | SemSearch_ES-95 Q0 1 1048 | SemSearch_ES-95 Q0 1 1049 | SemSearch_ES-95 Q0 1 1050 | SemSearch_ES-95 Q0 1 1051 | SemSearch_ES-95 Q0 1 1052 | SemSearch_ES-95 Q0 1 1053 | SemSearch_ES-95 Q0 1 1054 | SemSearch_ES-95 Q0 1 1055 | SemSearch_ES-95 Q0 1 1056 | SemSearch_ES-95 Q0 1 1057 | SemSearch_ES-95 Q0 1 1058 | SemSearch_ES-95 Q0 1 1059 | SemSearch_ES-95 Q0 1 1060 | SemSearch_ES-96 Q0 1 1061 | SemSearch_ES-96 Q0 1 1062 | SemSearch_ES-96 Q0 1 1063 | SemSearch_ES-97 Q0 2 1064 | SemSearch_ES-97 Q0 1 1065 | SemSearch_ES-97 Q0 1 1066 | SemSearch_ES-97 Q0 1 1067 | SemSearch_ES-97 Q0 1 1068 | SemSearch_ES-97 Q0 1 1069 | SemSearch_ES-97 Q0 1 1070 | SemSearch_ES-97 Q0 1 1071 | SemSearch_ES-97 Q0 1 1072 | SemSearch_ES-97 Q0 1 1073 | SemSearch_ES-97 Q0 1 1074 | SemSearch_ES-97 Q0 1 1075 | SemSearch_ES-97 Q0 1 1076 | SemSearch_ES-97 Q0 2 1077 | SemSearch_ES-97 Q0 1 1078 | SemSearch_ES-97 Q0 1 1079 | SemSearch_ES-98 Q0 1 1080 | SemSearch_ES-98 Q0 1 1081 | SemSearch_ES-98 Q0 1 1082 | SemSearch_ES-98 Q0 1 1083 | SemSearch_ES-98 Q0 1 1084 | SemSearch_ES-98 Q0 1 1085 | SemSearch_ES-98 Q0 1 1086 | SemSearch_ES-98 Q0 1 1087 | SemSearch_ES-98 Q0 1 1088 | SemSearch_ES-98 Q0 1 1089 | SemSearch_ES-98 Q0 1 1090 | SemSearch_ES-98 Q0 1 1091 | SemSearch_ES-98 Q0 1 1092 | SemSearch_ES-98 Q0 2 1093 | SemSearch_ES-98 Q0 1 1094 | SemSearch_ES-98 Q0 1 1095 | SemSearch_ES-98 Q0 1 1096 | SemSearch_ES-98 Q0 1 1097 | SemSearch_ES-98 Q0 1 1098 | SemSearch_ES-98 Q0 1 1099 | SemSearch_ES-98 Q0 1 1100 | SemSearch_ES-98 Q0 1 1101 | SemSearch_ES-98 Q0 1 1102 | SemSearch_ES-98 Q0 2 1103 | SemSearch_ES-98 Q0 1 1104 | SemSearch_ES-98 Q0 1 1105 | SemSearch_ES-98 Q0 1 1106 | SemSearch_ES-99 Q0 2 1107 | SemSearch_ES-99 Q0 1 1108 | SemSearch_ES-99 Q0 2 1109 | SemSearch_ES-99 Q0 1 1110 | SemSearch_ES-99 Q0 1 1111 | SemSearch_ES-99 Q0 1 1112 | SemSearch_ES-99 Q0 1 1113 | SemSearch_ES-99 Q0 1 1114 | SemSearch_ES-99 Q0 1 1115 | SemSearch_ES-99 Q0 2 1116 | -------------------------------------------------------------------------------- /qrels/queries.txt: -------------------------------------------------------------------------------- 1 | INEX_LD-20120111 vietnam war movie 2 | INEX_LD-20120112 vietnam war facts 3 | INEX_LD-20120121 vietnam food recipes 4 | INEX_LD-20120122 vietnamese food blog 5 | INEX_LD-20120131 vietnam travel national park 6 | INEX_LD-20120132 vietnam travel airports 7 | INEX_LD-20120211 guitar chord tuning 8 | INEX_LD-20120212 guitar chord minor 9 | INEX_LD-20120221 guitar classical flamenco 10 | INEX_LD-20120222 guitar classical bach 11 | INEX_LD-20120231 guitar origin Russia 12 | INEX_LD-20120232 guitar origin blues 13 | INEX_LD-20120311 tango culture movies 14 | INEX_LD-20120312 tango culture countries 15 | INEX_LD-20120321 tango music composers 16 | INEX_LD-20120322 tango music instruments 17 | INEX_LD-20120331 tango dance styles 18 | INEX_LD-20120332 tango dance history 19 | INEX_LD-20120411 bicycle sport races 20 | INEX_LD-20120412 bicycle sport disciplines 21 | INEX_LD-20120421 bicycle holiday towns 22 | INEX_LD-20120422 bicycle holiday nature 23 | INEX_LD-20120431 bicycle benefits health 24 | INEX_LD-20120432 bicycle benefits environment 25 | INEX_LD-20120511 female rock singers 26 | INEX_LD-20120512 south korean girl groups 27 | INEX_LD-20120521 electronic music genres 28 | INEX_LD-20120522 digital music notation formats 29 | INEX_LD-20120531 music conferences 30 | INEX_LD-20120532 intellectual property rights lobby 31 | INEX_LD-2009022 Szechwan dish food cuisine 32 | INEX_LD-2009039 roman architecture 33 | INEX_LD-2009053 finland car industry manufacturer saab sisu 34 | INEX_LD-2009061 france second world war normandy 35 | INEX_LD-2009062 social network group selection 36 | INEX_LD-2009063 D-Day normandy invasion 37 | INEX_LD-2009074 web ranking scoring algorithm 38 | INEX_LD-2009096 Eiffel 39 | INEX_LD-2009111 europe solar power facility 40 | INEX_LD-2009115 virtual museums 41 | INEX_LD-2010004 Indian food 42 | INEX_LD-2010014 composer museum 43 | INEX_LD-2010019 gallo roman architecture in paris 44 | INEX_LD-2010020 electricity source in France 45 | INEX_LD-2010037 social network API 46 | INEX_LD-2010043 List of films from the surrealist category 47 | INEX_LD-2010057 Einstein Relativity theory 48 | INEX_LD-2010069 summer flowers 49 | INEX_LD-2010100 house concrete wood 50 | INEX_LD-2010106 organic food advantages disadvantages 51 | INEX_LD-2012301 Niagara falls origin lake 52 | INEX_LD-2012303 Valley fever fungal infection San Joaquin 53 | INEX_LD-2012305 North Dakota's lowest river of another colour 54 | INEX_LD-2012307 July, 1850 president died Millard Fillmore sworn following day 55 | INEX_LD-2012309 residents small island city-state Malay Peninsula Chinese 56 | INEX_LD-2012311 John Lennon Yoko Ono album Starting Over 57 | INEX_LD-2012313 John Turturro 1991 Coen Brothers film 58 | INEX_LD-2012315 Baguio Quezon City Manila official independence 1945 59 | INEX_LD-2012317 daggeroso inclined to use a dagger novel Sons and Lovers 60 | INEX_LD-2012318 Directed Bela Glen Glenda Bride Monster Plan 9 Outer Space 61 | INEX_LD-2012319 1994 short story collection Alice Munro is Open 62 | INEX_LD-2012321 Asian port state-city Sir Stamford Raffles 63 | INEX_LD-2012323 Large glaciers island nation Langjokull Hofsjokull Vatnajokull 64 | INEX_LD-2012325 successor James G. Blaine studied law 65 | INEX_LD-2012327 Beloved author African-American Nobel Prize Literature 66 | INEX_LD-2012329 Sweden Iceland currency 67 | INEX_LD-2012331 Seoul Korea river name ethnic group China 68 | INEX_LD-2012333 Prime minister Canada nicknamed Silver-Tongued Laurier longest unbroken term 69 | INEX_LD-2012335 U.S. president authorise nuclear weapons against Japan 70 | INEX_LD-2012336 1906 territory Papua island Australian 71 | INEX_LD-2012337 Texas city Baylor University tornado 1953 72 | INEX_LD-2012339 Nelson Mandela John Dube 73 | INEX_LD-2012341 1997 Houston airport president 74 | INEX_LD-2012343 The Heart of a Woman poet's autobiography 75 | INEX_LD-2012345 Kennedy assassination governor of Texas seriously injured 76 | INEX_LD-2012347 seat Florida country Dade 77 | INEX_LD-2012349 Alexander Nevsky Cathedral Bulgarian city liberation Turks 78 | INEX_LD-2012351 Indian Cuisine dish rice dhal vegetables roti papad 79 | INEX_LD-2012353 country German language 80 | INEX_LD-2012354 greatest guitarist 81 | INEX_LD-2012355 England football player highest paid 82 | INEX_LD-2012357 prima ballerina Bolshoi Theatre 1960 83 | INEX_LD-2012359 Bob Ricker Executive Director the latest front group for the anti-gun movement 84 | INEX_LD-2012361 most famous award winning actor singer 85 | INEX_LD-2012363 American twins famous American professional tennis double players 86 | INEX_LD-2012365 mathematician computer scientist MIT's six inaugural MacVicar Faculty Fellows 87 | INEX_LD-2012367 invented telescope 88 | INEX_LD-2012369 most famous civic-military airports 89 | INEX_LD-2012371 most beautiful railway stations world cities located 90 | INEX_LD-2012372 famous historical battlefields opponents fought 91 | INEX_LD-2012373 birds cannot fly 92 | INEX_LD-2012375 animals lay eggs mammals 93 | INEX_LD-2012377 allegedly caused World War I 94 | INEX_LD-2012379 pairs cities same language same longitude different countries 95 | INEX_LD-2012381 movie directors directed a block buster 96 | INEX_LD-2012383 famous computer scientists disappeared at sea 97 | INEX_LD-2012385 famous politicians vegetarians 98 | INEX_LD-2012387 famous river confluence dam constructed 99 | INEX_LD-2012389 frequently visited sharks gulf Indian Ocean 100 | INEX_LD-2012390 baseball player most homeruns national league 101 | INEX_XER-60 olympic classes dinghy sailing 102 | INEX_XER-62 Neil Gaiman novels 103 | INEX_XER-63 Hugo awarded best novels 104 | INEX_XER-64 Alan Moore graphic novels adapted to film 105 | INEX_XER-65 Pacific navigators Australia explorers 106 | INEX_XER-67 Ferris and observation wheels 107 | INEX_XER-72 films shot in Venice 108 | INEX_XER-73 magazines about indie-music 109 | INEX_XER-74 circus mammals 110 | INEX_XER-79 Works by Charles Rennie Mackintosh 111 | INEX_XER-81 Movies about English hooligans 112 | INEX_XER-86 List of countries in World War Two 113 | INEX_XER-87 Axis powers of World War II 114 | INEX_XER-88 Nordic authors who are known for children's literature 115 | INEX_XER-91 Paul Auster novels 116 | INEX_XER-94 Hybrid cars sold in Europe 117 | INEX_XER-95 Tom Hanks movies where he plays a leading role. 118 | INEX_XER-96 Pure object-oriented programing languages 119 | INEX_XER-97 Compilers that can compile both C and C++ 120 | INEX_XER-98 Makers of lawn tennis rackets 121 | INEX_XER-99 Computer systems that have a recursive acronym for the name 122 | INEX_XER-100 Operating systems to which Steve Jobs related 123 | INEX_XER-106 Noble english person from the Hundred Years' War 124 | INEX_XER-108 State capitals of the United States of America 125 | INEX_XER-109 National capitals situated on islands 126 | INEX_XER-110 Nobel Prize in Literature winners who were also poets 127 | INEX_XER-113 Formula 1 drivers that won the Monaco Grand Prix 128 | INEX_XER-114 Formula one races in Europe 129 | INEX_XER-115 Formula One World Constructors' Champions 130 | INEX_XER-116 Italian nobel prize winners 131 | INEX_XER-117 Musicians who appeared in the Blues Brothers movies 132 | INEX_XER-118 French car models in 1960's 133 | INEX_XER-119 Swiss cantons where they speak German 134 | INEX_XER-121 US presidents since 1960 135 | INEX_XER-122 Movies with eight or more Academy Awards 136 | INEX_XER-123 FIFA world cup national team winners since 1974 137 | INEX_XER-124 Novels that won the Booker Prize 138 | INEX_XER-125 countries which have won the FIFA world cup 139 | INEX_XER-126 toy train manufacturers that are still in business 140 | INEX_XER-127 german female politicians 141 | INEX_XER-128 Bond girls 142 | INEX_XER-129 Science fiction book written in the 1980 143 | INEX_XER-130 Star Trek Captains 144 | INEX_XER-132 living nordic classical composers 145 | INEX_XER-133 EU countries 146 | INEX_XER-134 record-breaking sprinters in male 100-meter sprints 147 | INEX_XER-135 professional baseball team in Japan 148 | INEX_XER-136 Japanese players in Major League Baseball 149 | INEX_XER-138 National Parks East Coast Canada US 150 | INEX_XER-139 Films directed by Akira Kurosawa 151 | INEX_XER-140 Airports in Germany 152 | INEX_XER-141 Universities in Catalunya 153 | INEX_XER-143 Hanseatic league in Germany in the Netherlands Circle 154 | INEX_XER-144 chess world champions 155 | INEX_XER-147 Chemical elements that are named after people 156 | QALD2_te-1 Which German cities have more than 250000 inhabitants? 157 | QALD2_te-2 Who was the successor of John F. Kennedy? 158 | QALD2_te-3 Who is the mayor of Berlin? 159 | QALD2_te-5 What is the second highest mountain on Earth? 160 | QALD2_te-6 Give me all professional skateboarders from Sweden. 161 | QALD2_te-8 To which countries does the Himalayan mountain system extend? 162 | QALD2_te-9 Give me a list of all trumpet players that were bandleaders. 163 | QALD2_te-11 Who is the Formula 1 race driver with the most races? 164 | QALD2_te-12 Give me all world heritage sites designated within the past five years. 165 | QALD2_te-13 Who is the youngest player in the Premier League? 166 | QALD2_te-14 Give me all members of Prodigy. 167 | QALD2_te-15 What is the longest river? 168 | QALD2_te-17 Give me all cars that are produced in Germany. 169 | QALD2_te-19 Give me all people that were born in Vienna and died in Berlin. 170 | QALD2_te-21 What is the capital of Canada? 171 | QALD2_te-22 Who is the governor of Texas? 172 | QALD2_te-24 Who was the father of Queen Elizabeth II? 173 | QALD2_te-25 Which U.S. state has been admitted latest? 174 | QALD2_te-27 Sean Parnell is the governor of which U.S. state? 175 | QALD2_te-28 Give me all movies directed by Francis Ford Coppola. 176 | QALD2_te-29 Give me all actors starring in movies directed by and starring William Shatner. 177 | QALD2_te-31 Give me all current Methodist national leaders. 178 | QALD2_te-33 Give me all Australian nonprofit organizations. 179 | QALD2_te-34 In which military conflicts did Lawrence of Arabia participate? 180 | QALD2_te-35 Who developed Skype? 181 | QALD2_te-39 Give me all companies in Munich. 182 | QALD2_te-40 List all boardgames by GMT. 183 | QALD2_te-41 Who founded Intel? 184 | QALD2_te-42 Who is the husband of Amanda Palmer? 185 | QALD2_te-43 Give me all breeds of the German Shepherd dog. 186 | QALD2_te-44 Which cities does the Weser flow through? 187 | QALD2_te-45 Which countries are connected by the Rhine? 188 | QALD2_te-46 Which professional surfers were born on the Philippines? 189 | QALD2_te-48 In which UK city are the headquarters of the MI6? 190 | QALD2_te-49 Which other weapons did the designer of the Uzi develop? 191 | QALD2_te-51 Give me all Frisian islands that belong to the Netherlands. 192 | QALD2_te-53 What is the ruling party in Lisbon? 193 | QALD2_te-55 Which Greek goddesses dwelt on Mount Olympus? 194 | QALD2_te-57 Give me the Apollo 14 astronauts. 195 | QALD2_te-58 What is the time zone of Salt Lake City? 196 | QALD2_te-59 Which U.S. states are in the same timezone as Utah? 197 | QALD2_te-60 Give me a list of all lakes in Denmark. 198 | QALD2_te-63 Give me all Argentine films. 199 | QALD2_te-64 Give me all launch pads operated by NASA. 200 | QALD2_te-65 Which instruments did John Lennon play? 201 | QALD2_te-66 Which ships were called after Benjamin Franklin? 202 | QALD2_te-67 Who are the parents of the wife of Juan Carlos I? 203 | QALD2_te-72 In which U.S. state is Area 51 located? 204 | QALD2_te-75 Which daughters of British earls died in the same place they were born in? 205 | QALD2_te-76 List the children of Margaret Thatcher. 206 | QALD2_te-77 Who was called Scarface? 207 | QALD2_te-80 Give me all books by William Goldman with more than 300 pages. 208 | QALD2_te-81 Which books by Kerouac were published by Viking Press? 209 | QALD2_te-82 Give me a list of all American inventions. 210 | QALD2_te-84 Who created the comic Captain America? 211 | QALD2_te-86 What is the largest city in Australia? 212 | QALD2_te-87 Who composed the music for Harold and Maude? 213 | QALD2_te-88 Which films starring Clint Eastwood did he direct himself? 214 | QALD2_te-89 In which city was the former Dutch queen Juliana buried? 215 | QALD2_te-90 Where is the residence of the prime minister of Spain? 216 | QALD2_te-91 Which U.S. State has the abbreviation MN? 217 | QALD2_te-92 Show me all songs from Bruce Springsteen released between 1980 and 1990. 218 | QALD2_te-93 Which movies did Sam Raimi direct after Army of Darkness? 219 | QALD2_te-95 Who wrote the lyrics for the Polish national anthem? 220 | QALD2_te-97 Who painted The Storm on the Sea of Galilee? 221 | QALD2_te-98 Which country does the creator of Miffy come from? 222 | QALD2_te-99 For which label did Elvis record his first album? 223 | QALD2_te-100 Who produces Orangina? 224 | QALD2_tr-1 Give me all female Russian astronauts. 225 | QALD2_tr-3 Who is the daughter of Bill Clinton married to? 226 | QALD2_tr-4 Which river does the Brooklyn Bridge cross? 227 | QALD2_tr-6 Where did Abraham Lincoln die? 228 | QALD2_tr-8 Which states of Germany are governed by the Social Democratic Party? 229 | QALD2_tr-9 Which U.S. states possess gold minerals? 230 | QALD2_tr-10 In which country does the Nile start? 231 | QALD2_tr-11 Which countries have places with more than two caves? 232 | QALD2_tr-13 Which classis does the Millepede belong to? 233 | QALD2_tr-15 Who created Goofy? 234 | QALD2_tr-16 Give me the capitals of all countries in Africa. 235 | QALD2_tr-17 Give me all cities in New Jersey with more than 100000 inhabitants. 236 | QALD2_tr-18 Which museum exhibits The Scream by Munch? 237 | QALD2_tr-21 Which states border Illinois? 238 | QALD2_tr-22 In which country is the Limerick Lake? 239 | QALD2_tr-23 Which television shows were created by Walt Disney? 240 | QALD2_tr-24 Which mountain is the highest after the Annapurna? 241 | QALD2_tr-25 In which films directed by Garry Marshall was Julia Roberts starring? 242 | QALD2_tr-26 Which bridges are of the same type as the Manhattan Bridge? 243 | QALD2_tr-28 Which European countries have a constitutional monarchy? 244 | QALD2_tr-29 Which awards did WikiLeaks win? 245 | QALD2_tr-30 Which state of the USA has the highest population density? 246 | QALD2_tr-31 What is the currency of the Czech Republic? 247 | QALD2_tr-32 Which countries in the European Union adopted the Euro? 248 | QALD2_tr-34 Which countries have more than two official languages? 249 | QALD2_tr-35 Who is the owner of Universal Studios? 250 | QALD2_tr-36 Through which countries does the Yenisei river flow? 251 | QALD2_tr-38 Which monarchs of the United Kingdom were married to a German? 252 | QALD2_tr-40 What is the highest mountain in Australia? 253 | QALD2_tr-41 Give me all soccer clubs in Spain. 254 | QALD2_tr-42 What are the official languages of the Philippines? 255 | QALD2_tr-43 Who is the mayor of New York City? 256 | QALD2_tr-44 Who designed the Brooklyn Bridge? 257 | QALD2_tr-45 Which telecommunications organizations are located in Belgium? 258 | QALD2_tr-47 What is the highest place of Karakoram? 259 | QALD2_tr-49 Give me all companies in the advertising industry. 260 | QALD2_tr-50 What did Bruce Carver die from? 261 | QALD2_tr-51 Give me all school types. 262 | QALD2_tr-52 Which presidents were born in 1945? 263 | QALD2_tr-53 Give me all presidents of the United States. 264 | QALD2_tr-54 Who was the wife of U.S. president Lincoln? 265 | QALD2_tr-55 Who developed the video game World of Warcraft? 266 | QALD2_tr-57 List all episodes of the first season of the HBO television series The Sopranos! 267 | QALD2_tr-58 Who produced the most films? 268 | QALD2_tr-59 Give me all people with first name Jimmy. 269 | QALD2_tr-61 Which mountains are higher than the Nanga Parbat? 270 | QALD2_tr-62 Who created Wikipedia? 271 | QALD2_tr-63 Give me all actors starring in Batman Begins. 272 | QALD2_tr-64 Which software has been developed by organizations founded in California? 273 | QALD2_tr-65 Which companies work in the aerospace industry as well as on nuclear reactor technology? 274 | QALD2_tr-68 Which actors were born in Germany? 275 | QALD2_tr-69 Which caves have more than 3 entrances? 276 | QALD2_tr-70 Give me all films produced by Hal Roach. 277 | QALD2_tr-71 Give me all video games published by Mean Hamster Software. 278 | QALD2_tr-72 Which languages are spoken in Estonia? 279 | QALD2_tr-73 Who owns Aldi? 280 | QALD2_tr-74 Which capitals in Europe were host cities of the summer olympic games? 281 | QALD2_tr-75 Who has been the 5th president of the United States of America? 282 | QALD2_tr-77 Which music albums contain the song Last Christmas? 283 | QALD2_tr-78 Give me all books written by Danielle Steel. 284 | QALD2_tr-79 Which airports are located in California, USA? 285 | QALD2_tr-80 Give me all Canadian Grunge record labels. 286 | QALD2_tr-81 Which country has the most official languages? 287 | QALD2_tr-82 In which programming language is GIMP written? 288 | QALD2_tr-83 Who produced films starring Natalie Portman? 289 | QALD2_tr-84 Give me all movies with Tom Cruise. 290 | QALD2_tr-85 In which films did Julia Roberts as well as Richard Gere play? 291 | QALD2_tr-86 Give me all female German chancellors. 292 | QALD2_tr-87 Who wrote the book The pillars of the Earth? 293 | QALD2_tr-89 Give me all soccer clubs in the Premier League. 294 | QALD2_tr-91 Which organizations were founded in 1950? 295 | QALD2_tr-92 What is the highest mountain? 296 | SemSearch_ES-1 44 magnum hunting 297 | SemSearch_ES-2 B. F. Skinner 298 | SemSearch_ES-3 Bookwork 299 | SemSearch_ES-4 NAACP Image Awards 300 | SemSearch_ES-5 Scott County 301 | SemSearch_ES-6 air wisconsin 302 | SemSearch_ES-7 airsoft glock 303 | SemSearch_ES-8 aloha sol 304 | SemSearch_ES-9 american embassy nairobi 305 | SemSearch_ES-10 asheville north carolina 306 | SemSearch_ES-11 austin powers 307 | SemSearch_ES-12 austin texas 308 | SemSearch_ES-13 banana paper making 309 | SemSearch_ES-14 ben franklin 310 | SemSearch_ES-15 bradley center 311 | SemSearch_ES-16 brooklyn bridge 312 | SemSearch_ES-17 butte montana 313 | SemSearch_ES-18 canasta cards 314 | SemSearch_ES-19 carl lewis 315 | SemSearch_ES-20 carolina 316 | SemSearch_ES-21 charles darwin 317 | SemSearch_ES-22 city of charlotte 318 | SemSearch_ES-23 city of virginia beach 319 | SemSearch_ES-24 coastal carolina 320 | SemSearch_ES-25 david suchet 321 | SemSearch_ES-26 disney orlando 322 | SemSearch_ES-27 earl may 323 | SemSearch_ES-28 el salvador 324 | SemSearch_ES-29 ellis college 325 | SemSearch_ES-30 eloan line of credit 326 | SemSearch_ES-31 emery 327 | SemSearch_ES-32 fitzgerald auto mall chambersburg pa 328 | SemSearch_ES-33 harry potter 329 | SemSearch_ES-34 harry potter movie 330 | SemSearch_ES-35 hospice of cincinnati 331 | SemSearch_ES-36 imdb batman returns 332 | SemSearch_ES-37 jack johnson 333 | SemSearch_ES-38 jack the ripper 334 | SemSearch_ES-39 james caldwell high school 335 | SemSearch_ES-40 james clayton md 336 | SemSearch_ES-41 joan of arc 337 | SemSearch_ES-42 john maxwell 338 | SemSearch_ES-45 keith urban 339 | SemSearch_ES-47 king arthur 340 | SemSearch_ES-48 la scala restaurant philadelphia 341 | SemSearch_ES-49 laura bush 342 | SemSearch_ES-50 laura steele bob and tom 343 | SemSearch_ES-51 lexus of maplewood 344 | SemSearch_ES-52 lincoln park 345 | SemSearch_ES-53 lynchburg virginia 346 | SemSearch_ES-54 marc anthony 347 | SemSearch_ES-55 marcus theaters 348 | SemSearch_ES-56 mario bros 349 | SemSearch_ES-57 martin luther king 350 | SemSearch_ES-58 mason ohio 351 | SemSearch_ES-59 mercy hospital in des moines, ia 352 | SemSearch_ES-60 michael douglas 353 | SemSearch_ES-61 mr rourke fantasy island 354 | SemSearch_ES-63 old winchester shotguns 355 | SemSearch_ES-64 omeara ford 356 | SemSearch_ES-65 orlando florida 357 | SemSearch_ES-66 overeaters anonymous 358 | SemSearch_ES-67 ovguide movies 359 | SemSearch_ES-68 pierce county washington 360 | SemSearch_ES-69 piosenki mp3 361 | SemSearch_ES-70 radio italia online 362 | SemSearch_ES-71 richmond virginia 363 | SemSearch_ES-72 rock 103 memphis 364 | SemSearch_ES-73 rowan university 365 | SemSearch_ES-74 sacred heart u 366 | SemSearch_ES-75 sagemont church houston tx 367 | SemSearch_ES-76 san antonio 368 | SemSearch_ES-77 savannah tech 369 | SemSearch_ES-78 sharp pc 370 | SemSearch_ES-79 shobana masala 371 | SemSearch_ES-80 sonny and cher 372 | SemSearch_ES-81 south dakota state university 373 | SemSearch_ES-82 st lucia 374 | SemSearch_ES-83 st paul saints 375 | SemSearch_ES-84 the dish danielle fishel 376 | SemSearch_ES-85 the longest yard sale 377 | SemSearch_ES-86 the morning call lehigh valley pa 378 | SemSearch_ES-87 the quick lift 379 | SemSearch_ES-88 thomas jefferson 380 | SemSearch_ES-89 university of north dakota 381 | SemSearch_ES-90 university of phoenix 382 | SemSearch_ES-91 westminster abbey 383 | SemSearch_ES-93 08 toyota tundra 384 | SemSearch_ES-94 Hugh Downs 385 | SemSearch_ES-95 MADRID 386 | SemSearch_ES-96 New England Coffee 387 | SemSearch_ES-97 PINK PANTHER 2 388 | SemSearch_ES-98 University of Texas at Austin 389 | SemSearch_ES-99 University of York 390 | SemSearch_ES-100 YMCA Tampa 391 | SemSearch_ES-101 ashley wagner 392 | SemSearch_ES-102 beach flowers 393 | SemSearch_ES-103 bounce city humble tx 394 | SemSearch_ES-104 bourbonnais il 395 | SemSearch_ES-105 cedar garden apartments 396 | SemSearch_ES-106 chase masterson 397 | SemSearch_ES-107 concord steel 398 | SemSearch_ES-108 danielia cotton 399 | SemSearch_ES-109 david hewlett 400 | SemSearch_ES-111 eagle rock, ca 401 | SemSearch_ES-112 espresso tv stands 402 | SemSearch_ES-114 glenn frey 403 | SemSearch_ES-115 goodwill of michigan 404 | SemSearch_ES-118 iowa energy 405 | SemSearch_ES-119 john elliott 406 | SemSearch_ES-120 lawrence general hospital 407 | SemSearch_ES-123 michael zimmerman 408 | SemSearch_ES-124 motorola bluetooth hs850 409 | SemSearch_ES-125 nokia e73 410 | SemSearch_ES-127 palm tungsten e2 handheld 411 | SemSearch_ES-128 philadelphia neufchatel cheese 412 | SemSearch_ES-129 pizza populous detroit mi 413 | SemSearch_ES-130 plymouth police department 414 | SemSearch_ES-131 scpa san diego 415 | SemSearch_ES-132 sealy mattress co 416 | SemSearch_ES-133 sedona hiking trails 417 | SemSearch_ES-134 skye woods 418 | SemSearch_ES-135 spring shoes canada 419 | SemSearch_ES-136 sri lanka government gazette 420 | SemSearch_ES-137 steak express 421 | SemSearch_ES-138 syracuse spca 422 | SemSearch_ES-139 the big texan steak house 423 | SemSearch_ES-140 toledo bend realty 424 | SemSearch_ES-141 ventura county court 425 | SemSearch_ES-142 windsor hotel philadelphia 426 | SemSearch_LS-1 Apollo astronauts who walked on the Moon 427 | SemSearch_LS-2 Arab states of the Persian Gulf 428 | SemSearch_LS-3 astronauts who landed on the Moon 429 | SemSearch_LS-4 Axis powers of World War II 430 | SemSearch_LS-5 books of the Jewish canon 431 | SemSearch_LS-6 boroughs of New York City 432 | SemSearch_LS-7 Branches of the US military 433 | SemSearch_LS-8 continents in the world 434 | SemSearch_LS-9 degrees of Eastern Orthodox monasticism 435 | SemSearch_LS-10 did nicole kidman have any siblings 436 | SemSearch_LS-11 dioceses of the church of ireland 437 | SemSearch_LS-12 first targets of the atomic bomb 438 | SemSearch_LS-13 five great epics of Tamil literature 439 | SemSearch_LS-14 gods who dwelt on Mount Olympus 440 | SemSearch_LS-16 hijackers in the September 11 attacks 441 | SemSearch_LS-17 houses of the Russian parliament 442 | SemSearch_LS-18 john lennon, parents 443 | SemSearch_LS-19 kenya's captain in cricket 444 | SemSearch_LS-20 kublai khan siblings 445 | SemSearch_LS-21 lilly allen parents 446 | SemSearch_LS-22 major leagues in the united states 447 | SemSearch_LS-24 matt berry tv series 448 | SemSearch_LS-25 members of u2? 449 | SemSearch_LS-26 movies starring erykah badu 450 | SemSearch_LS-29 nations where Portuguese is an official language 451 | SemSearch_LS-30 orders (or 'choirs') of angels 452 | SemSearch_LS-31 permanent members of the UN Security Council 453 | SemSearch_LS-32 presidents depicted on mount rushmore who died of shooting 454 | SemSearch_LS-33 provinces and territories of Canada 455 | SemSearch_LS-34 ratt albums 456 | SemSearch_LS-35 republics of the former Yugoslavia 457 | SemSearch_LS-36 revolutionaries of 1959 in Cuba 458 | SemSearch_LS-37 standard axioms of set theory 459 | SemSearch_LS-38 states that border oklahoma 460 | SemSearch_LS-39 ten ancient Greek city-kingdoms of Cyprus 461 | SemSearch_LS-40 the first 13 american states 462 | SemSearch_LS-41 the four of the companions of the prophet 463 | SemSearch_LS-42 twelve tribes or sons of Israel 464 | SemSearch_LS-43 what books did paul of tarsus write? 465 | SemSearch_LS-44 what languages do they speak in afghanistan 466 | SemSearch_LS-46 where the British monarch is also head of state 467 | SemSearch_LS-49 who invented the python programming language 468 | SemSearch_LS-50 wonders of the ancient world 469 | TREC_Entity-1 Carriers that Blackberry makes phones for. 470 | TREC_Entity-2 Winners of the ACM Athena award. 471 | TREC_Entity-4 Professional sports teams in Philadelphia. 472 | TREC_Entity-5 Products of Medimmune, Inc. 473 | TREC_Entity-6 Organizations that award Nobel prizes. 474 | TREC_Entity-7 Airlines that currently use Boeing 747 planes. 475 | TREC_Entity-9 Members of The Beaux Arts Trio. 476 | TREC_Entity-10 Campuses of Indiana University. 477 | TREC_Entity-11 Donors to the Home Depot Foundation. 478 | TREC_Entity-12 Airlines that Air Canada has code share flights with. 479 | TREC_Entity-14 Authors awarded an Anthony Award at Bouchercon in 2007. 480 | TREC_Entity-15 Universities that are members of the SEC conference for football. 481 | TREC_Entity-16 Sponsors of the Mancuso quilt festivals. 482 | TREC_Entity-17 Chefs with a show on the Food Network. 483 | TREC_Entity-18 Members of the band Jefferson Airplane. 484 | TREC_Entity-19 Companies that John Hennessey serves on the board of. 485 | TREC_Entity-20 Scotch whisky distilleries on the island of Islay. 486 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pylucene>=4.10.1 --------------------------------------------------------------------------------