├── LICENSE ├── README.md ├── authors_comments.md ├── lib ├── edu.cmu.lti.wikipedia_redirect │ ├── .classpath │ ├── .project │ ├── .settings │ │ ├── org.eclipse.core.resources.prefs │ │ └── org.eclipse.jdt.core.prefs │ ├── .svn │ │ ├── entries │ │ ├── format │ │ ├── pristine │ │ │ ├── 29 │ │ │ │ └── 29fd2aaaf86816aafbbc105506741d18535ab692.svn-base │ │ │ ├── 30 │ │ │ │ └── 30792ab46c06afdce18ad48ab5759bdcf973ba1a.svn-base │ │ │ ├── 81 │ │ │ │ └── 8159a4c51d3239796584c79b7550d7c49972f4d8.svn-base │ │ │ ├── 84 │ │ │ │ └── 84bf19ebd162073a34792b87aa400129a6068865.svn-base │ │ │ ├── 01 │ │ │ │ └── 010314f8fd56597897f1bd210790af39ebe6b887.svn-base │ │ │ ├── 1a │ │ │ │ └── 1ade57daf5d7fd981e4ababdc006c8c3a02bcbe5.svn-base │ │ │ ├── 2d │ │ │ │ └── 2dd2033882f190c773a5bce39ed7b2362af4ad02.svn-base │ │ │ ├── 5d │ │ │ │ ├── 5d3212ecd32fd3d4b72c3e11820392d4eb0a7054.svn-base │ │ │ │ └── 5d601d1da850507fb57a9de2d042751af0c19eb0.svn-base │ │ │ ├── 6f │ │ │ │ └── 6f27f0a059481c81ea3674baef5f02dcdf93bc42.svn-base │ │ │ ├── b2 │ │ │ │ └── b2f1c3203bdbb9cbbc7b334f031504dcfa465b61.svn-base │ │ │ ├── c2 │ │ │ │ └── c2d8aecb47cbf4a0d2ebc3c5eb42630ab7999559.svn-base │ │ │ ├── f8 │ │ │ │ └── f8a0a6e41c412aed301f3bd515ea356240c9cc44.svn-base │ │ │ └── fb │ │ │ │ └── fbba7c603f44332bd67313432986b2f97da47014.svn-base │ │ └── wc.db │ ├── README.txt │ ├── launches │ │ ├── Demo.launch │ │ └── WikipediaRedirectExtractor.launch │ ├── src │ │ └── edu │ │ │ └── cmu │ │ │ └── lti │ │ │ └── wikipedia_redirect │ │ │ ├── Demo.java │ │ │ ├── IOUtil.java │ │ │ ├── WikipediaHypernym.java │ │ │ ├── WikipediaRedirect.java │ │ │ └── WikipediaRedirectExtractor.java │ └── test-data │ │ ├── sample-jawiki-latest-pages-articles.xml │ │ └── sample-res_cat_jawiki.txt └── wikiextractor-master-280915.zip ├── nordlys ├── __init__.py ├── config.py ├── storage │ ├── __init__.py │ ├── mongo.py │ └── surfaceforms.py ├── tagme │ ├── __init__.py │ ├── config.py │ ├── dexter_api.py │ ├── lucene_tools.py │ ├── mention.py │ ├── query.py │ ├── tagme.py │ ├── tagme_api.py │ └── test_coll.py └── wikipedia │ ├── __init__.py │ ├── anchor_extractor.py │ ├── annot_extractor.py │ ├── indexer.py │ ├── merge_sf.py │ ├── pageid_extractor.py │ └── utils.py ├── requirements.txt ├── run_scripts.sh ├── scripts ├── __init__.py ├── evaluator_annot.py ├── evaluator_disamb.py ├── evaluator_strict.py ├── evaluator_topics.py └── to_elq.py └── setup.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 Faegheh Hasibi 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TAGME reproducibility 2 | 3 | This repository contains resources developed within the following paper: 4 | 5 | F. Hasibi, K. Balog, and S.E. Bratsberg. “On the reproducibility of the TAGME Entity Linking System”, 6 | In proceedings of 38th European Conference on Information Retrieval (ECIR ’16), March 2016. 7 | 8 | This study is an effort aimed at reproducing the results presented in the TAGME paper [1]. You can check the [paper](http://hasibi.com/files/ecir2016-tagme.pdf) and [presentation](http://www.slideshare.net/FaeghehHasibi/tagmerep) for detailed information. 9 | 10 | We received invaluable comments from the TAGME authors about their system, and we made these notes available [here](authors_comments.md). 11 | These comments may inform future efforts related to the re-implementation of the TAGME system, as they cannot be found in the original paper. 12 | 13 | This repository is structured as follows: 14 | 15 | - `nordlys/`: Code required for running entity linkers. 16 | - `scripts/`: Evaluation scripts. 17 | - `lib/`: Contains libraries. 18 | - `run-scripts.sh`: Single script that runs all the scripts for getting the results of the paper. 19 | - [authors_comments.md](authors_comments.md): Comments from the TAGME authors and notes about our experiments. 20 | 21 | Other resources involved in this project are [data](http://hasibi.com/files/res/data.tar.gz), [qrels](http://hasibi.com/files/res/qrels.tar.gz), and [runs](http://hasibi.com/files/res/runs.tar.gz), which are described below. 22 | 23 | **Note:** Before running the code (`run-scripts.sh`), please read the [setup](setup.md) file and build all the required resources. 24 | 25 | 26 | ## Data 27 | 28 | The following data files can be downloaded from [here](http://hasibi.com/files/res/data.tar.gz): 29 | 30 | - **Wiki-disamb30** and **Wiki-annot30**: The original datasets are published [here](http://acube.di.unipi.it/tagme-dataset/). We complement the snippets with numerical IDS, as IDs are not contained in the original datasets. 31 | - **ERD-dev**: The dataset is originally published by the [ERD Challenge](http://web-ngram.research.microsoft.com/ERD2014); we use it in our generalizability experiments. The files related to this dataset are prefixed with `Trec_beta`. 32 | - **Y-ERD**: This dataset is originally published in [2] and is available [here](http://bit.ly/ictir2015-elq). The dataset is used in our generalizability experiments. 33 | - **Freebase snapshot**: A snapshot of Freebase containing only proper noun entities (e.g., people and locations) is made available by the ERD challenge and is used for filtering entities in the generalizability experiments. 34 | 35 | 36 | ## Qrels 37 | 38 | The qrel files can be downloaded from [here](http://hasibi.com/files/res/qrels.tar.gz). All qrels are tab-delimited and their format is as follows: 39 | 40 | - **Wiki-disamb30** and **Wiki-annot30**: The columns represent: snippet ID, confidence score, Wikipedia URI, and Wikipedia page id. The last column is not considered in the evaluation scripts. 41 | - **ERD-dev** and **Y-ERD**: The columns represent: query ID, confidence score (always 1), and Wikipedia URI. The entities after the second column represent an interpretation set (entity set) of the query. (If a query has multiple interpretations, there are multiple lines with that query ID.) 42 | 43 | 44 | ## Runs 45 | 46 | The run files can be downloaded from [here](http://hasibi.com/files/res/runs.tar.gz), and categorized into two groups: reproducibility and generalizability. 47 | 48 | - **Reproducibility**: The naming convention for these files is *XX_YY.txt*, where XX represents the dataset and YY is the name of the method. For each file, only the first 4 columns are considered for the evaluation, which are: snippet ID, confidence score, Wikipedia URI, and mention. 49 | - **Generalizability**: These files are named as *XX_YY_ZZ.elq*, where XX is the dataset, YY is the name of the method, and ZZ is the entity linking threshold used for evaluation. The format of these files is similar to the corresponding qrel files. 50 | 51 | ## Citation 52 | 53 | If you use the resources presented in this repository, please cite: 54 | 55 | ``` 56 | @inproceedings{Hasibi:2016:ORT, 57 | author = {Hasibi, Faegheh and Balog, Krisztian and Bratsberg, Svein Erik}, 58 | title = {On the reproducibility of the TAGME Entity Linking System}, 59 | booktitle = {roceedings of 38th European Conference on Information Retrieval}, 60 | series = {ECIR '16}, 61 | year = {2016}, 62 | pages = {436--449}, 63 | publisher = {Springer}, 64 | DOI = {http://dx.doi.org/10.1007/978-3-319-30671-1_32} 65 | } 66 | ``` 67 | 68 | ## Contact 69 | 70 | Should you have any questions, please contact Faegheh Hasibi at . 71 | 72 | [1] P. Ferragina and U. Scaiella. TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities). In Proceedings of CIKM '10, pages 1625–1628, 2010. 73 | 74 | [2] F. Hasibi, K. Balog, and S. E. Bratsberg. Entity Linking in Queries: Tasks and Evaluation. In Proceedings of ICTIR ’15, pages 171–180, 2015. 75 | -------------------------------------------------------------------------------- /authors_comments.md: -------------------------------------------------------------------------------- 1 | # Authors' comments 2 | 3 | In our study aimed at reproducing the results in [1], only parts of the results were reproducible. 4 | Later on, the TAGME authors clarified some of the issues that surfaced. 5 | We list these comments below, as they may inform future efforts related to the re-implementation of TAGME. 6 | 7 | We also include some additional notes on our experiments, which clarify our reasoning behind certain decisions. 8 | 9 | ## TAGME authors' comments 10 | 11 | *The comments below are taken from our personal communication with the TAGME authors. Most of these are direct quotes, but we made minor editorial changes and structured them by topic.* 12 | 13 | **Important note:** In this section "we" refers to the TAGME authors; to avoid ambiguity, we will refer to our implementation as the "ECIR'16 implementation." 14 | 15 | ### Implementation: 16 | 17 | - The TAGME paper [1] represents “version 1” of TAGME, while the source code and the TAGME API are “version 2”. In the second version, the epsilon value been changed and the value of tau has been decreased. 18 | 19 | - TAGME uses wiki page-to-page link records (enwiki-xxx-pagelinks.sql.gz), while the ECIR'16 implementation extracts links from the body of the pages (enwiki-xxx-pages-articles.xml.bz2). This affects the computation of relatedness, as the former source contains 20% more links than the latter. For example, the Wikipedia article [Jaguar](https://en.wikipedia.org/wiki/Jaguar) contains several links under "Extant Carnivora species" section, which may not be found in the *.xml.bz2* file. 20 | 21 | - TAGME version 2 uses a list of stop words to create alternative spots and we add them to the list of available spots during pre-processing phase. In other words, when a spot like "president of the united states" is found, 2 spots are created: (i) "president of the united states", and "president united states". Then these two spots are added to the anchor dictionary. However, TAGME does not perform any stop word removal during the parsing phase. 22 | 23 | - In TAGME version 2 the parsing method for anchors starting with 'the','a','an' has changed. We ignore those prefixes and use only the remaining part. So version 2 can never find 'the firebrand' but only 'firebrand'. 24 | 25 | - TAGME performs two extra filtering before the pruning step. 26 | * Filtering of mentions that are contained in a longer mention. 27 | * Filtering base on link probability threshold. In Version 1 this threshold was set to 0.02, but in version 2, it is set to 0.1. The code related to this filtering is line 87 of `TagmeParser.java` in TAGME source code: 28 | ``` 29 | this.minLinkProb = TagmeConfig.get().getSetting(MODULE).getFloatParam(PARAM_MIN_LP, DEFAULT_MIN_LP); 30 | ``` 31 | 32 | ### Evaluation: 33 | 34 | - For the experiments in [1], we used only 1.4M out of 2M snippets from WIKI-DISAMB30, as Weka could not load more than that into memory. From WIKI-ANNOT30 we used all snippets, the difference is merely a matter of approximation. 35 | 36 | - The evaluation metrics used for end-to-end performance (topics and annot metrics) are micro-averaged. 37 | 38 | - The evaluation metrics for the disambiguation phase are micro-averaged (prec = TP / TP+FP, recall = TP / total number of test cases) and are computed as follows: 39 | 1. annotate the fragment 40 | 2. search for the mention in the result 41 | 3. if you don't find it, ignore it 42 | 4. if you find it: 43 | - if it is correct, increment the number of the true positive 44 | - if it's not correct, increment the number of the false positive 45 | 46 | 47 | 48 | ## Our additional comments on the TAGME authors' comments :) 49 | 50 | - For the sake of reproducibility, we had to use the closest Wikipedia dump to the original experiments, that is dump from April 2010. The page-to-page link records for this dump are not available any more and therefore we had to extract them from the body of Wikipedia pages (enwiki-20100408-pages-articles.xml.bz2). 51 | - The TAGME datasets (Wiki-annot and Wiki-disamb) contain IDs of the pages, which have changed over time in Wikipedia. We addressed this issue as follows: 52 | * We converted the page ids of datasets to the corresponding page titles the dump of 2010. 53 | * For all the experiments, we converted the page titles to URIs based on the [Wikipedia instructions](https://en.wikipedia.org/wiki/Wikipedia:Page_name#Spaces.2C_underscores_and_character_coding). 54 | 55 | Using this method, the URIs used in our experiments are consistent with the TAGME datasets. Since the generalizability datasets are originally created based on DBpedia URIs, they result in minor difference (due to encodings). However, the differences are negligible and do not affect the overall conclusion. 56 | 57 | 58 | ``` 59 | [1] P. Ferragina and U. Scaiella. TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities). In Proceedings of CIKM '10, pages 1625–1628, 2010. 60 | ``` 61 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.classpath: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.project: -------------------------------------------------------------------------------- 1 | 2 | 3 | edu.cmu.lti.wikipedia_redirect 4 | 5 | 6 | 7 | 8 | 9 | org.eclipse.jdt.core.javabuilder 10 | 11 | 12 | 13 | 14 | 15 | org.eclipse.jdt.core.javanature 16 | 17 | 18 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.settings/org.eclipse.core.resources.prefs: -------------------------------------------------------------------------------- 1 | #Sat Oct 08 00:05:51 EDT 2011 2 | eclipse.preferences.version=1 3 | encoding/=UTF-8 4 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.settings/org.eclipse.jdt.core.prefs: -------------------------------------------------------------------------------- 1 | #Tue Oct 11 01:41:50 EDT 2011 2 | eclipse.preferences.version=1 3 | org.eclipse.jdt.core.compiler.codegen.inlineJsrBytecode=enabled 4 | org.eclipse.jdt.core.compiler.codegen.targetPlatform=1.5 5 | org.eclipse.jdt.core.compiler.codegen.unusedLocal=preserve 6 | org.eclipse.jdt.core.compiler.compliance=1.5 7 | org.eclipse.jdt.core.compiler.debug.lineNumber=generate 8 | org.eclipse.jdt.core.compiler.debug.localVariable=generate 9 | org.eclipse.jdt.core.compiler.debug.sourceFile=generate 10 | org.eclipse.jdt.core.compiler.problem.assertIdentifier=error 11 | org.eclipse.jdt.core.compiler.problem.enumIdentifier=error 12 | org.eclipse.jdt.core.compiler.source=1.5 13 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/entries: -------------------------------------------------------------------------------- 1 | 12 2 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/format: -------------------------------------------------------------------------------- 1 | 12 2 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/01/010314f8fd56597897f1bd210790af39ebe6b887.svn-base: -------------------------------------------------------------------------------- 1 | See http://code.google.com/p/wikipedia-redirect -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/1a/1ade57daf5d7fd981e4ababdc006c8c3a02bcbe5.svn-base: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/29/29fd2aaaf86816aafbbc105506741d18535ab692.svn-base: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Carnegie Mellon University 3 | * Licensed under the Apache License, Version 2.0 (the "License"); 4 | * you may not use this file except in compliance with the License. 5 | * You may obtain a copy of the License at 6 | * 7 | * http://www.apache.org/licenses/LICENSE-2.0 8 | * 9 | * Unless required by applicable law or agreed to in writing, 10 | * software distributed under the License is distributed on an 11 | * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 12 | * KIND, either express or implied. See the License for the 13 | * specific language governing permissions and limitations 14 | * under the License. 15 | */ 16 | package edu.cmu.lti.wikipedia_redirect; 17 | 18 | import java.io.BufferedReader; 19 | import java.io.BufferedWriter; 20 | import java.io.File; 21 | import java.io.FileInputStream; 22 | import java.io.FileOutputStream; 23 | import java.io.FileReader; 24 | import java.io.InputStreamReader; 25 | import java.io.LineNumberReader; 26 | import java.io.ObjectInputStream; 27 | import java.io.ObjectOutputStream; 28 | import java.io.OutputStreamWriter; 29 | import java.util.AbstractMap; 30 | import java.util.ArrayList; 31 | import java.util.List; 32 | import java.util.Map.Entry; 33 | 34 | /** 35 | * Reads and writes wikipedia redirect data. 36 | * 37 | * @author Hideki Shima 38 | * 39 | */ 40 | public class IOUtil { 41 | 42 | /** 43 | * Save Wikipedia redirect data 44 | * 45 | * @param redirectData 46 | * map where key is original term and value is redirected term 47 | * @throws Exception 48 | */ 49 | public static void save( AbstractMap map ) throws Exception { 50 | File outputDir = new File("target"); 51 | if (!outputDir.exists()) { 52 | outputDir.mkdirs(); 53 | } 54 | WikipediaRedirect wr = new WikipediaRedirect( map ); 55 | saveText( wr, outputDir ); 56 | saveSerialized( wr, outputDir ); 57 | } 58 | 59 | /** 60 | * Save Wikipedia redirect data into tab separated text file 61 | * 62 | * @param redirectData 63 | * map where key is original term and value is redirected term 64 | * @throws Exception 65 | */ 66 | private static void saveText( WikipediaRedirect wr, File outputDir ) throws Exception { 67 | File txtFile = new File(outputDir, "wikipedia_redirect.txt"); 68 | FileOutputStream fosTxt = new FileOutputStream(txtFile); 69 | OutputStreamWriter osw = new OutputStreamWriter(fosTxt, "utf-8"); 70 | BufferedWriter bw = new BufferedWriter(osw); 71 | for ( Entry entry : wr.entrySet() ) { 72 | bw.write( entry.getKey()+"\t"+entry.getValue()+"\n" ); 73 | } 74 | bw.close(); 75 | osw.close(); 76 | fosTxt.close(); 77 | System.out.println("Saved redirect data in text format: "+txtFile.getAbsolutePath()); 78 | } 79 | 80 | /** 81 | * Save Wikipedia redirect data into serialized object 82 | * 83 | * @param redirectData 84 | * map where key is original term and value is redirected term 85 | * @throws Exception 86 | */ 87 | private static void saveSerialized( WikipediaRedirect wr, File outputDir ) throws Exception { 88 | File objFile = new File(outputDir, "wikipedia_redirect.ser"); 89 | FileOutputStream fosObj = new FileOutputStream(objFile); 90 | ObjectOutputStream outObject = new ObjectOutputStream(fosObj); 91 | outObject.writeObject(wr); 92 | outObject.close(); 93 | fosObj.close(); 94 | System.out.println("Serialized redirect data: "+objFile.getAbsolutePath()); 95 | } 96 | 97 | /** 98 | * Deserializes wikipedia redirect data 99 | * @param file 100 | * serialized object or tab-separated text 101 | * @return wikipedia redirect 102 | * @throws Exception 103 | */ 104 | public static WikipediaRedirect loadWikipediaRedirect( File f ) throws Exception { 105 | if (!f.exists() || f.isDirectory()) { 106 | System.err.println("File not found: "+f.getAbsolutePath()); 107 | System.exit(-1); 108 | } 109 | if ( f.getName().endsWith(".ser") ) { 110 | return loadWikipediaRedirectFromSerialized( f ); 111 | } else { 112 | //faster than above? 113 | return loadWikipediaRedirectFromText( f ); 114 | } 115 | } 116 | 117 | /** 118 | * Deserializes wikipedia redirect data from serialized object data 119 | * @param file 120 | * serialized object 121 | * @return wikipedia redirect 122 | * @throws Exception 123 | */ 124 | private static WikipediaRedirect loadWikipediaRedirectFromSerialized( File f ) throws Exception { 125 | WikipediaRedirect object; 126 | try { 127 | FileInputStream inFile = new FileInputStream(f); 128 | ObjectInputStream inObject = new ObjectInputStream(inFile); 129 | object = (WikipediaRedirect)inObject.readObject(); 130 | inObject.close(); 131 | inFile.close(); 132 | } catch (Exception e) { 133 | throw e; 134 | } 135 | return object; 136 | } 137 | 138 | /** 139 | * Deserializes wikipedia redirect data from tab-separated text file 140 | * @param file 141 | * tab-separated text 142 | * @return wikipedia redirect 143 | * @throws Exception 144 | */ 145 | private static WikipediaRedirect loadWikipediaRedirectFromText( File f ) throws Exception { 146 | int size = (int)countLineNumber(f); 147 | WikipediaRedirect wr = new WikipediaRedirect( size ); 148 | try { 149 | FileInputStream fis = new FileInputStream( f ); 150 | InputStreamReader isr = new InputStreamReader( fis ); 151 | BufferedReader br = new BufferedReader( isr ); 152 | String line = null; 153 | while ( (line = br.readLine()) != null ) { 154 | String[] elements = line.split("\t"); 155 | wr.put( elements[0], elements[1] ); 156 | } 157 | br.close(); 158 | isr.close(); 159 | fis.close(); 160 | } catch (Exception e) { 161 | throw e; 162 | } 163 | return wr; 164 | } 165 | 166 | /** 167 | * Loads tab separated data as an alternative way to load() method. 168 | * Works for Wikipedia hypernym data generated by 169 | * NICT's "Hyponymy extraction tool" 170 | * 171 | * @param file 172 | * tab separated file that contains lines that look "word1[TAB]word2[BR]" 173 | * @return wikipedia redirect 174 | * @throws Exception 175 | */ 176 | public static WikipediaHypernym loadWikipediaHypernym( File f ) throws Exception { 177 | int size = (int)IOUtil.countLineNumber( f ); 178 | WikipediaHypernym object = new WikipediaHypernym( size ); 179 | try { 180 | FileInputStream inFile = new FileInputStream( f ); 181 | InputStreamReader isr = new InputStreamReader( inFile ); 182 | BufferedReader br = new BufferedReader( isr ); 183 | String line = null; 184 | while ( (line = br.readLine())!=null ) { 185 | String[] tokens = line.split("\t"); 186 | if (tokens.length<=1) { 187 | continue; 188 | } 189 | String key = tokens[0]; 190 | List targets = object.get(key); 191 | if ( targets==null ) { 192 | targets = new ArrayList(); 193 | } 194 | targets.add(tokens[1]); 195 | object.put(key, targets); 196 | } 197 | br.close(); 198 | isr.close(); 199 | inFile.close(); 200 | } catch (Exception e) { 201 | throw e; 202 | } 203 | return object; 204 | } 205 | 206 | /** 207 | * Count number of lines in a file in an efficient way 208 | * @param f 209 | * @return 210 | * @throws Exception 211 | */ 212 | public static long countLineNumber( File f ) throws Exception { 213 | LineNumberReader lnr = new LineNumberReader(new FileReader(f)); 214 | lnr.skip(Long.MAX_VALUE); 215 | int count = lnr.getLineNumber(); 216 | lnr.close(); 217 | return count; 218 | } 219 | } 220 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/2d/2dd2033882f190c773a5bce39ed7b2362af4ad02.svn-base: -------------------------------------------------------------------------------- 1 | #Sat Oct 08 00:05:51 EDT 2011 2 | eclipse.preferences.version=1 3 | encoding/=UTF-8 4 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/30/30792ab46c06afdce18ad48ab5759bdcf973ba1a.svn-base: -------------------------------------------------------------------------------- 1 | ACMフェロー アルフレッド・エイホ 0.288392 2 | ACMフェロー アンドリュー・タネンバウム 0.084127 3 | ACMフェロー エドムンド・クラーク 0.220679 4 | ACMフェロー グラディ・ブーチ -0.175180 5 | ACMフェロー ジャック・ドンガラ 0.427047 6 | ACMフェロー スティーブン・ボーン 0.220679 7 | ACMフェロー ダグラス・カマー 0.907805 8 | ACMフェロー ダン・ブリックリン 0.220679 9 | ACMフェロー ビャーネ・ストロヴストルップ 0.220679 10 | ACMフェロー ビル・グロップ 0.220679 11 | ACMフェロー ピーター・ノーヴィグ 0.233410 12 | ACMフェロー ボブ・フランクストン 0.241471 13 | ACMフェロー リチャード・ハミング 0.899804 14 | ACMフェロー 米澤明憲 0.143425 -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/5d/5d3212ecd32fd3d4b72c3e11820392d4eb0a7054.svn-base: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/5d/5d601d1da850507fb57a9de2d042751af0c19eb0.svn-base: -------------------------------------------------------------------------------- 1 | 2 | 3 | Wikipedia 4 | http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8 5 | MediaWiki 1.17wmf1 6 | first-letter 7 | 8 | メディア 9 | 特別 10 | 11 | ノート 12 | 利用者 13 | 利用者‐会話 14 | Wikipedia 15 | Wikipedia‐ノート 16 | ファイル 17 | ファイル‐ノート 18 | MediaWiki 19 | MediaWiki‐ノート 20 | Template 21 | Template‐ノート 22 | Help 23 | Help‐ノート 24 | Category 25 | Category‐ノート 26 | Portal 27 | Portal‐ノート 28 | プロジェクト 29 | プロジェクト‐ノート 30 | 31 | 32 | 33 | Wikipedia:Sandbox 34 | 6 35 | 36 | 37 | 36654478 38 | 2011-03-06T16:16:58Z 39 | 40 | Y-dash 41 | 309126 42 | 43 | テストは[[Wikipedia:サンドボックス]]でお願いいたします。 / [[Special:Contributions/Kompek|Kompek]] ([[User talk:Kompek|会話]]) による ID:36654304 の版を[[H:RV|取り消し]] 44 | #REDIRECT [[Wikipedia:サンドボックス]] 45 | 46 | 47 | 48 | SandBox 49 | 26 50 | 51 | 52 | 6986090 53 | 2006-08-05T23:25:48Z 54 | 55 | Nevylax 56 | 38464 57 | 58 | #REDIRECT [[サンドボックス]] 59 | #REDIRECT [[サンドボックス]] 60 | 61 | 62 | 63 | HomePage 64 | 46 65 | 66 | 67 | 2168894 68 | 2005-03-22T13:49:43Z 69 | 70 | Hideyuki 71 | 9577 72 | 73 | #REDIRECT [[ホームページ]] 74 | #REDIRECT [[ホームページ]] 75 | 76 | 77 | 78 | Wikipedia:About 79 | 51 80 | 81 | 82 | 19962101 83 | 2008-05-31T01:12:46Z 84 | 85 | Kanjy 86 | 36859 87 | 88 | 89 | 2003-03-06T11:35:13Z Setu さん版 (#REDIRECT [[Wikipedia:ウィキペディアについて]]) に戻す 90 | #REDIRECT [[Wikipedia:ウィキペディアについて]] 91 | 92 | 93 | 94 | Wikipedia:How does one edit a page 95 | 85 96 | 97 | 98 | 13206183 99 | 2007-06-19T05:06:43Z 100 | 101 | Aotake 102 | 34929 103 | 104 | 105 | redirect target 106 | #REDIRECT [[Help:ページの編集]] 107 | 108 | 109 | 110 | ワールド・ミュージック 111 | 113 112 | 113 | 114 | 24277249 115 | 2009-02-07T16:20:20Z 116 | 117 | Point136 118 | 211299 119 | 120 | 121 | Bot: リダイレクト構文の修正 122 | #REDIRECT [[ワールドミュージック]] 123 | 124 | 125 | 126 | ネマティック相 127 | 127 128 | 129 | 130 | 24277255 131 | 2009-02-07T16:20:40Z 132 | 133 | Point136 134 | 211299 135 | 136 | 137 | Bot: リダイレクト構文の修正 138 | #REDIRECT [[ネマティック液晶]] 139 | 140 | 141 | 142 | スメクティック相 143 | 128 144 | 145 | 146 | 2168972 147 | 2004-01-07T09:45:14Z 148 | 149 | Yas 150 | 739 151 | 152 | 153 | #REDIRECT [[液晶]] 154 | #REDIRECT [[液晶]] 155 | 156 | 157 | 158 | ミュージシャン一覧 (個人) 159 | 143 160 | 161 | 162 | 35399365 163 | 2010-12-14T00:40:05Z 164 | 165 | Xqbot 166 | 273540 167 | 168 | 169 | ロボットによる: 二重リダイレクト修正 → [[音楽家の一覧]] 170 | #転送 [[音楽家の一覧]] 171 | 172 | 173 | 174 | 病名 175 | 176 176 | 177 | 178 | 17793766 179 | 2008-02-04T13:37:54Z 180 | 181 | U3002 182 | 66126 183 | 184 | 185 | 二重リダイレクト回避 186 | #REDIRECT [[病気の別名の一覧]] 187 | 188 | 189 | 190 | Wikipedia:Welcome, newcomers 191 | 216 192 | 193 | 194 | 10662242 195 | 2007-02-15T13:40:22Z 196 | 197 | Cave cattum 198 | 41235 199 | 200 | #REDIRECT [[Wikipedia:ウィキペディアへようこそ]] 201 | #REDIRECT [[Wikipedia:ウィキペディアへようこそ]] 202 | 203 | 204 | 205 | 黒人霊歌 206 | 260 207 | 208 | 209 | 22493441 210 | 2008-10-23T05:59:07Z 211 | 212 | Buzin Satuma Hayato 213 | 243768 214 | 215 | 216 | 黒人霊歌はスピリチュアル(音楽)だと思う 217 | #REDIRECT [[スピリチュアル#スピリチュアル(音楽)]] 218 | 219 | 220 | 221 | Wikipedia:漢字やスペルに注意 222 | 281 223 | 224 | 225 | 13451992 226 | 2007-07-02T14:15:39Z 227 | 228 | Cave cattum 229 | 41235 230 | 231 | [[WP:AES|←]][[Wikipedia:記事を執筆する]]へのリダイレクト 232 | #REDIRECT [[Wikipedia:記事を執筆する]] 233 | 234 | 235 | 236 | Wikipedia:他言語の使用は控えめに 237 | 283 238 | 239 | 240 | 15761853 241 | 2007-10-27T06:12:35Z 242 | 243 | Khhy 244 | 13490 245 | 246 | #他言語表記は控えめに 247 | #REDIRECT [[Wikipedia:素晴らしい記事を書くには#他言語表記は控えめに]] 248 | 249 | 250 | 251 | Wikipedia:日本語表記法 252 | 291 253 | 254 | 255 | 12840559 256 | 2007-05-30T14:59:17Z 257 | 258 | Aotake 259 | 34929 260 | 261 | [[Wikipedia:表記ガイド]]へ統合。 262 | #REDIRECT [[Wikipedia:表記ガイド]] 263 | 264 | 265 | 266 | Wikipedia:リダイレクトの使い方 267 | 308 268 | 269 | 270 | 2169127 271 | 2003-11-30T04:09:10Z 272 | 273 | 219.164.91.166 274 | 275 | #REDIRECT [[Wikipedia:リダイレクト]] 276 | #REDIRECT [[Wikipedia:リダイレクト]] 277 | 278 | 279 | 280 | アルコール飲料 281 | 318 282 | 283 | 284 | 15885863 285 | 2007-11-02T12:42:36Z 286 | 287 | Balmung0731 288 | 99201 289 | 290 | [[酒]]へ統合 291 | #REDIRECT [[酒]] 292 | 293 | 294 | 295 | 地学 296 | 321 297 | 298 | 299 | 2169138 300 | 2003-09-26T11:33:29Z 301 | 302 | 133.11.230.18 303 | 304 | #REDIRECT [[地球科学]] 305 | 306 | 307 | 308 | Wikipedia:ノートページのレイアウト 309 | 323 310 | 311 | 312 | 36028694 313 | 2011-01-24T11:38:30Z 314 | 315 | Kurz 316 | 1601 317 | 318 | 319 | lk 320 | #REDIRECT [[Wikipedia:ノートページのガイドライン]] 321 | 322 | 323 | 324 | Wikipedia:ページを孤立させない 325 | 332 326 | 327 | 328 | 14492895 329 | 2007-08-27T16:20:04Z 330 | 331 | Cave cattum 332 | 41235 333 | 334 | #REDIRECT [[Wikipedia:記事どうしをつなぐ]] 335 | #REDIRECT [[Wikipedia:記事どうしをつなぐ]] 336 | 337 | 338 | 339 | 明石沢貴士 340 | 335 341 | 342 | 343 | 34308655 344 | 2010-10-03T18:47:08Z 345 | 346 | EmausBot 347 | 397108 348 | 349 | 350 | ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 あ行#.E6.98.8E.E7.9F.B3.E6.B2.A2.E8.B2.B4.E5.A3.AB]] 351 | #転送 [[プロジェクト:漫画家/日本の漫画家 あ行#.E6.98.8E.E7.9F.B3.E6.B2.A2.E8.B2.B4.E5.A3.AB]] 352 | 353 | 354 | 355 | ここまひ 356 | 341 357 | 358 | 359 | 34308404 360 | 2010-10-03T18:16:42Z 361 | 362 | EmausBot 363 | 397108 364 | 365 | 366 | ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 か行#.E3.81.93.E3.81.93.E3.81.BE.E3.81.B2]] 367 | #転送 [[プロジェクト:漫画家/日本の漫画家 か行#.E3.81.93.E3.81.93.E3.81.BE.E3.81.B2]] 368 | 369 | 370 | 371 | 吉冨昭仁 372 | 354 373 | 374 | 375 | 7856042 376 | 2006-09-23T16:20:44Z 377 | 378 | Mambo95 379 | 77516 380 | 381 | [[吉富昭仁]]へのリダイレクト 382 | #REDIRECT [[吉富昭仁]] 383 | 384 | 385 | 386 | 現在のイベント 387 | 356 388 | 389 | 390 | 12796821 391 | 2007-05-28T09:42:57Z 392 | 393 | Khhy 394 | 13490 395 | 396 | #REDIRECT [[Portal:最近の出来事]] 397 | #REDIRECT [[Portal:最近の出来事]] 398 | 399 | 400 | 401 | Wikipedia:項目名のつけ方 402 | 451 403 | 404 | 405 | 2169218 406 | 2003-02-03T20:03:50Z 407 | 408 | Tomos 409 | 10 410 | 411 | #REDIRECT[[Wikipedia:記事名の付け方]] 412 | 413 | 414 | 415 | 必要とされている記事 416 | 456 417 | 418 | 419 | 2169223 420 | 2004-04-16T16:59:03Z 421 | 422 | Listener 423 | 6379 424 | 425 | Wikipedia:執筆依頼, double redirect 426 | #REDIRECT [[Wikipedia:執筆依頼]] 427 | 428 | 429 | 430 | 東京を舞台にした漫画作品 431 | 465 432 | 433 | 434 | 39222184 435 | 2011-09-16T06:55:20Z 436 | 437 | リオネル 438 | 98816 439 | 440 | [[東京を舞台にした漫画・アニメ作品]]へ統合 441 | #REDIRECT [[東京を舞台にした漫画・アニメ作品]] 442 | 443 | 444 | 445 | 必要とされている画像 446 | 467 447 | 448 | 449 | 2169230 450 | 2004-03-20T05:01:58Z 451 | 452 | Michey.M-test 453 | 3537 454 | 455 | 456 | Wikipedia:画像提供依頼 457 | #REDIRECT [[Wikipedia:画像提供依頼]] 458 | 459 | 460 | 461 | 水縞とおる 462 | 471 463 | 464 | 465 | 34308685 466 | 2010-10-03T18:50:25Z 467 | 468 | EmausBot 469 | 397108 470 | 471 | 472 | ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]] 473 | #転送 [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]] 474 | 475 | 476 | 477 | 恋人は守護霊!? 478 | 472 479 | 480 | 481 | 34308641 482 | 2010-10-03T18:45:13Z 483 | 484 | EmausBot 485 | 397108 486 | 487 | 488 | ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]] 489 | #転送 [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]] 490 | 491 | [[Category:漫画作品 こ|いひとはしゆこれい]] 492 | [[Category:月刊コミックNORA|こいひとはしゆこれい]] 493 | 494 | 495 | 496 | ユーゴスラビア改名 497 | 504 498 | 499 | 500 | 24277258 501 | 2009-02-07T16:21:01Z 502 | 503 | Point136 504 | 211299 505 | 506 | 507 | Bot: リダイレクト構文の修正 508 | #REDIRECT [[ユーゴスラビア]] 509 | 510 | 511 | 512 | あだちつよし 513 | 519 514 | 515 | 516 | 34308384 517 | 2010-10-03T18:14:37Z 518 | 519 | EmausBot 520 | 397108 521 | 522 | 523 | ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 あ行#.E3.81.82.E3.81.A0.E3.81.A1.E3.81.A4.E3.82.88.E3.81.97]] 524 | #転送 [[プロジェクト:漫画家/日本の漫画家 あ行#.E3.81.82.E3.81.A0.E3.81.A1.E3.81.A4.E3.82.88.E3.81.97]] 525 | 526 | 527 | 528 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/6f/6f27f0a059481c81ea3674baef5f02dcdf93bc42.svn-base: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Carnegie Mellon University 3 | * Licensed under the Apache License, Version 2.0 (the "License"); 4 | * you may not use this file except in compliance with the License. 5 | * You may obtain a copy of the License at 6 | * 7 | * http://www.apache.org/licenses/LICENSE-2.0 8 | * 9 | * Unless required by applicable law or agreed to in writing, 10 | * software distributed under the License is distributed on an 11 | * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 12 | * KIND, either express or implied. See the License for the 13 | * specific language governing permissions and limitations 14 | * under the License. 15 | */ 16 | package edu.cmu.lti.wikipedia_redirect; 17 | 18 | import java.io.BufferedReader; 19 | import java.io.BufferedWriter; 20 | import java.io.File; 21 | import java.io.FileInputStream; 22 | import java.io.FileOutputStream; 23 | import java.io.InputStreamReader; 24 | import java.io.OutputStreamWriter; 25 | import java.util.regex.Matcher; 26 | import java.util.regex.Pattern; 27 | 28 | /** 29 | * Extracts wikipedia redirect information and serializes the data. 30 | * 31 | * @author Hideki Shima 32 | * 33 | */ 34 | public class WikipediaRedirectExtractor { 35 | 36 | private static String titlePattern = " "; 37 | private static String redirectPattern = " <redirect"; 38 | private static String textPattern = " <text xml"; 39 | private static Pattern pRedirect = Pattern.compile( 40 | "#[ ]?[^ ]+[ ]?\\[\\[(.+?)\\]\\]", Pattern.CASE_INSENSITIVE); 41 | 42 | public void run( File inputFile, File outputFile ) throws Exception { 43 | int invalidCount = 0; 44 | long t0 = System.currentTimeMillis(); 45 | FileInputStream fis = new FileInputStream( inputFile ); 46 | // TreeMap<String,String> map = new HashMap<String,String>(); 47 | InputStreamReader isr = new InputStreamReader(fis, "utf-8"); 48 | BufferedReader br = new BufferedReader(isr); 49 | FileOutputStream fos = new FileOutputStream(outputFile); 50 | OutputStreamWriter osw = new OutputStreamWriter(fos, "utf-8"); 51 | BufferedWriter bw = new BufferedWriter(osw); 52 | 53 | int count = 0; 54 | String title = null; 55 | String text = null; 56 | String line = null; 57 | boolean isRedirect = false; 58 | boolean inText = false; 59 | while ((line=br.readLine())!=null) { 60 | if (line.startsWith(titlePattern)) { 61 | title = line; 62 | text = null; 63 | isRedirect = false; 64 | } 65 | if (line.startsWith(redirectPattern)) { 66 | isRedirect = true; 67 | } 68 | if (isRedirect && (line.startsWith(textPattern) || inText)) { 69 | Matcher m = pRedirect.matcher(line); // slow regex shouldn't be used until here. 70 | if (m.find()) { // make sure the current text field contains [[...]] 71 | text = line; 72 | try { 73 | title = cleanupTitle(title); 74 | String redirectedTitle = m.group(1); 75 | if ( isValidAlias(title, redirectedTitle) ) { 76 | bw.write( title+"\t"+redirectedTitle+"\n" ); 77 | count++; 78 | // map.put( title, redirectedTitle ); 79 | } else { 80 | invalidCount++; 81 | } 82 | } catch ( StringIndexOutOfBoundsException e ) { 83 | System.out.println("ERROR: cannot extract redirection from title = "+title+", text = "+text); 84 | e.printStackTrace(); 85 | } 86 | } else { // Very rare case 87 | inText = true; 88 | } 89 | } 90 | } 91 | br.close(); 92 | isr.close(); 93 | fis.close(); 94 | 95 | bw.close(); 96 | osw.close(); 97 | fos.close(); 98 | System.out.println("---- Wikipedia redirect extraction done ----"); 99 | long t1 = System.currentTimeMillis(); 100 | // IOUtil.save( map ); 101 | System.out.println("Discarded "+invalidCount+" redirects to wikipedia meta articles."); 102 | System.out.println("Extracted "+count+" redirects."); 103 | System.out.println("Saved output: "+outputFile.getAbsolutePath()); 104 | System.out.println("Done in "+((t1-t0)/1000)+" sec."); 105 | } 106 | 107 | private String cleanupTitle( String title ) { 108 | int end = title.indexOf(""); 109 | return end!=-1?title.substring(titlePattern.length(), end):title; 110 | } 111 | 112 | /** 113 | * Identifies if the redirection is valid. 114 | * Currently, we only check if the redirection is related to 115 | * a special Wikipedia page or not. 116 | * 117 | * TODO: write more rules to discard more invalid redirects. 118 | * 119 | * @param title source title 120 | * @param redirectedTitle target title 121 | * @return validity 122 | */ 123 | private boolean isValidAlias( String title, String redirectedTitle ) { 124 | if ( title.startsWith("Wikipedia:") 125 | || title.startsWith("Template:") 126 | || title.startsWith("Portal:") 127 | || title.startsWith("List of ")) { 128 | return false; 129 | } 130 | return true; 131 | } 132 | 133 | public static void main(String[] args) throws Exception { 134 | if (args.length!=1) { 135 | System.err.println("ERROR: Please specify the path to the wikipedia article xml file as the argument."); 136 | System.err.println("Tips: enclose the path with double quotes if a space exists in the path."); 137 | return; 138 | } 139 | File inputFile = new File(args[0]); 140 | if (!inputFile.exists() || inputFile.isDirectory()) { 141 | System.err.println("ERROR: File not found at "+inputFile.getAbsolutePath()); 142 | return; 143 | } 144 | String prefix = inputFile.getName().replaceFirst("-.*", ""); 145 | File outputDir = new File("target"); 146 | if (!outputDir.exists()) { 147 | outputDir.mkdirs(); 148 | } 149 | File outputFile = new File(outputDir, prefix+"-redirect.txt"); 150 | new WikipediaRedirectExtractor().run( inputFile, outputFile ); 151 | } 152 | } 153 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/81/8159a4c51d3239796584c79b7550d7c49972f4d8.svn-base: -------------------------------------------------------------------------------- 1 | #Tue Oct 11 01:41:50 EDT 2011 2 | eclipse.preferences.version=1 3 | org.eclipse.jdt.core.compiler.codegen.inlineJsrBytecode=enabled 4 | org.eclipse.jdt.core.compiler.codegen.targetPlatform=1.5 5 | org.eclipse.jdt.core.compiler.codegen.unusedLocal=preserve 6 | org.eclipse.jdt.core.compiler.compliance=1.5 7 | org.eclipse.jdt.core.compiler.debug.lineNumber=generate 8 | org.eclipse.jdt.core.compiler.debug.localVariable=generate 9 | org.eclipse.jdt.core.compiler.debug.sourceFile=generate 10 | org.eclipse.jdt.core.compiler.problem.assertIdentifier=error 11 | org.eclipse.jdt.core.compiler.problem.enumIdentifier=error 12 | org.eclipse.jdt.core.compiler.source=1.5 13 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/84/84bf19ebd162073a34792b87aa400129a6068865.svn-base: -------------------------------------------------------------------------------- 1 | 2 | 3 | edu.cmu.lti.wikipedia_redirect 4 | 5 | 6 | 7 | 8 | 9 | org.eclipse.jdt.core.javabuilder 10 | 11 | 12 | 13 | 14 | 15 | org.eclipse.jdt.core.javanature 16 | 17 | 18 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/b2/b2f1c3203bdbb9cbbc7b334f031504dcfa465b61.svn-base: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Carnegie Mellon University 3 | * Licensed under the Apache License, Version 2.0 (the "License"); 4 | * you may not use this file except in compliance with the License. 5 | * You may obtain a copy of the License at 6 | * 7 | * http://www.apache.org/licenses/LICENSE-2.0 8 | * 9 | * Unless required by applicable law or agreed to in writing, 10 | * software distributed under the License is distributed on an 11 | * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 12 | * KIND, either express or implied. See the License for the 13 | * specific language governing permissions and limitations 14 | * under the License. 15 | */ 16 | package edu.cmu.lti.wikipedia_redirect; 17 | 18 | import java.io.File; 19 | import java.io.Serializable; 20 | import java.util.ArrayList; 21 | import java.util.HashMap; 22 | import java.util.List; 23 | 24 | /** 25 | * Represents the wikipedia hypernym data e.g. ones generated by 26 | * NICT's "Hyponymy extraction tool" 27 | * 28 | * @author Hideki Shima 29 | */ 30 | public class WikipediaHypernym extends HashMap> 31 | implements Serializable { 32 | 33 | private static final long serialVersionUID = 20111019L; 34 | 35 | public WikipediaHypernym( int size ) { 36 | // RAM (heap) efficient capacity setting 37 | super( size * 4 / 3 + 1 ); 38 | } 39 | 40 | public void load( File file ) throws Exception { 41 | WikipediaHypernym wh = IOUtil.loadWikipediaHypernym(file); 42 | for ( String key : wh.keySet() ) { 43 | List thisList = get(key); 44 | List newList = wh.get(key); 45 | if ( thisList != null ) { 46 | thisList.addAll( newList ); 47 | } else { 48 | thisList = new ArrayList( newList ); 49 | } 50 | put(key, thisList); 51 | } 52 | } 53 | 54 | } 55 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/c2/c2d8aecb47cbf4a0d2ebc3c5eb42630ab7999559.svn-base: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/f8/f8a0a6e41c412aed301f3bd515ea356240c9cc44.svn-base: -------------------------------------------------------------------------------- 1 | package edu.cmu.lti.wikipedia_redirect; 2 | import java.io.File; 3 | import java.util.Set; 4 | 5 | /** 6 | * Demo of what you can do with Wikipedia Redirect. 7 | * @author Hideki Shima 8 | */ 9 | public class Demo { 10 | private static String[] enSrcTerms = {"Bin Ladin", "William Henry Gates", 11 | "JFK", "The Steel City", "The City of Bridges", "Da burgh", "Hoagie", 12 | "Centre", "3.14"}; 13 | private static String[] jaSrcTerms = {"オサマビンラディン", "オサマ・ビンラーディン", 14 | "東日本大地震","東日本太平洋沖地震" ,"NACSIS", 15 | "ダイアモンド", "アボガド", "バイオリン", "平成12年", "3.14"}; 16 | private static String enTarget = "Bayesian network"; 17 | private static String jaTarget = "計算機科学"; 18 | 19 | public static void main(String[] args) throws Exception { 20 | // Initialization 21 | System.out.print("Loading Wikipedia Redirect ..."); 22 | long t0 = System.currentTimeMillis(); 23 | File inputFile = new File(args[0]); 24 | WikipediaRedirect wr = IOUtil.loadWikipediaRedirect(inputFile); 25 | boolean useJapaneseExample = inputFile.getName().substring(0, 2).equals("ja"); 26 | String[] srcTerms = useJapaneseExample ? jaSrcTerms : enSrcTerms; 27 | String target = useJapaneseExample ? jaTarget : enTarget; 28 | long t1 = System.currentTimeMillis(); 29 | System.out.println(" done in "+(t1-t0)/1000D+" sec.\n"); 30 | 31 | // Let's find a redirection given a source word. 32 | StringBuilder sb = new StringBuilder(); 33 | for ( String src : srcTerms ) { 34 | sb.append("redirect(\""+src+"\") = \""+wr.get(src)+"\"\n"); 35 | } 36 | long t2 = System.currentTimeMillis(); 37 | System.out.println(sb.toString()+"Done in "+(t2-t1)/1000D+" sec.\n--\n"); 38 | 39 | // Let's find which source words redirect to the given target word. 40 | Set keys = wr.getKeysByValue(target); 41 | long t3 = System.currentTimeMillis(); 42 | System.out.println("All of the following redirect to \""+target+"\":\n"+keys); 43 | System.out.println("Done in "+(t3-t2)/1000D+" sec.\n"); 44 | } 45 | } 46 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/fb/fbba7c603f44332bd67313432986b2f97da47014.svn-base: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Carnegie Mellon University 3 | * Licensed under the Apache License, Version 2.0 (the "License"); 4 | * you may not use this file except in compliance with the License. 5 | * You may obtain a copy of the License at 6 | * 7 | * http://www.apache.org/licenses/LICENSE-2.0 8 | * 9 | * Unless required by applicable law or agreed to in writing, 10 | * software distributed under the License is distributed on an 11 | * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 12 | * KIND, either express or implied. See the License for the 13 | * specific language governing permissions and limitations 14 | * under the License. 15 | */ 16 | package edu.cmu.lti.wikipedia_redirect; 17 | 18 | import java.io.Serializable; 19 | import java.util.HashMap; 20 | import java.util.LinkedHashSet; 21 | import java.util.Map; 22 | import java.util.Map.Entry; 23 | import java.util.Set; 24 | 25 | /** 26 | * Represents the wikipedia redirect data. 27 | * 28 | * Things you should know: key-value is one-to-many in Wikipedia Redirect. 29 | * Let's denote X -> Y when a source term X redirects to the term B. 30 | * X is unique in the entire Wikipedia Redirect data set, but Y is not. 31 | * In other words, there exists a Y such that X -> Y and X' -> Y. 32 | * 33 | * @author Hideki Shima 34 | * 35 | */ 36 | public class WikipediaRedirect extends HashMap 37 | implements Serializable { 38 | //Do we need case insensitive hash map? C.f. http://www.coderanch.com/t/385950/java/java/HashMap-key-case-insensitivity 39 | 40 | private static final long serialVersionUID = 20111008L; 41 | 42 | public WikipediaRedirect() { 43 | super(); 44 | } 45 | 46 | public WikipediaRedirect( int size ) { 47 | // RAM (heap) efficient capacity setting 48 | super( size * 4 / 3 + 1 ); 49 | } 50 | 51 | public WikipediaRedirect( Map map ) { 52 | super( map ); 53 | } 54 | 55 | /** 56 | * Get keys in the map such that the value equals to the given value. 57 | * 58 | * @param value 59 | * @return keys 60 | */ 61 | public Set getKeysByValue(String value) { 62 | Set results = new LinkedHashSet(); 63 | //Iterating through all items is slow. 64 | //TODO: use existing library for faster access e.g. guava. 65 | for (Entry entry : entrySet()) { 66 | if (value.equals(entry.getValue())) { 67 | results.add(entry.getKey()); 68 | } 69 | } 70 | return results; 71 | } 72 | } 73 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/.svn/wc.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hasibi/TAGME-Reproducibility/d21ed0d826fc60a6e4caaa5ec7b6c39e16f7c6c6/lib/edu.cmu.lti.wikipedia_redirect/.svn/wc.db -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/README.txt: -------------------------------------------------------------------------------- 1 | See http://code.google.com/p/wikipedia-redirect -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/launches/Demo.launch: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/launches/WikipediaRedirectExtractor.launch: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/src/edu/cmu/lti/wikipedia_redirect/Demo.java: -------------------------------------------------------------------------------- 1 | package edu.cmu.lti.wikipedia_redirect; 2 | import java.io.File; 3 | import java.util.Set; 4 | 5 | /** 6 | * Demo of what you can do with Wikipedia Redirect. 7 | * @author Hideki Shima 8 | */ 9 | public class Demo { 10 | private static String[] enSrcTerms = {"Bin Ladin", "William Henry Gates", 11 | "JFK", "The Steel City", "The City of Bridges", "Da burgh", "Hoagie", 12 | "Centre", "3.14"}; 13 | private static String[] jaSrcTerms = {"オサマビンラディン", "オサマ・ビンラーディン", 14 | "東日本大地震","東日本太平洋沖地震" ,"NACSIS", 15 | "ダイアモンド", "アボガド", "バイオリン", "平成12年", "3.14"}; 16 | private static String enTarget = "Bayesian network"; 17 | private static String jaTarget = "計算機科学"; 18 | 19 | public static void main(String[] args) throws Exception { 20 | // Initialization 21 | System.out.print("Loading Wikipedia Redirect ..."); 22 | long t0 = System.currentTimeMillis(); 23 | File inputFile = new File(args[0]); 24 | WikipediaRedirect wr = IOUtil.loadWikipediaRedirect(inputFile); 25 | boolean useJapaneseExample = inputFile.getName().substring(0, 2).equals("ja"); 26 | String[] srcTerms = useJapaneseExample ? jaSrcTerms : enSrcTerms; 27 | String target = useJapaneseExample ? jaTarget : enTarget; 28 | long t1 = System.currentTimeMillis(); 29 | System.out.println(" done in "+(t1-t0)/1000D+" sec.\n"); 30 | 31 | // Let's find a redirection given a source word. 32 | StringBuilder sb = new StringBuilder(); 33 | for ( String src : srcTerms ) { 34 | sb.append("redirect(\""+src+"\") = \""+wr.get(src)+"\"\n"); 35 | } 36 | long t2 = System.currentTimeMillis(); 37 | System.out.println(sb.toString()+"Done in "+(t2-t1)/1000D+" sec.\n--\n"); 38 | 39 | // Let's find which source words redirect to the given target word. 40 | Set keys = wr.getKeysByValue(target); 41 | long t3 = System.currentTimeMillis(); 42 | System.out.println("All of the following redirect to \""+target+"\":\n"+keys); 43 | System.out.println("Done in "+(t3-t2)/1000D+" sec.\n"); 44 | } 45 | } 46 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/src/edu/cmu/lti/wikipedia_redirect/IOUtil.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Carnegie Mellon University 3 | * Licensed under the Apache License, Version 2.0 (the "License"); 4 | * you may not use this file except in compliance with the License. 5 | * You may obtain a copy of the License at 6 | * 7 | * http://www.apache.org/licenses/LICENSE-2.0 8 | * 9 | * Unless required by applicable law or agreed to in writing, 10 | * software distributed under the License is distributed on an 11 | * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 12 | * KIND, either express or implied. See the License for the 13 | * specific language governing permissions and limitations 14 | * under the License. 15 | */ 16 | package edu.cmu.lti.wikipedia_redirect; 17 | 18 | import java.io.BufferedReader; 19 | import java.io.BufferedWriter; 20 | import java.io.File; 21 | import java.io.FileInputStream; 22 | import java.io.FileOutputStream; 23 | import java.io.FileReader; 24 | import java.io.InputStreamReader; 25 | import java.io.LineNumberReader; 26 | import java.io.ObjectInputStream; 27 | import java.io.ObjectOutputStream; 28 | import java.io.OutputStreamWriter; 29 | import java.util.AbstractMap; 30 | import java.util.ArrayList; 31 | import java.util.List; 32 | import java.util.Map.Entry; 33 | 34 | /** 35 | * Reads and writes wikipedia redirect data. 36 | * 37 | * @author Hideki Shima 38 | * 39 | */ 40 | public class IOUtil { 41 | 42 | /** 43 | * Save Wikipedia redirect data 44 | * 45 | * @param redirectData 46 | * map where key is original term and value is redirected term 47 | * @throws Exception 48 | */ 49 | public static void save( AbstractMap map ) throws Exception { 50 | File outputDir = new File("target"); 51 | if (!outputDir.exists()) { 52 | outputDir.mkdirs(); 53 | } 54 | WikipediaRedirect wr = new WikipediaRedirect( map ); 55 | saveText( wr, outputDir ); 56 | saveSerialized( wr, outputDir ); 57 | } 58 | 59 | /** 60 | * Save Wikipedia redirect data into tab separated text file 61 | * 62 | * @param redirectData 63 | * map where key is original term and value is redirected term 64 | * @throws Exception 65 | */ 66 | private static void saveText( WikipediaRedirect wr, File outputDir ) throws Exception { 67 | File txtFile = new File(outputDir, "wikipedia_redirect.txt"); 68 | FileOutputStream fosTxt = new FileOutputStream(txtFile); 69 | OutputStreamWriter osw = new OutputStreamWriter(fosTxt, "utf-8"); 70 | BufferedWriter bw = new BufferedWriter(osw); 71 | for ( Entry entry : wr.entrySet() ) { 72 | bw.write( entry.getKey()+"\t"+entry.getValue()+"\n" ); 73 | } 74 | bw.close(); 75 | osw.close(); 76 | fosTxt.close(); 77 | System.out.println("Saved redirect data in text format: "+txtFile.getAbsolutePath()); 78 | } 79 | 80 | /** 81 | * Save Wikipedia redirect data into serialized object 82 | * 83 | * @param redirectData 84 | * map where key is original term and value is redirected term 85 | * @throws Exception 86 | */ 87 | private static void saveSerialized( WikipediaRedirect wr, File outputDir ) throws Exception { 88 | File objFile = new File(outputDir, "wikipedia_redirect.ser"); 89 | FileOutputStream fosObj = new FileOutputStream(objFile); 90 | ObjectOutputStream outObject = new ObjectOutputStream(fosObj); 91 | outObject.writeObject(wr); 92 | outObject.close(); 93 | fosObj.close(); 94 | System.out.println("Serialized redirect data: "+objFile.getAbsolutePath()); 95 | } 96 | 97 | /** 98 | * Deserializes wikipedia redirect data 99 | * @param file 100 | * serialized object or tab-separated text 101 | * @return wikipedia redirect 102 | * @throws Exception 103 | */ 104 | public static WikipediaRedirect loadWikipediaRedirect( File f ) throws Exception { 105 | if (!f.exists() || f.isDirectory()) { 106 | System.err.println("File not found: "+f.getAbsolutePath()); 107 | System.exit(-1); 108 | } 109 | if ( f.getName().endsWith(".ser") ) { 110 | return loadWikipediaRedirectFromSerialized( f ); 111 | } else { 112 | //faster than above? 113 | return loadWikipediaRedirectFromText( f ); 114 | } 115 | } 116 | 117 | /** 118 | * Deserializes wikipedia redirect data from serialized object data 119 | * @param file 120 | * serialized object 121 | * @return wikipedia redirect 122 | * @throws Exception 123 | */ 124 | private static WikipediaRedirect loadWikipediaRedirectFromSerialized( File f ) throws Exception { 125 | WikipediaRedirect object; 126 | try { 127 | FileInputStream inFile = new FileInputStream(f); 128 | ObjectInputStream inObject = new ObjectInputStream(inFile); 129 | object = (WikipediaRedirect)inObject.readObject(); 130 | inObject.close(); 131 | inFile.close(); 132 | } catch (Exception e) { 133 | throw e; 134 | } 135 | return object; 136 | } 137 | 138 | /** 139 | * Deserializes wikipedia redirect data from tab-separated text file 140 | * @param file 141 | * tab-separated text 142 | * @return wikipedia redirect 143 | * @throws Exception 144 | */ 145 | private static WikipediaRedirect loadWikipediaRedirectFromText( File f ) throws Exception { 146 | int size = (int)countLineNumber(f); 147 | WikipediaRedirect wr = new WikipediaRedirect( size ); 148 | try { 149 | FileInputStream fis = new FileInputStream( f ); 150 | InputStreamReader isr = new InputStreamReader( fis ); 151 | BufferedReader br = new BufferedReader( isr ); 152 | String line = null; 153 | while ( (line = br.readLine()) != null ) { 154 | String[] elements = line.split("\t"); 155 | wr.put( elements[0], elements[1] ); 156 | } 157 | br.close(); 158 | isr.close(); 159 | fis.close(); 160 | } catch (Exception e) { 161 | throw e; 162 | } 163 | return wr; 164 | } 165 | 166 | /** 167 | * Loads tab separated data as an alternative way to load() method. 168 | * Works for Wikipedia hypernym data generated by 169 | * NICT's "Hyponymy extraction tool" 170 | * 171 | * @param file 172 | * tab separated file that contains lines that look "word1[TAB]word2[BR]" 173 | * @return wikipedia redirect 174 | * @throws Exception 175 | */ 176 | public static WikipediaHypernym loadWikipediaHypernym( File f ) throws Exception { 177 | int size = (int)IOUtil.countLineNumber( f ); 178 | WikipediaHypernym object = new WikipediaHypernym( size ); 179 | try { 180 | FileInputStream inFile = new FileInputStream( f ); 181 | InputStreamReader isr = new InputStreamReader( inFile ); 182 | BufferedReader br = new BufferedReader( isr ); 183 | String line = null; 184 | while ( (line = br.readLine())!=null ) { 185 | String[] tokens = line.split("\t"); 186 | if (tokens.length<=1) { 187 | continue; 188 | } 189 | String key = tokens[0]; 190 | List targets = object.get(key); 191 | if ( targets==null ) { 192 | targets = new ArrayList(); 193 | } 194 | targets.add(tokens[1]); 195 | object.put(key, targets); 196 | } 197 | br.close(); 198 | isr.close(); 199 | inFile.close(); 200 | } catch (Exception e) { 201 | throw e; 202 | } 203 | return object; 204 | } 205 | 206 | /** 207 | * Count number of lines in a file in an efficient way 208 | * @param f 209 | * @return 210 | * @throws Exception 211 | */ 212 | public static long countLineNumber( File f ) throws Exception { 213 | LineNumberReader lnr = new LineNumberReader(new FileReader(f)); 214 | lnr.skip(Long.MAX_VALUE); 215 | int count = lnr.getLineNumber(); 216 | lnr.close(); 217 | return count; 218 | } 219 | } 220 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/src/edu/cmu/lti/wikipedia_redirect/WikipediaHypernym.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Carnegie Mellon University 3 | * Licensed under the Apache License, Version 2.0 (the "License"); 4 | * you may not use this file except in compliance with the License. 5 | * You may obtain a copy of the License at 6 | * 7 | * http://www.apache.org/licenses/LICENSE-2.0 8 | * 9 | * Unless required by applicable law or agreed to in writing, 10 | * software distributed under the License is distributed on an 11 | * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 12 | * KIND, either express or implied. See the License for the 13 | * specific language governing permissions and limitations 14 | * under the License. 15 | */ 16 | package edu.cmu.lti.wikipedia_redirect; 17 | 18 | import java.io.File; 19 | import java.io.Serializable; 20 | import java.util.ArrayList; 21 | import java.util.HashMap; 22 | import java.util.List; 23 | 24 | /** 25 | * Represents the wikipedia hypernym data e.g. ones generated by 26 | * NICT's "Hyponymy extraction tool" 27 | * 28 | * @author Hideki Shima 29 | */ 30 | public class WikipediaHypernym extends HashMap> 31 | implements Serializable { 32 | 33 | private static final long serialVersionUID = 20111019L; 34 | 35 | public WikipediaHypernym( int size ) { 36 | // RAM (heap) efficient capacity setting 37 | super( size * 4 / 3 + 1 ); 38 | } 39 | 40 | public void load( File file ) throws Exception { 41 | WikipediaHypernym wh = IOUtil.loadWikipediaHypernym(file); 42 | for ( String key : wh.keySet() ) { 43 | List thisList = get(key); 44 | List newList = wh.get(key); 45 | if ( thisList != null ) { 46 | thisList.addAll( newList ); 47 | } else { 48 | thisList = new ArrayList( newList ); 49 | } 50 | put(key, thisList); 51 | } 52 | } 53 | 54 | } 55 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/src/edu/cmu/lti/wikipedia_redirect/WikipediaRedirect.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Carnegie Mellon University 3 | * Licensed under the Apache License, Version 2.0 (the "License"); 4 | * you may not use this file except in compliance with the License. 5 | * You may obtain a copy of the License at 6 | * 7 | * http://www.apache.org/licenses/LICENSE-2.0 8 | * 9 | * Unless required by applicable law or agreed to in writing, 10 | * software distributed under the License is distributed on an 11 | * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 12 | * KIND, either express or implied. See the License for the 13 | * specific language governing permissions and limitations 14 | * under the License. 15 | */ 16 | package edu.cmu.lti.wikipedia_redirect; 17 | 18 | import java.io.Serializable; 19 | import java.util.HashMap; 20 | import java.util.LinkedHashSet; 21 | import java.util.Map; 22 | import java.util.Map.Entry; 23 | import java.util.Set; 24 | 25 | /** 26 | * Represents the wikipedia redirect data. 27 | * 28 | * Things you should know: key-value is one-to-many in Wikipedia Redirect. 29 | * Let's denote X -> Y when a source term X redirects to the term B. 30 | * X is unique in the entire Wikipedia Redirect data set, but Y is not. 31 | * In other words, there exists a Y such that X -> Y and X' -> Y. 32 | * 33 | * @author Hideki Shima 34 | * 35 | */ 36 | public class WikipediaRedirect extends HashMap 37 | implements Serializable { 38 | //Do we need case insensitive hash map? C.f. http://www.coderanch.com/t/385950/java/java/HashMap-key-case-insensitivity 39 | 40 | private static final long serialVersionUID = 20111008L; 41 | 42 | public WikipediaRedirect() { 43 | super(); 44 | } 45 | 46 | public WikipediaRedirect( int size ) { 47 | // RAM (heap) efficient capacity setting 48 | super( size * 4 / 3 + 1 ); 49 | } 50 | 51 | public WikipediaRedirect( Map map ) { 52 | super( map ); 53 | } 54 | 55 | /** 56 | * Get keys in the map such that the value equals to the given value. 57 | * 58 | * @param value 59 | * @return keys 60 | */ 61 | public Set getKeysByValue(String value) { 62 | Set results = new LinkedHashSet(); 63 | //Iterating through all items is slow. 64 | //TODO: use existing library for faster access e.g. guava. 65 | for (Entry entry : entrySet()) { 66 | if (value.equals(entry.getValue())) { 67 | results.add(entry.getKey()); 68 | } 69 | } 70 | return results; 71 | } 72 | } 73 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/src/edu/cmu/lti/wikipedia_redirect/WikipediaRedirectExtractor.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Carnegie Mellon University 3 | * Licensed under the Apache License, Version 2.0 (the "License"); 4 | * you may not use this file except in compliance with the License. 5 | * You may obtain a copy of the License at 6 | * 7 | * http://www.apache.org/licenses/LICENSE-2.0 8 | * 9 | * Unless required by applicable law or agreed to in writing, 10 | * software distributed under the License is distributed on an 11 | * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 12 | * KIND, either express or implied. See the License for the 13 | * specific language governing permissions and limitations 14 | * under the License. 15 | */ 16 | package edu.cmu.lti.wikipedia_redirect; 17 | 18 | import java.io.BufferedReader; 19 | import java.io.BufferedWriter; 20 | import java.io.File; 21 | import java.io.FileInputStream; 22 | import java.io.FileOutputStream; 23 | import java.io.InputStreamReader; 24 | import java.io.OutputStreamWriter; 25 | import java.util.regex.Matcher; 26 | import java.util.regex.Pattern; 27 | 28 | /** 29 | * Extracts wikipedia redirect information and serializes the data. 30 | * 31 | * @author Hideki Shima 32 | * 33 | */ 34 | public class WikipediaRedirectExtractor { 35 | 36 | private static String titlePattern = " "; 37 | private static String redirectPattern = " <redirect"; 38 | private static String textPattern = " <text xml"; 39 | private static Pattern pRedirect = Pattern.compile( 40 | "#[ ]?[^ ]+[ ]?\\[\\[(.+?)\\]\\]", Pattern.CASE_INSENSITIVE); 41 | 42 | public void run( File inputFile, File outputFile ) throws Exception { 43 | int invalidCount = 0; 44 | long t0 = System.currentTimeMillis(); 45 | FileInputStream fis = new FileInputStream( inputFile ); 46 | // TreeMap<String,String> map = new HashMap<String,String>(); 47 | InputStreamReader isr = new InputStreamReader(fis, "utf-8"); 48 | BufferedReader br = new BufferedReader(isr); 49 | FileOutputStream fos = new FileOutputStream(outputFile); 50 | OutputStreamWriter osw = new OutputStreamWriter(fos, "utf-8"); 51 | BufferedWriter bw = new BufferedWriter(osw); 52 | 53 | int count = 0; 54 | String title = null; 55 | String text = null; 56 | String line = null; 57 | boolean isRedirect = false; 58 | boolean inText = false; 59 | while ((line=br.readLine())!=null) { 60 | if (line.startsWith(titlePattern)) { 61 | title = line; 62 | text = null; 63 | isRedirect = false; 64 | } 65 | if (line.startsWith(redirectPattern)) { 66 | isRedirect = true; 67 | } 68 | if (isRedirect && (line.startsWith(textPattern) || inText)) { 69 | Matcher m = pRedirect.matcher(line); // slow regex shouldn't be used until here. 70 | if (m.find()) { // make sure the current text field contains [[...]] 71 | text = line; 72 | try { 73 | title = cleanupTitle(title); 74 | String redirectedTitle = m.group(1); 75 | if ( isValidAlias(title, redirectedTitle) ) { 76 | bw.write( title+"\t"+redirectedTitle+"\n" ); 77 | count++; 78 | // map.put( title, redirectedTitle ); 79 | } else { 80 | invalidCount++; 81 | } 82 | } catch ( StringIndexOutOfBoundsException e ) { 83 | System.out.println("ERROR: cannot extract redirection from title = "+title+", text = "+text); 84 | e.printStackTrace(); 85 | } 86 | } else { // Very rare case 87 | inText = true; 88 | } 89 | } 90 | } 91 | br.close(); 92 | isr.close(); 93 | fis.close(); 94 | 95 | bw.close(); 96 | osw.close(); 97 | fos.close(); 98 | System.out.println("---- Wikipedia redirect extraction done ----"); 99 | long t1 = System.currentTimeMillis(); 100 | // IOUtil.save( map ); 101 | System.out.println("Discarded "+invalidCount+" redirects to wikipedia meta articles."); 102 | System.out.println("Extracted "+count+" redirects."); 103 | System.out.println("Saved output: "+outputFile.getAbsolutePath()); 104 | System.out.println("Done in "+((t1-t0)/1000)+" sec."); 105 | } 106 | 107 | private String cleanupTitle( String title ) { 108 | int end = title.indexOf(""); 109 | return end!=-1?title.substring(titlePattern.length(), end):title; 110 | } 111 | 112 | /** 113 | * Identifies if the redirection is valid. 114 | * Currently, we only check if the redirection is related to 115 | * a special Wikipedia page or not. 116 | * 117 | * TODO: write more rules to discard more invalid redirects. 118 | * 119 | * @param title source title 120 | * @param redirectedTitle target title 121 | * @return validity 122 | */ 123 | private boolean isValidAlias( String title, String redirectedTitle ) { 124 | if ( title.startsWith("Wikipedia:") 125 | || title.startsWith("Template:") 126 | || title.startsWith("Portal:") 127 | || title.startsWith("List of ")) { 128 | return false; 129 | } 130 | return true; 131 | } 132 | 133 | public static void main(String[] args) throws Exception { 134 | if (args.length!=1) { 135 | System.err.println("ERROR: Please specify the path to the wikipedia article xml file as the argument."); 136 | System.err.println("Tips: enclose the path with double quotes if a space exists in the path."); 137 | return; 138 | } 139 | File inputFile = new File(args[0]); 140 | if (!inputFile.exists() || inputFile.isDirectory()) { 141 | System.err.println("ERROR: File not found at "+inputFile.getAbsolutePath()); 142 | return; 143 | } 144 | String prefix = inputFile.getName().replaceFirst("-.*", ""); 145 | File outputDir = new File("target"); 146 | if (!outputDir.exists()) { 147 | outputDir.mkdirs(); 148 | } 149 | File outputFile = new File(outputDir, prefix+"-redirect.txt"); 150 | new WikipediaRedirectExtractor().run( inputFile, outputFile ); 151 | } 152 | } 153 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/test-data/sample-jawiki-latest-pages-articles.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | Wikipedia 4 | http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8 5 | MediaWiki 1.17wmf1 6 | first-letter 7 | 8 | メディア 9 | 特別 10 | 11 | ノート 12 | 利用者 13 | 利用者‐会話 14 | Wikipedia 15 | Wikipedia‐ノート 16 | ファイル 17 | ファイル‐ノート 18 | MediaWiki 19 | MediaWiki‐ノート 20 | Template 21 | Template‐ノート 22 | Help 23 | Help‐ノート 24 | Category 25 | Category‐ノート 26 | Portal 27 | Portal‐ノート 28 | プロジェクト 29 | プロジェクト‐ノート 30 | 31 | 32 | 33 | Wikipedia:Sandbox 34 | 6 35 | 36 | 37 | 36654478 38 | 2011-03-06T16:16:58Z 39 | 40 | Y-dash 41 | 309126 42 | 43 | テストは[[Wikipedia:サンドボックス]]でお願いいたします。 / [[Special:Contributions/Kompek|Kompek]] ([[User talk:Kompek|会話]]) による ID:36654304 の版を[[H:RV|取り消し]] 44 | #REDIRECT [[Wikipedia:サンドボックス]] 45 | 46 | 47 | 48 | SandBox 49 | 26 50 | 51 | 52 | 6986090 53 | 2006-08-05T23:25:48Z 54 | 55 | Nevylax 56 | 38464 57 | 58 | #REDIRECT [[サンドボックス]] 59 | #REDIRECT [[サンドボックス]] 60 | 61 | 62 | 63 | HomePage 64 | 46 65 | 66 | 67 | 2168894 68 | 2005-03-22T13:49:43Z 69 | 70 | Hideyuki 71 | 9577 72 | 73 | #REDIRECT [[ホームページ]] 74 | #REDIRECT [[ホームページ]] 75 | 76 | 77 | 78 | Wikipedia:About 79 | 51 80 | 81 | 82 | 19962101 83 | 2008-05-31T01:12:46Z 84 | 85 | Kanjy 86 | 36859 87 | 88 | 89 | 2003-03-06T11:35:13Z Setu さん版 (#REDIRECT [[Wikipedia:ウィキペディアについて]]) に戻す 90 | #REDIRECT [[Wikipedia:ウィキペディアについて]] 91 | 92 | 93 | 94 | Wikipedia:How does one edit a page 95 | 85 96 | 97 | 98 | 13206183 99 | 2007-06-19T05:06:43Z 100 | 101 | Aotake 102 | 34929 103 | 104 | 105 | redirect target 106 | #REDIRECT [[Help:ページの編集]] 107 | 108 | 109 | 110 | ワールド・ミュージック 111 | 113 112 | 113 | 114 | 24277249 115 | 2009-02-07T16:20:20Z 116 | 117 | Point136 118 | 211299 119 | 120 | 121 | Bot: リダイレクト構文の修正 122 | #REDIRECT [[ワールドミュージック]] 123 | 124 | 125 | 126 | ネマティック相 127 | 127 128 | 129 | 130 | 24277255 131 | 2009-02-07T16:20:40Z 132 | 133 | Point136 134 | 211299 135 | 136 | 137 | Bot: リダイレクト構文の修正 138 | #REDIRECT [[ネマティック液晶]] 139 | 140 | 141 | 142 | スメクティック相 143 | 128 144 | 145 | 146 | 2168972 147 | 2004-01-07T09:45:14Z 148 | 149 | Yas 150 | 739 151 | 152 | 153 | #REDIRECT [[液晶]] 154 | #REDIRECT [[液晶]] 155 | 156 | 157 | 158 | ミュージシャン一覧 (個人) 159 | 143 160 | 161 | 162 | 35399365 163 | 2010-12-14T00:40:05Z 164 | 165 | Xqbot 166 | 273540 167 | 168 | 169 | ロボットによる: 二重リダイレクト修正 → [[音楽家の一覧]] 170 | #転送 [[音楽家の一覧]] 171 | 172 | 173 | 174 | 病名 175 | 176 176 | 177 | 178 | 17793766 179 | 2008-02-04T13:37:54Z 180 | 181 | U3002 182 | 66126 183 | 184 | 185 | 二重リダイレクト回避 186 | #REDIRECT [[病気の別名の一覧]] 187 | 188 | 189 | 190 | Wikipedia:Welcome, newcomers 191 | 216 192 | 193 | 194 | 10662242 195 | 2007-02-15T13:40:22Z 196 | 197 | Cave cattum 198 | 41235 199 | 200 | #REDIRECT [[Wikipedia:ウィキペディアへようこそ]] 201 | #REDIRECT [[Wikipedia:ウィキペディアへようこそ]] 202 | 203 | 204 | 205 | 黒人霊歌 206 | 260 207 | 208 | 209 | 22493441 210 | 2008-10-23T05:59:07Z 211 | 212 | Buzin Satuma Hayato 213 | 243768 214 | 215 | 216 | 黒人霊歌はスピリチュアル(音楽)だと思う 217 | #REDIRECT [[スピリチュアル#スピリチュアル(音楽)]] 218 | 219 | 220 | 221 | Wikipedia:漢字やスペルに注意 222 | 281 223 | 224 | 225 | 13451992 226 | 2007-07-02T14:15:39Z 227 | 228 | Cave cattum 229 | 41235 230 | 231 | [[WP:AES|←]][[Wikipedia:記事を執筆する]]へのリダイレクト 232 | #REDIRECT [[Wikipedia:記事を執筆する]] 233 | 234 | 235 | 236 | Wikipedia:他言語の使用は控えめに 237 | 283 238 | 239 | 240 | 15761853 241 | 2007-10-27T06:12:35Z 242 | 243 | Khhy 244 | 13490 245 | 246 | #他言語表記は控えめに 247 | #REDIRECT [[Wikipedia:素晴らしい記事を書くには#他言語表記は控えめに]] 248 | 249 | 250 | 251 | Wikipedia:日本語表記法 252 | 291 253 | 254 | 255 | 12840559 256 | 2007-05-30T14:59:17Z 257 | 258 | Aotake 259 | 34929 260 | 261 | [[Wikipedia:表記ガイド]]へ統合。 262 | #REDIRECT [[Wikipedia:表記ガイド]] 263 | 264 | 265 | 266 | Wikipedia:リダイレクトの使い方 267 | 308 268 | 269 | 270 | 2169127 271 | 2003-11-30T04:09:10Z 272 | 273 | 219.164.91.166 274 | 275 | #REDIRECT [[Wikipedia:リダイレクト]] 276 | #REDIRECT [[Wikipedia:リダイレクト]] 277 | 278 | 279 | 280 | アルコール飲料 281 | 318 282 | 283 | 284 | 15885863 285 | 2007-11-02T12:42:36Z 286 | 287 | Balmung0731 288 | 99201 289 | 290 | [[酒]]へ統合 291 | #REDIRECT [[酒]] 292 | 293 | 294 | 295 | 地学 296 | 321 297 | 298 | 299 | 2169138 300 | 2003-09-26T11:33:29Z 301 | 302 | 133.11.230.18 303 | 304 | #REDIRECT [[地球科学]] 305 | 306 | 307 | 308 | Wikipedia:ノートページのレイアウト 309 | 323 310 | 311 | 312 | 36028694 313 | 2011-01-24T11:38:30Z 314 | 315 | Kurz 316 | 1601 317 | 318 | 319 | lk 320 | #REDIRECT [[Wikipedia:ノートページのガイドライン]] 321 | 322 | 323 | 324 | Wikipedia:ページを孤立させない 325 | 332 326 | 327 | 328 | 14492895 329 | 2007-08-27T16:20:04Z 330 | 331 | Cave cattum 332 | 41235 333 | 334 | #REDIRECT [[Wikipedia:記事どうしをつなぐ]] 335 | #REDIRECT [[Wikipedia:記事どうしをつなぐ]] 336 | 337 | 338 | 339 | 明石沢貴士 340 | 335 341 | 342 | 343 | 34308655 344 | 2010-10-03T18:47:08Z 345 | 346 | EmausBot 347 | 397108 348 | 349 | 350 | ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 あ行#.E6.98.8E.E7.9F.B3.E6.B2.A2.E8.B2.B4.E5.A3.AB]] 351 | #転送 [[プロジェクト:漫画家/日本の漫画家 あ行#.E6.98.8E.E7.9F.B3.E6.B2.A2.E8.B2.B4.E5.A3.AB]] 352 | 353 | 354 | 355 | ここまひ 356 | 341 357 | 358 | 359 | 34308404 360 | 2010-10-03T18:16:42Z 361 | 362 | EmausBot 363 | 397108 364 | 365 | 366 | ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 か行#.E3.81.93.E3.81.93.E3.81.BE.E3.81.B2]] 367 | #転送 [[プロジェクト:漫画家/日本の漫画家 か行#.E3.81.93.E3.81.93.E3.81.BE.E3.81.B2]] 368 | 369 | 370 | 371 | 吉冨昭仁 372 | 354 373 | 374 | 375 | 7856042 376 | 2006-09-23T16:20:44Z 377 | 378 | Mambo95 379 | 77516 380 | 381 | [[吉富昭仁]]へのリダイレクト 382 | #REDIRECT [[吉富昭仁]] 383 | 384 | 385 | 386 | 現在のイベント 387 | 356 388 | 389 | 390 | 12796821 391 | 2007-05-28T09:42:57Z 392 | 393 | Khhy 394 | 13490 395 | 396 | #REDIRECT [[Portal:最近の出来事]] 397 | #REDIRECT [[Portal:最近の出来事]] 398 | 399 | 400 | 401 | Wikipedia:項目名のつけ方 402 | 451 403 | 404 | 405 | 2169218 406 | 2003-02-03T20:03:50Z 407 | 408 | Tomos 409 | 10 410 | 411 | #REDIRECT[[Wikipedia:記事名の付け方]] 412 | 413 | 414 | 415 | 必要とされている記事 416 | 456 417 | 418 | 419 | 2169223 420 | 2004-04-16T16:59:03Z 421 | 422 | Listener 423 | 6379 424 | 425 | Wikipedia:執筆依頼, double redirect 426 | #REDIRECT [[Wikipedia:執筆依頼]] 427 | 428 | 429 | 430 | 東京を舞台にした漫画作品 431 | 465 432 | 433 | 434 | 39222184 435 | 2011-09-16T06:55:20Z 436 | 437 | リオネル 438 | 98816 439 | 440 | [[東京を舞台にした漫画・アニメ作品]]へ統合 441 | #REDIRECT [[東京を舞台にした漫画・アニメ作品]] 442 | 443 | 444 | 445 | 必要とされている画像 446 | 467 447 | 448 | 449 | 2169230 450 | 2004-03-20T05:01:58Z 451 | 452 | Michey.M-test 453 | 3537 454 | 455 | 456 | Wikipedia:画像提供依頼 457 | #REDIRECT [[Wikipedia:画像提供依頼]] 458 | 459 | 460 | 461 | 水縞とおる 462 | 471 463 | 464 | 465 | 34308685 466 | 2010-10-03T18:50:25Z 467 | 468 | EmausBot 469 | 397108 470 | 471 | 472 | ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]] 473 | #転送 [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]] 474 | 475 | 476 | 477 | 恋人は守護霊!? 478 | 472 479 | 480 | 481 | 34308641 482 | 2010-10-03T18:45:13Z 483 | 484 | EmausBot 485 | 397108 486 | 487 | 488 | ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]] 489 | #転送 [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]] 490 | 491 | [[Category:漫画作品 こ|いひとはしゆこれい]] 492 | [[Category:月刊コミックNORA|こいひとはしゆこれい]] 493 | 494 | 495 | 496 | ユーゴスラビア改名 497 | 504 498 | 499 | 500 | 24277258 501 | 2009-02-07T16:21:01Z 502 | 503 | Point136 504 | 211299 505 | 506 | 507 | Bot: リダイレクト構文の修正 508 | #REDIRECT [[ユーゴスラビア]] 509 | 510 | 511 | 512 | あだちつよし 513 | 519 514 | 515 | 516 | 34308384 517 | 2010-10-03T18:14:37Z 518 | 519 | EmausBot 520 | 397108 521 | 522 | 523 | ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 あ行#.E3.81.82.E3.81.A0.E3.81.A1.E3.81.A4.E3.82.88.E3.81.97]] 524 | #転送 [[プロジェクト:漫画家/日本の漫画家 あ行#.E3.81.82.E3.81.A0.E3.81.A1.E3.81.A4.E3.82.88.E3.81.97]] 525 | 526 | 527 | 528 | -------------------------------------------------------------------------------- /lib/edu.cmu.lti.wikipedia_redirect/test-data/sample-res_cat_jawiki.txt: -------------------------------------------------------------------------------- 1 | ACMフェロー アルフレッド・エイホ 0.288392 2 | ACMフェロー アンドリュー・タネンバウム 0.084127 3 | ACMフェロー エドムンド・クラーク 0.220679 4 | ACMフェロー グラディ・ブーチ -0.175180 5 | ACMフェロー ジャック・ドンガラ 0.427047 6 | ACMフェロー スティーブン・ボーン 0.220679 7 | ACMフェロー ダグラス・カマー 0.907805 8 | ACMフェロー ダン・ブリックリン 0.220679 9 | ACMフェロー ビャーネ・ストロヴストルップ 0.220679 10 | ACMフェロー ビル・グロップ 0.220679 11 | ACMフェロー ピーター・ノーヴィグ 0.233410 12 | ACMフェロー ボブ・フランクストン 0.241471 13 | ACMフェロー リチャード・ハミング 0.899804 14 | ACMフェロー 米澤明憲 0.143425 -------------------------------------------------------------------------------- /lib/wikiextractor-master-280915.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hasibi/TAGME-Reproducibility/d21ed0d826fc60a6e4caaa5ec7b6c39e16f7c6c6/lib/wikiextractor-master-280915.zip -------------------------------------------------------------------------------- /nordlys/__init__.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | 3 | import sys 4 | 5 | # set default encoding to utf-8 6 | reload(sys) 7 | sys.setdefaultencoding("utf-8") 8 | -------------------------------------------------------------------------------- /nordlys/config.py: -------------------------------------------------------------------------------- 1 | """ 2 | Global nordlys config. 3 | 4 | @author: Krisztian Balog (krisztian.balog@uis.no) 5 | """ 6 | 7 | from os import path 8 | 9 | NORDLYS_DIR = path.dirname(path.abspath(__file__)) 10 | LIB_DIR = path.dirname(path.dirname(path.abspath(__file__))) + "/lib" 11 | DATA_DIR = path.dirname(path.dirname(path.abspath(__file__))) + "/data" 12 | OUTPUT_DIR = path.dirname(path.dirname(path.abspath(__file__))) + "/output" 13 | 14 | MONGO_DB = "nordlys" 15 | MONGO_HOST = "localhost" -------------------------------------------------------------------------------- /nordlys/storage/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hasibi/TAGME-Reproducibility/d21ed0d826fc60a6e4caaa5ec7b6c39e16f7c6c6/nordlys/storage/__init__.py -------------------------------------------------------------------------------- /nordlys/storage/mongo.py: -------------------------------------------------------------------------------- 1 | """ 2 | Tools for working with MongoDB. 3 | 4 | @author: Krisztian Balog (krisztian.balog@uis.no) 5 | """ 6 | 7 | from pymongo import MongoClient 8 | 9 | 10 | class Mongo(object): 11 | """Manages the MongoDB connection and operations.""" 12 | ID_FIELD = "_id" 13 | 14 | def __init__(self, host, db, collection): 15 | self.client = MongoClient(host) 16 | self.db = self.client[db] 17 | self.collection = self.db[collection] 18 | self.db_name = db 19 | self.collection_name = collection 20 | print "Connected to " + self.db_name + "." + self.collection_name 21 | 22 | @staticmethod 23 | def escape(s): 24 | """Escapes string (to be used as key or fieldname). 25 | Replaces . and $ with their unicode eqivalents.""" 26 | return s.replace(".", "\u002e").replace("$", "\u0024") 27 | 28 | @staticmethod 29 | def unescape(s): 30 | """Unescapes string.""" 31 | return s.replace("\u002e", ".").replace("\u0024", "$") 32 | 33 | def find_by_id(self, doc_id): 34 | """Returns all document content for a given document id.""" 35 | return self.get_doc(self.collection.find_one({Mongo.ID_FIELD: self.escape(doc_id)})) 36 | 37 | def get_doc(self, mdoc): 38 | """Returns document contents with with keys and _id field unescaped.""" 39 | if mdoc is None: 40 | return None 41 | 42 | doc = {} 43 | for f in mdoc: 44 | if f == Mongo.ID_FIELD: 45 | doc[f] = self.unescape(mdoc[f]) 46 | else: 47 | doc[self.unescape(f)] = mdoc[f] 48 | 49 | return doc 50 | 51 | @staticmethod 52 | def print_doc(doc): 53 | print "_id: " + doc[Mongo.ID_FIELD] 54 | for key, value in doc.iteritems(): 55 | if key == Mongo.ID_FIELD: continue # ignore the id key 56 | if type(value) is list: 57 | print key + ":" 58 | for v in value: 59 | print "\t" + str(v) 60 | else: 61 | print key + ": " + str(value) -------------------------------------------------------------------------------- /nordlys/storage/surfaceforms.py: -------------------------------------------------------------------------------- 1 | """ 2 | Entity surface forms stored in MongoDB. 3 | 4 | The surface form is used as _id. The associated entities are stored in key-value format. 5 | 6 | @author: Krisztian Balog (krisztian.balog@uis.no) 7 | """ 8 | 9 | from nordlys.config import MONGO_DB, MONGO_HOST 10 | from nordlys.storage.mongo import Mongo 11 | 12 | 13 | class SurfaceForms(object): 14 | 15 | def __init__(self, collection): 16 | self.collection = collection 17 | self.mongo = Mongo(MONGO_HOST, MONGO_DB, self.collection) 18 | 19 | def get(self, surface_form): 20 | """Returns all information associated with a surface form.""" 21 | 22 | # need to unescape the keys in the value part 23 | mdoc = self.mongo.find_by_id(surface_form) 24 | if mdoc is None: 25 | return None 26 | doc = {} 27 | for f in mdoc: 28 | if f != Mongo.ID_FIELD: 29 | doc[f] = {} 30 | for key, value in mdoc[f].iteritems(): 31 | doc[f][Mongo.unescape(key)] = value 32 | 33 | return doc -------------------------------------------------------------------------------- /nordlys/tagme/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hasibi/TAGME-Reproducibility/d21ed0d826fc60a6e4caaa5ec7b6c39e16f7c6c6/nordlys/tagme/__init__.py -------------------------------------------------------------------------------- /nordlys/tagme/config.py: -------------------------------------------------------------------------------- 1 | """ 2 | Configurations for tagme package. 3 | 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 5 | """ 6 | 7 | from nordlys.config import DATA_DIR 8 | from nordlys.storage.surfaceforms import SurfaceForms 9 | 10 | 11 | # Test collection files 12 | Y_ERD = DATA_DIR + "/Y-ERD.tsv" 13 | 14 | ERD_QUERY = DATA_DIR + "/Trec_beta.query.txt" 15 | ERD_ANNOTATION = DATA_DIR + "/Trec_beta.annotation.txt" 16 | 17 | WIKI_ANNOT30 = DATA_DIR + "/wiki-annot30" 18 | WIKI_ANNOT30_SNIPPET = DATA_DIR + "/wiki-annot30-snippet.txt" 19 | WIKI_ANNOT30_ANNOTATION = DATA_DIR + "/wiki-annot30-annotation.txt" 20 | 21 | WIKI_DISAMB30 = DATA_DIR + "/wiki-disamb30" 22 | WIKI_DISAMB30_SNIPPET = DATA_DIR + "/wiki-disamb30-snippet.txt" 23 | WIKI_DISAMB30_ANNOTATION = DATA_DIR + "/wiki-disamb30-annotation.txt" 24 | 25 | # Surface form dictionaries 26 | COLLECTION_SURFACEFORMS_WIKI = "surfaceforms_wiki_20100408" 27 | SF_WIKI = SurfaceForms(collection=COLLECTION_SURFACEFORMS_WIKI) 28 | 29 | 30 | INDEX_PATH = "/xxx/20100408-index" 31 | INDEX_ANNOT_PATH = "/xxx/20100408-index-annot/" -------------------------------------------------------------------------------- /nordlys/tagme/dexter_api.py: -------------------------------------------------------------------------------- 1 | """ 2 | Methods to annotate queries with TagMe API. 3 | 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 5 | """ 6 | 7 | import argparse 8 | 9 | import requests 10 | from nordlys.config import OUTPUT_DIR 11 | 12 | from nordlys.tagme.test_coll import read_tagme_queries, read_yerd_queries, read_erd_queries 13 | from nordlys.wikipedia.utils import WikipediaUtils 14 | from nordlys.tagme import config 15 | 16 | 17 | class DexterAPI(object): 18 | ANNOT_DEXTER_URI = "http://dexterdemo.isti.cnr.it:8080/dexter-webapp/api/rest/annotate?min-conf=0" 19 | DESC_DEXTER_URI = "http://dexterdemo.isti.cnr.it:8080/dexter-webapp/api/rest/get-desc" 20 | 21 | def __init__(self): 22 | self.id_title_dict = {} 23 | 24 | def ask_dexter_query(self, query): 25 | """Sends queries to Dexter Api.""" 26 | data = {'dsb': "tagme", 'n': "50", 'debug': "false", 'format': "text", 'text': query} 27 | res = requests.post(self.ANNOT_DEXTER_URI, data).json() 28 | res['query'] = query 29 | return res 30 | 31 | def ask_title(self, page_id): 32 | """Sends page id to the API and get the page title.""" 33 | if page_id not in self.id_title_dict: 34 | req = "?id=" + str(page_id) + "&title-only=true" 35 | res = requests.get(self.DESC_DEXTER_URI + req).json() 36 | title = res.get('title', "") 37 | wiki_uri = WikipediaUtils.wiki_title_to_uri(title.encode("utf-8")) 38 | self.id_title_dict[page_id] = wiki_uri 39 | return self.id_title_dict[page_id] 40 | 41 | def aks_dexter_queries(self, queries, out_file): 42 | """ 43 | Sends queries to Dexter Api and writes them in a json file. 44 | 45 | :param queries: dictionary {qid: query, ...} 46 | :param out_file: The file to write json output 47 | """ 48 | print "Getting resutls from Tagme ..." 49 | # responses = {} 50 | out_str = "" 51 | open(out_file, "w").close() 52 | out = open(out_file, "a") 53 | i = 0 54 | for qid in sorted(queries, key=lambda item: int(item) if item.isdigit() else item): 55 | query = queries[qid] 56 | print "[" + qid + "]", query 57 | tagme_res = self.ask_dexter_query(query) 58 | out_str += self.__to_str(qid, tagme_res) 59 | out.write(out_str) 60 | out_str = "" 61 | i += 1 62 | if i % 100 == 0: 63 | # out.write(out_str) 64 | print i, "th query processed ...." 65 | print "items ins the page-id cache:", len(self.id_title_dict) 66 | self.id_title_dict = {} 67 | # out_str = "" 68 | out.write(out_str) 69 | # json.dump(responses, open(out_file, "w"), indent=4, sort_keys=True) 70 | print "Dexter results: " + out_file 71 | # return responses 72 | 73 | def __to_str(self, qid, response): 74 | """ 75 | Output format: 76 | qid, score, wiki-uri, mention, page-id, start, end, linkProbability, linkFrequency, documentFrequency, 77 | entityFrequency, commonness 78 | 79 | :param qid: 80 | :param response: 81 | :return: 82 | """ 83 | none_str = "*NONE*" 84 | out_str = "" 85 | for annot in response['spots']: 86 | wiki_uri = self.ask_title(annot.get('entity', none_str)) 87 | if wiki_uri is None: 88 | continue 89 | qid_str = str(qid) + "\t" + str(annot.get('score', none_str)) + "\t" + wiki_uri + "\t" + \ 90 | annot.get('mention', none_str) + "\t" + str(annot.get('entity', none_str)) + "\t" + \ 91 | str(annot.get('start', none_str)) + "\t" + str(annot.get('end', none_str)) + "\t" + \ 92 | str(annot.get('linkProbability', none_str)) + "\t" + str(annot.get('linkFrequency', none_str)) + "\t" +\ 93 | str(annot.get('documentFrequency', none_str)) + "\t" + str(annot.get('entityFrequency', none_str)) + "\t" +\ 94 | str(annot.get('commonness', none_str)) + "\n" 95 | out_str += qid_str 96 | return out_str 97 | 98 | 99 | def main(): 100 | parser = argparse.ArgumentParser() 101 | parser.add_argument("-th", "--threshold", help="rho score threshold", type=float, default=0) 102 | parser.add_argument("-qid", help="annotates queries from this qid", type=str) 103 | parser.add_argument("-data", help="Data set name", choices=['y-erd', 'erd-dev', 'wiki-annot30', 'wiki-disamb30']) 104 | args = parser.parse_args() 105 | 106 | if args.data == "erd-dev": 107 | queries = read_erd_queries() 108 | elif args.data == "y-erd": 109 | queries = read_yerd_queries() 110 | elif args.data == "wiki-annot30": 111 | queries = read_tagme_queries(config.WIKI_ANNOT30_SNIPPET) 112 | elif args.data == "wiki-disamb30": 113 | queries = read_tagme_queries(config.WIKI_DISAMB30_SNIPPET) 114 | 115 | # asks tagMe and creates output file 116 | qid_str = "_" + args.qid if args.qid else "" 117 | out_file = OUTPUT_DIR + "/" + args.data + "_dexter" + qid_str + ".txt" 118 | tagme = DexterAPI() 119 | tagme.aks_dexter_queries(queries, out_file) 120 | 121 | 122 | if __name__ == '__main__': 123 | main() -------------------------------------------------------------------------------- /nordlys/tagme/lucene_tools.py: -------------------------------------------------------------------------------- 1 | """ 2 | Tools for Lucene. 3 | All Lucene features should be accessed in nordlys through this class. 4 | 5 | - Lucene class for ensuring that the same version, analyzer, etc. 6 | are used across nordlys modules. Handles IndexReader, IndexWriter, etc. 7 | - Command line tools for checking indexed document content 8 | 9 | @author: Krisztian Balog (krisztian.balog@uis.no) 10 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 11 | """ 12 | 13 | import argparse 14 | import lucene 15 | from java.io import File 16 | from org.apache.lucene.analysis.standard import StandardAnalyzer 17 | from org.apache.lucene.document import Document 18 | from org.apache.lucene.document import Field 19 | from org.apache.lucene.document import FieldType 20 | from org.apache.lucene.index import IndexWriter 21 | from org.apache.lucene.index import IndexWriterConfig 22 | from org.apache.lucene.index import DirectoryReader 23 | from org.apache.lucene.index import Term 24 | from org.apache.lucene.search import IndexSearcher 25 | from org.apache.lucene.search import BooleanClause 26 | from org.apache.lucene.search import TermQuery 27 | from org.apache.lucene.search import BooleanQuery 28 | from org.apache.lucene.search import PhraseQuery 29 | from org.apache.lucene.store import SimpleFSDirectory 30 | from org.apache.lucene.store import RAMDirectory 31 | from org.apache.lucene.util import Version 32 | from org.apache.lucene.store import IOContext 33 | 34 | # has java VM for Lucene been initialized 35 | lucene_vm_init = False 36 | 37 | 38 | class Lucene(object): 39 | 40 | # default fieldnames for id and contents 41 | FIELDNAME_ID = "id" 42 | FIELDNAME_CONTENTS = "contents" 43 | 44 | # internal fieldtypes 45 | # used as Enum, the actual values don't matter 46 | FIELDTYPE_ID = "id" 47 | FIELDTYPE_ID_TV = "id_tv" 48 | FIELDTYPE_TEXT = "text" 49 | FIELDTYPE_TEXT_TV = "text_tv" 50 | FIELDTYPE_TEXT_TVP = "text_tvp" 51 | 52 | def __init__(self, index_dir, use_ram=False, jvm_ram=None): 53 | global lucene_vm_init 54 | if not lucene_vm_init: 55 | if jvm_ram: 56 | # e.g. jvm_ram = "8g" 57 | print "Increased JVM ram" 58 | lucene.initVM(vmargs=['-Djava.awt.headless=true'], maxheap=jvm_ram) 59 | else: 60 | lucene.initVM(vmargs=['-Djava.awt.headless=true']) 61 | lucene_vm_init = True 62 | self.dir = SimpleFSDirectory(File(index_dir)) 63 | 64 | self.use_ram = use_ram 65 | if use_ram: 66 | print "Using ram directory..." 67 | self.ram_dir = RAMDirectory(self.dir, IOContext.DEFAULT) 68 | self.analyzer = None 69 | self.reader = None 70 | self.searcher = None 71 | self.writer = None 72 | self.ldf = None 73 | print "Connected to index " + index_dir 74 | 75 | def get_version(self): 76 | """Get Lucene version.""" 77 | return Version.LUCENE_48 78 | 79 | def get_analyzer(self): 80 | """Get analyzer.""" 81 | if self.analyzer is None: 82 | self.analyzer = StandardAnalyzer(self.get_version()) 83 | return self.analyzer 84 | 85 | def open_reader(self): 86 | """Open IndexReader.""" 87 | if self.reader is None: 88 | if self.use_ram: 89 | print "reading from ram directory ..." 90 | self.reader = DirectoryReader.open(self.ram_dir) 91 | else: 92 | self.reader = DirectoryReader.open(self.dir) 93 | 94 | def get_reader(self): 95 | return self.reader 96 | 97 | def close_reader(self): 98 | """Close IndexReader.""" 99 | if self.reader is not None: 100 | self.reader.close() 101 | self.reader = None 102 | else: 103 | raise Exception("There is no open IndexReader to close") 104 | 105 | def open_searcher(self): 106 | """ 107 | Open IndexSearcher. Automatically opens an IndexReader too, 108 | if it is not already open. There is no close method for the 109 | searcher. 110 | """ 111 | if self.searcher is None: 112 | self.open_reader() 113 | self.searcher = IndexSearcher(self.reader) 114 | 115 | def get_searcher(self): 116 | """Returns index searcher (opens it if needed).""" 117 | self.open_searcher() 118 | return self.searcher 119 | 120 | def open_writer(self): 121 | """Open IndexWriter.""" 122 | if self.writer is None: 123 | config = IndexWriterConfig(self.get_version(), self.get_analyzer()) 124 | config.setOpenMode(IndexWriterConfig.OpenMode.CREATE) 125 | self.writer = IndexWriter(self.dir, config) 126 | else: 127 | raise Exception("IndexWriter is already open") 128 | 129 | def close_writer(self): 130 | """Close IndexWriter.""" 131 | if self.writer is not None: 132 | self.writer.close() 133 | self.writer = None 134 | else: 135 | raise Exception("There is no open IndexWriter to close") 136 | 137 | def add_document(self, contents): 138 | """ 139 | Adds a Lucene document with the specified contents to the index. 140 | See LuceneDocument.create_document() for the explanation of contents. 141 | """ 142 | if self.ldf is None: # create a single LuceneDocument object that will be reused 143 | self.ldf = LuceneDocument() 144 | self.writer.addDocument(self.ldf.create_document(contents)) 145 | 146 | def get_lucene_document_id(self, doc_id): 147 | """Loads a document from a Lucene index based on its id.""" 148 | self.open_searcher() 149 | query = TermQuery(Term(self.FIELDNAME_ID, doc_id)) 150 | tophit = self.searcher.search(query, 1).scoreDocs 151 | if len(tophit) == 1: 152 | return tophit[0].doc 153 | else: 154 | return None 155 | 156 | def get_document_id(self, lucene_doc_id): 157 | """Gets lucene document id and returns the document id.""" 158 | self.open_reader() 159 | return self.reader.document(lucene_doc_id).get(self.FIELDNAME_ID) 160 | 161 | def get_id_lookup_query(self, id, field=None): 162 | """Creates Lucene query for searching by (external) document id """ 163 | if field is None: 164 | field = self.FIELDNAME_ID 165 | return TermQuery(Term(field, id)) 166 | 167 | def get_and_query(self, queries): 168 | """Creates an AND Boolean query from multiple Lucene queries """ 169 | # empty boolean query with Similarity.coord() disabled 170 | bq = BooleanQuery(False) 171 | for q in queries: 172 | bq.add(q, BooleanClause.Occur.MUST) 173 | return bq 174 | 175 | def get_or_query(self, queries): 176 | """Creates an OR Boolean query from multiple Lucene queries """ 177 | # empty boolean query with Similarity.coord() disabled 178 | bq = BooleanQuery(False) 179 | for q in queries: 180 | bq.add(q, BooleanClause.Occur.SHOULD) 181 | return bq 182 | 183 | def get_phrase_query(self, query, field): 184 | """Creates phrase query for searching exact phrase.""" 185 | phq = PhraseQuery() 186 | for t in query.split(): 187 | phq.add(Term(field, t)) 188 | return phq 189 | 190 | def num_docs(self): 191 | """Returns number of documents in the index.""" 192 | self.open_reader() 193 | return self.reader.numDocs() 194 | 195 | 196 | class LuceneDocument(object): 197 | """Internal representation of a Lucene document""" 198 | 199 | def __init__(self): 200 | self.ldf = LuceneDocumentField() 201 | 202 | def create_document(self, contents): 203 | """Create a Lucene document from the specified contents. 204 | Contents is a list of fields to be indexed, represented as a dictionary 205 | with keys 'field_name', 'field_type', and 'field_value'.""" 206 | doc = Document() 207 | for f in contents: 208 | doc.add(Field(f['field_name'], f['field_value'], 209 | self.ldf.get_field(f['field_type']))) 210 | return doc 211 | 212 | 213 | class LuceneDocumentField(object): 214 | """Internal handler class for possible field types""" 215 | 216 | def __init__(self): 217 | """Init possible field types""" 218 | 219 | # FIELD_ID: stored, indexed, non-tokenized 220 | self.field_id = FieldType() 221 | self.field_id.setIndexed(True) 222 | self.field_id.setStored(True) 223 | self.field_id.setTokenized(False) 224 | 225 | # FIELD_ID_TV: stored, indexed, not tokenized, with term vectors (without positions) 226 | # for storing IDs with term vector info 227 | self.field_id_tv = FieldType() 228 | self.field_id_tv.setIndexed(True) 229 | self.field_id_tv.setStored(True) 230 | self.field_id_tv.setTokenized(False) 231 | self.field_id_tv.setStoreTermVectors(True) 232 | 233 | # FIELD_TEXT: stored, indexed, tokenized, with positions 234 | self.field_text = FieldType() 235 | self.field_text.setIndexed(True) 236 | self.field_text.setStored(True) 237 | self.field_text.setTokenized(True) 238 | 239 | # FIELD_TEXT_TV: stored, indexed, tokenized, with term vectors (without positions) 240 | self.field_text_tv = FieldType() 241 | self.field_text_tv.setIndexed(True) 242 | self.field_text_tv.setStored(True) 243 | self.field_text_tv.setTokenized(True) 244 | self.field_text_tv.setStoreTermVectors(True) 245 | 246 | # FIELD_TEXT_TVP: stored, indexed, tokenized, with term vectors and positions 247 | # (but no character offsets) 248 | self.field_text_tvp = FieldType() 249 | self.field_text_tvp.setIndexed(True) 250 | self.field_text_tvp.setStored(True) 251 | self.field_text_tvp.setTokenized(True) 252 | self.field_text_tvp.setStoreTermVectors(True) 253 | self.field_text_tvp.setStoreTermVectorPositions(True) 254 | 255 | def get_field(self, type): 256 | """Get Lucene FieldType object for the corresponding internal FIELDTYPE_ value""" 257 | if type == Lucene.FIELDTYPE_ID: 258 | return self.field_id 259 | elif type == Lucene.FIELDTYPE_ID_TV: 260 | return self.field_id_tv 261 | elif type == Lucene.FIELDTYPE_TEXT: 262 | return self.field_text 263 | elif type == Lucene.FIELDTYPE_TEXT_TV: 264 | return self.field_text_tv 265 | elif type == Lucene.FIELDTYPE_TEXT_TVP: 266 | return self.field_text_tvp 267 | else: 268 | raise Exception("Unknown field type") 269 | 270 | 271 | def main(): 272 | parser = argparse.ArgumentParser() 273 | parser.add_argument("-i", "--index", help="index directory", type=str) 274 | args = parser.parse_args() 275 | 276 | index_dir = args.index 277 | print "Index: " + index_dir + "\n" 278 | 279 | l = Lucene(index_dir, jvm_ram="8g") 280 | pq = l.get_phrase_query("originally used", "contents") 281 | 282 | l.open_searcher() 283 | tophit = l.searcher.search(pq, 1).scoreDocs 284 | print tophit[0] 285 | 286 | if __name__ == '__main__': 287 | main() -------------------------------------------------------------------------------- /nordlys/tagme/mention.py: -------------------------------------------------------------------------------- 1 | """ 2 | Tools for processing mentions: 3 | - Finds candidate entities 4 | - Calculates commonness 5 | 6 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 7 | """ 8 | 9 | from nordlys.tagme.config import SF_WIKI 10 | 11 | 12 | class Mention(object): 13 | 14 | def __init__(self, text): 15 | self.text = text.lower() 16 | self.__matched_ens = None # all entities matching a mention (from all sources) 17 | self.__wiki_occurrences = None 18 | 19 | @property 20 | def matched_ens(self): 21 | return self.__gen_matched_ens() 22 | 23 | @property 24 | def wiki_occurrences(self): 25 | return self.__calc_wiki_occurrences() 26 | 27 | def __gen_matched_ens(self): 28 | """Gets all entities matching the n-gram""" 29 | if self.__matched_ens is None: 30 | matches = SF_WIKI.get(self.text) 31 | matched_ens = matches if matches is not None else {} 32 | self.__matched_ens = matched_ens 33 | return self.__matched_ens 34 | 35 | def __calc_wiki_occurrences(self): 36 | """Calculates the denominator for commonness (for Wiki annotations).""" 37 | if self.__wiki_occurrences is None: 38 | self.__wiki_occurrences = 0 39 | for en, occ in self.matched_ens.get('anchor', {}).iteritems(): 40 | self.__wiki_occurrences += occ 41 | return self.__wiki_occurrences 42 | 43 | def get_men_candidate_ens(self, commonness_th): 44 | """ 45 | Gets candidate entities for the given n-gram. 46 | 47 | :param commonness_th: commonness threshold 48 | :return: dictionary {Wiki_uri: commonness, ..} 49 | """ 50 | candidate_entities = {} 51 | wiki_matches = self.get_wiki_matches(commonness_th) 52 | candidate_entities.update(wiki_matches) 53 | return candidate_entities 54 | 55 | def get_wiki_matches(self, commonness_th): 56 | """ 57 | Gets entity matches from Wikipedia anchors (with dbpedia uris). 58 | 59 | :param commonness_th: float, Commonness threshold 60 | :return: Dictionary {Wiki_uri: commonness, ...} 61 | 62 | """ 63 | if commonness_th is None: 64 | commonness_th = 0 65 | 66 | wiki_matches = {} 67 | # calculates commonness for each entity and filter the ones below the commonness threshold. 68 | for wiki_uri in self.matched_ens.get("anchor", {}): 69 | cmn = self.calc_commonness(wiki_uri) 70 | if cmn >= commonness_th: 71 | wiki_matches[wiki_uri] = cmn 72 | 73 | sources = ["title", "title-nv", "redirect"] 74 | for source in sources: 75 | for wiki_uri in self.matched_ens.get(source, {}): 76 | if wiki_uri not in wiki_matches: 77 | cmn = self.calc_commonness(wiki_uri) 78 | wiki_matches[wiki_uri] = cmn 79 | return wiki_matches 80 | 81 | def calc_commonness(self, en_uri): 82 | """ 83 | Calculates commonness for the given entity: 84 | (times mention is linked) / (times mention linked to entity) 85 | - Returns zero if the entity is not linked by the mention. 86 | 87 | :param en_uri: Wikipedia uri 88 | :return Commonness 89 | """ 90 | if not en_uri.startswith(" 6): 57 | continue 58 | link_prob = self.__get_link_prob(mention) 59 | if link_prob < self.link_prob_th: 60 | continue 61 | # These mentions will be kept 62 | self.link_probs[ngram] = link_prob 63 | # Filters entities by cmn threshold 0.001; this was only in TAGME source code and speeds up the process. 64 | # TAGME source code: it.acubelab.tagme.anchor (lines 279-284) 65 | ens[ngram] = mention.get_men_candidate_ens(0.001) 66 | 67 | # filters containment mentions (based on paper) 68 | candidate_entities = {} 69 | sorted_mentions = sorted(ens.keys(), key=lambda item: len(item.split())) # sorts by mention length 70 | for i in range(0, len(sorted_mentions)): 71 | m_i = sorted_mentions[i] 72 | ignore_m_i = False 73 | for j in range(i+1, len(sorted_mentions)): 74 | m_j = sorted_mentions[j] 75 | if (m_i in m_j) and (self.link_probs[m_i] < self.link_probs[m_j]): 76 | ignore_m_i = True 77 | break 78 | if not ignore_m_i: 79 | candidate_entities[m_i] = ens[m_i] 80 | return candidate_entities 81 | 82 | def disambiguate(self, candidate_entities): 83 | """ 84 | Performs disambiguation and link each mention to a single entity. 85 | 86 | :param candidate_entities: {men:{en:cmn, ...}, ...} 87 | :return: disambiguated entities {men:en, ...} 88 | """ 89 | # Gets the relevance score 90 | rel_scores = {} 91 | for m_i in candidate_entities.keys(): 92 | if self.DEBUG: 93 | print "********************", m_i, "********************" 94 | rel_scores[m_i] = {} 95 | for e_m_i in candidate_entities[m_i].keys(): 96 | if self.DEBUG: 97 | print "-- ", e_m_i 98 | rel_scores[m_i][e_m_i] = 0 99 | for m_j in candidate_entities.keys(): # all other mentions 100 | if (m_i == m_j) or (len(candidate_entities[m_j].keys()) == 0): 101 | continue 102 | vote_e_m_j = self.__get_vote(e_m_i, candidate_entities[m_j]) 103 | rel_scores[m_i][e_m_i] += vote_e_m_j 104 | if self.DEBUG: 105 | print m_j, vote_e_m_j 106 | 107 | # pruning uncommon entities (based on the paper) 108 | self.rel_scores = {} 109 | for m_i in rel_scores: 110 | for e_m_i in rel_scores[m_i]: 111 | cmn = candidate_entities[m_i][e_m_i] 112 | if cmn >= self.cmn_th: 113 | if m_i not in self.rel_scores: 114 | self.rel_scores[m_i] = {} 115 | self.rel_scores[m_i][e_m_i] = rel_scores[m_i][e_m_i] 116 | 117 | # DT pruning 118 | disamb_ens = {} 119 | for m_i in self.rel_scores: 120 | if len(self.rel_scores[m_i].keys()) == 0: 121 | continue 122 | top_k_ens = self.__get_top_k(m_i) 123 | best_cmn = 0 124 | best_en = None 125 | for en in top_k_ens: 126 | cmn = candidate_entities[m_i][en] 127 | if cmn >= best_cmn: 128 | best_en = en 129 | best_cmn = cmn 130 | disamb_ens[m_i] = best_en 131 | 132 | return disamb_ens 133 | 134 | def prune(self, dismab_ens): 135 | """ 136 | Performs AVG pruning. 137 | 138 | :param dismab_ens: {men: en, ... } 139 | :return: {men: (en, score), ...} 140 | """ 141 | linked_ens = {} 142 | for men, en in dismab_ens.iteritems(): 143 | coh_score = self.__get_coherence_score(men, en, dismab_ens) 144 | rho_score = (self.link_probs[men] + coh_score) / 2.0 145 | if rho_score >= self.rho_th: 146 | linked_ens[men] = (en, rho_score) 147 | return linked_ens 148 | 149 | def __get_link_prob(self, mention): 150 | """ 151 | Gets link probability for the given mention. 152 | Here, in fact, we are computing key-phraseness. 153 | """ 154 | 155 | pq = ENTITY_INDEX.get_phrase_query(mention.text, Lucene.FIELDNAME_CONTENTS) 156 | mention_freq = ENTITY_INDEX.searcher.search(pq, 1).totalHits 157 | if mention_freq == 0: 158 | return 0 159 | if self.sf_source == "wiki": 160 | link_prob = mention.wiki_occurrences / float(mention_freq) 161 | # This is TAGME implementation, from source code: 162 | # link_prob = float(mention.wiki_occurrences) / max(mention_freq, mention.wiki_occurrences) 163 | elif self.sf_source == "facc": 164 | link_prob = mention.facc_occurrences / float(mention_freq) 165 | return link_prob 166 | 167 | def __get_vote(self, entity, men_cand_ens): 168 | """ 169 | vote_e = sum_e_i(mw_rel(e, e_i) * cmn(e_i)) / i 170 | 171 | :param entity: en 172 | :param men_cand_ens: {en: cmn, ...} 173 | :return: voting score 174 | """ 175 | entity = entity if self.sf_source == "wiki" else entity[0] 176 | vote = 0 177 | for e_i, cmn in men_cand_ens.iteritems(): 178 | e_i = e_i if self.sf_source == "wiki" else e_i[0] 179 | mw_rel = self.__get_mw_rel(entity, e_i) 180 | # print "\t", e_i, "cmn:", cmn, "mw_rel:", mw_rel 181 | vote += cmn * mw_rel 182 | vote /= float(len(men_cand_ens)) 183 | return vote 184 | 185 | def __get_mw_rel(self, e1, e2): 186 | """ 187 | Calculates Milne & Witten relatedness for two entities. 188 | This implementation is based on Dexter implementation (which is similar to TAGME implementation). 189 | - Dexter implementation: https://github.com/dexter/dexter/blob/master/dexter-core/src/main/java/it/cnr/isti/hpc/dexter/relatedness/MilneRelatedness.java 190 | - TAGME: it.acubelab.tagme.preprocessing.graphs.OnTheFlyArrayMeasure 191 | """ 192 | if e1 == e2: # to speed-up 193 | return 1.0 194 | en_uris = tuple(sorted({e1, e2})) 195 | ens_in_links = [self.__get_in_links([en_uri]) for en_uri in en_uris] 196 | if min(ens_in_links) == 0: 197 | return 0 198 | conj = self.__get_in_links(en_uris) 199 | if conj == 0: 200 | return 0 201 | numerator = math.log(max(ens_in_links)) - math.log(conj) 202 | denominator = math.log(ANNOT_INDEX.num_docs()) - math.log(min(ens_in_links)) 203 | rel = 1 - (numerator / denominator) 204 | if rel < 0: 205 | return 0 206 | return rel 207 | 208 | def __get_in_links(self, en_uris): 209 | """ 210 | returns "and" occurrences of entities in the corpus. 211 | 212 | :param en_uris: list of dbp_uris 213 | """ 214 | en_uris = tuple(sorted(set(en_uris))) 215 | if en_uris in self.in_links: 216 | return self.in_links[en_uris] 217 | 218 | term_queries = [] 219 | for en_uri in en_uris: 220 | term_queries.append(ANNOT_INDEX.get_id_lookup_query(en_uri, Lucene.FIELDNAME_CONTENTS)) 221 | and_query = ANNOT_INDEX.get_and_query(term_queries) 222 | self.in_links[en_uris] = ANNOT_INDEX.searcher.search(and_query, 1).totalHits 223 | return self.in_links[en_uris] 224 | 225 | def __get_coherence_score(self, men, en, dismab_ens): 226 | """ 227 | coherence_score = sum_e_i(rel(e_i, en)) / len(ens) - 1 228 | 229 | :param en: entity 230 | :param dismab_ens: {men: (dbp_uri, fb_id), ....} 231 | """ 232 | coh_score = 0 233 | for m_i, e_i in dismab_ens.iteritems(): 234 | if m_i == men: 235 | continue 236 | coh_score += self.__get_mw_rel(e_i, en) 237 | coh_score = coh_score / float(len(dismab_ens.keys()) - 1) if len(dismab_ens.keys()) - 1 != 0 else 0 238 | return coh_score 239 | 240 | def __get_top_k(self, mention): 241 | """Returns top-k percent of the entities based on rel score.""" 242 | k = int(round(len(self.rel_scores[mention].keys()) * self.k_th)) 243 | k = 1 if k == 0 else k 244 | sorted_rel_scores = sorted(self.rel_scores[mention].items(), key=lambda item: item[1], reverse=True) 245 | top_k_ens = [] 246 | count = 1 247 | prev_rel_score = sorted_rel_scores[0][1] 248 | for en, rel_score in sorted_rel_scores: 249 | if rel_score != prev_rel_score: 250 | count += 1 251 | if count > k: 252 | break 253 | top_k_ens.append(en) 254 | prev_rel_score = rel_score 255 | return top_k_ens 256 | 257 | 258 | def main(): 259 | parser = argparse.ArgumentParser() 260 | parser.add_argument("-th", "--threshold", help="score threshold", type=float, default=0) 261 | parser.add_argument("-data", help="Data set name", choices=['y-erd', 'erd-dev', 'wiki-annot30', 'wiki-disamb30']) 262 | args = parser.parse_args() 263 | 264 | if args.data == "erd-dev": 265 | queries = test_coll.read_erd_queries() 266 | elif args.data == "y-erd": 267 | queries = test_coll.read_yerd_queries() 268 | elif args.data == "wiki-annot30": 269 | queries = test_coll.read_tagme_queries(config.WIKI_ANNOT30_SNIPPET) 270 | elif args.data == "wiki-disamb30": 271 | queries = test_coll.read_tagme_queries(config.WIKI_DISAMB30_SNIPPET) 272 | 273 | out_file_name = OUTPUT_DIR + "/" + args.data + "_tagme_wiki10.txt" 274 | open(out_file_name, "w").close() 275 | out_file = open(out_file_name, "a") 276 | 277 | # process the queries 278 | for qid, query in sorted(queries.items(), key=lambda item: int(item[0]) if item[0].isdigit() else item[0]): 279 | print "[" + qid + "]", query 280 | tagme = Tagme(Query(qid, query), args.threshold) 281 | print " parsing ..." 282 | cand_ens = tagme.parse() 283 | print " disambiguation ..." 284 | disamb_ens = tagme.disambiguate(cand_ens) 285 | print " pruning ..." 286 | linked_ens = tagme.prune(disamb_ens) 287 | 288 | out_str = "" 289 | for men, (en, score) in linked_ens.iteritems(): 290 | out_str += str(qid) + "\t" + str(score) + "\t" + en + "\t" + men + "\tpage-id" + "\n" 291 | print out_str, "-----------\n" 292 | out_file.write(out_str) 293 | 294 | print "output:", out_file_name 295 | 296 | 297 | if __name__ == "__main__": 298 | main() -------------------------------------------------------------------------------- /nordlys/tagme/tagme_api.py: -------------------------------------------------------------------------------- 1 | """ 2 | Methods to annotate queries with TagMe API. 3 | 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 5 | """ 6 | 7 | import argparse 8 | import requests 9 | from nordlys.config import OUTPUT_DIR 10 | from nordlys.tagme import config 11 | from nordlys.tagme.test_coll import read_erd_queries, read_yerd_queries, read_tagme_queries 12 | from nordlys.wikipedia.utils import WikipediaUtils 13 | 14 | 15 | class TagmeAPI(object): 16 | TAGME_URI = "http://tagme.di.unipi.it/tag" 17 | NONE = "*NONE*" 18 | 19 | def __init__(self, key): 20 | self.key = key 21 | 22 | def ask_tagme_query(self, query): 23 | """Sends queries to Tagme Api.""" 24 | data = {'key': self.key, 'lang': "en", 'text': query} 25 | res = requests.post(self.TAGME_URI, data).json() 26 | res['query'] = query 27 | return res 28 | 29 | def aks_tagme_queries(self, queries, out_file): 30 | """ 31 | Sends queries to Tagme Api and writes them in a json file. 32 | 33 | :param queries: dictionary {qid: query, ...} 34 | :param out_file: The file to write the output 35 | """ 36 | print "Getting results from Tagme ..." 37 | out_str = "" 38 | open(out_file, "w").close() 39 | out = open(out_file, "a") 40 | i = 0 41 | for qid in sorted(queries): 42 | query = queries[qid] 43 | print "[" + qid + "]", query 44 | tagme_res = self.ask_tagme_query(query) 45 | out_str += self.__to_str(qid, tagme_res) 46 | # responses[qid] = tagme_res 47 | i += 1 48 | if i % 1000 == 0: 49 | out.write(out_str) 50 | print "until qid:", qid 51 | out_str = "" 52 | out.write(out_str) 53 | print "TagMe results: " + out_file 54 | 55 | def __to_str(self, qid, response): 56 | """ 57 | Output format: 58 | qid, score, wiki-uri, mention, page-id, start, end 59 | """ 60 | out_str = "" 61 | for annot in response['annotations']: 62 | title = str(annot.get('title', TagmeAPI.NONE)) 63 | page_id = str(annot.get('id', TagmeAPI.NONE)) 64 | wiki_uri = self.__get_uri_from_title(title) 65 | out_str += str(qid) + "\t" + str(annot.get('rho', TagmeAPI.NONE)) + "\t" + wiki_uri + "\t" + \ 66 | annot.get('spot', TagmeAPI.NONE) + "\t" + page_id + "\t" + \ 67 | str(annot.get('start', TagmeAPI.NONE)) + "\t" + str(annot.get('end', TagmeAPI.NONE)) + "\n" 68 | return out_str 69 | 70 | def __get_uri_from_title(self, title): 71 | return WikipediaUtils.wiki_title_to_uri(title) if title != TagmeAPI.NONE else TagmeAPI.NONE 72 | 73 | 74 | def main(): 75 | key = "XXX" # To be taken from TAGME authors 76 | 77 | parser = argparse.ArgumentParser() 78 | parser.add_argument("-data", help="Data set name", choices=['y-erd', 'erd-dev', 'wiki-annot30', 'wiki-disamb30']) 79 | args = parser.parse_args() 80 | 81 | if args.data == "erd-dev": 82 | queries = read_erd_queries() 83 | elif args.data == "y-erd": 84 | queries = read_yerd_queries() 85 | elif args.data == "wiki-annot30": 86 | queries = read_tagme_queries(config.WIKI_ANNOT30_SNIPPET) 87 | elif args.data == "wiki-disamb30": 88 | queries = read_tagme_queries(config.WIKI_DISAMB30_SNIPPET) 89 | 90 | # Asks TAGME and creates json file 91 | out_file = OUTPUT_DIR + "/" + args.data + "_tagmeAPI" + ".txt" 92 | tagme = TagmeAPI(key) 93 | tagme.aks_tagme_queries(queries, out_file) 94 | 95 | if __name__ == '__main__': 96 | main() -------------------------------------------------------------------------------- /nordlys/tagme/test_coll.py: -------------------------------------------------------------------------------- 1 | """ 2 | Reads queries from test collections 3 | 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 5 | """ 6 | 7 | import csv 8 | from nordlys.tagme import config 9 | 10 | 11 | def read_yerd_queries(y_erd_file=config.Y_ERD): 12 | """ 13 | Reads queries from Erd query file. 14 | 15 | :return dictionary {query_id : query_content} 16 | """ 17 | queries = {} 18 | with open(y_erd_file, 'rb') as y_erd: 19 | reader = csv.DictReader(y_erd, delimiter="\t", quoting=csv.QUOTE_NONE) 20 | 21 | for line in reader: 22 | qid = line['qid'] 23 | query = line['query'] 24 | queries[qid] = query.strip() 25 | print "Number of queries:", len(queries) 26 | return queries 27 | 28 | 29 | def read_erd_queries(erd_q_file=config.ERD_QUERY): 30 | """ 31 | Reads queries from Erd query file. 32 | 33 | :return dictionary {qid : query} 34 | """ 35 | queries = {} 36 | q_file = open(erd_q_file, "r") 37 | for line in q_file: 38 | line = line.split("\t") 39 | query_id = line[0].strip() 40 | query = line[-1].strip() 41 | queries[query_id] = query 42 | q_file.close() 43 | print "Number of queries:", len(queries) 44 | return queries 45 | 46 | 47 | def read_tagme_queries(dataset_file): 48 | """ 49 | Reads queries from snippet file. 50 | 51 | :return dictionary {qid : query} 52 | """ 53 | queries = {} 54 | q_file = open(dataset_file, "r") 55 | for line in q_file: 56 | line = line.strip().split("\t") 57 | query_id = line[0].strip() 58 | query = line[1].strip() 59 | queries[query_id] = query 60 | q_file.close() 61 | print "Number of queries:", len(queries) 62 | return queries -------------------------------------------------------------------------------- /nordlys/wikipedia/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'faeghehhasibi' 2 | -------------------------------------------------------------------------------- /nordlys/wikipedia/anchor_extractor.py: -------------------------------------------------------------------------------- 1 | """ 2 | Creates a single anchor file for all entity-linking annotations. 3 | 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 5 | """ 6 | import argparse 7 | import os 8 | 9 | 10 | def merge_anchors(basedir, outfile): 11 | """Writes all annotations into a single file.""" 12 | open(outfile, "w").close() 13 | out = open(outfile, "a") 14 | i = 0 15 | for path, dirs, files in os.walk(basedir): 16 | for fn in sorted(files): 17 | if fn.endswith(".tsv"): 18 | with open(os.path.join(path, fn)) as in_file: 19 | out.write(in_file.read()) 20 | i += 1 21 | if i % 100 == 0: 22 | print i, "th file is added!" 23 | print "file:", os.path.join(path, fn) 24 | 25 | 26 | def count_anchors(anchor_file, out_file): 27 | """Counts the number of occurrences anchor-entity pairs""" 28 | sf_dict = {} 29 | in_file = open(anchor_file) 30 | i = 0 31 | for line in in_file: 32 | i += 1 33 | cols = line.strip().split("\t") 34 | if (len(cols) < 4) or (cols[2].strip().lower() == ""): 35 | continue 36 | sf = cols[2].strip().lower() 37 | en = cols[3].strip() 38 | if sf not in sf_dict: 39 | sf_dict[sf] = {} 40 | if en not in sf_dict[sf]: 41 | sf_dict[sf][en] = 1 42 | else: 43 | sf_dict[sf][en] += 1 44 | if i % 1000000 == 0: 45 | print i, "th line processed!" 46 | 47 | out_str = "" 48 | for sf, en_counts in sf_dict.iteritems(): 49 | for en, count in en_counts.iteritems(): 50 | out_str += sf + "\t" + en + "\t" + str(count) + "\n" 51 | out = open(out_file, "w") 52 | out.write(out_str) 53 | out.close() 54 | 55 | 56 | def main(): 57 | # Builds anchor file 58 | parser = argparse.ArgumentParser() 59 | parser.add_argument("-inputdir", help="Path to directory to read from") 60 | parser.add_argument("-outputdir", help="Path to write the annotations (.tsv files)") 61 | args = parser.parse_args() 62 | 63 | merge_anchors(args.inputdir, args.outputdir + "/anchors.txt") 64 | count_anchors(args.outputdir + "/anchors.txt", args.outputdir + "/anchors_count.txt") 65 | 66 | if __name__ == "__main__": 67 | main() 68 | -------------------------------------------------------------------------------- /nordlys/wikipedia/annot_extractor.py: -------------------------------------------------------------------------------- 1 | """ 2 | Extracts annotations from processed Wikipedia articles and writes them in .tsv files. 3 | 4 | author: Faegheh Hasibi 5 | """ 6 | 7 | import argparse 8 | import os 9 | import re 10 | from datetime import datetime 11 | 12 | tagRE = re.compile(r'(.*?)(<(/?\w+)[^>]*>)(?:([^<]*)(<.*?>)?)?') 13 | idRE = re.compile(r'id="([0-9]+)"') 14 | titleRE = re.compile(r'title="(.*)"') 15 | linkRE = re.compile(r'href="(.*)"') 16 | 17 | 18 | def process_file(wiki_file, out_file): 19 | """ 20 | Extracts annotations from XML annotated file and write annotations. 21 | output file format: page_id title mention linked_en 22 | 23 | :param wiki_file: XML file containing multiple articles. 24 | :param out_file: Name of tsv file. 25 | """ 26 | print "Processing " + wiki_file + " ...", 27 | open(out_file, "w").close() 28 | out = open(out_file, "a") 29 | f = open(wiki_file, "r") 30 | annots = [] 31 | doc_id, doc_title = None, None 32 | for line in f: 33 | # Writes annotations and reset variables 34 | if re.search(r'', line): 35 | out.write("".join(annots)) 36 | annots = [] 37 | doc_id, doc_title = None, None 38 | for m in tagRE.finditer(line): 39 | if not m: 40 | continue 41 | tag = m.group(3) 42 | if tag == "doc": 43 | doc_id = idRE.search(m.group(2)) 44 | doc_title = titleRE.search(m.group(2)) 45 | if (not doc_id) or (not doc_title): 46 | print "\nINFO: doc id or title not found in " + wiki_file, 47 | continue 48 | if tag == "a": 49 | mention = m.group(4) 50 | link = linkRE.search(m.group(2)) 51 | if (not link) or (doc_id is None) or (doc_title is None): 52 | print "\nINFO: link not found in " + wiki_file, 53 | continue 54 | annot = doc_id.group(1) + "\t" + doc_title.group(1) + "\t" + mention + "\t" + link.group(1) + "\n" 55 | annots.append(annot) 56 | print " --> output in " + out_file 57 | 58 | 59 | def add_dir(base_in_dir, base_out_dir): 60 | """Adds FACC annotations from a directory recursively.""" 61 | for path, dirs, _ in os.walk(base_in_dir): 62 | for dir in sorted(dirs): 63 | s_t = datetime.now() # start time 64 | total_time = 0.0 65 | for _, _, files in os.walk(os.path.join(base_in_dir, dir)): 66 | for fn in files: 67 | out_dir = base_out_dir + dir 68 | if not os.path.exists(out_dir): 69 | os.makedirs(out_dir) 70 | out_file = os.path.join(out_dir, "an_" + fn + ".tsv") 71 | in_file = os.path.join(base_in_dir + dir, fn) 72 | process_file(in_file, out_file) 73 | 74 | e_t = datetime.now() # end time 75 | diff = e_t - s_t 76 | total_time += diff.total_seconds() 77 | print "[processing time for bunch " + dir + " (min)]:", total_time/60 78 | 79 | 80 | def main(): 81 | parser = argparse.ArgumentParser() 82 | parser.add_argument("-inputdir", help="Path to directory to read from") 83 | parser.add_argument("-outputdir", help="Path to write the annotations (.tsv files)") 84 | args = parser.parse_args() 85 | 86 | add_dir(args.inputdir, args.outputdir) 87 | 88 | if __name__ == "__main__": 89 | main() 90 | -------------------------------------------------------------------------------- /nordlys/wikipedia/indexer.py: -------------------------------------------------------------------------------- 1 | """ 2 | Creates a Lucene index for Wikipedia articles. 3 | 4 | - A single field index is created. 5 | - disambiguation and list pages are ignored. 6 | - wiki page annotations are ignored and only mentions are kept. 7 | 8 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 9 | """ 10 | import argparse 11 | import os 12 | import re 13 | from urllib import unquote 14 | from nordlys.wikipedia.utils import WikipediaUtils 15 | from nordlys.tagme.lucene_tools import Lucene 16 | 17 | 18 | class Indexer(object): 19 | tagRE = re.compile(r'(.*?)(<(/?\w+)[^>]*>)(?:([^<]*)(<.*?>)?)?') 20 | idRE = re.compile(r'id="([0-9]+)"') 21 | titleRE = re.compile(r'title="(.*)"') 22 | linkRE = re.compile(r'href="(.*)"') 23 | 24 | def __init__(self, annot_only): 25 | self.annot_only = annot_only 26 | self.contents = None 27 | self.lucene = None 28 | 29 | def __add_to_contents(self, field_name, field_value, field_type): 30 | """ 31 | Adds field to document contents. 32 | Field value can be a list, where each item is added separately (i.e., the field is multi-valued). 33 | """ 34 | if type(field_value) is list: 35 | for fv in field_value: 36 | self.__add_to_contents(field_name, fv, field_type) 37 | else: 38 | if len(field_value) > 0: # ignore empty fields 39 | self.contents.append({'field_name': field_name, 40 | 'field_value': field_value, 41 | 'field_type': field_type}) 42 | 43 | def index_file(self, file_name): 44 | """ 45 | Adds one file to the index. 46 | 47 | :param file_name: file to be indexed 48 | """ 49 | self.contents = [] 50 | article_text = "" 51 | article_annots = [] # for annot-only index 52 | 53 | f = open(file_name, "r") 54 | for line in f: 55 | line = line.replace("#redirect", "") 56 | # ------ Reaches the end tag for an article --------- 57 | if re.search(r'', line): 58 | # ignores null titles 59 | if wiki_uri is None: 60 | print "\tINFO: Null Wikipedia title!" 61 | # ignores disambiguation pages 62 | elif (wiki_uri.endswith("(disambiguation)>")) or \ 63 | ((len(article_text) < 200) and ("may refer to:" in article_text)): 64 | print "\tINFO: disambiguation page " + wiki_uri + " ignored!" 65 | # ignores list pages 66 | elif (wiki_uri.startswith(" --collection surfaceforms_wiki_YYYYMMDD --file --jsonArray 6 | 7 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 8 | """ 9 | import argparse 10 | 11 | import os 12 | import json 13 | from urllib import unquote 14 | from nordlys.storage.mongo import Mongo 15 | from nordlys.wikipedia.utils import WikipediaUtils 16 | 17 | 18 | class Merger(object): 19 | 20 | def __init__(self): 21 | self.all_sfs = {} 22 | 23 | def merge_all(self, titles_file, redirects_file, anchors_file, out_file): 24 | self.add_anchors(anchors_file) 25 | self.add_titles(titles_file) 26 | self.add_redirects(redirects_file) 27 | 28 | # Converting all surface forms to mongo format 29 | print "Converting to mongodb format ..." 30 | sf_mongo_entries = [] 31 | i = 0 32 | for sf, en_sources in self.all_sfs.iteritems(): 33 | escaped_sf = Mongo.escape(sf) 34 | entry = {"_id": escaped_sf} 35 | for source, en in en_sources.iteritems(): 36 | entry[source] = en 37 | sf_mongo_entries.append(entry) 38 | i += 1 39 | if i % 1000000 == 0: 40 | print "processes", i, "the surface form" 41 | print "writing to json file ..." 42 | json.dump(sf_mongo_entries, open(out_file, "w"), indent=4, sort_keys=True) 43 | 44 | def __add_to_dict(self, sf, pred, en, count=1): 45 | if sf not in self.all_sfs: 46 | self.all_sfs[sf] = {} 47 | if pred not in self.all_sfs[sf]: 48 | self.all_sfs[sf][pred] = {en: count} 49 | else: 50 | self.all_sfs[sf][pred][en] = count 51 | 52 | # ============== ANCHORS ============== 53 | 54 | def add_anchors(self, anchor_file): 55 | print "Adding anchors ..." 56 | i = 0 57 | infile = open(anchor_file, "r") 58 | for line in infile: 59 | # print line 60 | cols = line.strip().split("\t") 61 | sf = cols[0].strip() 62 | count = int(cols[2]) 63 | wiki_uri = WikipediaUtils.wiki_title_to_uri(unquote(cols[1].strip())) 64 | self.__add_to_dict(sf, "anchor", wiki_uri, count) 65 | i += 1 66 | if i % 1000000 == 0: 67 | print "Processed", i, "th anchor!" 68 | 69 | # ============== REDIRECTS ============== 70 | 71 | def add_redirects(self, redirect_file): 72 | """Adds redirect pages to the surface form dictionary.""" 73 | print "Adding redirects ..." 74 | redirects = open(redirect_file, "r") 75 | count = 0 76 | for line in redirects: 77 | cols = line.strip().split("\t") 78 | sf = cols[0].strip().lower() 79 | wiki_uri = WikipediaUtils.wiki_title_to_uri(cols[1].strip()) 80 | # print sf, wiki_uri 81 | self.__add_to_dict(sf, "redirect", wiki_uri) 82 | count += 1 83 | if count % 1000000 == 0: 84 | print "Processed ", count, "th redirects." 85 | 86 | # ============== TITLES ============== 87 | 88 | def add_titles(self, title_file): 89 | """Adds titles and title name variants to the surface form dictionary.""" 90 | print "Adding titles ..." 91 | redirects = open(title_file, "r") 92 | count = 0 93 | for line in redirects: 94 | cols = line.strip().split("\t") 95 | title = unquote(cols[1].strip()) 96 | wiki_uri = WikipediaUtils.wiki_title_to_uri(title) 97 | self.__add_to_dict(title.lower(), "title", wiki_uri) 98 | title_nv = self.__title_nv(title) 99 | if (title_nv != title) and (title_nv.strip() != ""): 100 | self.__add_to_dict(title_nv.lower(), "title-nv", wiki_uri) 101 | count += 1 102 | if count % 1000000 == 0: 103 | print "Processed ", count, "th titles." 104 | 105 | @staticmethod 106 | def __title_nv(title): 107 | """Removes all letters after "(" and "," from page title.""" 108 | p_pos = title.find("(") 109 | title_nv = title[:p_pos] if p_pos != -1 else title 110 | c_pos = title.find(",") 111 | title_nv = title[:c_pos] if c_pos != -1 else title_nv 112 | return title_nv.strip() 113 | 114 | 115 | def main(): 116 | parser = argparse.ArgumentParser() 117 | parser.add_argument("-anchors", help="Path to anchor file") 118 | parser.add_argument("-redirects", help="Path to redirect file") 119 | parser.add_argument("-titles", help="Path to page-title file") 120 | parser.add_argument("-outputdir", help="Path to output directory") 121 | args = parser.parse_args() 122 | 123 | 124 | # Merges titles, redirects, and anchors 125 | merger = Merger() 126 | merger.merge_all(args.titles, args.redirects, args.anchors, args.outputdir + "/sf_dict_mongo.json") 127 | 128 | if __name__ == "__main__": 129 | main() -------------------------------------------------------------------------------- /nordlys/wikipedia/pageid_extractor.py: -------------------------------------------------------------------------------- 1 | """ 2 | Extracts page id and titles from Wikipedia dump and writes them into a single file 3 | 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 5 | """ 6 | 7 | import argparse 8 | import os 9 | import re 10 | 11 | 12 | tagRE = re.compile(r'(.*?)(<(/?\w+)[^>]*>)(?:([^<]*)(<.*?>)?)?') 13 | idRE = re.compile(r'id="([0-9]+)"') 14 | titleRE = re.compile(r'title="(.*)"') 15 | 16 | 17 | def read_file(file_name): 18 | """Extracts page ids and titles from a single file.""" 19 | out_str = "" 20 | f = open(file_name, "r") 21 | 22 | for line in f: 23 | for m in tagRE.finditer(line): 24 | if not m: 25 | continue 26 | tag = m.group(3) 27 | if tag == "doc": 28 | doc_id = idRE.search(m.group(2)) 29 | doc_title = titleRE.search(m.group(2)) 30 | if (not doc_id) or (not doc_title): 31 | print "\nINFO: doc id or title not found in " + file_name, 32 | continue 33 | out_str += doc_id.group(1) + "\t" + doc_title.group(1) + "\n" 34 | break 35 | return out_str 36 | 37 | 38 | def read_files(basedir, output_file): 39 | """Extracts page id and titles to a single file.""" 40 | open(output_file, "w").close() 41 | out_file = open(output_file, "a") 42 | for path, dirs, _ in os.walk(basedir): 43 | for dir in sorted(dirs): 44 | for _, _, files in os.walk(os.path.join(basedir, dir)): 45 | for fn in sorted(files): 46 | print "parsing ", os.path.join(basedir + dir, fn), "..." 47 | out_str = read_file(os.path.join(basedir + dir, fn)) 48 | out_file.write(out_str) 49 | 50 | 51 | def main(): 52 | parser = argparse.ArgumentParser() 53 | parser.add_argument("-inputdir", help="Path to directory to read from") 54 | parser.add_argument("-output", help="Path to write the annotations (.tsv files)") 55 | args = parser.parse_args() 56 | 57 | read_files(args.inputdir, args.output + "/page-id-titles.txt") 58 | print "All page ids are added" 59 | 60 | 61 | if __name__ == "__main__": 62 | main() 63 | -------------------------------------------------------------------------------- /nordlys/wikipedia/utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | Wikipedia utils. 3 | 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 5 | """ 6 | 7 | from urllib import quote 8 | 9 | 10 | class WikipediaUtils(object): 11 | mongo = None 12 | 13 | @staticmethod 14 | def wiki_title_to_uri(title): 15 | """ 16 | Converts wiki page title to wiki_uri 17 | based on https://en.wikipedia.org/wiki/Wikipedia:Page_name#Spaces.2C_underscores_and_character_coding 18 | encoding based on http://dbpedia.org/services-resources/uri-encoding 19 | """ 20 | if title: 21 | wiki_uri = "" 22 | return wiki_uri 23 | else: 24 | return None 25 | 26 | @staticmethod 27 | def wiki_uri_to_dbp_uri(wiki_uri): 28 | """Converts Wikipedia uri to DBpedia URI.""" 29 | return wiki_uri.replace("=2.7 3 | sphinx-bootstrap-theme>=0.4.0 4 | sphinxcontrib-httpdomain>=1.2.1 5 | lxml>=2.3.2 6 | beautifulsoup4>=4.3.2 7 | numpy>=1.8.1 8 | nltk>=2.0.4 9 | -------------------------------------------------------------------------------- /run_scripts.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | # NOTES: 4 | # 1. Before running this script, download the `data` folder from `http://hasibi.com/files/res/data.tar.gz` 5 | # and put it under the main repository directory (i.e., tagme-rep/data) 6 | # 2. For all of the experiments we get all linked entities by setting the threshold to 0. 7 | # Later, in the evaluation scripts we filter out the entities below a certain threshold. 8 | 9 | # =============== 10 | # Reproducibility 11 | # =============== 12 | 13 | # TAGME-API - Wiki-Disamb30 14 | python -m nordlys.tagme.tagme_api -data wiki-disamb30 15 | python -m scripts.evaluator_disamb qrels/qrels_wiki-disamb30.txt output/wiki-disamb30_tagmeAPI.txt 16 | 17 | # TAGME-API - Wiki-Annot30 18 | python -m nordlys.tagme.tagme_api -data wiki-annot30 19 | python -m scripts.evaluator_annot qrels/qrels_wiki-annot30.txt output/wiki-annot30_tagmeAPI.txt 0.2 20 | python -m scripts.evaluator_topics qrels/qrels_wiki-annot30.txt output/wiki-annot30_tagmeAPI.txt 0.2 21 | 22 | # TAGME-our(wiki10) 23 | python -m nordlys.tagme.tagme -data wiki-annot30 24 | python -m scripts.evaluator_annot qrels/qrels_wiki-annot30.txt output/wiki-annot30_tagme_wiki10.txt 0.2 25 | python -m scripts.evaluator_topics qrels/qrels_wiki-annot30.txt output/wiki-annot30_tagme_wiki10.txt 0.2 26 | 27 | # Dexter 28 | python -m nordlys.tagme.dexter_api -data wiki-annot30 29 | python -m scripts.evaluator_annot qrels/qrels_wiki-annot30.txt output/wiki-annot30_dexter.txt 0.2 30 | python -m scripts.evaluator_topics qrels/qrels_wiki-annot30.txt output/wiki-annot30_dexter.txt 0.2 31 | 32 | 33 | 34 | # ================ 35 | # Generalizability 36 | # ================ 37 | 38 | # TAGME API - ERD-dev 39 | python -m nordlys.tagme.tagme_api -data erd-dev 40 | python -m scripts.to_elq output/erd-dev_tagmeAPI.txt 0.1 41 | python -m scripts.evaluator_strict qrels/qrels_erd-dev.txt output/erd-dev_tagmeAPI_0.1.elq 42 | 43 | # TAGME API - Y-ERD 44 | python -m nordlys.tagme.tagme_api -data y-erd 45 | python -m scripts.to_elq output/y-erd_tagmeAPI.txt 0.1 46 | python -m scripts.evaluator_strict qrels/qrels_y-erd.txt output/y-erd_tagmeAPI_0.1.elq 47 | 48 | #TAGME-wp10 - ERD-dev 49 | python -m nordlys.tagme.tagme -data erd-dev 50 | python -m scripts.to_elq output/erd-dev_tagme_wiki10.txt 0.1 51 | python -m scripts.evaluator_strict qrels/qrels_erd-dev.txt output/erd-dev_tagme_wiki10_0.1.elq 52 | 53 | #TAGME-wp10 - Y-ERD 54 | python -m nordlys.tagme.tagme -data y-erd 55 | python -m scripts.to_elq output/y-erd_tagme_wiki10.txt 0.1 56 | python -m scripts.evaluator_strict qrels/qrels_y-erd.txt output/y-erd_tagme_wiki10_0.1.elq 57 | 58 | #TAGME-wp12 - ERD-dev 59 | python -m nordlys.tagme.tagme -data erd-dev 60 | python -m scripts.to_elq output/erd-dev_tagme_wiki12.txt 0.1 61 | python -m scripts.evaluator_strict qrels/qrels_erd-dev.txt output/erd-dev_tagme_wiki12_0.1.elq 62 | 63 | #TAGME-wp12 - Y-ERD 64 | python -m nordlys.tagme.tagme -data y-erd 65 | python -m scripts.to_elq output/y-erd_tagme_wiki12.txt 0.1 66 | python -m scripts.evaluator_strict qrels/qrels_y-erd.txt output/y-erd_tagme_wiki12_0.1.elq 67 | 68 | # Dexter - ERD-dev 69 | python -m nordlys.tagme.dexter_api -data erd-dev 70 | python -m scripts.to_elq output/erd-dev_dexter.txt 0.1 71 | python -m scripts.evaluator_strict qrels/qrels_erd-dev.txt output/erd-dev_dexter_0.1.elq 72 | 73 | # Dexter - Y-ERD 74 | python -m nordlys.tagme.dexter_api -data y-erd 75 | python -m scripts.to_elq output/y-erd_dexter.txt 0.1 76 | python -m scripts.evaluator_strict qrels/qrels_y-erd.txt output/y-erd_dexter_0.1.elq 77 | -------------------------------------------------------------------------------- /scripts/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hasibi/TAGME-Reproducibility/d21ed0d826fc60a6e4caaa5ec7b6c39e16f7c6c6/scripts/__init__.py -------------------------------------------------------------------------------- /scripts/evaluator_annot.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script computes Topic metrics for the end-to-end performance. 3 | Precision and recall are macro-averaged. 4 | Matching condition: entities should match and mentions should be equal or contained in each other. 5 | 6 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 7 | """ 8 | 9 | from __future__ import division 10 | import sys 11 | from collections import defaultdict 12 | 13 | 14 | class EvaluatorAnnot(object): 15 | def __init__(self, qrels, results, score_th, null_qrels=None): 16 | self.qrels_dict = self.__group_by_queries(qrels) 17 | self.results_dict = self.__group_by_queries(results, res=True, score_th=score_th) 18 | self.null_qrels = self.__group_by_queries(null_qrels) if null_qrels else None 19 | 20 | @staticmethod 21 | def __group_by_queries(file_lines, res=False, score_th=None): 22 | """ 23 | Groups the lines by query id. 24 | 25 | :param file_lines: list of lines [[qid, score, en_id, mention, page_id], ...] 26 | :return: {qid: {(men0, en0), (men1, en01), ..}, ..}; 27 | """ 28 | grouped_inters = defaultdict(set) 29 | for cols in file_lines: 30 | if len(cols) > 2: 31 | if res and (float(cols[1]) < score_th): 32 | continue 33 | grouped_inters[cols[0]].add((cols[3].lower(), cols[2].lower())) 34 | return grouped_inters 35 | 36 | def rm_nulls_from_res(self): 37 | """ 38 | Removes mentions that not linked to an entity in the qrel. 39 | There are some entities in the qrel with "*NONE*" as id. We remove the related mentions from the result file. 40 | Null entities are generated due to the inconsistency between TAGME Wikipedia dump (2009) and our dump (2010). 41 | """ 42 | print "Removing mentions with null entities ..." 43 | new_results_dict = defaultdict(set) 44 | for qid in self.results_dict: 45 | # easy case: the query does not have any null entity. 46 | if qid not in set(self.null_qrels.keys()): 47 | new_results_dict[qid] = self.results_dict[qid] 48 | continue 49 | 50 | qrel_null_mentions = [item[0] for item in self.null_qrels[qid]] 51 | # check null mentions with results mentions 52 | for men, en in self.results_dict[qid]: 53 | is_null = False 54 | for qrel_null_men in qrel_null_mentions: 55 | # results mention does not match null qrel mention 56 | if mention_match(qrel_null_men, men): 57 | is_null = True 58 | break 59 | 60 | if not is_null: 61 | new_results_dict[qid].add((men, en)) 62 | self.results_dict = new_results_dict 63 | 64 | def eval(self, eval_query_func): 65 | """ 66 | Evaluates all queries and calculates total precision, recall and F1 (macro averaging). 67 | 68 | :param eval_query_func: A function that takes qrel and results for a query and returns evaluation metrics 69 | :return Total precision, recall, and F1 for all queries 70 | """ 71 | self.rm_nulls_from_res() 72 | queries_eval = {} 73 | total_prec, total_rec, total_f = 0, 0, 0 74 | for qid in sorted(self.qrels_dict): 75 | queries_eval[qid] = eval_query_func(self.qrels_dict[qid], self.results_dict.get(qid, {})) 76 | 77 | total_prec += queries_eval[qid]['prec'] 78 | total_rec += queries_eval[qid]['rec'] 79 | 80 | n = len(self.qrels_dict) # number of queries 81 | total_prec /= n 82 | total_rec /= n 83 | total_f = 2 * total_prec * total_rec / (total_prec + total_rec) 84 | 85 | log = "\n----------------" + "\nEvaluation results:\n" + \ 86 | "Prec: " + str(round(total_prec, 4)) + "\n" +\ 87 | "Rec: " + str(round(total_rec, 4)) + "\n" + \ 88 | "F1: " + str(round(total_f, 4)) + "\n" + \ 89 | "all: " + str(round(total_prec, 4)) + ", " + str(round(total_rec, 4)) + ", " + str(round(total_f, 4)) 90 | print log 91 | metrics = {'prec': total_prec, 'rec': total_rec, 'f': total_f} 92 | return metrics 93 | 94 | 95 | def erd_eval_query(query_qrels, query_results): 96 | """ 97 | Evaluates a single query. 98 | 99 | :param query_qrels: Query interpretations from Qrel [{en1, en2, ..}, ..] 100 | :param query_results: Query interpretations from result file [{en1, en2, ..}, ..] 101 | :return: precision, recall, and F1 for a query 102 | """ 103 | tp = 0 # correct 104 | fn = 0 # missed 105 | fp = 0 # incorrectly returned 106 | 107 | # ----- Query has at least an interpretation set. ----- 108 | # Iterate over qrels to calculate TP and FN 109 | for qrel_item in query_qrels: 110 | if find_item(qrel_item, query_results): 111 | tp += 1 112 | else: 113 | fn += 1 114 | # Iterate over results to calculate FP 115 | for res_item in query_results: 116 | if not find_item(res_item, query_qrels): # Finds the result in the qrels 117 | fp += 1 118 | 119 | prec = tp / (tp+fp) if tp+fp != 0 else 0 120 | rec = tp / (tp+fn) if tp+fn != 0 else 0 121 | f = (2 * prec * rec) / (prec + rec) if prec + rec != 0 else 0 122 | metrics = {'prec': prec, 'rec': rec, 'f': f} 123 | return metrics 124 | 125 | 126 | def find_item(item_to_find, items_list): 127 | """ 128 | Returns True if an item is found in the item list. 129 | 130 | :param item_to_find: item to be found 131 | :param items_list: list of items to search in 132 | :return boolean 133 | """ 134 | is_found = False 135 | 136 | for item in items_list: 137 | if (item[1] == item_to_find[1]) and mention_match(item[0], item_to_find[0]): 138 | is_found = True 139 | return is_found 140 | 141 | 142 | def mention_match(mention1, mention2): 143 | """ 144 | Checks if two mentions matches each other. 145 | Matching condition: One of the mentions is sub-string of the other one. 146 | """ 147 | match = ((mention1 in mention2) or (mention2 in mention1)) 148 | return match 149 | 150 | 151 | def parse_file(file_name, res=False): 152 | """ 153 | Parses file and returns the positive instances for each query. 154 | 155 | :param file_name: Name of file to be parsed 156 | :return lists of lines [[qid, label, en_id, ...], ...], lines with null entities are separated 157 | """ 158 | null_lines = [] 159 | file_lines = [] 160 | efile = open(file_name, "r") 161 | for line in efile.readlines(): 162 | if line.strip() == "": 163 | continue 164 | cols = line.strip().split("\t") 165 | if (not res) and (cols[2].strip() == "*NONE*"): 166 | null_lines.append(cols) 167 | else: 168 | file_lines.append(cols) 169 | return file_lines, null_lines 170 | 171 | 172 | def main(args): 173 | if len(args) < 2: 174 | print "\tUsage: " 175 | exit(0) 176 | print "parsing qrel ..." 177 | qrels, null_qrels = parse_file(args[0]) # here qrel does not contain null entities 178 | print "parsing results ..." 179 | results = parse_file(args[1], res=True)[0] 180 | print "evaluating ..." 181 | evaluator = EvaluatorAnnot(qrels, results, float(args[2]), null_qrels=null_qrels) 182 | evaluator.eval(erd_eval_query) 183 | 184 | if __name__ == '__main__': 185 | main(sys.argv[1:]) 186 | -------------------------------------------------------------------------------- /scripts/evaluator_disamb.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script computes evaluation metrics for disambiguation phase. 3 | Foe each query, if the ground truth entity is found in the results, both Precision and recall are set to 1; 4 | otherwise to 0. 5 | 6 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 7 | """ 8 | 9 | from __future__ import division 10 | import sys 11 | from collections import defaultdict 12 | 13 | 14 | class EvaluatorDisamb(object): 15 | 16 | def __init__(self, qrels, results, null_qrels=None): 17 | self.qrels_dict = self.__group_by_queries(qrels) 18 | self.results_dict = self.__group_by_queries(results) 19 | self.null_qrels = self.__group_by_queries(null_qrels) if null_qrels else None 20 | 21 | @staticmethod 22 | def __group_by_queries(file_lines): 23 | """ 24 | Groups the lines by query id. 25 | 26 | :param file_lines: list of lines [[qid, score, wiki_uri, mention, page_id], ...] 27 | :return: {qid: {(men0, en0), (men1, en01), ..}, ..}; 28 | """ 29 | grouped_inters = defaultdict(set) 30 | for cols in file_lines: 31 | if len(cols) > 2: 32 | grouped_inters[cols[0]].add((cols[3].lower(), cols[2])) 33 | return grouped_inters 34 | 35 | def eval(self, eval_query_func): 36 | """ 37 | Evaluates all queries and calculates total precision, recall and F1 (macro averaging). 38 | 39 | :param eval_query_func: A function that takes qrel and results for a query and returns evaluation metrics 40 | :return Total precision, recall, and F1 for all queries 41 | """ 42 | queries_eval = {} 43 | total_prec, total_rec, total_f = 0, 0, 0 44 | for qid in set(sorted(self.qrels_dict)): 45 | queries_eval[qid] = eval_query_func(self.qrels_dict[qid], self.results_dict.get(qid, {})) 46 | total_prec += queries_eval[qid]['prec'] 47 | total_rec += queries_eval[qid]['rec'] 48 | 49 | n = len(self.qrels_dict) # number of queries 50 | total_prec /= n 51 | total_rec /= n 52 | total_f = (2 * total_prec * total_rec) / (total_prec + total_rec) 53 | 54 | log = "\n----------------" + "\nEvaluation results:\n" + \ 55 | "Prec: " + str(round(total_prec, 4)) + "\n" +\ 56 | "Rec: " + str(round(total_rec, 4)) + "\n" + \ 57 | "F1: " + str(round(total_f, 4)) + "\n" + \ 58 | "all: " + str(round(total_prec, 4)) + ", " + str(round(total_rec, 4)) + ", " + str(round(total_f, 4)) 59 | print log 60 | metrics = {'prec': total_prec, 'rec': total_rec, 'f': total_f} 61 | return metrics 62 | 63 | 64 | def erd_eval_query(query_qrels, query_results): 65 | """ 66 | Evaluates a single query. 67 | 68 | :param query_qrels: Query interpretations from Qrel [{en1, en2, ..}, ..] 69 | :param query_results: Query interpretations from result file [{en1, en2, ..}, ..] 70 | :return: precision, recall, and F1 for a query 71 | """ 72 | prec, rec = 0, 0 73 | 74 | # ----- Query has at least an interpretation set. ----- 75 | # Iterate over qrels to calculate TP and FN 76 | for qrel_item in query_qrels: 77 | if find_item(qrel_item, query_results): 78 | prec += 1 79 | rec += 1 80 | 81 | prec /= len(query_qrels) 82 | rec /= len(query_qrels) 83 | f = (2 * prec * rec) / (prec + rec) if prec + rec != 0 else 0 84 | metrics = {'prec': prec, 'rec': rec, 'f': f} 85 | return metrics 86 | 87 | 88 | def find_item(item_to_find, items_list): 89 | """ 90 | Returns True if an item is found in the item list. 91 | 92 | :param item_to_find: item to be found 93 | :param items_list: list of items to search in 94 | :return boolean 95 | """ 96 | is_found = False 97 | 98 | for item in items_list: 99 | if item[1] == item_to_find[1]: 100 | is_found = True 101 | return is_found 102 | 103 | 104 | def parse_file(file_name, res=False): 105 | """ 106 | Parses file and returns the positive instances for each query. 107 | 108 | :param file_name: Name of file to be parsed 109 | :return list of lines [[qid, label, en_id, ...], ...] 110 | """ 111 | null_lines = [] 112 | file_lines = [] 113 | efile = open(file_name, "r") 114 | for line in efile.readlines(): 115 | if line.strip() == "": 116 | continue 117 | cols = line.strip().split("\t") 118 | if (not res) and (cols[2].strip() == "*NONE*"): 119 | null_lines.append(cols) 120 | else: 121 | file_lines.append(cols) 122 | return file_lines, null_lines 123 | 124 | 125 | def main(args): 126 | if len(args) < 2: 127 | print "\tUsage: " 128 | exit(0) 129 | print "parsing qrel ..." 130 | qrels, null_qrels = parse_file(args[0]) # here qrel does not contain null entities 131 | print "parsing results ..." 132 | results = parse_file(args[1], res=True)[0] 133 | print "evaluating ..." 134 | evaluator = EvaluatorDisamb(qrels, results, null_qrels=null_qrels) 135 | evaluator.eval(erd_eval_query) 136 | 137 | 138 | if __name__ == '__main__': 139 | main(sys.argv[1:]) 140 | -------------------------------------------------------------------------------- /scripts/evaluator_strict.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script evaluates query interpretations based on the strict evaluation metrics; 3 | macro averaging of precision, recall and F-measure. 4 | 5 | For detailed information see: 6 | F. Hasibi, K. Balog, and S. E. Bratsberg. "Entity Linking in Queries: Tasks and Evaluation", 7 | In Proceedings of ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '15), Sep 2015. 8 | DOI: http://dx.doi.org/10.1145/2808194.2809473 9 | 10 | Usage: 11 | python evaluation_erd.py 12 | e.g. 13 | python evaluation_erd.py qrels_sets_ERD-dev.txt ERD-dev_MLMcg-GIF.txt 14 | 15 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 16 | """ 17 | 18 | from __future__ import division 19 | import sys 20 | from collections import defaultdict 21 | 22 | 23 | class Evaluator(object): 24 | 25 | def __init__(self, qrels, results): 26 | self.qrels_dict = self.__group_by_queries(qrels) 27 | self.results_dict = self.__group_by_queries(results) 28 | qid_overlap = set(self.qrels_dict.keys()) & set(self.results_dict.keys()) 29 | if len(qid_overlap) == 0: 30 | print "ERR: Query mismatch between qrel and result file!" 31 | exit(0) 32 | 33 | @staticmethod 34 | def __group_by_queries(file_lines): 35 | """ 36 | Groups the lines by query id. 37 | 38 | :param file_lines: list of lines [[qid, label, en_id, ...], ...] 39 | :return: {qid: [iset0, iset1, ..], ..}; isets are sets of entity ids 40 | """ 41 | grouped_inters = defaultdict(list) 42 | for cols in file_lines: 43 | if len(cols) > 2: 44 | grouped_inters[cols[0]].append(set(cols[2:])) 45 | elif cols[0] not in grouped_inters: 46 | grouped_inters[cols[0]] = [] 47 | 48 | # check that identical interpretations are not assigned to a query 49 | for qid, interprets in grouped_inters.iteritems(): 50 | q_interprets = set() 51 | for inter in interprets: 52 | if tuple(sorted(inter)) in q_interprets: 53 | print "Err: Identical interpretations for query [" + qid + "]!" 54 | exit(0) 55 | else: 56 | q_interprets.add(tuple(sorted(inter))) 57 | return grouped_inters 58 | 59 | def eval(self, eval_query_func): 60 | """ 61 | Evaluates all queries and calculates total precision, recall and F1 (macro averaging). 62 | 63 | :param eval_query_func: A function that takes qrel and results for a query and returns evaluation metrics 64 | :return Total precision, recall, and F1 for all queries 65 | """ 66 | queries_eval = {} 67 | total_prec, total_rec, total_f = 0, 0, 0 68 | for qid in sorted(self.qrels_dict): 69 | queries_eval[qid] = eval_query_func(self.qrels_dict[qid], self.results_dict.get(qid, [])) 70 | total_prec += queries_eval[qid]['prec'] 71 | total_rec += queries_eval[qid]['rec'] 72 | n = len(self.qrels_dict) # number of queries 73 | total_prec /= n 74 | total_rec /= n 75 | total_f = (2 * total_rec * total_prec) / (total_rec + total_prec) if total_prec + total_rec != 0 else 0 76 | 77 | log = "\n----------------" + "\nEvaluation results:\n" + \ 78 | "Prec: " + str(round(total_prec, 4)) + "\n" +\ 79 | "Rec: " + str(round(total_rec, 4)) + "\n" + \ 80 | "F1: " + str(round(total_f, 4)) + "\n" + \ 81 | "all: " + str(round(total_prec, 4)) + ", " + str(round(total_rec, 4)) + ", " + str(round(total_f, 4)) 82 | print log 83 | metrics = {'prec': total_prec, 'rec': total_rec, 'f': total_f} 84 | return metrics 85 | 86 | 87 | def erd_eval_query(query_qrels, query_results): 88 | """ 89 | Evaluates a single query. 90 | 91 | :param query_qrels: Query interpretations from Qrel [{en1, en2, ..}, ..] 92 | :param query_results: Query interpretations from result file [{en1, en2, ..}, ..] 93 | :return: precision, recall, and F1 for a query 94 | """ 95 | tp = 0 # correct 96 | fn = 0 # missed 97 | fp = 0 # incorrectly returned 98 | 99 | # ----- Query has no interpretation set. ------ 100 | if len(query_qrels) == 0: 101 | if len(query_results) == 0: 102 | return {'prec': 1, 'rec': 1, 'f': 1} 103 | return {'prec': 0, 'rec': 0, 'f': 0} 104 | 105 | # ----- Query has at least an interpretation set. ----- 106 | # Iterate over qrels to calculate TP and FN 107 | for qrel_item in query_qrels: 108 | if find_item(qrel_item, query_results): 109 | tp += 1 110 | else: 111 | fn += 1 112 | # Iterate over results to calculate FP 113 | for res_item in query_results: 114 | if not find_item(res_item, query_qrels): # Finds the result in the qrels 115 | fp += 1 116 | 117 | prec = tp / (tp+fp) if tp+fp != 0 else 0 118 | rec = tp / (tp+fn) if tp+fn != 0 else 0 119 | metrics = {'prec': prec, 'rec': rec} 120 | return metrics 121 | 122 | 123 | def find_item(item_to_find, items_list): 124 | """ 125 | Returns True if an item is found in the item list. 126 | 127 | :param item_to_find: item to be found 128 | :param items_list: list of items to search in 129 | :return boolean 130 | """ 131 | is_found = False 132 | 133 | item_to_find = set([en.lower() for en in item_to_find]) 134 | 135 | for item in items_list: 136 | item = set([en.lower() for en in item]) 137 | if item == item_to_find: 138 | is_found = True 139 | return is_found 140 | 141 | 142 | def parse_file(file_name): 143 | """ 144 | Parses file and returns the positive instances for each query. 145 | 146 | :param file_name: Name of file to be parsed 147 | :return list of lines [[qid, label, en_id, ...], ...] 148 | """ 149 | file_lines = [] 150 | efile = open(file_name, "r") 151 | for line in efile.readlines(): 152 | if line.strip() == "": 153 | continue 154 | cols = line.strip().split("\t") 155 | file_lines.append(cols) 156 | return file_lines 157 | 158 | 159 | def main(args): 160 | if len(args) < 2: 161 | print "\tUsage: [qrel_file] [result_file]" 162 | exit(0) 163 | qrels = parse_file(args[0]) 164 | results = parse_file(args[1]) 165 | evaluator = Evaluator(qrels, results) 166 | evaluator.eval(erd_eval_query) 167 | 168 | if __name__ == '__main__': 169 | main(sys.argv[1:]) 170 | -------------------------------------------------------------------------------- /scripts/evaluator_topics.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script computes Topic metrics for the end-to-end performance. 3 | Precision and recall are micro-averaged. 4 | Matching condition: only entities should match. 5 | 6 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 7 | """ 8 | 9 | from __future__ import division 10 | import sys 11 | from collections import defaultdict 12 | 13 | 14 | class EvaluatorTopics(object): 15 | 16 | def __init__(self, qrels, results, null_qrels=None, score_th=0): 17 | self.qrels_dict = self.__group_by_queries(qrels) 18 | self.results_dict = self.__group_by_queries(results, score_th=score_th) 19 | self.null_qrels = self.__group_by_queries(null_qrels) if null_qrels else None 20 | self.score_th = score_th 21 | 22 | @staticmethod 23 | def __group_by_queries(file_lines, score_th=None): 24 | """ 25 | Groups the lines by query id. 26 | 27 | :param file_lines: list of lines [[qid, score, en_id, mention, page_id], ...] 28 | :return: {qid: {(men0, en0), (men1, en01), ..}, ..}; 29 | """ 30 | grouped_inters = defaultdict(set) 31 | for cols in file_lines: 32 | if len(cols) > 2: 33 | if score_th and (float(cols[1]) < score_th): 34 | continue 35 | grouped_inters[cols[0]].add((cols[3].lower(), cols[2].lower())) 36 | return grouped_inters 37 | 38 | def rm_nulls_res(self): 39 | """ 40 | Removes mentions that not linked to an entity in the qrel. 41 | There are some entities in the qrel with "*NONE*" as id. We remove the related mentions from the result file. 42 | Null entities are generated due to the inconsistency between TAGME Wikipedia dump (2009) and our dump (2010). 43 | """ 44 | print "Removing mentions with null entities ..." 45 | new_results_dict = defaultdict(set) 46 | for qid in self.results_dict: 47 | # easy case 48 | if qid not in set(self.null_qrels.keys()): 49 | new_results_dict[qid] = self.results_dict[qid] 50 | continue 51 | 52 | qrel_null_mentions = [item[0] for item in self.null_qrels[qid]] 53 | # check null mentions with results mentions 54 | for men, en in self.results_dict[qid]: 55 | is_null = False 56 | for qrel_null_men in qrel_null_mentions: 57 | # results mention does not match null qrel mention 58 | if mention_match(qrel_null_men, men): 59 | is_null = True 60 | break 61 | 62 | if not is_null: 63 | new_results_dict[qid].add((men, en)) 64 | # else: 65 | # print qid, men, en, "QREL mention:", qrel_null_men, "-*-" 66 | self.results_dict = new_results_dict 67 | 68 | def eval(self, eval_query_func): 69 | """ 70 | Evaluates all queries and calculates total precision, recall and F1 (macro averaging). 71 | 72 | :param eval_query_func: A function that takes qrel and results for a query and returns evaluation metrics 73 | :return Total precision, recall, and F1 for all queries 74 | """ 75 | self.rm_nulls_res() 76 | print "comparing results ..." 77 | queries_eval = {} 78 | total_tp, total_fp, total_fn = 0, 0, 0 79 | for qid in sorted(self.qrels_dict): 80 | queries_eval[qid] = eval_query_func(self.qrels_dict[qid], self.results_dict.get(qid, {})) 81 | 82 | total_tp += queries_eval[qid]['tp'] 83 | total_fp += queries_eval[qid]['fp'] 84 | total_fn += queries_eval[qid]['fn'] 85 | 86 | total_prec = total_tp / (total_tp + total_fp) 87 | total_rec = total_tp / (total_tp + total_fn) 88 | total_f = 2 * total_prec * total_rec / (total_prec + total_rec) 89 | 90 | log = "\n----------------" + "\nEvaluation results:\n" + \ 91 | "Prec: " + str(round(total_prec, 4)) + "\n" +\ 92 | "Rec: " + str(round(total_rec, 4)) + "\n" + \ 93 | "F1: " + str(round(total_f, 4)) + "\n" + \ 94 | "all: " + str(round(total_prec, 4)) + ", " + str(round(total_rec, 4)) + ", " + str(round(total_f, 4)) 95 | print log 96 | metrics = {'prec': total_prec, 'rec': total_rec, 'f': total_f} 97 | return metrics 98 | 99 | 100 | def erd_eval_query(query_qrels, query_results): 101 | """ 102 | Evaluates a single query. 103 | 104 | :param query_qrels: Query interpretations from Qrel [{en1, en2, ..}, ..] 105 | :param query_results: Query interpretations from result file [{en1, en2, ..}, ..] 106 | :return: precision, recall, and F1 for a query 107 | """ 108 | tp = 0 # correct 109 | fn = 0 # missed 110 | fp = 0 # incorrectly returned 111 | 112 | # ----- Query has at least an interpretation set. ----- 113 | # Iterate over qrels to calculate TP and FN 114 | results_ens = [item[1] for item in query_results] 115 | qrel_ens = [item[1] for item in query_qrels] 116 | for qrel_item in qrel_ens: 117 | if find_item(qrel_item, results_ens): 118 | tp += 1 119 | else: 120 | fn += 1 121 | # Iterate over results to calculate FP 122 | for res_item in results_ens: 123 | if not find_item(res_item, qrel_ens): # Finds the result in the qrels 124 | fp += 1 125 | 126 | stats = {'tp': tp, 'fp': fp, 'fn': fn} 127 | return stats 128 | 129 | 130 | def find_item(item_to_find, items_list): 131 | """ 132 | Returns True if an item is found in the item list. 133 | 134 | :param item_to_find: item to be found 135 | :param items_list: list of items to search in 136 | :return boolean 137 | """ 138 | is_found = False 139 | for item in items_list: 140 | if item == item_to_find: 141 | is_found = True 142 | return is_found 143 | 144 | 145 | def mention_match(mention1, mention2): 146 | """ 147 | Checks if two mentions matches each other. 148 | Matching condition: One of the mentions is sub-string of the other one. 149 | """ 150 | match = ((mention1 in mention2) or (mention2 in mention1)) 151 | return match 152 | 153 | 154 | def parse_file(file_name, res=False): 155 | """ 156 | Parses file and returns the positive instances for each query. 157 | 158 | :param file_name: Name of file to be parsed 159 | :return list of lines [[qid, score, en_id, mention, ...], ...] 160 | """ 161 | null_lines = [] 162 | file_lines = [] 163 | infile = open(file_name, "r") 164 | for line in infile.readlines(): 165 | if line.strip() == "": 166 | continue 167 | cols = line.strip().split("\t") 168 | if (not res) and (cols[2].strip() == "*NONE*"): 169 | null_lines.append(cols) 170 | else: 171 | file_lines.append(cols) 172 | return file_lines, null_lines 173 | 174 | 175 | def main(args): 176 | if len(args) < 2: 177 | print "\tUsage: " 178 | exit(0) 179 | print "parsing qrel ..." 180 | qrels, null_qrels = parse_file(args[0]) # here qrel does not contain null entities 181 | print "parsing results ..." 182 | results = parse_file(args[1])[0] 183 | print "evaluating ..." 184 | score_th = 0 if len(args) == 2 else float(args[2]) 185 | evaluator = EvaluatorTopics(qrels, results, null_qrels=null_qrels, score_th=score_th) 186 | evaluator.eval(erd_eval_query) 187 | 188 | if __name__ == '__main__': 189 | main(sys.argv[1:]) 190 | -------------------------------------------------------------------------------- /scripts/to_elq.py: -------------------------------------------------------------------------------- 1 | """ 2 | Converts the results to ELQ format: 3 | - Filters general concept entities (keeps only proper noun entities) 4 | - Creates ELQ format file 5 | 6 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no) 7 | """ 8 | 9 | import sys 10 | from collections import defaultdict 11 | from nordlys.config import DATA_DIR 12 | from nordlys.wikipedia.utils import WikipediaUtils 13 | 14 | 15 | def load_kb(): 16 | """Loads Freebase snapshot of proper noun entities.""" 17 | print "Loading knowledge base snapshot ..." 18 | __fb_dbp_file = open(DATA_DIR + "/fb_dbp_snapshot.txt", "r") 19 | global KB_SNP_DBP 20 | for line in __fb_dbp_file: 21 | cols = line.strip().split("\t") 22 | KB_SNP_DBP.add(cols[1]) 23 | __fb_dbp_file.close() 24 | 25 | KB_SNP_DBP = set() 26 | 27 | 28 | def read_file(input_file, score_th): 29 | lines = [] 30 | with open(input_file, "r") as input: 31 | for line in input: 32 | cols = line.strip().split("\t") 33 | if float(cols[1]) < score_th: 34 | continue 35 | lines.append(cols) 36 | return lines 37 | 38 | 39 | def filter_general_ens(lines): 40 | """Returns tab-separated lines: qid score en men fb_id""" 41 | filtered_annots = [] 42 | for line in lines: 43 | dbp_uri = WikipediaUtils.wiki_uri_to_dbp_uri(line[2]) 44 | if dbp_uri in KB_SNP_DBP: # check fb is in the KB snapshot 45 | filtered_annots.append(line) 46 | return filtered_annots 47 | 48 | 49 | def to_inter_sets(lines): 50 | """Groups linked entities and interpretation set.""" 51 | group_by_qid = defaultdict(set) 52 | for cols in lines: 53 | group_by_qid[cols[0]].add(cols[2]) 54 | return group_by_qid 55 | 56 | 57 | def main(args): 58 | if len(args) < 2: 59 | print "USAGE: " 60 | exit(0) 61 | load_kb() 62 | lines = read_file(args[0], float(args[1])) 63 | filtered_annots = filter_general_ens(lines) 64 | inter_sets = to_inter_sets(filtered_annots) 65 | 66 | out_str = "" 67 | for qid in sorted(inter_sets.keys()): 68 | ens = inter_sets[qid] 69 | out_str += qid + "\t1\t" + "\t".join(ens) + "\n" 70 | out_file = args[0][:args[0].rfind(".")] + "_" + str(args[1]) + ".elq" 71 | open(out_file, "w").write(out_str) 72 | print "Output file:", out_file 73 | 74 | 75 | if __name__ == "__main__": 76 | main(sys.argv[1:]) -------------------------------------------------------------------------------- /setup.md: -------------------------------------------------------------------------------- 1 | # Setup In order to set up and run our implementation of TAGME, you need to install [PyLucence](https://lucene.apache.org/pylucene/) and the packages listed in the ``requirements.txt`` file. Once the required packages are installed, you need to have the resources required for running the code, which are: (i) a surface form dictionary and (ii) indices for a Wikipedia dump. You can directly ask the authors of [1] to provide you with these resources or build them using the following steps: 1. Downloading a Wikipedia dump 2. Preprocessing the dump 3. Building indices 4. Building a surface form dictionary 5. Setting the config file Below we describe each of these steps. ## 1. Downloading a Wikipedia dump Our TAGME is built based on a Wikipedia dump; i.e., ``enwiki-YYYYMMDD-pages-articles.xml.bz2`` file that can be downloaded from [here](http://dumps.wikimedia.org/enwiki/). For the experiments in [1], we used the dumps from *20100408* and *20120502*, which are available upon request. ## 2. Preprocessing the dump The [Wikipedia Extractor](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) tool is used for preprocessing of the Wikipedia dump. The dump used for experiments is available under `lib/wikiextractor-master`. The following command is executed to pre-process the dumps. Mind that the `-l` option is necessary, as it preserves the links. ``` python tagme-rep/lib/wikiextractor-master/WikiExtractor.py -o path/to/output/folder -l path/to/enwiki-YYYYMMDD-pages-articles.xml.bz2 ``` We assume that the resulting files are stored under `preprocessed-YYYYMMDD` folder. ## 3. Building indices Two type of indices are built from the Wikipedia dumps: - **YYYYMMDD-index**: Index of Wikipedia articles (with resolved URIs). - **YYYYMMDD-index-annot**: Index containing only Wikipedia annotations. This index is used to compute relatedness between entities. Run the following commands to build these indices: - ``python -m nordlys.wikipedia.indexer -i preprocessed-YYYYMMDD/ -o YYYYMMDD-index/`` - ``python -m nordlys.wikipedia.indexer -a -i preprocessed-YYYYMMDD/ -o YYYYMMDD-index-annot/`` We note that the following pages are ignored from the indices: - **List pages**: Wikipedia URIs starting with ". ``` [1] F. Hasibi, K. Balog, and S.E. Bratsberg. “On the reproducibility of the TAGME Entity Linking System”, In proceedings of 38th European Conference on Information Retrieval (ECIR ’16), Padova, Italy, March 2016. ``` --------------------------------------------------------------------------------