├── LICENSE
├── README.md
├── authors_comments.md
├── lib
    ├── edu.cmu.lti.wikipedia_redirect
    │   ├── .classpath
    │   ├── .project
    │   ├── .settings
    │   │   ├── org.eclipse.core.resources.prefs
    │   │   └── org.eclipse.jdt.core.prefs
    │   ├── .svn
    │   │   ├── entries
    │   │   ├── format
    │   │   ├── pristine
    │   │   │   ├── 29
    │   │   │   │   └── 29fd2aaaf86816aafbbc105506741d18535ab692.svn-base
    │   │   │   ├── 30
    │   │   │   │   └── 30792ab46c06afdce18ad48ab5759bdcf973ba1a.svn-base
    │   │   │   ├── 81
    │   │   │   │   └── 8159a4c51d3239796584c79b7550d7c49972f4d8.svn-base
    │   │   │   ├── 84
    │   │   │   │   └── 84bf19ebd162073a34792b87aa400129a6068865.svn-base
    │   │   │   ├── 01
    │   │   │   │   └── 010314f8fd56597897f1bd210790af39ebe6b887.svn-base
    │   │   │   ├── 1a
    │   │   │   │   └── 1ade57daf5d7fd981e4ababdc006c8c3a02bcbe5.svn-base
    │   │   │   ├── 2d
    │   │   │   │   └── 2dd2033882f190c773a5bce39ed7b2362af4ad02.svn-base
    │   │   │   ├── 5d
    │   │   │   │   ├── 5d3212ecd32fd3d4b72c3e11820392d4eb0a7054.svn-base
    │   │   │   │   └── 5d601d1da850507fb57a9de2d042751af0c19eb0.svn-base
    │   │   │   ├── 6f
    │   │   │   │   └── 6f27f0a059481c81ea3674baef5f02dcdf93bc42.svn-base
    │   │   │   ├── b2
    │   │   │   │   └── b2f1c3203bdbb9cbbc7b334f031504dcfa465b61.svn-base
    │   │   │   ├── c2
    │   │   │   │   └── c2d8aecb47cbf4a0d2ebc3c5eb42630ab7999559.svn-base
    │   │   │   ├── f8
    │   │   │   │   └── f8a0a6e41c412aed301f3bd515ea356240c9cc44.svn-base
    │   │   │   └── fb
    │   │   │   │   └── fbba7c603f44332bd67313432986b2f97da47014.svn-base
    │   │   └── wc.db
    │   ├── README.txt
    │   ├── launches
    │   │   ├── Demo.launch
    │   │   └── WikipediaRedirectExtractor.launch
    │   ├── src
    │   │   └── edu
    │   │   │   └── cmu
    │   │   │       └── lti
    │   │   │           └── wikipedia_redirect
    │   │   │               ├── Demo.java
    │   │   │               ├── IOUtil.java
    │   │   │               ├── WikipediaHypernym.java
    │   │   │               ├── WikipediaRedirect.java
    │   │   │               └── WikipediaRedirectExtractor.java
    │   └── test-data
    │   │   ├── sample-jawiki-latest-pages-articles.xml
    │   │   └── sample-res_cat_jawiki.txt
    └── wikiextractor-master-280915.zip
├── nordlys
    ├── __init__.py
    ├── config.py
    ├── storage
    │   ├── __init__.py
    │   ├── mongo.py
    │   └── surfaceforms.py
    ├── tagme
    │   ├── __init__.py
    │   ├── config.py
    │   ├── dexter_api.py
    │   ├── lucene_tools.py
    │   ├── mention.py
    │   ├── query.py
    │   ├── tagme.py
    │   ├── tagme_api.py
    │   └── test_coll.py
    └── wikipedia
    │   ├── __init__.py
    │   ├── anchor_extractor.py
    │   ├── annot_extractor.py
    │   ├── indexer.py
    │   ├── merge_sf.py
    │   ├── pageid_extractor.py
    │   └── utils.py
├── requirements.txt
├── run_scripts.sh
├── scripts
    ├── __init__.py
    ├── evaluator_annot.py
    ├── evaluator_disamb.py
    ├── evaluator_strict.py
    ├── evaluator_topics.py
    └── to_elq.py
└── setup.md


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2016 Faegheh Hasibi
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # TAGME reproducibility
 2 | 
 3 | This repository contains resources developed within the following paper:
 4 | 
 5 | 	F. Hasibi, K. Balog, and S.E. Bratsberg. “On the reproducibility of the TAGME Entity Linking System”,
 6 | 	In proceedings of 38th European Conference on Information Retrieval (ECIR ’16), March 2016.
 7 | 
 8 | This study is an effort aimed at reproducing the results presented in the TAGME paper [1]. You can check the [paper](http://hasibi.com/files/ecir2016-tagme.pdf) and [presentation](http://www.slideshare.net/FaeghehHasibi/tagmerep) for detailed information.
 9 | 
10 | We received invaluable comments from the TAGME authors about their system, and we made these notes available [here](authors_comments.md).
11 | These comments may inform future efforts related to the re-implementation of the TAGME system, as they cannot be found in the original paper.
12 | 
13 | This repository is structured as follows:
14 | 
15 | - `nordlys/`: Code required for running entity linkers.
16 | - `scripts/`: Evaluation scripts.
17 | - `lib/`: Contains libraries.
18 | - `run-scripts.sh`: Single script that runs all the scripts for getting the results of the paper.
19 | - [authors_comments.md](authors_comments.md): Comments from the TAGME authors and notes about our experiments.
20 | 
21 | Other resources involved in this project are [data](http://hasibi.com/files/res/data.tar.gz), [qrels](http://hasibi.com/files/res/qrels.tar.gz), and [runs](http://hasibi.com/files/res/runs.tar.gz), which are described below.
22 | 
23 | **Note:** Before running the code (`run-scripts.sh`), please read the [setup](setup.md) file and build all the required resources.
24 | 
25 | 
26 | ## Data
27 | 
28 | The following data files can be downloaded from [here](http://hasibi.com/files/res/data.tar.gz):
29 | 
30 |   - **Wiki-disamb30** and **Wiki-annot30**: The original datasets are published [here](http://acube.di.unipi.it/tagme-dataset/). We complement the snippets with numerical IDS, as IDs are not contained in the original datasets.
31 |   - **ERD-dev**: The dataset is originally published by the [ERD Challenge](http://web-ngram.research.microsoft.com/ERD2014); we use it in our generalizability experiments. The files related to this dataset are prefixed with `Trec_beta`.
32 |   - **Y-ERD**: This dataset is originally published in [2] and is available [here](http://bit.ly/ictir2015-elq). The dataset is used in our generalizability experiments.
33 |   - **Freebase snapshot**: A snapshot of Freebase containing only proper noun entities (e.g., people and locations) is made available by the ERD challenge and is used for filtering entities in the generalizability experiments.
34 | 
35 | 
36 | ## Qrels
37 | 
38 | The qrel files can be downloaded from [here](http://hasibi.com/files/res/qrels.tar.gz). All qrels are tab-delimited and their format is as follows:
39 | 
40 |   - **Wiki-disamb30** and **Wiki-annot30**: The columns represent: snippet ID, confidence score, Wikipedia URI, and Wikipedia page id. The last column is not considered in the evaluation scripts.
41 |   - **ERD-dev** and **Y-ERD**: The columns represent: query ID, confidence score (always 1), and Wikipedia URI. The entities after the second column represent an interpretation set (entity set) of the query. (If a query has multiple interpretations, there are multiple lines with that query ID.)
42 | 
43 | 
44 | ## Runs
45 | 
46 | The run files can be downloaded from [here](http://hasibi.com/files/res/runs.tar.gz), and categorized into two groups: reproducibility and generalizability.
47 | 
48 |   - **Reproducibility**: The naming convention for these files is *XX_YY.txt*, where XX represents the dataset and YY is the name of the method. For each file, only the first 4 columns are considered for the evaluation, which are: snippet ID, confidence score, Wikipedia URI, and mention.
49 |   - **Generalizability**: These files are named as *XX_YY_ZZ.elq*, where XX is the dataset, YY is the name of the method, and ZZ is the entity linking threshold used for evaluation. The format of these files is similar to the corresponding qrel files.
50 | 
51 | ## Citation
52 | 
53 | If you use the resources presented in this repository, please cite:
54 | 
55 | ```
56 | @inproceedings{Hasibi:2016:ORT, 
57 |    author =    {Hasibi, Faegheh and Balog, Krisztian and Bratsberg, Svein Erik},
58 |    title =     {On the reproducibility of the TAGME Entity Linking System},
59 |    booktitle = {roceedings of 38th European Conference on Information Retrieval},
60 |    series =    {ECIR '16},
61 |    year =      {2016},
62 |    pages =     {436--449},
63 |    publisher = {Springer},
64 |    DOI =       {http://dx.doi.org/10.1007/978-3-319-30671-1_32}
65 | } 
66 | ```
67 | 
68 | ## Contact
69 | 
70 | Should you have any questions, please contact Faegheh Hasibi at <f.hasibi@cs.ru.nl>.
71 | 
72 | [1] P. Ferragina and U. Scaiella. TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities). In Proceedings of CIKM '10, pages 1625–1628, 2010.
73 | 
74 | [2] F. Hasibi, K. Balog, and S. E. Bratsberg. Entity Linking in Queries: Tasks and Evaluation. In Proceedings of ICTIR ’15, pages 171–180, 2015.
75 | 


--------------------------------------------------------------------------------
/authors_comments.md:
--------------------------------------------------------------------------------
 1 | # Authors' comments
 2 | 
 3 | In our study aimed at reproducing the results in [1], only parts of the results were reproducible.
 4 | Later on, the TAGME authors clarified some of the issues that surfaced.
 5 | We list these comments below, as they may inform future efforts related to the re-implementation of TAGME.
 6 | 
 7 | We also include some additional notes on our experiments, which clarify our reasoning behind certain decisions.
 8 | 
 9 | ## TAGME authors' comments
10 | 
11 | *The comments below are taken from our personal communication with the TAGME authors. Most of these are direct quotes, but we made minor editorial changes and structured them by topic.*
12 | 
13 | **Important note:** In this section "we" refers to the TAGME authors; to avoid ambiguity, we will refer to our implementation as the "ECIR'16 implementation."
14 | 
15 | ### Implementation:
16 | 
17 | - The TAGME paper [1] represents “version 1” of TAGME, while the source code and the TAGME API are “version 2”. In the second version, the epsilon value been changed and the value of tau has been decreased.
18 | 
19 | - TAGME uses wiki page-to-page link records (enwiki-xxx-pagelinks.sql.gz), while the ECIR'16 implementation extracts links from the body of the pages (enwiki-xxx-pages-articles.xml.bz2). This affects the computation of relatedness, as the former source contains 20% more links than the latter. For example, the Wikipedia article [Jaguar](https://en.wikipedia.org/wiki/Jaguar) contains several links under "Extant Carnivora species" section, which may not be found in the *.xml.bz2* file.
20 | 
21 | - TAGME version 2 uses a list of stop words to create alternative spots and we add them to the list of available spots during pre-processing phase. In other words, when a spot like "president of the united states" is found, 2 spots are created: (i) "president of the united states", and "president united states". Then these two spots are added to the anchor dictionary. However, TAGME does not perform any stop word removal during the parsing phase.
22 | 
23 | - In TAGME version 2 the parsing method for anchors starting with 'the','a','an' has changed. We ignore those prefixes and use only the remaining part. So version 2 can never find 'the firebrand' but only 'firebrand'.
24 | 
25 | -  TAGME performs two extra filtering before the pruning step.
26 |   * Filtering of mentions that are contained in a longer mention.
27 |   * Filtering base on link probability threshold. In Version 1 this threshold was set to 0.02, but in version 2, it is set to 0.1. The code related to this filtering is line 87 of `TagmeParser.java` in TAGME source code:
28 |   ```
29 |   this.minLinkProb = TagmeConfig.get().getSetting(MODULE).getFloatParam(PARAM_MIN_LP, DEFAULT_MIN_LP);
30 |   ```
31 | 
32 | ### Evaluation:
33 | 
34 | - For the experiments in [1], we used only 1.4M out of 2M snippets from WIKI-DISAMB30, as Weka could not load more than that into memory. From WIKI-ANNOT30 we used all snippets, the difference is merely a matter of approximation.
35 | 
36 | - The evaluation metrics used for end-to-end performance (topics and annot metrics) are micro-averaged.
37 | 
38 | - The evaluation metrics for the disambiguation phase are micro-averaged (prec = TP / TP+FP, recall = TP / total number of test cases) and are computed as follows:
39 |   1. annotate the fragment
40 |   2. search for the mention in the result
41 |   3. if you don't find it, ignore it
42 |   4. if you find it:
43 |       - if it is correct, increment the number of the true positive
44 |       - if it's not correct, increment the number of the false positive
45 | 
46 | 
47 | 
48 | ## Our additional comments on the TAGME authors' comments :)
49 | 
50 | - For the sake of reproducibility, we had to use the closest Wikipedia dump to the original experiments, that is dump from April 2010. The page-to-page link records for this dump are not available any more and therefore we had to extract them from the body of Wikipedia pages (enwiki-20100408-pages-articles.xml.bz2).
51 | - The TAGME datasets (Wiki-annot and Wiki-disamb) contain IDs of the pages, which have changed over time in Wikipedia. We addressed this issue as follows:
52 |   * We converted the page ids of datasets to the corresponding page titles the dump of 2010.
53 |   * For all the experiments, we converted the page titles to URIs based on the [Wikipedia instructions](https://en.wikipedia.org/wiki/Wikipedia:Page_name#Spaces.2C_underscores_and_character_coding).
54 | 
55 |   Using this method, the URIs used in our experiments are consistent with the TAGME datasets. Since the generalizability datasets are originally created based on DBpedia URIs, they result in minor difference (due to encodings). However, the differences are negligible and do not affect the overall conclusion.
56 | 
57 | 
58 | ```
59 | [1] P. Ferragina and U. Scaiella. TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities). In Proceedings of CIKM '10, pages 1625–1628, 2010.
60 | ```
61 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.classpath:
--------------------------------------------------------------------------------
1 | <?xml version="1.0" encoding="UTF-8"?>
2 | <classpath>
3 | 	<classpathentry kind="src" path="src"/>
4 | 	<classpathentry kind="src" path="src-test"/>
5 | 	<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER/org.eclipse.jdt.internal.debug.ui.launcher.StandardVMType/JavaSE-1.6"/>
6 | 	<classpathentry kind="output" path="bin"/>
7 | </classpath>
8 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.project:
--------------------------------------------------------------------------------
 1 | <?xml version="1.0" encoding="UTF-8"?>
 2 | <projectDescription>
 3 | 	<name>edu.cmu.lti.wikipedia_redirect</name>
 4 | 	<comment></comment>
 5 | 	<projects>
 6 | 	</projects>
 7 | 	<buildSpec>
 8 | 		<buildCommand>
 9 | 			<name>org.eclipse.jdt.core.javabuilder</name>
10 | 			<arguments>
11 | 			</arguments>
12 | 		</buildCommand>
13 | 	</buildSpec>
14 | 	<natures>
15 | 		<nature>org.eclipse.jdt.core.javanature</nature>
16 | 	</natures>
17 | </projectDescription>
18 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.settings/org.eclipse.core.resources.prefs:
--------------------------------------------------------------------------------
1 | #Sat Oct 08 00:05:51 EDT 2011
2 | eclipse.preferences.version=1
3 | encoding/<project>=UTF-8
4 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.settings/org.eclipse.jdt.core.prefs:
--------------------------------------------------------------------------------
 1 | #Tue Oct 11 01:41:50 EDT 2011
 2 | eclipse.preferences.version=1
 3 | org.eclipse.jdt.core.compiler.codegen.inlineJsrBytecode=enabled
 4 | org.eclipse.jdt.core.compiler.codegen.targetPlatform=1.5
 5 | org.eclipse.jdt.core.compiler.codegen.unusedLocal=preserve
 6 | org.eclipse.jdt.core.compiler.compliance=1.5
 7 | org.eclipse.jdt.core.compiler.debug.lineNumber=generate
 8 | org.eclipse.jdt.core.compiler.debug.localVariable=generate
 9 | org.eclipse.jdt.core.compiler.debug.sourceFile=generate
10 | org.eclipse.jdt.core.compiler.problem.assertIdentifier=error
11 | org.eclipse.jdt.core.compiler.problem.enumIdentifier=error
12 | org.eclipse.jdt.core.compiler.source=1.5
13 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/entries:
--------------------------------------------------------------------------------
1 | 12
2 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/format:
--------------------------------------------------------------------------------
1 | 12
2 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/01/010314f8fd56597897f1bd210790af39ebe6b887.svn-base:
--------------------------------------------------------------------------------
1 | See http://code.google.com/p/wikipedia-redirect


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/1a/1ade57daf5d7fd981e4ababdc006c8c3a02bcbe5.svn-base:
--------------------------------------------------------------------------------
 1 | <?xml version="1.0" encoding="UTF-8" standalone="no"?>
 2 | <launchConfiguration type="org.eclipse.jdt.launching.localJavaApplication">
 3 | <listAttribute key="org.eclipse.debug.core.MAPPED_RESOURCE_PATHS">
 4 | <listEntry value="/edu.cmu.lti.wikipedia_redirect/src/edu/cmu/lti/wikipedia_redirect/Demo.java"/>
 5 | </listAttribute>
 6 | <listAttribute key="org.eclipse.debug.core.MAPPED_RESOURCE_TYPES">
 7 | <listEntry value="1"/>
 8 | </listAttribute>
 9 | <stringAttribute key="org.eclipse.jdt.launching.MAIN_TYPE" value="edu.cmu.lti.wikipedia_redirect.Demo"/>
10 | <stringAttribute key="org.eclipse.jdt.launching.PROGRAM_ARGUMENTS" value="data/wikipedia_redirect.ser"/>
11 | <stringAttribute key="org.eclipse.jdt.launching.PROJECT_ATTR" value="edu.cmu.lti.wikipedia_redirect"/>
12 | <stringAttribute key="org.eclipse.jdt.launching.VM_ARGUMENTS" value="-Xmx128m"/>
13 | </launchConfiguration>
14 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/29/29fd2aaaf86816aafbbc105506741d18535ab692.svn-base:
--------------------------------------------------------------------------------
  1 | /*
  2 |  * Copyright 2011 Carnegie Mellon University
  3 |  * Licensed under the Apache License, Version 2.0 (the "License"); 
  4 |  * you may not use this file except in compliance with the License.
  5 |  * You may obtain a copy of the License at
  6 |  *  
  7 |  *   http://www.apache.org/licenses/LICENSE-2.0
  8 |  *
  9 |  * Unless required by applicable law or agreed to in writing, 
 10 |  * software distributed under the License is distributed on an
 11 |  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 12 |  * KIND, either express or implied. See the License for the
 13 |  * specific language governing permissions and limitations
 14 |  * under the License.
 15 |  */
 16 | package edu.cmu.lti.wikipedia_redirect;
 17 | 
 18 | import java.io.BufferedReader;
 19 | import java.io.BufferedWriter;
 20 | import java.io.File;
 21 | import java.io.FileInputStream;
 22 | import java.io.FileOutputStream;
 23 | import java.io.FileReader;
 24 | import java.io.InputStreamReader;
 25 | import java.io.LineNumberReader;
 26 | import java.io.ObjectInputStream;
 27 | import java.io.ObjectOutputStream;
 28 | import java.io.OutputStreamWriter;
 29 | import java.util.AbstractMap;
 30 | import java.util.ArrayList;
 31 | import java.util.List;
 32 | import java.util.Map.Entry;
 33 | 
 34 | /**
 35 |  * Reads and writes wikipedia redirect data.
 36 |  * 
 37 |  * @author Hideki Shima
 38 |  *
 39 |  */
 40 | public class IOUtil {
 41 | 
 42 |   /**
 43 |    * Save Wikipedia redirect data
 44 |    * 
 45 |    * @param redirectData
 46 |    *   map where key is original term and value is redirected term
 47 |    * @throws Exception
 48 |    */
 49 |   public static void save( AbstractMap<String,String> map ) throws Exception {
 50 |     File outputDir = new File("target");
 51 |     if (!outputDir.exists()) {
 52 |       outputDir.mkdirs();
 53 |     }
 54 |     WikipediaRedirect wr = new WikipediaRedirect( map );
 55 |     saveText( wr, outputDir );
 56 |     saveSerialized( wr, outputDir );
 57 |   }
 58 |   
 59 |   /**
 60 |    * Save Wikipedia redirect data into tab separated text file
 61 |    * 
 62 |    * @param redirectData
 63 |    *   map where key is original term and value is redirected term
 64 |    * @throws Exception
 65 |    */
 66 |   private static void saveText( WikipediaRedirect wr, File outputDir ) throws Exception {
 67 |     File txtFile = new File(outputDir, "wikipedia_redirect.txt");
 68 |     FileOutputStream fosTxt = new FileOutputStream(txtFile);
 69 |     OutputStreamWriter osw = new OutputStreamWriter(fosTxt, "utf-8");
 70 |     BufferedWriter bw = new BufferedWriter(osw);
 71 |     for ( Entry<String,String> entry : wr.entrySet() ) {
 72 |       bw.write( entry.getKey()+"\t"+entry.getValue()+"\n" );
 73 |     }
 74 |     bw.close();
 75 |     osw.close();
 76 |     fosTxt.close();
 77 |     System.out.println("Saved redirect data in text format: "+txtFile.getAbsolutePath());
 78 |   }
 79 |   
 80 |   /**
 81 |    * Save Wikipedia redirect data into serialized object
 82 |    * 
 83 |    * @param redirectData
 84 |    *   map where key is original term and value is redirected term
 85 |    * @throws Exception
 86 |    */
 87 |   private static void saveSerialized( WikipediaRedirect wr, File outputDir ) throws Exception {
 88 |     File objFile = new File(outputDir, "wikipedia_redirect.ser");
 89 |     FileOutputStream fosObj = new FileOutputStream(objFile);
 90 |     ObjectOutputStream outObject = new ObjectOutputStream(fosObj);
 91 |     outObject.writeObject(wr);
 92 |     outObject.close();
 93 |     fosObj.close();
 94 |     System.out.println("Serialized redirect data: "+objFile.getAbsolutePath());
 95 |   }
 96 |   
 97 |   /**
 98 |    * Deserializes wikipedia redirect data
 99 |    * @param file 
100 |    *   serialized object or tab-separated text
101 |    * @return wikipedia redirect
102 |    * @throws Exception
103 |    */
104 |   public static WikipediaRedirect loadWikipediaRedirect( File f ) throws Exception {
105 |     if (!f.exists() || f.isDirectory()) {
106 |       System.err.println("File not found: "+f.getAbsolutePath());
107 |       System.exit(-1);
108 |     }
109 |     if ( f.getName().endsWith(".ser") ) {
110 |       return loadWikipediaRedirectFromSerialized( f );
111 |     } else {
112 |       //faster than above?
113 |       return loadWikipediaRedirectFromText( f );
114 |     }
115 |   }
116 |   
117 |   /**
118 |    * Deserializes wikipedia redirect data from serialized object data
119 |    * @param file 
120 |    *   serialized object
121 |    * @return wikipedia redirect
122 |    * @throws Exception
123 |    */
124 |   private static WikipediaRedirect loadWikipediaRedirectFromSerialized( File f ) throws Exception {
125 |     WikipediaRedirect object;
126 |     try {
127 |       FileInputStream inFile = new FileInputStream(f);
128 |       ObjectInputStream inObject = new ObjectInputStream(inFile);
129 |       object = (WikipediaRedirect)inObject.readObject();
130 |       inObject.close();
131 |       inFile.close();      
132 |     } catch (Exception e) {
133 |       throw e;
134 |     }    
135 |     return object;
136 |   }
137 |   
138 |   /**
139 |    * Deserializes wikipedia redirect data from tab-separated text file
140 |    * @param file 
141 |    *   tab-separated text
142 |    * @return wikipedia redirect
143 |    * @throws Exception
144 |    */
145 |   private static WikipediaRedirect loadWikipediaRedirectFromText( File f ) throws Exception {
146 |     int size = (int)countLineNumber(f);
147 |     WikipediaRedirect wr = new WikipediaRedirect( size );
148 |     try {
149 |       FileInputStream fis = new FileInputStream( f );
150 |       InputStreamReader isr = new InputStreamReader( fis );
151 |       BufferedReader br = new BufferedReader( isr );
152 |       String line = null;
153 |       while ( (line = br.readLine()) != null ) {
154 |         String[] elements = line.split("\t");
155 |         wr.put( elements[0], elements[1] );
156 |       }
157 |       br.close();
158 |       isr.close();
159 |       fis.close();
160 |     } catch (Exception e) {
161 |       throw e;
162 |     }
163 |     return wr;
164 |   }
165 |   
166 |   /**
167 |    * Loads tab separated data as an alternative way to load() method.
168 |    * Works for Wikipedia hypernym data generated by
169 |    * <a href="http://alaginrc.nict.go.jp/hyponymy/index.html">NICT's "Hyponymy extraction tool"</a>
170 |    * 
171 |    * @param file
172 |    *   tab separated file that contains lines that look "word1[TAB]word2[BR]"
173 |    * @return wikipedia redirect
174 |    * @throws Exception
175 |    */
176 |   public static WikipediaHypernym loadWikipediaHypernym( File f ) throws Exception {
177 |     int size = (int)IOUtil.countLineNumber( f );
178 |     WikipediaHypernym object = new WikipediaHypernym( size );
179 |     try {
180 |       FileInputStream inFile = new FileInputStream( f );
181 |       InputStreamReader isr = new InputStreamReader( inFile );
182 |       BufferedReader br = new BufferedReader( isr );
183 |       String line = null;
184 |       while ( (line = br.readLine())!=null ) {
185 |         String[] tokens = line.split("\t");
186 |         if (tokens.length<=1) {
187 |           continue;
188 |         }
189 |         String key = tokens[0];
190 |         List<String> targets = object.get(key);
191 |         if ( targets==null ) {
192 |           targets = new ArrayList<String>();
193 |         }
194 |         targets.add(tokens[1]);
195 |         object.put(key, targets);
196 |       }
197 |       br.close();
198 |       isr.close();
199 |       inFile.close();      
200 |     } catch (Exception e) {
201 |       throw e;
202 |     }    
203 |     return object;
204 |   }
205 |   
206 |   /**
207 |    * Count number of lines in a file in an efficient way
208 |    * @param f
209 |    * @return
210 |    * @throws Exception
211 |    */
212 |   public static long countLineNumber( File f ) throws Exception {
213 |     LineNumberReader lnr = new LineNumberReader(new FileReader(f));
214 |     lnr.skip(Long.MAX_VALUE);
215 |     int count = lnr.getLineNumber();
216 |     lnr.close();
217 |     return count;
218 |   }
219 | }
220 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/2d/2dd2033882f190c773a5bce39ed7b2362af4ad02.svn-base:
--------------------------------------------------------------------------------
1 | #Sat Oct 08 00:05:51 EDT 2011
2 | eclipse.preferences.version=1
3 | encoding/<project>=UTF-8
4 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/30/30792ab46c06afdce18ad48ab5759bdcf973ba1a.svn-base:
--------------------------------------------------------------------------------
 1 | ACMフェロー	アルフレッド・エイホ	0.288392
 2 | ACMフェロー	アンドリュー・タネンバウム	0.084127
 3 | ACMフェロー	エドムンド・クラーク	0.220679
 4 | ACMフェロー	グラディ・ブーチ	-0.175180
 5 | ACMフェロー	ジャック・ドンガラ	0.427047
 6 | ACMフェロー	スティーブン・ボーン	0.220679
 7 | ACMフェロー	ダグラス・カマー	0.907805
 8 | ACMフェロー	ダン・ブリックリン	0.220679
 9 | ACMフェロー	ビャーネ・ストロヴストルップ	0.220679
10 | ACMフェロー	ビル・グロップ	0.220679
11 | ACMフェロー	ピーター・ノーヴィグ	0.233410
12 | ACMフェロー	ボブ・フランクストン	0.241471
13 | ACMフェロー	リチャード・ハミング	0.899804
14 | ACMフェロー	米澤明憲	0.143425


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/5d/5d3212ecd32fd3d4b72c3e11820392d4eb0a7054.svn-base:
--------------------------------------------------------------------------------
1 | <?xml version="1.0" encoding="UTF-8"?>
2 | <classpath>
3 | 	<classpathentry kind="src" path="src"/>
4 | 	<classpathentry kind="src" path="src-test"/>
5 | 	<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER/org.eclipse.jdt.internal.debug.ui.launcher.StandardVMType/JavaSE-1.6"/>
6 | 	<classpathentry kind="output" path="bin"/>
7 | </classpath>
8 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/5d/5d601d1da850507fb57a9de2d042751af0c19eb0.svn-base:
--------------------------------------------------------------------------------
  1 | <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.5/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.5/ http://www.mediawiki.org/xml/export-0.5.xsd" version="0.5" xml:lang="ja">
  2 |   <siteinfo>
  3 |     <sitename>Wikipedia</sitename>
  4 |     <base>http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8</base>
  5 |     <generator>MediaWiki 1.17wmf1</generator>
  6 |     <case>first-letter</case>
  7 |     <namespaces>
  8 |       <namespace key="-2" case="first-letter">メディア</namespace>
  9 |       <namespace key="-1" case="first-letter">特別</namespace>
 10 |       <namespace key="0" case="first-letter" />
 11 |       <namespace key="1" case="first-letter">ノート</namespace>
 12 |       <namespace key="2" case="first-letter">利用者</namespace>
 13 |       <namespace key="3" case="first-letter">利用者‐会話</namespace>
 14 |       <namespace key="4" case="first-letter">Wikipedia</namespace>
 15 |       <namespace key="5" case="first-letter">Wikipedia‐ノート</namespace>
 16 |       <namespace key="6" case="first-letter">ファイル</namespace>
 17 |       <namespace key="7" case="first-letter">ファイル‐ノート</namespace>
 18 |       <namespace key="8" case="first-letter">MediaWiki</namespace>
 19 |       <namespace key="9" case="first-letter">MediaWiki‐ノート</namespace>
 20 |       <namespace key="10" case="first-letter">Template</namespace>
 21 |       <namespace key="11" case="first-letter">Template‐ノート</namespace>
 22 |       <namespace key="12" case="first-letter">Help</namespace>
 23 |       <namespace key="13" case="first-letter">Help‐ノート</namespace>
 24 |       <namespace key="14" case="first-letter">Category</namespace>
 25 |       <namespace key="15" case="first-letter">Category‐ノート</namespace>
 26 |       <namespace key="100" case="first-letter">Portal</namespace>
 27 |       <namespace key="101" case="first-letter">Portal‐ノート</namespace>
 28 |       <namespace key="102" case="first-letter">プロジェクト</namespace>
 29 |       <namespace key="103" case="first-letter">プロジェクト‐ノート</namespace>
 30 |     </namespaces>
 31 |   </siteinfo>
 32 |   <page>
 33 |     <title>Wikipedia:Sandbox</title>
 34 |     <id>6</id>
 35 |     <redirect />
 36 |     <revision>
 37 |       <id>36654478</id>
 38 |       <timestamp>2011-03-06T16:16:58Z</timestamp>
 39 |       <contributor>
 40 |         <username>Y-dash</username>
 41 |         <id>309126</id>
 42 |       </contributor>
 43 |       <comment>テストは[[Wikipedia:サンドボックス]]でお願いいたします。 / [[Special:Contributions/Kompek|Kompek]] ([[User talk:Kompek|会話]]) による ID:36654304 の版を[[H:RV|取り消し]]</comment>
 44 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:サンドボックス]]</text>
 45 |     </revision>
 46 |   </page>
 47 |   <page>
 48 |     <title>SandBox</title>
 49 |     <id>26</id>
 50 |     <redirect />
 51 |     <revision>
 52 |       <id>6986090</id>
 53 |       <timestamp>2006-08-05T23:25:48Z</timestamp>
 54 |       <contributor>
 55 |         <username>Nevylax</username>
 56 |         <id>38464</id>
 57 |       </contributor>
 58 |       <comment>#REDIRECT [[サンドボックス]]</comment>
 59 |       <text xml:space="preserve">#REDIRECT [[サンドボックス]]</text>
 60 |     </revision>
 61 |   </page>
 62 |   <page>
 63 |     <title>HomePage</title>
 64 |     <id>46</id>
 65 |     <redirect />
 66 |     <revision>
 67 |       <id>2168894</id>
 68 |       <timestamp>2005-03-22T13:49:43Z</timestamp>
 69 |       <contributor>
 70 |         <username>Hideyuki</username>
 71 |         <id>9577</id>
 72 |       </contributor>
 73 |       <comment>#REDIRECT [[ホームページ]]</comment>
 74 |       <text xml:space="preserve">#REDIRECT [[ホームページ]]</text>
 75 |     </revision>
 76 |   </page>
 77 |   <page>
 78 |     <title>Wikipedia:About</title>
 79 |     <id>51</id>
 80 |     <redirect />
 81 |     <revision>
 82 |       <id>19962101</id>
 83 |       <timestamp>2008-05-31T01:12:46Z</timestamp>
 84 |       <contributor>
 85 |         <username>Kanjy</username>
 86 |         <id>36859</id>
 87 |       </contributor>
 88 |       <minor />
 89 |       <comment>2003-03-06T11:35:13Z Setu さん版 (#REDIRECT [[Wikipedia:ウィキペディアについて]]) に戻す</comment>
 90 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:ウィキペディアについて]]</text>
 91 |     </revision>
 92 |   </page>
 93 |   <page>
 94 |     <title>Wikipedia:How does one edit a page</title>
 95 |     <id>85</id>
 96 |     <redirect />
 97 |     <revision>
 98 |       <id>13206183</id>
 99 |       <timestamp>2007-06-19T05:06:43Z</timestamp>
100 |       <contributor>
101 |         <username>Aotake</username>
102 |         <id>34929</id>
103 |       </contributor>
104 |       <minor />
105 |       <comment>redirect target</comment>
106 |       <text xml:space="preserve">#REDIRECT [[Help:ページの編集]]</text>
107 |     </revision>
108 |   </page>
109 |   <page>
110 |     <title>ワールド・ミュージック</title>
111 |     <id>113</id>
112 |     <redirect />
113 |     <revision>
114 |       <id>24277249</id>
115 |       <timestamp>2009-02-07T16:20:20Z</timestamp>
116 |       <contributor>
117 |         <username>Point136</username>
118 |         <id>211299</id>
119 |       </contributor>
120 |       <minor />
121 |       <comment>Bot: リダイレクト構文の修正</comment>
122 |       <text xml:space="preserve">#REDIRECT [[ワールドミュージック]]</text>
123 |     </revision>
124 |   </page>
125 |   <page>
126 |     <title>ネマティック相</title>
127 |     <id>127</id>
128 |     <redirect />
129 |     <revision>
130 |       <id>24277255</id>
131 |       <timestamp>2009-02-07T16:20:40Z</timestamp>
132 |       <contributor>
133 |         <username>Point136</username>
134 |         <id>211299</id>
135 |       </contributor>
136 |       <minor />
137 |       <comment>Bot: リダイレクト構文の修正</comment>
138 |       <text xml:space="preserve">#REDIRECT [[ネマティック液晶]]</text>
139 |     </revision>
140 |   </page>
141 |   <page>
142 |     <title>スメクティック相</title>
143 |     <id>128</id>
144 |     <redirect />
145 |     <revision>
146 |       <id>2168972</id>
147 |       <timestamp>2004-01-07T09:45:14Z</timestamp>
148 |       <contributor>
149 |         <username>Yas</username>
150 |         <id>739</id>
151 |       </contributor>
152 |       <minor />
153 |       <comment>#REDIRECT [[液晶]]</comment>
154 |       <text xml:space="preserve">#REDIRECT [[液晶]]</text>
155 |     </revision>
156 |   </page>
157 |   <page>
158 |     <title>ミュージシャン一覧 (個人)</title>
159 |     <id>143</id>
160 |     <redirect />
161 |     <revision>
162 |       <id>35399365</id>
163 |       <timestamp>2010-12-14T00:40:05Z</timestamp>
164 |       <contributor>
165 |         <username>Xqbot</username>
166 |         <id>273540</id>
167 |       </contributor>
168 |       <minor />
169 |       <comment>ロボットによる: 二重リダイレクト修正 → [[音楽家の一覧]]</comment>
170 |       <text xml:space="preserve">#転送 [[音楽家の一覧]]</text>
171 |     </revision>
172 |   </page>
173 |   <page>
174 |     <title>病名</title>
175 |     <id>176</id>
176 |     <redirect />
177 |     <revision>
178 |       <id>17793766</id>
179 |       <timestamp>2008-02-04T13:37:54Z</timestamp>
180 |       <contributor>
181 |         <username>U3002</username>
182 |         <id>66126</id>
183 |       </contributor>
184 |       <minor />
185 |       <comment>二重リダイレクト回避</comment>
186 |       <text xml:space="preserve">#REDIRECT [[病気の別名の一覧]]</text>
187 |     </revision>
188 |   </page>
189 |   <page>
190 |     <title>Wikipedia:Welcome, newcomers</title>
191 |     <id>216</id>
192 |     <redirect />
193 |     <revision>
194 |       <id>10662242</id>
195 |       <timestamp>2007-02-15T13:40:22Z</timestamp>
196 |       <contributor>
197 |         <username>Cave cattum</username>
198 |         <id>41235</id>
199 |       </contributor>
200 |       <comment>#REDIRECT [[Wikipedia:ウィキペディアへようこそ]]</comment>
201 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:ウィキペディアへようこそ]]</text>
202 |     </revision>
203 |   </page>
204 |   <page>
205 |     <title>黒人霊歌</title>
206 |     <id>260</id>
207 |     <redirect />
208 |     <revision>
209 |       <id>22493441</id>
210 |       <timestamp>2008-10-23T05:59:07Z</timestamp>
211 |       <contributor>
212 |         <username>Buzin Satuma Hayato</username>
213 |         <id>243768</id>
214 |       </contributor>
215 |       <minor />
216 |       <comment>黒人霊歌はスピリチュアル（音楽）だと思う</comment>
217 |       <text xml:space="preserve">#REDIRECT [[スピリチュアル#スピリチュアル（音楽）]]</text>
218 |     </revision>
219 |   </page>
220 |   <page>
221 |     <title>Wikipedia:漢字やスペルに注意</title>
222 |     <id>281</id>
223 |     <redirect />
224 |     <revision>
225 |       <id>13451992</id>
226 |       <timestamp>2007-07-02T14:15:39Z</timestamp>
227 |       <contributor>
228 |         <username>Cave cattum</username>
229 |         <id>41235</id>
230 |       </contributor>
231 |       <comment>[[WP:AES|←]][[Wikipedia:記事を執筆する]]へのリダイレクト</comment>
232 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:記事を執筆する]]</text>
233 |     </revision>
234 |   </page>
235 |   <page>
236 |     <title>Wikipedia:他言語の使用は控えめに</title>
237 |     <id>283</id>
238 |     <redirect />
239 |     <revision>
240 |       <id>15761853</id>
241 |       <timestamp>2007-10-27T06:12:35Z</timestamp>
242 |       <contributor>
243 |         <username>Khhy</username>
244 |         <id>13490</id>
245 |       </contributor>
246 |       <comment>#他言語表記は控えめに</comment>
247 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:素晴らしい記事を書くには#他言語表記は控えめに]]</text>
248 |     </revision>
249 |   </page>
250 |   <page>
251 |     <title>Wikipedia:日本語表記法</title>
252 |     <id>291</id>
253 |     <redirect />
254 |     <revision>
255 |       <id>12840559</id>
256 |       <timestamp>2007-05-30T14:59:17Z</timestamp>
257 |       <contributor>
258 |         <username>Aotake</username>
259 |         <id>34929</id>
260 |       </contributor>
261 |       <comment>[[Wikipedia:表記ガイド]]へ統合。</comment>
262 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:表記ガイド]]</text>
263 |     </revision>
264 |   </page>
265 |   <page>
266 |     <title>Wikipedia:リダイレクトの使い方</title>
267 |     <id>308</id>
268 |     <redirect />
269 |     <revision>
270 |       <id>2169127</id>
271 |       <timestamp>2003-11-30T04:09:10Z</timestamp>
272 |       <contributor>
273 |         <ip>219.164.91.166</ip>
274 |       </contributor>
275 |       <comment>#REDIRECT [[Wikipedia:リダイレクト]]</comment>
276 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:リダイレクト]]</text>
277 |     </revision>
278 |   </page>
279 |   <page>
280 |     <title>アルコール飲料</title>
281 |     <id>318</id>
282 |     <redirect />
283 |     <revision>
284 |       <id>15885863</id>
285 |       <timestamp>2007-11-02T12:42:36Z</timestamp>
286 |       <contributor>
287 |         <username>Balmung0731</username>
288 |         <id>99201</id>
289 |       </contributor>
290 |       <comment>[[酒]]へ統合</comment>
291 |       <text xml:space="preserve">#REDIRECT [[酒]]</text>
292 |     </revision>
293 |   </page>
294 |   <page>
295 |     <title>地学</title>
296 |     <id>321</id>
297 |     <redirect />
298 |     <revision>
299 |       <id>2169138</id>
300 |       <timestamp>2003-09-26T11:33:29Z</timestamp>
301 |       <contributor>
302 |         <ip>133.11.230.18</ip>
303 |       </contributor>
304 |       <text xml:space="preserve">#REDIRECT [[地球科学]]</text>
305 |     </revision>
306 |   </page>
307 |   <page>
308 |     <title>Wikipedia:ノートページのレイアウト</title>
309 |     <id>323</id>
310 |     <redirect />
311 |     <revision>
312 |       <id>36028694</id>
313 |       <timestamp>2011-01-24T11:38:30Z</timestamp>
314 |       <contributor>
315 |         <username>Kurz</username>
316 |         <id>1601</id>
317 |       </contributor>
318 |       <minor />
319 |       <comment>lk</comment>
320 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:ノートページのガイドライン]]</text>
321 |     </revision>
322 |   </page>
323 |   <page>
324 |     <title>Wikipedia:ページを孤立させない</title>
325 |     <id>332</id>
326 |     <redirect />
327 |     <revision>
328 |       <id>14492895</id>
329 |       <timestamp>2007-08-27T16:20:04Z</timestamp>
330 |       <contributor>
331 |         <username>Cave cattum</username>
332 |         <id>41235</id>
333 |       </contributor>
334 |       <comment>#REDIRECT [[Wikipedia:記事どうしをつなぐ]]</comment>
335 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:記事どうしをつなぐ]]</text>
336 |     </revision>
337 |   </page>
338 |   <page>
339 |     <title>明石沢貴士</title>
340 |     <id>335</id>
341 |     <redirect />
342 |     <revision>
343 |       <id>34308655</id>
344 |       <timestamp>2010-10-03T18:47:08Z</timestamp>
345 |       <contributor>
346 |         <username>EmausBot</username>
347 |         <id>397108</id>
348 |       </contributor>
349 |       <minor />
350 |       <comment>ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 あ行#.E6.98.8E.E7.9F.B3.E6.B2.A2.E8.B2.B4.E5.A3.AB]]</comment>
351 |       <text xml:space="preserve">#転送 [[プロジェクト:漫画家/日本の漫画家 あ行#.E6.98.8E.E7.9F.B3.E6.B2.A2.E8.B2.B4.E5.A3.AB]]</text>
352 |     </revision>
353 |   </page>
354 |   <page>
355 |     <title>ここまひ</title>
356 |     <id>341</id>
357 |     <redirect />
358 |     <revision>
359 |       <id>34308404</id>
360 |       <timestamp>2010-10-03T18:16:42Z</timestamp>
361 |       <contributor>
362 |         <username>EmausBot</username>
363 |         <id>397108</id>
364 |       </contributor>
365 |       <minor />
366 |       <comment>ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 か行#.E3.81.93.E3.81.93.E3.81.BE.E3.81.B2]]</comment>
367 |       <text xml:space="preserve">#転送 [[プロジェクト:漫画家/日本の漫画家 か行#.E3.81.93.E3.81.93.E3.81.BE.E3.81.B2]]</text>
368 |     </revision>
369 |   </page>
370 |   <page>
371 |     <title>吉冨昭仁</title>
372 |     <id>354</id>
373 |     <redirect />
374 |     <revision>
375 |       <id>7856042</id>
376 |       <timestamp>2006-09-23T16:20:44Z</timestamp>
377 |       <contributor>
378 |         <username>Mambo95</username>
379 |         <id>77516</id>
380 |       </contributor>
381 |       <comment>[[吉富昭仁]]へのリダイレクト</comment>
382 |       <text xml:space="preserve">#REDIRECT [[吉富昭仁]]</text>
383 |     </revision>
384 |   </page>
385 |   <page>
386 |     <title>現在のイベント</title>
387 |     <id>356</id>
388 |     <redirect />
389 |     <revision>
390 |       <id>12796821</id>
391 |       <timestamp>2007-05-28T09:42:57Z</timestamp>
392 |       <contributor>
393 |         <username>Khhy</username>
394 |         <id>13490</id>
395 |       </contributor>
396 |       <comment>#REDIRECT [[Portal:最近の出来事]]</comment>
397 |       <text xml:space="preserve">#REDIRECT [[Portal:最近の出来事]]</text>
398 |     </revision>
399 |   </page>
400 |   <page>
401 |     <title>Wikipedia:項目名のつけ方</title>
402 |     <id>451</id>
403 |     <redirect />
404 |     <revision>
405 |       <id>2169218</id>
406 |       <timestamp>2003-02-03T20:03:50Z</timestamp>
407 |       <contributor>
408 |         <username>Tomos</username>
409 |         <id>10</id>
410 |       </contributor>
411 |       <text xml:space="preserve">#REDIRECT[[Wikipedia:記事名の付け方]]</text>
412 |     </revision>
413 |   </page>
414 |   <page>
415 |     <title>必要とされている記事</title>
416 |     <id>456</id>
417 |     <redirect />
418 |     <revision>
419 |       <id>2169223</id>
420 |       <timestamp>2004-04-16T16:59:03Z</timestamp>
421 |       <contributor>
422 |         <username>Listener</username>
423 |         <id>6379</id>
424 |       </contributor>
425 |       <comment>Wikipedia:執筆依頼, double redirect</comment>
426 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:執筆依頼]]</text>
427 |     </revision>
428 |   </page>
429 |   <page>
430 |     <title>東京を舞台にした漫画作品</title>
431 |     <id>465</id>
432 |     <redirect />
433 |     <revision>
434 |       <id>39222184</id>
435 |       <timestamp>2011-09-16T06:55:20Z</timestamp>
436 |       <contributor>
437 |         <username>リオネル</username>
438 |         <id>98816</id>
439 |       </contributor>
440 |       <comment>[[東京を舞台にした漫画・アニメ作品]]へ統合</comment>
441 |       <text xml:space="preserve">#REDIRECT [[東京を舞台にした漫画・アニメ作品]]</text>
442 |     </revision>
443 |   </page>
444 |   <page>
445 |     <title>必要とされている画像</title>
446 |     <id>467</id>
447 |     <redirect />
448 |     <revision>
449 |       <id>2169230</id>
450 |       <timestamp>2004-03-20T05:01:58Z</timestamp>
451 |       <contributor>
452 |         <username>Michey.M-test</username>
453 |         <id>3537</id>
454 |       </contributor>
455 |       <minor />
456 |       <comment>Wikipedia:画像提供依頼</comment>
457 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:画像提供依頼]]</text>
458 |     </revision>
459 |   </page>
460 |   <page>
461 |     <title>水縞とおる</title>
462 |     <id>471</id>
463 |     <redirect />
464 |     <revision>
465 |       <id>34308685</id>
466 |       <timestamp>2010-10-03T18:50:25Z</timestamp>
467 |       <contributor>
468 |         <username>EmausBot</username>
469 |         <id>397108</id>
470 |       </contributor>
471 |       <minor />
472 |       <comment>ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]]</comment>
473 |       <text xml:space="preserve">#転送 [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]]</text>
474 |     </revision>
475 |   </page>
476 |   <page>
477 |     <title>恋人は守護霊!?</title>
478 |     <id>472</id>
479 |     <redirect />
480 |     <revision>
481 |       <id>34308641</id>
482 |       <timestamp>2010-10-03T18:45:13Z</timestamp>
483 |       <contributor>
484 |         <username>EmausBot</username>
485 |         <id>397108</id>
486 |       </contributor>
487 |       <minor />
488 |       <comment>ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]]</comment>
489 |       <text xml:space="preserve">#転送 [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]]
490 | 
491 | [[Category:漫画作品 こ|いひとはしゆこれい]]
492 | [[Category:月刊コミックNORA|こいひとはしゆこれい]]</text>
493 |     </revision>
494 |   </page>
495 |   <page>
496 |     <title>ユーゴスラビア改名</title>
497 |     <id>504</id>
498 |     <redirect />
499 |     <revision>
500 |       <id>24277258</id>
501 |       <timestamp>2009-02-07T16:21:01Z</timestamp>
502 |       <contributor>
503 |         <username>Point136</username>
504 |         <id>211299</id>
505 |       </contributor>
506 |       <minor />
507 |       <comment>Bot: リダイレクト構文の修正</comment>
508 |       <text xml:space="preserve">#REDIRECT [[ユーゴスラビア]]</text>
509 |     </revision>
510 |   </page>
511 |   <page>
512 |     <title>あだちつよし</title>
513 |     <id>519</id>
514 |     <redirect />
515 |     <revision>
516 |       <id>34308384</id>
517 |       <timestamp>2010-10-03T18:14:37Z</timestamp>
518 |       <contributor>
519 |         <username>EmausBot</username>
520 |         <id>397108</id>
521 |       </contributor>
522 |       <minor />
523 |       <comment>ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 あ行#.E3.81.82.E3.81.A0.E3.81.A1.E3.81.A4.E3.82.88.E3.81.97]]</comment>
524 |       <text xml:space="preserve">#転送 [[プロジェクト:漫画家/日本の漫画家 あ行#.E3.81.82.E3.81.A0.E3.81.A1.E3.81.A4.E3.82.88.E3.81.97]]</text>
525 |     </revision>
526 |   </page>
527 | </mediawiki>
528 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/6f/6f27f0a059481c81ea3674baef5f02dcdf93bc42.svn-base:
--------------------------------------------------------------------------------
  1 | /*
  2 |  * Copyright 2011 Carnegie Mellon University
  3 |  * Licensed under the Apache License, Version 2.0 (the "License"); 
  4 |  * you may not use this file except in compliance with the License.
  5 |  * You may obtain a copy of the License at
  6 |  *  
  7 |  *   http://www.apache.org/licenses/LICENSE-2.0
  8 |  *
  9 |  * Unless required by applicable law or agreed to in writing, 
 10 |  * software distributed under the License is distributed on an
 11 |  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 12 |  * KIND, either express or implied. See the License for the
 13 |  * specific language governing permissions and limitations
 14 |  * under the License.
 15 |  */
 16 | package edu.cmu.lti.wikipedia_redirect;
 17 | 
 18 | import java.io.BufferedReader;
 19 | import java.io.BufferedWriter;
 20 | import java.io.File;
 21 | import java.io.FileInputStream;
 22 | import java.io.FileOutputStream;
 23 | import java.io.InputStreamReader;
 24 | import java.io.OutputStreamWriter;
 25 | import java.util.regex.Matcher;
 26 | import java.util.regex.Pattern;
 27 | 
 28 | /**
 29 |  * Extracts wikipedia redirect information and serializes the data.
 30 |  * 
 31 |  * @author Hideki Shima
 32 |  *
 33 |  */
 34 | public class WikipediaRedirectExtractor {
 35 | 
 36 |   private static String titlePattern    = "    <title>";
 37 |   private static String redirectPattern = "    <redirect";
 38 |   private static String textPattern     = "      <text xml";
 39 |   private static Pattern pRedirect = Pattern.compile(
 40 |           "#[ ]?[^ ]+[ ]?\\[\\[(.+?)\\]\\]", Pattern.CASE_INSENSITIVE);
 41 |   
 42 |   public void run( File inputFile, File outputFile ) throws Exception {
 43 |     int invalidCount = 0;
 44 |     long t0 = System.currentTimeMillis();
 45 |     FileInputStream fis = new FileInputStream( inputFile );
 46 | //    TreeMap<String,String> map = new HashMap<String,String>();
 47 |     InputStreamReader isr = new InputStreamReader(fis, "utf-8");
 48 |     BufferedReader br = new BufferedReader(isr);
 49 |     FileOutputStream fos = new FileOutputStream(outputFile);
 50 |     OutputStreamWriter osw = new OutputStreamWriter(fos, "utf-8");
 51 |     BufferedWriter bw = new BufferedWriter(osw);
 52 | 
 53 |     int count = 0;
 54 |     String title = null;
 55 |     String text = null;
 56 |     String line = null;
 57 |     boolean isRedirect = false;
 58 |     boolean inText = false;
 59 |     while ((line=br.readLine())!=null) {
 60 |       if (line.startsWith(titlePattern)) {
 61 |         title = line;
 62 |         text = null;
 63 |         isRedirect = false;  
 64 |       }
 65 |       if (line.startsWith(redirectPattern)) {
 66 |         isRedirect = true;
 67 |       }
 68 |       if (isRedirect && (line.startsWith(textPattern) || inText)) {
 69 |         Matcher m = pRedirect.matcher(line); // slow regex shouldn't be used until here.
 70 |         if (m.find()) { // make sure the current text field contains [[...]]
 71 |           text  = line;
 72 |           try {
 73 |             title = cleanupTitle(title);
 74 |             String redirectedTitle = m.group(1);
 75 |             if ( isValidAlias(title, redirectedTitle) ) {
 76 |               bw.write( title+"\t"+redirectedTitle+"\n" );
 77 |               count++;
 78 | //              map.put( title, redirectedTitle );
 79 |             } else {
 80 |               invalidCount++;
 81 |             }
 82 |           } catch ( StringIndexOutOfBoundsException e ) {
 83 |             System.out.println("ERROR: cannot extract redirection from title = "+title+", text = "+text);
 84 |             e.printStackTrace();
 85 |           }
 86 |         } else { // Very rare case 
 87 |           inText = true;
 88 |         }
 89 |       }
 90 |     }
 91 |     br.close();
 92 |     isr.close();
 93 |     fis.close();
 94 | 
 95 |     bw.close();
 96 |     osw.close();
 97 |     fos.close();
 98 |     System.out.println("---- Wikipedia redirect extraction done ----");
 99 |     long t1 = System.currentTimeMillis();
100 | //    IOUtil.save( map );
101 |     System.out.println("Discarded "+invalidCount+" redirects to wikipedia meta articles.");
102 |     System.out.println("Extracted "+count+" redirects.");
103 |     System.out.println("Saved output: "+outputFile.getAbsolutePath());
104 |     System.out.println("Done in "+((t1-t0)/1000)+" sec.");
105 |   }
106 |   
107 |   private String cleanupTitle( String title ) {
108 |     int end = title.indexOf("</title>");
109 |     return end!=-1?title.substring(titlePattern.length(), end):title;
110 |   }
111 | 
112 |   /**
113 |    * Identifies if the redirection is valid.
114 |    * Currently, we only check if the redirection is related to
115 |    * a special Wikipedia page or not.
116 |    * 
117 |    * TODO: write more rules to discard more invalid redirects.
118 |    *  
119 |    * @param title source title
120 |    * @param redirectedTitle target title
121 |    * @return validity
122 |    */
123 |   private boolean isValidAlias( String title, String redirectedTitle ) {
124 |     if ( title.startsWith("Wikipedia:") 
125 |             || title.startsWith("Template:")
126 |             || title.startsWith("Portal:")
127 |             || title.startsWith("List of ")) {
128 |       return false;
129 |     }
130 |     return true;
131 |   }
132 |   
133 |   public static void main(String[] args) throws Exception {
134 |     if (args.length!=1) {
135 |       System.err.println("ERROR: Please specify the path to the wikipedia article xml file as the argument.");
136 |       System.err.println("Tips: enclose the path with double quotes if a space exists in the path.");
137 |       return;
138 |     }
139 |     File inputFile = new File(args[0]);
140 |     if (!inputFile.exists() || inputFile.isDirectory()) {
141 |       System.err.println("ERROR: File not found at "+inputFile.getAbsolutePath());
142 |       return;
143 |     }
144 |     String prefix = inputFile.getName().replaceFirst("-.*", "");
145 |     File outputDir = new File("target");
146 |     if (!outputDir.exists()) {
147 |       outputDir.mkdirs();
148 |     }
149 |     File outputFile = new File(outputDir, prefix+"-redirect.txt");
150 |     new WikipediaRedirectExtractor().run( inputFile, outputFile );
151 |   }
152 | }
153 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/81/8159a4c51d3239796584c79b7550d7c49972f4d8.svn-base:
--------------------------------------------------------------------------------
 1 | #Tue Oct 11 01:41:50 EDT 2011
 2 | eclipse.preferences.version=1
 3 | org.eclipse.jdt.core.compiler.codegen.inlineJsrBytecode=enabled
 4 | org.eclipse.jdt.core.compiler.codegen.targetPlatform=1.5
 5 | org.eclipse.jdt.core.compiler.codegen.unusedLocal=preserve
 6 | org.eclipse.jdt.core.compiler.compliance=1.5
 7 | org.eclipse.jdt.core.compiler.debug.lineNumber=generate
 8 | org.eclipse.jdt.core.compiler.debug.localVariable=generate
 9 | org.eclipse.jdt.core.compiler.debug.sourceFile=generate
10 | org.eclipse.jdt.core.compiler.problem.assertIdentifier=error
11 | org.eclipse.jdt.core.compiler.problem.enumIdentifier=error
12 | org.eclipse.jdt.core.compiler.source=1.5
13 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/84/84bf19ebd162073a34792b87aa400129a6068865.svn-base:
--------------------------------------------------------------------------------
 1 | <?xml version="1.0" encoding="UTF-8"?>
 2 | <projectDescription>
 3 | 	<name>edu.cmu.lti.wikipedia_redirect</name>
 4 | 	<comment></comment>
 5 | 	<projects>
 6 | 	</projects>
 7 | 	<buildSpec>
 8 | 		<buildCommand>
 9 | 			<name>org.eclipse.jdt.core.javabuilder</name>
10 | 			<arguments>
11 | 			</arguments>
12 | 		</buildCommand>
13 | 	</buildSpec>
14 | 	<natures>
15 | 		<nature>org.eclipse.jdt.core.javanature</nature>
16 | 	</natures>
17 | </projectDescription>
18 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/b2/b2f1c3203bdbb9cbbc7b334f031504dcfa465b61.svn-base:
--------------------------------------------------------------------------------
 1 | /*
 2 |  * Copyright 2011 Carnegie Mellon University
 3 |  * Licensed under the Apache License, Version 2.0 (the "License"); 
 4 |  * you may not use this file except in compliance with the License.
 5 |  * You may obtain a copy of the License at
 6 |  *  
 7 |  *   http://www.apache.org/licenses/LICENSE-2.0
 8 |  *
 9 |  * Unless required by applicable law or agreed to in writing, 
10 |  * software distributed under the License is distributed on an
11 |  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
12 |  * KIND, either express or implied. See the License for the
13 |  * specific language governing permissions and limitations
14 |  * under the License.
15 |  */
16 | package edu.cmu.lti.wikipedia_redirect;
17 | 
18 | import java.io.File;
19 | import java.io.Serializable;
20 | import java.util.ArrayList;
21 | import java.util.HashMap;
22 | import java.util.List;
23 | 
24 | /**
25 |  * Represents the wikipedia hypernym data e.g. ones generated by
26 |  * <a href="http://alaginrc.nict.go.jp/hyponymy/index.html">NICT's "Hyponymy extraction tool"</a>
27 |  * 
28 |  * @author Hideki Shima
29 |  */
30 | public class WikipediaHypernym extends HashMap<String,List<String>>
31 |  implements Serializable {
32 | 
33 |   private static final long serialVersionUID = 20111019L;
34 | 
35 |   public WikipediaHypernym( int size ) {
36 |     // RAM (heap) efficient capacity setting
37 |     super( size * 4 / 3 + 1 );
38 |   }
39 |   
40 |   public void load( File file ) throws Exception {
41 |     WikipediaHypernym wh = IOUtil.loadWikipediaHypernym(file);
42 |     for ( String key : wh.keySet() ) {
43 |       List<String> thisList = get(key);
44 |       List<String> newList = wh.get(key);
45 |       if ( thisList != null ) {
46 |         thisList.addAll( newList );
47 |       } else {
48 |         thisList = new ArrayList<String>( newList );
49 |       }
50 |       put(key, thisList);
51 |     }
52 |   }
53 |   
54 | }
55 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/c2/c2d8aecb47cbf4a0d2ebc3c5eb42630ab7999559.svn-base:
--------------------------------------------------------------------------------
 1 | <?xml version="1.0" encoding="UTF-8" standalone="no"?>
 2 | <launchConfiguration type="org.eclipse.jdt.launching.localJavaApplication">
 3 | <listAttribute key="org.eclipse.debug.core.MAPPED_RESOURCE_PATHS">
 4 | <listEntry value="/edu.cmu.lti.wikipedia_redirect/src/edu/cmu/lti/wikipedia_redirect/WikipediaRedirectExtractor.java"/>
 5 | </listAttribute>
 6 | <listAttribute key="org.eclipse.debug.core.MAPPED_RESOURCE_TYPES">
 7 | <listEntry value="1"/>
 8 | </listAttribute>
 9 | <stringAttribute key="org.eclipse.jdt.launching.MAIN_TYPE" value="edu.cmu.lti.wikipedia_redirect.WikipediaRedirectExtractor"/>
10 | <stringAttribute key="org.eclipse.jdt.launching.PROGRAM_ARGUMENTS" value="data/wikipedia_100k.txt"/>
11 | <stringAttribute key="org.eclipse.jdt.launching.PROJECT_ATTR" value="edu.cmu.lti.wikipedia_redirect"/>
12 | <stringAttribute key="org.eclipse.jdt.launching.VM_ARGUMENTS" value="-Xmx500m"/>
13 | </launchConfiguration>
14 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/f8/f8a0a6e41c412aed301f3bd515ea356240c9cc44.svn-base:
--------------------------------------------------------------------------------
 1 | package edu.cmu.lti.wikipedia_redirect;
 2 | import java.io.File;
 3 | import java.util.Set;
 4 | 
 5 | /**
 6 |  * Demo of what you can do with Wikipedia Redirect.
 7 |  * @author Hideki Shima
 8 |  */
 9 | public class Demo {
10 |   private static String[] enSrcTerms = {"Bin Ladin", "William Henry Gates", 
11 |     "JFK", "The Steel City", "The City of Bridges", "Da burgh", "Hoagie", 
12 |     "Centre", "3.14"};
13 |   private static String[] jaSrcTerms = {"オサマビンラディン", "オサマ・ビンラーディン",
14 |           "東日本大地震","東日本太平洋沖地震" ,"NACSIS", 
15 |           "ダイアモンド", "アボガド", "バイオリン", "平成12年", "3.14"};
16 |   private static String enTarget = "Bayesian network";
17 |   private static String jaTarget = "計算機科学";
18 |   
19 |   public static void main(String[] args) throws Exception {
20 |     // Initialization
21 |     System.out.print("Loading Wikipedia Redirect ...");
22 |     long t0 = System.currentTimeMillis();
23 |     File inputFile = new File(args[0]);
24 |     WikipediaRedirect wr = IOUtil.loadWikipediaRedirect(inputFile);
25 |     boolean useJapaneseExample = inputFile.getName().substring(0, 2).equals("ja");
26 |     String[] srcTerms = useJapaneseExample ? jaSrcTerms : enSrcTerms;
27 |     String target = useJapaneseExample ? jaTarget : enTarget;
28 |     long t1 = System.currentTimeMillis();
29 |     System.out.println(" done in "+(t1-t0)/1000D+" sec.\n");
30 |     
31 |     // Let's find a redirection given a source word.
32 |     StringBuilder sb = new StringBuilder();
33 |     for ( String src : srcTerms ) {
34 |       sb.append("redirect(\""+src+"\") = \""+wr.get(src)+"\"\n");
35 |     }
36 |     long t2 = System.currentTimeMillis();
37 |     System.out.println(sb.toString()+"Done in "+(t2-t1)/1000D+" sec.\n--\n");
38 | 
39 |     // Let's find which source words redirect to the given target word.
40 |     Set<String> keys = wr.getKeysByValue(target);
41 |     long t3 = System.currentTimeMillis();
42 |     System.out.println("All of the following redirect to \""+target+"\":\n"+keys);
43 |     System.out.println("Done in "+(t3-t2)/1000D+" sec.\n");
44 |   }
45 | }
46 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/pristine/fb/fbba7c603f44332bd67313432986b2f97da47014.svn-base:
--------------------------------------------------------------------------------
 1 | /*
 2 |  * Copyright 2011 Carnegie Mellon University
 3 |  * Licensed under the Apache License, Version 2.0 (the "License"); 
 4 |  * you may not use this file except in compliance with the License.
 5 |  * You may obtain a copy of the License at
 6 |  *  
 7 |  *   http://www.apache.org/licenses/LICENSE-2.0
 8 |  *
 9 |  * Unless required by applicable law or agreed to in writing, 
10 |  * software distributed under the License is distributed on an
11 |  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
12 |  * KIND, either express or implied. See the License for the
13 |  * specific language governing permissions and limitations
14 |  * under the License.
15 |  */
16 | package edu.cmu.lti.wikipedia_redirect;
17 | 
18 | import java.io.Serializable;
19 | import java.util.HashMap;
20 | import java.util.LinkedHashSet;
21 | import java.util.Map;
22 | import java.util.Map.Entry;
23 | import java.util.Set;
24 | 
25 | /**
26 |  * Represents the wikipedia redirect data.
27 |  * 
28 |  * Things you should know: key-value is one-to-many in Wikipedia Redirect.
29 |  * Let's denote X -> Y when a source term X redirects to the term B.
30 |  * X is unique in the entire Wikipedia Redirect data set, but Y is not.
31 |  * In other words, there exists a Y such that X -> Y and X' -> Y. 
32 |  * 
33 |  * @author Hideki Shima
34 |  *
35 |  */
36 | public class WikipediaRedirect extends HashMap<String,String>
37 |  implements Serializable {
38 |   //Do we need case insensitive hash map? C.f. http://www.coderanch.com/t/385950/java/java/HashMap-key-case-insensitivity
39 | 
40 |   private static final long serialVersionUID = 20111008L;
41 | 
42 |   public WikipediaRedirect() {
43 |     super();
44 |   }
45 | 
46 |   public WikipediaRedirect( int size ) {
47 |     // RAM (heap) efficient capacity setting
48 |     super( size * 4 / 3 + 1 );
49 |   }
50 | 
51 |   public WikipediaRedirect( Map<String, String> map ) {
52 |     super( map );
53 |   }
54 | 
55 |   /**
56 |    * Get keys in the map such that the value equals to the given value.
57 |    *   
58 |    * @param value
59 |    * @return keys
60 |    */
61 |   public Set<String> getKeysByValue(String value) {
62 |     Set<String> results = new LinkedHashSet<String>();
63 |     //Iterating through all items is slow.
64 |     //TODO: use existing library for faster access e.g. guava.
65 |     for (Entry<String,String> entry : entrySet()) {
66 |       if (value.equals(entry.getValue())) {
67 |         results.add(entry.getKey());
68 |       }
69 |     }
70 |     return results;
71 |   }
72 | }
73 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/.svn/wc.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hasibi/TAGME-Reproducibility/d21ed0d826fc60a6e4caaa5ec7b6c39e16f7c6c6/lib/edu.cmu.lti.wikipedia_redirect/.svn/wc.db


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/README.txt:
--------------------------------------------------------------------------------
1 | See http://code.google.com/p/wikipedia-redirect


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/launches/Demo.launch:
--------------------------------------------------------------------------------
 1 | <?xml version="1.0" encoding="UTF-8" standalone="no"?>
 2 | <launchConfiguration type="org.eclipse.jdt.launching.localJavaApplication">
 3 | <listAttribute key="org.eclipse.debug.core.MAPPED_RESOURCE_PATHS">
 4 | <listEntry value="/edu.cmu.lti.wikipedia_redirect/src/edu/cmu/lti/wikipedia_redirect/Demo.java"/>
 5 | </listAttribute>
 6 | <listAttribute key="org.eclipse.debug.core.MAPPED_RESOURCE_TYPES">
 7 | <listEntry value="1"/>
 8 | </listAttribute>
 9 | <stringAttribute key="org.eclipse.jdt.launching.MAIN_TYPE" value="edu.cmu.lti.wikipedia_redirect.Demo"/>
10 | <stringAttribute key="org.eclipse.jdt.launching.PROGRAM_ARGUMENTS" value="data/wikipedia_redirect.ser"/>
11 | <stringAttribute key="org.eclipse.jdt.launching.PROJECT_ATTR" value="edu.cmu.lti.wikipedia_redirect"/>
12 | <stringAttribute key="org.eclipse.jdt.launching.VM_ARGUMENTS" value="-Xmx128m"/>
13 | </launchConfiguration>
14 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/launches/WikipediaRedirectExtractor.launch:
--------------------------------------------------------------------------------
 1 | <?xml version="1.0" encoding="UTF-8" standalone="no"?>
 2 | <launchConfiguration type="org.eclipse.jdt.launching.localJavaApplication">
 3 | <listAttribute key="org.eclipse.debug.core.MAPPED_RESOURCE_PATHS">
 4 | <listEntry value="/edu.cmu.lti.wikipedia_redirect/src/edu/cmu/lti/wikipedia_redirect/WikipediaRedirectExtractor.java"/>
 5 | </listAttribute>
 6 | <listAttribute key="org.eclipse.debug.core.MAPPED_RESOURCE_TYPES">
 7 | <listEntry value="1"/>
 8 | </listAttribute>
 9 | <stringAttribute key="org.eclipse.jdt.launching.MAIN_TYPE" value="edu.cmu.lti.wikipedia_redirect.WikipediaRedirectExtractor"/>
10 | <stringAttribute key="org.eclipse.jdt.launching.PROGRAM_ARGUMENTS" value="data/wikipedia_100k.txt"/>
11 | <stringAttribute key="org.eclipse.jdt.launching.PROJECT_ATTR" value="edu.cmu.lti.wikipedia_redirect"/>
12 | <stringAttribute key="org.eclipse.jdt.launching.VM_ARGUMENTS" value="-Xmx500m"/>
13 | </launchConfiguration>
14 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/src/edu/cmu/lti/wikipedia_redirect/Demo.java:
--------------------------------------------------------------------------------
 1 | package edu.cmu.lti.wikipedia_redirect;
 2 | import java.io.File;
 3 | import java.util.Set;
 4 | 
 5 | /**
 6 |  * Demo of what you can do with Wikipedia Redirect.
 7 |  * @author Hideki Shima
 8 |  */
 9 | public class Demo {
10 |   private static String[] enSrcTerms = {"Bin Ladin", "William Henry Gates", 
11 |     "JFK", "The Steel City", "The City of Bridges", "Da burgh", "Hoagie", 
12 |     "Centre", "3.14"};
13 |   private static String[] jaSrcTerms = {"オサマビンラディン", "オサマ・ビンラーディン",
14 |           "東日本大地震","東日本太平洋沖地震" ,"NACSIS", 
15 |           "ダイアモンド", "アボガド", "バイオリン", "平成12年", "3.14"};
16 |   private static String enTarget = "Bayesian network";
17 |   private static String jaTarget = "計算機科学";
18 |   
19 |   public static void main(String[] args) throws Exception {
20 |     // Initialization
21 |     System.out.print("Loading Wikipedia Redirect ...");
22 |     long t0 = System.currentTimeMillis();
23 |     File inputFile = new File(args[0]);
24 |     WikipediaRedirect wr = IOUtil.loadWikipediaRedirect(inputFile);
25 |     boolean useJapaneseExample = inputFile.getName().substring(0, 2).equals("ja");
26 |     String[] srcTerms = useJapaneseExample ? jaSrcTerms : enSrcTerms;
27 |     String target = useJapaneseExample ? jaTarget : enTarget;
28 |     long t1 = System.currentTimeMillis();
29 |     System.out.println(" done in "+(t1-t0)/1000D+" sec.\n");
30 |     
31 |     // Let's find a redirection given a source word.
32 |     StringBuilder sb = new StringBuilder();
33 |     for ( String src : srcTerms ) {
34 |       sb.append("redirect(\""+src+"\") = \""+wr.get(src)+"\"\n");
35 |     }
36 |     long t2 = System.currentTimeMillis();
37 |     System.out.println(sb.toString()+"Done in "+(t2-t1)/1000D+" sec.\n--\n");
38 | 
39 |     // Let's find which source words redirect to the given target word.
40 |     Set<String> keys = wr.getKeysByValue(target);
41 |     long t3 = System.currentTimeMillis();
42 |     System.out.println("All of the following redirect to \""+target+"\":\n"+keys);
43 |     System.out.println("Done in "+(t3-t2)/1000D+" sec.\n");
44 |   }
45 | }
46 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/src/edu/cmu/lti/wikipedia_redirect/IOUtil.java:
--------------------------------------------------------------------------------
  1 | /*
  2 |  * Copyright 2011 Carnegie Mellon University
  3 |  * Licensed under the Apache License, Version 2.0 (the "License"); 
  4 |  * you may not use this file except in compliance with the License.
  5 |  * You may obtain a copy of the License at
  6 |  *  
  7 |  *   http://www.apache.org/licenses/LICENSE-2.0
  8 |  *
  9 |  * Unless required by applicable law or agreed to in writing, 
 10 |  * software distributed under the License is distributed on an
 11 |  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 12 |  * KIND, either express or implied. See the License for the
 13 |  * specific language governing permissions and limitations
 14 |  * under the License.
 15 |  */
 16 | package edu.cmu.lti.wikipedia_redirect;
 17 | 
 18 | import java.io.BufferedReader;
 19 | import java.io.BufferedWriter;
 20 | import java.io.File;
 21 | import java.io.FileInputStream;
 22 | import java.io.FileOutputStream;
 23 | import java.io.FileReader;
 24 | import java.io.InputStreamReader;
 25 | import java.io.LineNumberReader;
 26 | import java.io.ObjectInputStream;
 27 | import java.io.ObjectOutputStream;
 28 | import java.io.OutputStreamWriter;
 29 | import java.util.AbstractMap;
 30 | import java.util.ArrayList;
 31 | import java.util.List;
 32 | import java.util.Map.Entry;
 33 | 
 34 | /**
 35 |  * Reads and writes wikipedia redirect data.
 36 |  * 
 37 |  * @author Hideki Shima
 38 |  *
 39 |  */
 40 | public class IOUtil {
 41 | 
 42 |   /**
 43 |    * Save Wikipedia redirect data
 44 |    * 
 45 |    * @param redirectData
 46 |    *   map where key is original term and value is redirected term
 47 |    * @throws Exception
 48 |    */
 49 |   public static void save( AbstractMap<String,String> map ) throws Exception {
 50 |     File outputDir = new File("target");
 51 |     if (!outputDir.exists()) {
 52 |       outputDir.mkdirs();
 53 |     }
 54 |     WikipediaRedirect wr = new WikipediaRedirect( map );
 55 |     saveText( wr, outputDir );
 56 |     saveSerialized( wr, outputDir );
 57 |   }
 58 |   
 59 |   /**
 60 |    * Save Wikipedia redirect data into tab separated text file
 61 |    * 
 62 |    * @param redirectData
 63 |    *   map where key is original term and value is redirected term
 64 |    * @throws Exception
 65 |    */
 66 |   private static void saveText( WikipediaRedirect wr, File outputDir ) throws Exception {
 67 |     File txtFile = new File(outputDir, "wikipedia_redirect.txt");
 68 |     FileOutputStream fosTxt = new FileOutputStream(txtFile);
 69 |     OutputStreamWriter osw = new OutputStreamWriter(fosTxt, "utf-8");
 70 |     BufferedWriter bw = new BufferedWriter(osw);
 71 |     for ( Entry<String,String> entry : wr.entrySet() ) {
 72 |       bw.write( entry.getKey()+"\t"+entry.getValue()+"\n" );
 73 |     }
 74 |     bw.close();
 75 |     osw.close();
 76 |     fosTxt.close();
 77 |     System.out.println("Saved redirect data in text format: "+txtFile.getAbsolutePath());
 78 |   }
 79 |   
 80 |   /**
 81 |    * Save Wikipedia redirect data into serialized object
 82 |    * 
 83 |    * @param redirectData
 84 |    *   map where key is original term and value is redirected term
 85 |    * @throws Exception
 86 |    */
 87 |   private static void saveSerialized( WikipediaRedirect wr, File outputDir ) throws Exception {
 88 |     File objFile = new File(outputDir, "wikipedia_redirect.ser");
 89 |     FileOutputStream fosObj = new FileOutputStream(objFile);
 90 |     ObjectOutputStream outObject = new ObjectOutputStream(fosObj);
 91 |     outObject.writeObject(wr);
 92 |     outObject.close();
 93 |     fosObj.close();
 94 |     System.out.println("Serialized redirect data: "+objFile.getAbsolutePath());
 95 |   }
 96 |   
 97 |   /**
 98 |    * Deserializes wikipedia redirect data
 99 |    * @param file 
100 |    *   serialized object or tab-separated text
101 |    * @return wikipedia redirect
102 |    * @throws Exception
103 |    */
104 |   public static WikipediaRedirect loadWikipediaRedirect( File f ) throws Exception {
105 |     if (!f.exists() || f.isDirectory()) {
106 |       System.err.println("File not found: "+f.getAbsolutePath());
107 |       System.exit(-1);
108 |     }
109 |     if ( f.getName().endsWith(".ser") ) {
110 |       return loadWikipediaRedirectFromSerialized( f );
111 |     } else {
112 |       //faster than above?
113 |       return loadWikipediaRedirectFromText( f );
114 |     }
115 |   }
116 |   
117 |   /**
118 |    * Deserializes wikipedia redirect data from serialized object data
119 |    * @param file 
120 |    *   serialized object
121 |    * @return wikipedia redirect
122 |    * @throws Exception
123 |    */
124 |   private static WikipediaRedirect loadWikipediaRedirectFromSerialized( File f ) throws Exception {
125 |     WikipediaRedirect object;
126 |     try {
127 |       FileInputStream inFile = new FileInputStream(f);
128 |       ObjectInputStream inObject = new ObjectInputStream(inFile);
129 |       object = (WikipediaRedirect)inObject.readObject();
130 |       inObject.close();
131 |       inFile.close();      
132 |     } catch (Exception e) {
133 |       throw e;
134 |     }    
135 |     return object;
136 |   }
137 |   
138 |   /**
139 |    * Deserializes wikipedia redirect data from tab-separated text file
140 |    * @param file 
141 |    *   tab-separated text
142 |    * @return wikipedia redirect
143 |    * @throws Exception
144 |    */
145 |   private static WikipediaRedirect loadWikipediaRedirectFromText( File f ) throws Exception {
146 |     int size = (int)countLineNumber(f);
147 |     WikipediaRedirect wr = new WikipediaRedirect( size );
148 |     try {
149 |       FileInputStream fis = new FileInputStream( f );
150 |       InputStreamReader isr = new InputStreamReader( fis );
151 |       BufferedReader br = new BufferedReader( isr );
152 |       String line = null;
153 |       while ( (line = br.readLine()) != null ) {
154 |         String[] elements = line.split("\t");
155 |         wr.put( elements[0], elements[1] );
156 |       }
157 |       br.close();
158 |       isr.close();
159 |       fis.close();
160 |     } catch (Exception e) {
161 |       throw e;
162 |     }
163 |     return wr;
164 |   }
165 |   
166 |   /**
167 |    * Loads tab separated data as an alternative way to load() method.
168 |    * Works for Wikipedia hypernym data generated by
169 |    * <a href="http://alaginrc.nict.go.jp/hyponymy/index.html">NICT's "Hyponymy extraction tool"</a>
170 |    * 
171 |    * @param file
172 |    *   tab separated file that contains lines that look "word1[TAB]word2[BR]"
173 |    * @return wikipedia redirect
174 |    * @throws Exception
175 |    */
176 |   public static WikipediaHypernym loadWikipediaHypernym( File f ) throws Exception {
177 |     int size = (int)IOUtil.countLineNumber( f );
178 |     WikipediaHypernym object = new WikipediaHypernym( size );
179 |     try {
180 |       FileInputStream inFile = new FileInputStream( f );
181 |       InputStreamReader isr = new InputStreamReader( inFile );
182 |       BufferedReader br = new BufferedReader( isr );
183 |       String line = null;
184 |       while ( (line = br.readLine())!=null ) {
185 |         String[] tokens = line.split("\t");
186 |         if (tokens.length<=1) {
187 |           continue;
188 |         }
189 |         String key = tokens[0];
190 |         List<String> targets = object.get(key);
191 |         if ( targets==null ) {
192 |           targets = new ArrayList<String>();
193 |         }
194 |         targets.add(tokens[1]);
195 |         object.put(key, targets);
196 |       }
197 |       br.close();
198 |       isr.close();
199 |       inFile.close();      
200 |     } catch (Exception e) {
201 |       throw e;
202 |     }    
203 |     return object;
204 |   }
205 |   
206 |   /**
207 |    * Count number of lines in a file in an efficient way
208 |    * @param f
209 |    * @return
210 |    * @throws Exception
211 |    */
212 |   public static long countLineNumber( File f ) throws Exception {
213 |     LineNumberReader lnr = new LineNumberReader(new FileReader(f));
214 |     lnr.skip(Long.MAX_VALUE);
215 |     int count = lnr.getLineNumber();
216 |     lnr.close();
217 |     return count;
218 |   }
219 | }
220 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/src/edu/cmu/lti/wikipedia_redirect/WikipediaHypernym.java:
--------------------------------------------------------------------------------
 1 | /*
 2 |  * Copyright 2011 Carnegie Mellon University
 3 |  * Licensed under the Apache License, Version 2.0 (the "License"); 
 4 |  * you may not use this file except in compliance with the License.
 5 |  * You may obtain a copy of the License at
 6 |  *  
 7 |  *   http://www.apache.org/licenses/LICENSE-2.0
 8 |  *
 9 |  * Unless required by applicable law or agreed to in writing, 
10 |  * software distributed under the License is distributed on an
11 |  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
12 |  * KIND, either express or implied. See the License for the
13 |  * specific language governing permissions and limitations
14 |  * under the License.
15 |  */
16 | package edu.cmu.lti.wikipedia_redirect;
17 | 
18 | import java.io.File;
19 | import java.io.Serializable;
20 | import java.util.ArrayList;
21 | import java.util.HashMap;
22 | import java.util.List;
23 | 
24 | /**
25 |  * Represents the wikipedia hypernym data e.g. ones generated by
26 |  * <a href="http://alaginrc.nict.go.jp/hyponymy/index.html">NICT's "Hyponymy extraction tool"</a>
27 |  * 
28 |  * @author Hideki Shima
29 |  */
30 | public class WikipediaHypernym extends HashMap<String,List<String>>
31 |  implements Serializable {
32 | 
33 |   private static final long serialVersionUID = 20111019L;
34 | 
35 |   public WikipediaHypernym( int size ) {
36 |     // RAM (heap) efficient capacity setting
37 |     super( size * 4 / 3 + 1 );
38 |   }
39 |   
40 |   public void load( File file ) throws Exception {
41 |     WikipediaHypernym wh = IOUtil.loadWikipediaHypernym(file);
42 |     for ( String key : wh.keySet() ) {
43 |       List<String> thisList = get(key);
44 |       List<String> newList = wh.get(key);
45 |       if ( thisList != null ) {
46 |         thisList.addAll( newList );
47 |       } else {
48 |         thisList = new ArrayList<String>( newList );
49 |       }
50 |       put(key, thisList);
51 |     }
52 |   }
53 |   
54 | }
55 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/src/edu/cmu/lti/wikipedia_redirect/WikipediaRedirect.java:
--------------------------------------------------------------------------------
 1 | /*
 2 |  * Copyright 2011 Carnegie Mellon University
 3 |  * Licensed under the Apache License, Version 2.0 (the "License"); 
 4 |  * you may not use this file except in compliance with the License.
 5 |  * You may obtain a copy of the License at
 6 |  *  
 7 |  *   http://www.apache.org/licenses/LICENSE-2.0
 8 |  *
 9 |  * Unless required by applicable law or agreed to in writing, 
10 |  * software distributed under the License is distributed on an
11 |  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
12 |  * KIND, either express or implied. See the License for the
13 |  * specific language governing permissions and limitations
14 |  * under the License.
15 |  */
16 | package edu.cmu.lti.wikipedia_redirect;
17 | 
18 | import java.io.Serializable;
19 | import java.util.HashMap;
20 | import java.util.LinkedHashSet;
21 | import java.util.Map;
22 | import java.util.Map.Entry;
23 | import java.util.Set;
24 | 
25 | /**
26 |  * Represents the wikipedia redirect data.
27 |  * 
28 |  * Things you should know: key-value is one-to-many in Wikipedia Redirect.
29 |  * Let's denote X -> Y when a source term X redirects to the term B.
30 |  * X is unique in the entire Wikipedia Redirect data set, but Y is not.
31 |  * In other words, there exists a Y such that X -> Y and X' -> Y. 
32 |  * 
33 |  * @author Hideki Shima
34 |  *
35 |  */
36 | public class WikipediaRedirect extends HashMap<String,String>
37 |  implements Serializable {
38 |   //Do we need case insensitive hash map? C.f. http://www.coderanch.com/t/385950/java/java/HashMap-key-case-insensitivity
39 | 
40 |   private static final long serialVersionUID = 20111008L;
41 | 
42 |   public WikipediaRedirect() {
43 |     super();
44 |   }
45 | 
46 |   public WikipediaRedirect( int size ) {
47 |     // RAM (heap) efficient capacity setting
48 |     super( size * 4 / 3 + 1 );
49 |   }
50 | 
51 |   public WikipediaRedirect( Map<String, String> map ) {
52 |     super( map );
53 |   }
54 | 
55 |   /**
56 |    * Get keys in the map such that the value equals to the given value.
57 |    *   
58 |    * @param value
59 |    * @return keys
60 |    */
61 |   public Set<String> getKeysByValue(String value) {
62 |     Set<String> results = new LinkedHashSet<String>();
63 |     //Iterating through all items is slow.
64 |     //TODO: use existing library for faster access e.g. guava.
65 |     for (Entry<String,String> entry : entrySet()) {
66 |       if (value.equals(entry.getValue())) {
67 |         results.add(entry.getKey());
68 |       }
69 |     }
70 |     return results;
71 |   }
72 | }
73 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/src/edu/cmu/lti/wikipedia_redirect/WikipediaRedirectExtractor.java:
--------------------------------------------------------------------------------
  1 | /*
  2 |  * Copyright 2011 Carnegie Mellon University
  3 |  * Licensed under the Apache License, Version 2.0 (the "License"); 
  4 |  * you may not use this file except in compliance with the License.
  5 |  * You may obtain a copy of the License at
  6 |  *  
  7 |  *   http://www.apache.org/licenses/LICENSE-2.0
  8 |  *
  9 |  * Unless required by applicable law or agreed to in writing, 
 10 |  * software distributed under the License is distributed on an
 11 |  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 12 |  * KIND, either express or implied. See the License for the
 13 |  * specific language governing permissions and limitations
 14 |  * under the License.
 15 |  */
 16 | package edu.cmu.lti.wikipedia_redirect;
 17 | 
 18 | import java.io.BufferedReader;
 19 | import java.io.BufferedWriter;
 20 | import java.io.File;
 21 | import java.io.FileInputStream;
 22 | import java.io.FileOutputStream;
 23 | import java.io.InputStreamReader;
 24 | import java.io.OutputStreamWriter;
 25 | import java.util.regex.Matcher;
 26 | import java.util.regex.Pattern;
 27 | 
 28 | /**
 29 |  * Extracts wikipedia redirect information and serializes the data.
 30 |  * 
 31 |  * @author Hideki Shima
 32 |  *
 33 |  */
 34 | public class WikipediaRedirectExtractor {
 35 | 
 36 |   private static String titlePattern    = "    <title>";
 37 |   private static String redirectPattern = "    <redirect";
 38 |   private static String textPattern     = "      <text xml";
 39 |   private static Pattern pRedirect = Pattern.compile(
 40 |           "#[ ]?[^ ]+[ ]?\\[\\[(.+?)\\]\\]", Pattern.CASE_INSENSITIVE);
 41 |   
 42 |   public void run( File inputFile, File outputFile ) throws Exception {
 43 |     int invalidCount = 0;
 44 |     long t0 = System.currentTimeMillis();
 45 |     FileInputStream fis = new FileInputStream( inputFile );
 46 | //    TreeMap<String,String> map = new HashMap<String,String>();
 47 |     InputStreamReader isr = new InputStreamReader(fis, "utf-8");
 48 |     BufferedReader br = new BufferedReader(isr);
 49 |     FileOutputStream fos = new FileOutputStream(outputFile);
 50 |     OutputStreamWriter osw = new OutputStreamWriter(fos, "utf-8");
 51 |     BufferedWriter bw = new BufferedWriter(osw);
 52 | 
 53 |     int count = 0;
 54 |     String title = null;
 55 |     String text = null;
 56 |     String line = null;
 57 |     boolean isRedirect = false;
 58 |     boolean inText = false;
 59 |     while ((line=br.readLine())!=null) {
 60 |       if (line.startsWith(titlePattern)) {
 61 |         title = line;
 62 |         text = null;
 63 |         isRedirect = false;  
 64 |       }
 65 |       if (line.startsWith(redirectPattern)) {
 66 |         isRedirect = true;
 67 |       }
 68 |       if (isRedirect && (line.startsWith(textPattern) || inText)) {
 69 |         Matcher m = pRedirect.matcher(line); // slow regex shouldn't be used until here.
 70 |         if (m.find()) { // make sure the current text field contains [[...]]
 71 |           text  = line;
 72 |           try {
 73 |             title = cleanupTitle(title);
 74 |             String redirectedTitle = m.group(1);
 75 |             if ( isValidAlias(title, redirectedTitle) ) {
 76 |               bw.write( title+"\t"+redirectedTitle+"\n" );
 77 |               count++;
 78 | //              map.put( title, redirectedTitle );
 79 |             } else {
 80 |               invalidCount++;
 81 |             }
 82 |           } catch ( StringIndexOutOfBoundsException e ) {
 83 |             System.out.println("ERROR: cannot extract redirection from title = "+title+", text = "+text);
 84 |             e.printStackTrace();
 85 |           }
 86 |         } else { // Very rare case 
 87 |           inText = true;
 88 |         }
 89 |       }
 90 |     }
 91 |     br.close();
 92 |     isr.close();
 93 |     fis.close();
 94 | 
 95 |     bw.close();
 96 |     osw.close();
 97 |     fos.close();
 98 |     System.out.println("---- Wikipedia redirect extraction done ----");
 99 |     long t1 = System.currentTimeMillis();
100 | //    IOUtil.save( map );
101 |     System.out.println("Discarded "+invalidCount+" redirects to wikipedia meta articles.");
102 |     System.out.println("Extracted "+count+" redirects.");
103 |     System.out.println("Saved output: "+outputFile.getAbsolutePath());
104 |     System.out.println("Done in "+((t1-t0)/1000)+" sec.");
105 |   }
106 |   
107 |   private String cleanupTitle( String title ) {
108 |     int end = title.indexOf("</title>");
109 |     return end!=-1?title.substring(titlePattern.length(), end):title;
110 |   }
111 | 
112 |   /**
113 |    * Identifies if the redirection is valid.
114 |    * Currently, we only check if the redirection is related to
115 |    * a special Wikipedia page or not.
116 |    * 
117 |    * TODO: write more rules to discard more invalid redirects.
118 |    *  
119 |    * @param title source title
120 |    * @param redirectedTitle target title
121 |    * @return validity
122 |    */
123 |   private boolean isValidAlias( String title, String redirectedTitle ) {
124 |     if ( title.startsWith("Wikipedia:") 
125 |             || title.startsWith("Template:")
126 |             || title.startsWith("Portal:")
127 |             || title.startsWith("List of ")) {
128 |       return false;
129 |     }
130 |     return true;
131 |   }
132 |   
133 |   public static void main(String[] args) throws Exception {
134 |     if (args.length!=1) {
135 |       System.err.println("ERROR: Please specify the path to the wikipedia article xml file as the argument.");
136 |       System.err.println("Tips: enclose the path with double quotes if a space exists in the path.");
137 |       return;
138 |     }
139 |     File inputFile = new File(args[0]);
140 |     if (!inputFile.exists() || inputFile.isDirectory()) {
141 |       System.err.println("ERROR: File not found at "+inputFile.getAbsolutePath());
142 |       return;
143 |     }
144 |     String prefix = inputFile.getName().replaceFirst("-.*", "");
145 |     File outputDir = new File("target");
146 |     if (!outputDir.exists()) {
147 |       outputDir.mkdirs();
148 |     }
149 |     File outputFile = new File(outputDir, prefix+"-redirect.txt");
150 |     new WikipediaRedirectExtractor().run( inputFile, outputFile );
151 |   }
152 | }
153 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/test-data/sample-jawiki-latest-pages-articles.xml:
--------------------------------------------------------------------------------
  1 | <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.5/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.5/ http://www.mediawiki.org/xml/export-0.5.xsd" version="0.5" xml:lang="ja">
  2 |   <siteinfo>
  3 |     <sitename>Wikipedia</sitename>
  4 |     <base>http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8</base>
  5 |     <generator>MediaWiki 1.17wmf1</generator>
  6 |     <case>first-letter</case>
  7 |     <namespaces>
  8 |       <namespace key="-2" case="first-letter">メディア</namespace>
  9 |       <namespace key="-1" case="first-letter">特別</namespace>
 10 |       <namespace key="0" case="first-letter" />
 11 |       <namespace key="1" case="first-letter">ノート</namespace>
 12 |       <namespace key="2" case="first-letter">利用者</namespace>
 13 |       <namespace key="3" case="first-letter">利用者‐会話</namespace>
 14 |       <namespace key="4" case="first-letter">Wikipedia</namespace>
 15 |       <namespace key="5" case="first-letter">Wikipedia‐ノート</namespace>
 16 |       <namespace key="6" case="first-letter">ファイル</namespace>
 17 |       <namespace key="7" case="first-letter">ファイル‐ノート</namespace>
 18 |       <namespace key="8" case="first-letter">MediaWiki</namespace>
 19 |       <namespace key="9" case="first-letter">MediaWiki‐ノート</namespace>
 20 |       <namespace key="10" case="first-letter">Template</namespace>
 21 |       <namespace key="11" case="first-letter">Template‐ノート</namespace>
 22 |       <namespace key="12" case="first-letter">Help</namespace>
 23 |       <namespace key="13" case="first-letter">Help‐ノート</namespace>
 24 |       <namespace key="14" case="first-letter">Category</namespace>
 25 |       <namespace key="15" case="first-letter">Category‐ノート</namespace>
 26 |       <namespace key="100" case="first-letter">Portal</namespace>
 27 |       <namespace key="101" case="first-letter">Portal‐ノート</namespace>
 28 |       <namespace key="102" case="first-letter">プロジェクト</namespace>
 29 |       <namespace key="103" case="first-letter">プロジェクト‐ノート</namespace>
 30 |     </namespaces>
 31 |   </siteinfo>
 32 |   <page>
 33 |     <title>Wikipedia:Sandbox</title>
 34 |     <id>6</id>
 35 |     <redirect />
 36 |     <revision>
 37 |       <id>36654478</id>
 38 |       <timestamp>2011-03-06T16:16:58Z</timestamp>
 39 |       <contributor>
 40 |         <username>Y-dash</username>
 41 |         <id>309126</id>
 42 |       </contributor>
 43 |       <comment>テストは[[Wikipedia:サンドボックス]]でお願いいたします。 / [[Special:Contributions/Kompek|Kompek]] ([[User talk:Kompek|会話]]) による ID:36654304 の版を[[H:RV|取り消し]]</comment>
 44 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:サンドボックス]]</text>
 45 |     </revision>
 46 |   </page>
 47 |   <page>
 48 |     <title>SandBox</title>
 49 |     <id>26</id>
 50 |     <redirect />
 51 |     <revision>
 52 |       <id>6986090</id>
 53 |       <timestamp>2006-08-05T23:25:48Z</timestamp>
 54 |       <contributor>
 55 |         <username>Nevylax</username>
 56 |         <id>38464</id>
 57 |       </contributor>
 58 |       <comment>#REDIRECT [[サンドボックス]]</comment>
 59 |       <text xml:space="preserve">#REDIRECT [[サンドボックス]]</text>
 60 |     </revision>
 61 |   </page>
 62 |   <page>
 63 |     <title>HomePage</title>
 64 |     <id>46</id>
 65 |     <redirect />
 66 |     <revision>
 67 |       <id>2168894</id>
 68 |       <timestamp>2005-03-22T13:49:43Z</timestamp>
 69 |       <contributor>
 70 |         <username>Hideyuki</username>
 71 |         <id>9577</id>
 72 |       </contributor>
 73 |       <comment>#REDIRECT [[ホームページ]]</comment>
 74 |       <text xml:space="preserve">#REDIRECT [[ホームページ]]</text>
 75 |     </revision>
 76 |   </page>
 77 |   <page>
 78 |     <title>Wikipedia:About</title>
 79 |     <id>51</id>
 80 |     <redirect />
 81 |     <revision>
 82 |       <id>19962101</id>
 83 |       <timestamp>2008-05-31T01:12:46Z</timestamp>
 84 |       <contributor>
 85 |         <username>Kanjy</username>
 86 |         <id>36859</id>
 87 |       </contributor>
 88 |       <minor />
 89 |       <comment>2003-03-06T11:35:13Z Setu さん版 (#REDIRECT [[Wikipedia:ウィキペディアについて]]) に戻す</comment>
 90 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:ウィキペディアについて]]</text>
 91 |     </revision>
 92 |   </page>
 93 |   <page>
 94 |     <title>Wikipedia:How does one edit a page</title>
 95 |     <id>85</id>
 96 |     <redirect />
 97 |     <revision>
 98 |       <id>13206183</id>
 99 |       <timestamp>2007-06-19T05:06:43Z</timestamp>
100 |       <contributor>
101 |         <username>Aotake</username>
102 |         <id>34929</id>
103 |       </contributor>
104 |       <minor />
105 |       <comment>redirect target</comment>
106 |       <text xml:space="preserve">#REDIRECT [[Help:ページの編集]]</text>
107 |     </revision>
108 |   </page>
109 |   <page>
110 |     <title>ワールド・ミュージック</title>
111 |     <id>113</id>
112 |     <redirect />
113 |     <revision>
114 |       <id>24277249</id>
115 |       <timestamp>2009-02-07T16:20:20Z</timestamp>
116 |       <contributor>
117 |         <username>Point136</username>
118 |         <id>211299</id>
119 |       </contributor>
120 |       <minor />
121 |       <comment>Bot: リダイレクト構文の修正</comment>
122 |       <text xml:space="preserve">#REDIRECT [[ワールドミュージック]]</text>
123 |     </revision>
124 |   </page>
125 |   <page>
126 |     <title>ネマティック相</title>
127 |     <id>127</id>
128 |     <redirect />
129 |     <revision>
130 |       <id>24277255</id>
131 |       <timestamp>2009-02-07T16:20:40Z</timestamp>
132 |       <contributor>
133 |         <username>Point136</username>
134 |         <id>211299</id>
135 |       </contributor>
136 |       <minor />
137 |       <comment>Bot: リダイレクト構文の修正</comment>
138 |       <text xml:space="preserve">#REDIRECT [[ネマティック液晶]]</text>
139 |     </revision>
140 |   </page>
141 |   <page>
142 |     <title>スメクティック相</title>
143 |     <id>128</id>
144 |     <redirect />
145 |     <revision>
146 |       <id>2168972</id>
147 |       <timestamp>2004-01-07T09:45:14Z</timestamp>
148 |       <contributor>
149 |         <username>Yas</username>
150 |         <id>739</id>
151 |       </contributor>
152 |       <minor />
153 |       <comment>#REDIRECT [[液晶]]</comment>
154 |       <text xml:space="preserve">#REDIRECT [[液晶]]</text>
155 |     </revision>
156 |   </page>
157 |   <page>
158 |     <title>ミュージシャン一覧 (個人)</title>
159 |     <id>143</id>
160 |     <redirect />
161 |     <revision>
162 |       <id>35399365</id>
163 |       <timestamp>2010-12-14T00:40:05Z</timestamp>
164 |       <contributor>
165 |         <username>Xqbot</username>
166 |         <id>273540</id>
167 |       </contributor>
168 |       <minor />
169 |       <comment>ロボットによる: 二重リダイレクト修正 → [[音楽家の一覧]]</comment>
170 |       <text xml:space="preserve">#転送 [[音楽家の一覧]]</text>
171 |     </revision>
172 |   </page>
173 |   <page>
174 |     <title>病名</title>
175 |     <id>176</id>
176 |     <redirect />
177 |     <revision>
178 |       <id>17793766</id>
179 |       <timestamp>2008-02-04T13:37:54Z</timestamp>
180 |       <contributor>
181 |         <username>U3002</username>
182 |         <id>66126</id>
183 |       </contributor>
184 |       <minor />
185 |       <comment>二重リダイレクト回避</comment>
186 |       <text xml:space="preserve">#REDIRECT [[病気の別名の一覧]]</text>
187 |     </revision>
188 |   </page>
189 |   <page>
190 |     <title>Wikipedia:Welcome, newcomers</title>
191 |     <id>216</id>
192 |     <redirect />
193 |     <revision>
194 |       <id>10662242</id>
195 |       <timestamp>2007-02-15T13:40:22Z</timestamp>
196 |       <contributor>
197 |         <username>Cave cattum</username>
198 |         <id>41235</id>
199 |       </contributor>
200 |       <comment>#REDIRECT [[Wikipedia:ウィキペディアへようこそ]]</comment>
201 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:ウィキペディアへようこそ]]</text>
202 |     </revision>
203 |   </page>
204 |   <page>
205 |     <title>黒人霊歌</title>
206 |     <id>260</id>
207 |     <redirect />
208 |     <revision>
209 |       <id>22493441</id>
210 |       <timestamp>2008-10-23T05:59:07Z</timestamp>
211 |       <contributor>
212 |         <username>Buzin Satuma Hayato</username>
213 |         <id>243768</id>
214 |       </contributor>
215 |       <minor />
216 |       <comment>黒人霊歌はスピリチュアル（音楽）だと思う</comment>
217 |       <text xml:space="preserve">#REDIRECT [[スピリチュアル#スピリチュアル（音楽）]]</text>
218 |     </revision>
219 |   </page>
220 |   <page>
221 |     <title>Wikipedia:漢字やスペルに注意</title>
222 |     <id>281</id>
223 |     <redirect />
224 |     <revision>
225 |       <id>13451992</id>
226 |       <timestamp>2007-07-02T14:15:39Z</timestamp>
227 |       <contributor>
228 |         <username>Cave cattum</username>
229 |         <id>41235</id>
230 |       </contributor>
231 |       <comment>[[WP:AES|←]][[Wikipedia:記事を執筆する]]へのリダイレクト</comment>
232 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:記事を執筆する]]</text>
233 |     </revision>
234 |   </page>
235 |   <page>
236 |     <title>Wikipedia:他言語の使用は控えめに</title>
237 |     <id>283</id>
238 |     <redirect />
239 |     <revision>
240 |       <id>15761853</id>
241 |       <timestamp>2007-10-27T06:12:35Z</timestamp>
242 |       <contributor>
243 |         <username>Khhy</username>
244 |         <id>13490</id>
245 |       </contributor>
246 |       <comment>#他言語表記は控えめに</comment>
247 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:素晴らしい記事を書くには#他言語表記は控えめに]]</text>
248 |     </revision>
249 |   </page>
250 |   <page>
251 |     <title>Wikipedia:日本語表記法</title>
252 |     <id>291</id>
253 |     <redirect />
254 |     <revision>
255 |       <id>12840559</id>
256 |       <timestamp>2007-05-30T14:59:17Z</timestamp>
257 |       <contributor>
258 |         <username>Aotake</username>
259 |         <id>34929</id>
260 |       </contributor>
261 |       <comment>[[Wikipedia:表記ガイド]]へ統合。</comment>
262 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:表記ガイド]]</text>
263 |     </revision>
264 |   </page>
265 |   <page>
266 |     <title>Wikipedia:リダイレクトの使い方</title>
267 |     <id>308</id>
268 |     <redirect />
269 |     <revision>
270 |       <id>2169127</id>
271 |       <timestamp>2003-11-30T04:09:10Z</timestamp>
272 |       <contributor>
273 |         <ip>219.164.91.166</ip>
274 |       </contributor>
275 |       <comment>#REDIRECT [[Wikipedia:リダイレクト]]</comment>
276 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:リダイレクト]]</text>
277 |     </revision>
278 |   </page>
279 |   <page>
280 |     <title>アルコール飲料</title>
281 |     <id>318</id>
282 |     <redirect />
283 |     <revision>
284 |       <id>15885863</id>
285 |       <timestamp>2007-11-02T12:42:36Z</timestamp>
286 |       <contributor>
287 |         <username>Balmung0731</username>
288 |         <id>99201</id>
289 |       </contributor>
290 |       <comment>[[酒]]へ統合</comment>
291 |       <text xml:space="preserve">#REDIRECT [[酒]]</text>
292 |     </revision>
293 |   </page>
294 |   <page>
295 |     <title>地学</title>
296 |     <id>321</id>
297 |     <redirect />
298 |     <revision>
299 |       <id>2169138</id>
300 |       <timestamp>2003-09-26T11:33:29Z</timestamp>
301 |       <contributor>
302 |         <ip>133.11.230.18</ip>
303 |       </contributor>
304 |       <text xml:space="preserve">#REDIRECT [[地球科学]]</text>
305 |     </revision>
306 |   </page>
307 |   <page>
308 |     <title>Wikipedia:ノートページのレイアウト</title>
309 |     <id>323</id>
310 |     <redirect />
311 |     <revision>
312 |       <id>36028694</id>
313 |       <timestamp>2011-01-24T11:38:30Z</timestamp>
314 |       <contributor>
315 |         <username>Kurz</username>
316 |         <id>1601</id>
317 |       </contributor>
318 |       <minor />
319 |       <comment>lk</comment>
320 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:ノートページのガイドライン]]</text>
321 |     </revision>
322 |   </page>
323 |   <page>
324 |     <title>Wikipedia:ページを孤立させない</title>
325 |     <id>332</id>
326 |     <redirect />
327 |     <revision>
328 |       <id>14492895</id>
329 |       <timestamp>2007-08-27T16:20:04Z</timestamp>
330 |       <contributor>
331 |         <username>Cave cattum</username>
332 |         <id>41235</id>
333 |       </contributor>
334 |       <comment>#REDIRECT [[Wikipedia:記事どうしをつなぐ]]</comment>
335 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:記事どうしをつなぐ]]</text>
336 |     </revision>
337 |   </page>
338 |   <page>
339 |     <title>明石沢貴士</title>
340 |     <id>335</id>
341 |     <redirect />
342 |     <revision>
343 |       <id>34308655</id>
344 |       <timestamp>2010-10-03T18:47:08Z</timestamp>
345 |       <contributor>
346 |         <username>EmausBot</username>
347 |         <id>397108</id>
348 |       </contributor>
349 |       <minor />
350 |       <comment>ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 あ行#.E6.98.8E.E7.9F.B3.E6.B2.A2.E8.B2.B4.E5.A3.AB]]</comment>
351 |       <text xml:space="preserve">#転送 [[プロジェクト:漫画家/日本の漫画家 あ行#.E6.98.8E.E7.9F.B3.E6.B2.A2.E8.B2.B4.E5.A3.AB]]</text>
352 |     </revision>
353 |   </page>
354 |   <page>
355 |     <title>ここまひ</title>
356 |     <id>341</id>
357 |     <redirect />
358 |     <revision>
359 |       <id>34308404</id>
360 |       <timestamp>2010-10-03T18:16:42Z</timestamp>
361 |       <contributor>
362 |         <username>EmausBot</username>
363 |         <id>397108</id>
364 |       </contributor>
365 |       <minor />
366 |       <comment>ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 か行#.E3.81.93.E3.81.93.E3.81.BE.E3.81.B2]]</comment>
367 |       <text xml:space="preserve">#転送 [[プロジェクト:漫画家/日本の漫画家 か行#.E3.81.93.E3.81.93.E3.81.BE.E3.81.B2]]</text>
368 |     </revision>
369 |   </page>
370 |   <page>
371 |     <title>吉冨昭仁</title>
372 |     <id>354</id>
373 |     <redirect />
374 |     <revision>
375 |       <id>7856042</id>
376 |       <timestamp>2006-09-23T16:20:44Z</timestamp>
377 |       <contributor>
378 |         <username>Mambo95</username>
379 |         <id>77516</id>
380 |       </contributor>
381 |       <comment>[[吉富昭仁]]へのリダイレクト</comment>
382 |       <text xml:space="preserve">#REDIRECT [[吉富昭仁]]</text>
383 |     </revision>
384 |   </page>
385 |   <page>
386 |     <title>現在のイベント</title>
387 |     <id>356</id>
388 |     <redirect />
389 |     <revision>
390 |       <id>12796821</id>
391 |       <timestamp>2007-05-28T09:42:57Z</timestamp>
392 |       <contributor>
393 |         <username>Khhy</username>
394 |         <id>13490</id>
395 |       </contributor>
396 |       <comment>#REDIRECT [[Portal:最近の出来事]]</comment>
397 |       <text xml:space="preserve">#REDIRECT [[Portal:最近の出来事]]</text>
398 |     </revision>
399 |   </page>
400 |   <page>
401 |     <title>Wikipedia:項目名のつけ方</title>
402 |     <id>451</id>
403 |     <redirect />
404 |     <revision>
405 |       <id>2169218</id>
406 |       <timestamp>2003-02-03T20:03:50Z</timestamp>
407 |       <contributor>
408 |         <username>Tomos</username>
409 |         <id>10</id>
410 |       </contributor>
411 |       <text xml:space="preserve">#REDIRECT[[Wikipedia:記事名の付け方]]</text>
412 |     </revision>
413 |   </page>
414 |   <page>
415 |     <title>必要とされている記事</title>
416 |     <id>456</id>
417 |     <redirect />
418 |     <revision>
419 |       <id>2169223</id>
420 |       <timestamp>2004-04-16T16:59:03Z</timestamp>
421 |       <contributor>
422 |         <username>Listener</username>
423 |         <id>6379</id>
424 |       </contributor>
425 |       <comment>Wikipedia:執筆依頼, double redirect</comment>
426 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:執筆依頼]]</text>
427 |     </revision>
428 |   </page>
429 |   <page>
430 |     <title>東京を舞台にした漫画作品</title>
431 |     <id>465</id>
432 |     <redirect />
433 |     <revision>
434 |       <id>39222184</id>
435 |       <timestamp>2011-09-16T06:55:20Z</timestamp>
436 |       <contributor>
437 |         <username>リオネル</username>
438 |         <id>98816</id>
439 |       </contributor>
440 |       <comment>[[東京を舞台にした漫画・アニメ作品]]へ統合</comment>
441 |       <text xml:space="preserve">#REDIRECT [[東京を舞台にした漫画・アニメ作品]]</text>
442 |     </revision>
443 |   </page>
444 |   <page>
445 |     <title>必要とされている画像</title>
446 |     <id>467</id>
447 |     <redirect />
448 |     <revision>
449 |       <id>2169230</id>
450 |       <timestamp>2004-03-20T05:01:58Z</timestamp>
451 |       <contributor>
452 |         <username>Michey.M-test</username>
453 |         <id>3537</id>
454 |       </contributor>
455 |       <minor />
456 |       <comment>Wikipedia:画像提供依頼</comment>
457 |       <text xml:space="preserve">#REDIRECT [[Wikipedia:画像提供依頼]]</text>
458 |     </revision>
459 |   </page>
460 |   <page>
461 |     <title>水縞とおる</title>
462 |     <id>471</id>
463 |     <redirect />
464 |     <revision>
465 |       <id>34308685</id>
466 |       <timestamp>2010-10-03T18:50:25Z</timestamp>
467 |       <contributor>
468 |         <username>EmausBot</username>
469 |         <id>397108</id>
470 |       </contributor>
471 |       <minor />
472 |       <comment>ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]]</comment>
473 |       <text xml:space="preserve">#転送 [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]]</text>
474 |     </revision>
475 |   </page>
476 |   <page>
477 |     <title>恋人は守護霊!?</title>
478 |     <id>472</id>
479 |     <redirect />
480 |     <revision>
481 |       <id>34308641</id>
482 |       <timestamp>2010-10-03T18:45:13Z</timestamp>
483 |       <contributor>
484 |         <username>EmausBot</username>
485 |         <id>397108</id>
486 |       </contributor>
487 |       <minor />
488 |       <comment>ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]]</comment>
489 |       <text xml:space="preserve">#転送 [[プロジェクト:漫画家/日本の漫画家 ま行#.E6.B0.B4.E7.B8.9E.E3.81.A8.E3.81.8A.E3.82.8B]]
490 | 
491 | [[Category:漫画作品 こ|いひとはしゆこれい]]
492 | [[Category:月刊コミックNORA|こいひとはしゆこれい]]</text>
493 |     </revision>
494 |   </page>
495 |   <page>
496 |     <title>ユーゴスラビア改名</title>
497 |     <id>504</id>
498 |     <redirect />
499 |     <revision>
500 |       <id>24277258</id>
501 |       <timestamp>2009-02-07T16:21:01Z</timestamp>
502 |       <contributor>
503 |         <username>Point136</username>
504 |         <id>211299</id>
505 |       </contributor>
506 |       <minor />
507 |       <comment>Bot: リダイレクト構文の修正</comment>
508 |       <text xml:space="preserve">#REDIRECT [[ユーゴスラビア]]</text>
509 |     </revision>
510 |   </page>
511 |   <page>
512 |     <title>あだちつよし</title>
513 |     <id>519</id>
514 |     <redirect />
515 |     <revision>
516 |       <id>34308384</id>
517 |       <timestamp>2010-10-03T18:14:37Z</timestamp>
518 |       <contributor>
519 |         <username>EmausBot</username>
520 |         <id>397108</id>
521 |       </contributor>
522 |       <minor />
523 |       <comment>ロボットによる: 二重リダイレクト修正 → [[プロジェクト:漫画家/日本の漫画家 あ行#.E3.81.82.E3.81.A0.E3.81.A1.E3.81.A4.E3.82.88.E3.81.97]]</comment>
524 |       <text xml:space="preserve">#転送 [[プロジェクト:漫画家/日本の漫画家 あ行#.E3.81.82.E3.81.A0.E3.81.A1.E3.81.A4.E3.82.88.E3.81.97]]</text>
525 |     </revision>
526 |   </page>
527 | </mediawiki>
528 | 


--------------------------------------------------------------------------------
/lib/edu.cmu.lti.wikipedia_redirect/test-data/sample-res_cat_jawiki.txt:
--------------------------------------------------------------------------------
 1 | ACMフェロー	アルフレッド・エイホ	0.288392
 2 | ACMフェロー	アンドリュー・タネンバウム	0.084127
 3 | ACMフェロー	エドムンド・クラーク	0.220679
 4 | ACMフェロー	グラディ・ブーチ	-0.175180
 5 | ACMフェロー	ジャック・ドンガラ	0.427047
 6 | ACMフェロー	スティーブン・ボーン	0.220679
 7 | ACMフェロー	ダグラス・カマー	0.907805
 8 | ACMフェロー	ダン・ブリックリン	0.220679
 9 | ACMフェロー	ビャーネ・ストロヴストルップ	0.220679
10 | ACMフェロー	ビル・グロップ	0.220679
11 | ACMフェロー	ピーター・ノーヴィグ	0.233410
12 | ACMフェロー	ボブ・フランクストン	0.241471
13 | ACMフェロー	リチャード・ハミング	0.899804
14 | ACMフェロー	米澤明憲	0.143425


--------------------------------------------------------------------------------
/lib/wikiextractor-master-280915.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hasibi/TAGME-Reproducibility/d21ed0d826fc60a6e4caaa5ec7b6c39e16f7c6c6/lib/wikiextractor-master-280915.zip


--------------------------------------------------------------------------------
/nordlys/__init__.py:
--------------------------------------------------------------------------------
1 | from __future__ import division
2 | 
3 | import sys
4 | 
5 | # set default encoding to utf-8
6 | reload(sys)  
7 | sys.setdefaultencoding("utf-8")
8 | 


--------------------------------------------------------------------------------
/nordlys/config.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Global nordlys config.
 3 | 
 4 | @author: Krisztian Balog (krisztian.balog@uis.no)
 5 | """
 6 | 
 7 | from os import path
 8 | 
 9 | NORDLYS_DIR = path.dirname(path.abspath(__file__))
10 | LIB_DIR = path.dirname(path.dirname(path.abspath(__file__))) + "/lib"
11 | DATA_DIR = path.dirname(path.dirname(path.abspath(__file__))) + "/data"
12 | OUTPUT_DIR = path.dirname(path.dirname(path.abspath(__file__))) + "/output"
13 | 
14 | MONGO_DB = "nordlys"
15 | MONGO_HOST = "localhost"


--------------------------------------------------------------------------------
/nordlys/storage/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hasibi/TAGME-Reproducibility/d21ed0d826fc60a6e4caaa5ec7b6c39e16f7c6c6/nordlys/storage/__init__.py


--------------------------------------------------------------------------------
/nordlys/storage/mongo.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Tools for working with MongoDB.
 3 | 
 4 | @author: Krisztian Balog (krisztian.balog@uis.no)
 5 | """
 6 | 
 7 | from pymongo import MongoClient
 8 | 
 9 | 
10 | class Mongo(object):
11 |     """Manages the MongoDB connection and operations."""
12 |     ID_FIELD = "_id"
13 | 
14 |     def __init__(self, host, db, collection):
15 |         self.client = MongoClient(host)
16 |         self.db = self.client[db]
17 |         self.collection = self.db[collection]
18 |         self.db_name = db
19 |         self.collection_name = collection
20 |         print "Connected to " + self.db_name + "." + self.collection_name
21 | 
22 |     @staticmethod
23 |     def escape(s):
24 |         """Escapes string (to be used as key or fieldname).
25 |         Replaces . and $ with their unicode eqivalents."""
26 |         return s.replace(".", "\u002e").replace("$", "\u0024")
27 | 
28 |     @staticmethod
29 |     def unescape(s):
30 |         """Unescapes string."""
31 |         return s.replace("\u002e", ".").replace("\u0024", "$")
32 | 
33 |     def find_by_id(self, doc_id):
34 |         """Returns all document content for a given document id."""
35 |         return self.get_doc(self.collection.find_one({Mongo.ID_FIELD: self.escape(doc_id)}))
36 | 
37 |     def get_doc(self, mdoc):
38 |         """Returns document contents with with keys and _id field unescaped."""
39 |         if mdoc is None:
40 |             return None
41 | 
42 |         doc = {}
43 |         for f in mdoc:
44 |             if f == Mongo.ID_FIELD:
45 |                 doc[f] = self.unescape(mdoc[f])
46 |             else:
47 |                 doc[self.unescape(f)] = mdoc[f]
48 | 
49 |         return doc
50 | 
51 |     @staticmethod
52 |     def print_doc(doc):
53 |         print "_id: " + doc[Mongo.ID_FIELD]
54 |         for key, value in doc.iteritems():
55 |             if key == Mongo.ID_FIELD: continue  # ignore the id key
56 |             if type(value) is list:
57 |                 print key + ":"
58 |                 for v in value:
59 |                     print "\t" + str(v)
60 |             else:
61 |                 print key + ": " + str(value)


--------------------------------------------------------------------------------
/nordlys/storage/surfaceforms.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Entity surface forms stored in MongoDB.
 3 | 
 4 | The surface form is used as _id. The associated entities are stored in key-value format.
 5 | 
 6 | @author: Krisztian Balog (krisztian.balog@uis.no)
 7 | """
 8 | 
 9 | from nordlys.config import MONGO_DB, MONGO_HOST
10 | from nordlys.storage.mongo import Mongo
11 | 
12 | 
13 | class SurfaceForms(object):
14 | 
15 |     def __init__(self, collection):
16 |         self.collection = collection
17 |         self.mongo = Mongo(MONGO_HOST, MONGO_DB, self.collection)
18 | 
19 |     def get(self, surface_form):
20 |         """Returns all information associated with a surface form."""
21 | 
22 |         # need to unescape the keys in the value part
23 |         mdoc = self.mongo.find_by_id(surface_form)
24 |         if mdoc is None:
25 |             return None
26 |         doc = {}
27 |         for f in mdoc:
28 |             if f != Mongo.ID_FIELD:
29 |                 doc[f] = {}
30 |                 for key, value in mdoc[f].iteritems():
31 |                     doc[f][Mongo.unescape(key)] = value
32 | 
33 |         return doc


--------------------------------------------------------------------------------
/nordlys/tagme/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hasibi/TAGME-Reproducibility/d21ed0d826fc60a6e4caaa5ec7b6c39e16f7c6c6/nordlys/tagme/__init__.py


--------------------------------------------------------------------------------
/nordlys/tagme/config.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Configurations for tagme package.
 3 | 
 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
 5 | """
 6 | 
 7 | from nordlys.config import DATA_DIR
 8 | from nordlys.storage.surfaceforms import SurfaceForms
 9 | 
10 | 
11 | # Test collection files
12 | Y_ERD = DATA_DIR + "/Y-ERD.tsv"
13 | 
14 | ERD_QUERY = DATA_DIR + "/Trec_beta.query.txt"
15 | ERD_ANNOTATION = DATA_DIR + "/Trec_beta.annotation.txt"
16 | 
17 | WIKI_ANNOT30 = DATA_DIR + "/wiki-annot30"
18 | WIKI_ANNOT30_SNIPPET = DATA_DIR + "/wiki-annot30-snippet.txt"
19 | WIKI_ANNOT30_ANNOTATION = DATA_DIR + "/wiki-annot30-annotation.txt"
20 | 
21 | WIKI_DISAMB30 = DATA_DIR + "/wiki-disamb30"
22 | WIKI_DISAMB30_SNIPPET = DATA_DIR + "/wiki-disamb30-snippet.txt"
23 | WIKI_DISAMB30_ANNOTATION = DATA_DIR + "/wiki-disamb30-annotation.txt"
24 | 
25 | # Surface form dictionaries
26 | COLLECTION_SURFACEFORMS_WIKI = "surfaceforms_wiki_20100408"
27 | SF_WIKI = SurfaceForms(collection=COLLECTION_SURFACEFORMS_WIKI)
28 | 
29 | 
30 | INDEX_PATH = "/xxx/20100408-index"
31 | INDEX_ANNOT_PATH = "/xxx/20100408-index-annot/"


--------------------------------------------------------------------------------
/nordlys/tagme/dexter_api.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Methods to annotate queries with TagMe API.
  3 | 
  4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
  5 | """
  6 | 
  7 | import argparse
  8 | 
  9 | import requests
 10 | from nordlys.config import OUTPUT_DIR
 11 | 
 12 | from nordlys.tagme.test_coll import read_tagme_queries, read_yerd_queries, read_erd_queries
 13 | from nordlys.wikipedia.utils import WikipediaUtils
 14 | from nordlys.tagme import config
 15 | 
 16 | 
 17 | class DexterAPI(object):
 18 |     ANNOT_DEXTER_URI = "http://dexterdemo.isti.cnr.it:8080/dexter-webapp/api/rest/annotate?min-conf=0"
 19 |     DESC_DEXTER_URI = "http://dexterdemo.isti.cnr.it:8080/dexter-webapp/api/rest/get-desc"
 20 | 
 21 |     def __init__(self):
 22 |         self.id_title_dict = {}
 23 | 
 24 |     def ask_dexter_query(self, query):
 25 |         """Sends queries to Dexter Api."""
 26 |         data = {'dsb': "tagme", 'n': "50", 'debug': "false", 'format': "text", 'text': query}
 27 |         res = requests.post(self.ANNOT_DEXTER_URI, data).json()
 28 |         res['query'] = query
 29 |         return res
 30 | 
 31 |     def ask_title(self, page_id):
 32 |         """Sends page id to the API and get the page title."""
 33 |         if page_id not in self.id_title_dict:
 34 |             req = "?id=" + str(page_id) + "&title-only=true"
 35 |             res = requests.get(self.DESC_DEXTER_URI + req).json()
 36 |             title = res.get('title', "")
 37 |             wiki_uri = WikipediaUtils.wiki_title_to_uri(title.encode("utf-8"))
 38 |             self.id_title_dict[page_id] = wiki_uri
 39 |         return self.id_title_dict[page_id]
 40 | 
 41 |     def aks_dexter_queries(self, queries, out_file):
 42 |         """
 43 |         Sends queries to Dexter Api and writes them in a json file.
 44 | 
 45 |         :param queries: dictionary {qid: query, ...}
 46 |         :param out_file: The file to write json output
 47 |         """
 48 |         print "Getting resutls from Tagme ..."
 49 |         # responses = {}
 50 |         out_str = ""
 51 |         open(out_file, "w").close()
 52 |         out = open(out_file, "a")
 53 |         i = 0
 54 |         for qid in sorted(queries, key=lambda item: int(item) if item.isdigit() else item):
 55 |             query = queries[qid]
 56 |             print "[" + qid + "]", query
 57 |             tagme_res = self.ask_dexter_query(query)
 58 |             out_str += self.__to_str(qid, tagme_res)
 59 |             out.write(out_str)
 60 |             out_str = ""
 61 |             i += 1
 62 |             if i % 100 == 0:
 63 |                 # out.write(out_str)
 64 |                 print i, "th query processed ...."
 65 |                 print "items ins the page-id cache:", len(self.id_title_dict)
 66 |                 self.id_title_dict = {}
 67 |                 # out_str = ""
 68 |         out.write(out_str)
 69 |         # json.dump(responses, open(out_file, "w"), indent=4, sort_keys=True)
 70 |         print "Dexter results: " + out_file
 71 |         # return responses
 72 | 
 73 |     def __to_str(self, qid, response):
 74 |         """
 75 |         Output format:
 76 |         qid, score, wiki-uri, mention, page-id, start, end, linkProbability, linkFrequency, documentFrequency,
 77 |         entityFrequency, commonness
 78 | 
 79 |         :param qid:
 80 |         :param response:
 81 |         :return:
 82 |         """
 83 |         none_str = "*NONE*"
 84 |         out_str = ""
 85 |         for annot in response['spots']:
 86 |             wiki_uri = self.ask_title(annot.get('entity', none_str))
 87 |             if wiki_uri is None:
 88 |                 continue
 89 |             qid_str = str(qid) + "\t" + str(annot.get('score', none_str)) + "\t" + wiki_uri + "\t" + \
 90 |                       annot.get('mention', none_str) + "\t" + str(annot.get('entity', none_str)) + "\t" + \
 91 |                       str(annot.get('start', none_str)) + "\t" + str(annot.get('end', none_str)) + "\t" + \
 92 |                       str(annot.get('linkProbability', none_str)) + "\t" + str(annot.get('linkFrequency', none_str)) + "\t" +\
 93 |                       str(annot.get('documentFrequency', none_str)) + "\t" + str(annot.get('entityFrequency', none_str)) + "\t" +\
 94 |                       str(annot.get('commonness', none_str)) + "\n"
 95 |             out_str += qid_str
 96 |         return out_str
 97 | 
 98 | 
 99 | def main():
100 |     parser = argparse.ArgumentParser()
101 |     parser.add_argument("-th", "--threshold", help="rho score threshold", type=float, default=0)
102 |     parser.add_argument("-qid", help="annotates queries from this qid", type=str)
103 |     parser.add_argument("-data", help="Data set name", choices=['y-erd', 'erd-dev', 'wiki-annot30', 'wiki-disamb30'])
104 |     args = parser.parse_args()
105 | 
106 |     if args.data == "erd-dev":
107 |         queries = read_erd_queries()
108 |     elif args.data == "y-erd":
109 |         queries = read_yerd_queries()
110 |     elif args.data == "wiki-annot30":
111 |         queries = read_tagme_queries(config.WIKI_ANNOT30_SNIPPET)
112 |     elif args.data == "wiki-disamb30":
113 |         queries = read_tagme_queries(config.WIKI_DISAMB30_SNIPPET)
114 | 
115 |     # asks tagMe and creates output file
116 |     qid_str = "_" + args.qid if args.qid else ""
117 |     out_file = OUTPUT_DIR + "/" + args.data + "_dexter" + qid_str + ".txt"
118 |     tagme = DexterAPI()
119 |     tagme.aks_dexter_queries(queries, out_file)
120 | 
121 | 
122 | if __name__ == '__main__':
123 |     main()


--------------------------------------------------------------------------------
/nordlys/tagme/lucene_tools.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Tools for Lucene.
  3 | All Lucene features should be accessed in nordlys through this class. 
  4 | 
  5 | - Lucene class for ensuring that the same version, analyzer, etc. 
  6 |   are used across nordlys modules. Handles IndexReader, IndexWriter, etc.  
  7 | - Command line tools for checking indexed document content
  8 | 
  9 | @author: Krisztian Balog (krisztian.balog@uis.no)
 10 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
 11 | """
 12 | 
 13 | import argparse
 14 | import lucene
 15 | from java.io import File
 16 | from org.apache.lucene.analysis.standard import StandardAnalyzer
 17 | from org.apache.lucene.document import Document
 18 | from org.apache.lucene.document import Field
 19 | from org.apache.lucene.document import FieldType
 20 | from org.apache.lucene.index import IndexWriter
 21 | from org.apache.lucene.index import IndexWriterConfig
 22 | from org.apache.lucene.index import DirectoryReader 
 23 | from org.apache.lucene.index import Term
 24 | from org.apache.lucene.search import IndexSearcher
 25 | from org.apache.lucene.search import BooleanClause
 26 | from org.apache.lucene.search import TermQuery
 27 | from org.apache.lucene.search import BooleanQuery
 28 | from org.apache.lucene.search import PhraseQuery
 29 | from org.apache.lucene.store import SimpleFSDirectory
 30 | from org.apache.lucene.store import RAMDirectory
 31 | from org.apache.lucene.util import Version
 32 | from org.apache.lucene.store import IOContext
 33 | 
 34 | # has java VM for Lucene been initialized
 35 | lucene_vm_init = False
 36 | 
 37 | 
 38 | class Lucene(object):
 39 | 
 40 |     # default fieldnames for id and contents
 41 |     FIELDNAME_ID = "id"
 42 |     FIELDNAME_CONTENTS = "contents"
 43 | 
 44 |     # internal fieldtypes
 45 |     # used as Enum, the actual values don't matter
 46 |     FIELDTYPE_ID = "id"
 47 |     FIELDTYPE_ID_TV = "id_tv"
 48 |     FIELDTYPE_TEXT = "text"
 49 |     FIELDTYPE_TEXT_TV = "text_tv"
 50 |     FIELDTYPE_TEXT_TVP = "text_tvp"
 51 | 
 52 |     def __init__(self, index_dir, use_ram=False, jvm_ram=None):
 53 |         global lucene_vm_init
 54 |         if not lucene_vm_init:
 55 |             if jvm_ram:
 56 |                 # e.g. jvm_ram = "8g"
 57 |                 print "Increased JVM ram"
 58 |                 lucene.initVM(vmargs=['-Djava.awt.headless=true'], maxheap=jvm_ram)
 59 |             else:
 60 |                 lucene.initVM(vmargs=['-Djava.awt.headless=true'])
 61 |             lucene_vm_init = True
 62 |         self.dir = SimpleFSDirectory(File(index_dir))
 63 | 
 64 |         self.use_ram = use_ram
 65 |         if use_ram:
 66 |             print "Using ram directory..."
 67 |             self.ram_dir = RAMDirectory(self.dir, IOContext.DEFAULT)
 68 |         self.analyzer = None
 69 |         self.reader = None
 70 |         self.searcher = None
 71 |         self.writer = None
 72 |         self.ldf = None
 73 |         print "Connected to index " + index_dir
 74 | 
 75 |     def get_version(self):
 76 |         """Get Lucene version."""
 77 |         return Version.LUCENE_48
 78 | 
 79 |     def get_analyzer(self):
 80 |         """Get analyzer."""
 81 |         if self.analyzer is None:
 82 |             self.analyzer = StandardAnalyzer(self.get_version())
 83 |         return self.analyzer
 84 | 
 85 |     def open_reader(self):
 86 |         """Open IndexReader."""
 87 |         if self.reader is None:
 88 |             if self.use_ram:
 89 |                 print "reading from ram directory ..."
 90 |                 self.reader = DirectoryReader.open(self.ram_dir)
 91 |             else:
 92 |                 self.reader = DirectoryReader.open(self.dir)
 93 | 
 94 |     def get_reader(self):
 95 |         return self.reader
 96 | 
 97 |     def close_reader(self):
 98 |         """Close IndexReader."""
 99 |         if self.reader is not None:
100 |             self.reader.close()
101 |             self.reader = None
102 |         else:
103 |             raise Exception("There is no open IndexReader to close")
104 | 
105 |     def open_searcher(self):
106 |         """
107 |         Open IndexSearcher. Automatically opens an IndexReader too,
108 |         if it is not already open. There is no close method for the
109 |         searcher.
110 |         """
111 |         if self.searcher is None:
112 |             self.open_reader()
113 |             self.searcher = IndexSearcher(self.reader)
114 | 
115 |     def get_searcher(self):
116 |         """Returns index searcher (opens it if needed)."""
117 |         self.open_searcher()
118 |         return self.searcher
119 | 
120 |     def open_writer(self):
121 |         """Open IndexWriter."""
122 |         if self.writer is None:
123 |             config = IndexWriterConfig(self.get_version(), self.get_analyzer())
124 |             config.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
125 |             self.writer = IndexWriter(self.dir, config)
126 |         else:
127 |             raise Exception("IndexWriter is already open")
128 | 
129 |     def close_writer(self):
130 |         """Close IndexWriter."""
131 |         if self.writer is not None:
132 |             self.writer.close()
133 |             self.writer = None
134 |         else:
135 |             raise Exception("There is no open IndexWriter to close")
136 | 
137 |     def add_document(self, contents):
138 |         """
139 |         Adds a Lucene document with the specified contents to the index.
140 |         See LuceneDocument.create_document() for the explanation of contents.
141 |         """
142 |         if self.ldf is None:  # create a single LuceneDocument object that will be reused
143 |             self.ldf = LuceneDocument()
144 |         self.writer.addDocument(self.ldf.create_document(contents))
145 | 
146 |     def get_lucene_document_id(self, doc_id):
147 |         """Loads a document from a Lucene index based on its id."""
148 |         self.open_searcher()
149 |         query = TermQuery(Term(self.FIELDNAME_ID, doc_id))
150 |         tophit = self.searcher.search(query, 1).scoreDocs
151 |         if len(tophit) == 1:
152 |             return tophit[0].doc
153 |         else:
154 |             return None
155 | 
156 |     def get_document_id(self, lucene_doc_id):
157 |         """Gets lucene document id and returns the document id."""
158 |         self.open_reader()
159 |         return self.reader.document(lucene_doc_id).get(self.FIELDNAME_ID)
160 | 
161 |     def get_id_lookup_query(self, id, field=None):
162 |         """Creates Lucene query for searching by (external) document id """
163 |         if field is None:
164 |             field = self.FIELDNAME_ID
165 |         return TermQuery(Term(field, id))
166 | 
167 |     def get_and_query(self, queries):
168 |         """Creates an AND Boolean query from multiple Lucene queries """
169 |         # empty boolean query with Similarity.coord() disabled
170 |         bq = BooleanQuery(False)
171 |         for q in queries:
172 |             bq.add(q, BooleanClause.Occur.MUST)
173 |         return bq
174 | 
175 |     def get_or_query(self, queries):
176 |         """Creates an OR Boolean query from multiple Lucene queries """
177 |         # empty boolean query with Similarity.coord() disabled
178 |         bq = BooleanQuery(False)
179 |         for q in queries:
180 |             bq.add(q, BooleanClause.Occur.SHOULD)
181 |         return bq
182 | 
183 |     def get_phrase_query(self, query, field):
184 |         """Creates phrase query for searching exact phrase."""
185 |         phq = PhraseQuery()
186 |         for t in query.split():
187 |             phq.add(Term(field, t))
188 |         return phq
189 | 
190 |     def num_docs(self):
191 |         """Returns number of documents in the index."""
192 |         self.open_reader()
193 |         return self.reader.numDocs()
194 | 
195 | 
196 | class LuceneDocument(object):
197 |     """Internal representation of a Lucene document"""
198 | 
199 |     def __init__(self):
200 |         self.ldf = LuceneDocumentField()
201 | 
202 |     def create_document(self, contents):
203 |         """Create a Lucene document from the specified contents.
204 |         Contents is a list of fields to be indexed, represented as a dictionary
205 |         with keys 'field_name', 'field_type', and 'field_value'."""
206 |         doc = Document()
207 |         for f in contents:
208 |             doc.add(Field(f['field_name'], f['field_value'],
209 |                           self.ldf.get_field(f['field_type'])))
210 |         return doc
211 | 
212 | 
213 | class LuceneDocumentField(object):
214 |     """Internal handler class for possible field types"""
215 | 
216 |     def __init__(self):
217 |         """Init possible field types"""
218 | 
219 |         # FIELD_ID: stored, indexed, non-tokenized
220 |         self.field_id = FieldType()
221 |         self.field_id.setIndexed(True)
222 |         self.field_id.setStored(True)
223 |         self.field_id.setTokenized(False)
224 | 
225 |         # FIELD_ID_TV: stored, indexed, not tokenized, with term vectors (without positions)
226 |         # for storing IDs with term vector info
227 |         self.field_id_tv = FieldType()
228 |         self.field_id_tv.setIndexed(True)
229 |         self.field_id_tv.setStored(True)
230 |         self.field_id_tv.setTokenized(False)
231 |         self.field_id_tv.setStoreTermVectors(True)
232 | 
233 |         # FIELD_TEXT: stored, indexed, tokenized, with positions
234 |         self.field_text = FieldType()
235 |         self.field_text.setIndexed(True)
236 |         self.field_text.setStored(True)
237 |         self.field_text.setTokenized(True)
238 | 
239 |         # FIELD_TEXT_TV: stored, indexed, tokenized, with term vectors (without positions)
240 |         self.field_text_tv = FieldType()
241 |         self.field_text_tv.setIndexed(True)
242 |         self.field_text_tv.setStored(True)
243 |         self.field_text_tv.setTokenized(True)
244 |         self.field_text_tv.setStoreTermVectors(True)
245 | 
246 |         # FIELD_TEXT_TVP: stored, indexed, tokenized, with term vectors and positions
247 |         # (but no character offsets)
248 |         self.field_text_tvp = FieldType()
249 |         self.field_text_tvp.setIndexed(True)
250 |         self.field_text_tvp.setStored(True)
251 |         self.field_text_tvp.setTokenized(True)
252 |         self.field_text_tvp.setStoreTermVectors(True)
253 |         self.field_text_tvp.setStoreTermVectorPositions(True)
254 | 
255 |     def get_field(self, type):
256 |         """Get Lucene FieldType object for the corresponding internal FIELDTYPE_ value"""
257 |         if type == Lucene.FIELDTYPE_ID:
258 |             return self.field_id
259 |         elif type == Lucene.FIELDTYPE_ID_TV:
260 |             return self.field_id_tv
261 |         elif type == Lucene.FIELDTYPE_TEXT:
262 |             return self.field_text
263 |         elif type == Lucene.FIELDTYPE_TEXT_TV:
264 |             return self.field_text_tv
265 |         elif type == Lucene.FIELDTYPE_TEXT_TVP:
266 |             return self.field_text_tvp
267 |         else:
268 |             raise Exception("Unknown field type")
269 | 
270 | 
271 | def main():
272 |     parser = argparse.ArgumentParser()
273 |     parser.add_argument("-i", "--index", help="index directory", type=str)
274 |     args = parser.parse_args()
275 | 
276 |     index_dir = args.index
277 |     print "Index:       " + index_dir + "\n"
278 | 
279 |     l = Lucene(index_dir, jvm_ram="8g")
280 |     pq = l.get_phrase_query("originally used", "contents")
281 | 
282 |     l.open_searcher()
283 |     tophit = l.searcher.search(pq, 1).scoreDocs
284 |     print tophit[0]
285 | 
286 | if __name__ == '__main__':
287 |     main()


--------------------------------------------------------------------------------
/nordlys/tagme/mention.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Tools for processing mentions:
 3 |   -  Finds candidate entities
 4 |   -  Calculates commonness
 5 | 
 6 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
 7 | """
 8 | 
 9 | from nordlys.tagme.config import SF_WIKI
10 | 
11 | 
12 | class Mention(object):
13 | 
14 |     def __init__(self, text):
15 |         self.text = text.lower()
16 |         self.__matched_ens = None       # all entities matching a mention (from all sources)
17 |         self.__wiki_occurrences = None
18 | 
19 |     @property
20 |     def matched_ens(self):
21 |         return self.__gen_matched_ens()
22 | 
23 |     @property
24 |     def wiki_occurrences(self):
25 |         return self.__calc_wiki_occurrences()
26 | 
27 |     def __gen_matched_ens(self):
28 |         """Gets all entities matching the n-gram"""
29 |         if self.__matched_ens is None:
30 |             matches = SF_WIKI.get(self.text)
31 |             matched_ens = matches if matches is not None else {}
32 |             self.__matched_ens = matched_ens
33 |         return self.__matched_ens
34 | 
35 |     def __calc_wiki_occurrences(self):
36 |         """Calculates the denominator for commonness (for Wiki annotations)."""
37 |         if self.__wiki_occurrences is None:
38 |             self.__wiki_occurrences = 0
39 |             for en, occ in self.matched_ens.get('anchor', {}).iteritems():
40 |                 self.__wiki_occurrences += occ
41 |         return self.__wiki_occurrences
42 | 
43 |     def get_men_candidate_ens(self, commonness_th):
44 |         """
45 |         Gets candidate entities for the given n-gram.
46 | 
47 |         :param commonness_th: commonness threshold
48 |         :return: dictionary {Wiki_uri: commonness, ..}
49 |         """
50 |         candidate_entities = {}
51 |         wiki_matches = self.get_wiki_matches(commonness_th)
52 |         candidate_entities.update(wiki_matches)
53 |         return candidate_entities
54 | 
55 |     def get_wiki_matches(self, commonness_th):
56 |         """
57 |         Gets entity matches from Wikipedia anchors (with dbpedia uris).
58 | 
59 |         :param commonness_th: float, Commonness threshold
60 |         :return: Dictionary {Wiki_uri: commonness, ...}
61 | 
62 |         """
63 |         if commonness_th is None:
64 |             commonness_th = 0
65 | 
66 |         wiki_matches = {}
67 |         # calculates commonness for each entity and filter the ones below the commonness threshold.
68 |         for wiki_uri in self.matched_ens.get("anchor", {}):
69 |             cmn = self.calc_commonness(wiki_uri)
70 |             if cmn >= commonness_th:
71 |                 wiki_matches[wiki_uri] = cmn
72 | 
73 |         sources = ["title", "title-nv", "redirect"]
74 |         for source in sources:
75 |             for wiki_uri in self.matched_ens.get(source, {}):
76 |                 if wiki_uri not in wiki_matches:
77 |                     cmn = self.calc_commonness(wiki_uri)
78 |                     wiki_matches[wiki_uri] = cmn
79 |         return wiki_matches
80 | 
81 |     def calc_commonness(self, en_uri):
82 |         """
83 |         Calculates commonness for the given entity:
84 |             (times mention is linked) / (times mention linked to entity)
85 |             - Returns zero if the entity is not linked by the mention.
86 | 
87 |         :param en_uri: Wikipedia uri
88 |         :return Commonness
89 |         """
90 |         if not en_uri.startswith("<wikipedia:"):
91 |             raise Exception("Only Wikipedia URI should be passed to commonness!")
92 |         cmn = self.matched_ens.get('anchor', {}).get(en_uri, 0) / float(self.wiki_occurrences)
93 |         return cmn
94 | 


--------------------------------------------------------------------------------
/nordlys/tagme/query.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Tools for processing queries.
 3 | This class finds all n_grams and the candidate entities for a query.
 4 | 
 5 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
 6 | """
 7 | 
 8 | import re
 9 | 
10 | 
11 | class Query(object):
12 |     def __init__(self, qid, query):
13 |         self.id = qid
14 |         self.query = self.preprocess(query).lower()
15 | 
16 |     @staticmethod
17 |     def preprocess(input_str):
18 |         """Pre-process the query; removes some special chars."""
19 |         input_str = re.sub('[^A-Za-z0-9]+', ' ', input_str)
20 |         input_str = input_str.replace(" OR ", " ").replace(" AND ", " ")
21 |         # removing multiple spaces
22 |         cleaned_str = ' '.join(input_str.split())
23 |         return cleaned_str
24 | 
25 |     def get_ngrams(self):
26 |         """
27 |         Finds all n-grams of the query.
28 | 
29 |         :return list of n-grams
30 |         """
31 |         con = self.query.strip().split()
32 |         ngrams = []
33 |         for i in range(1, len(con) + 1):  # number of words
34 |             for start in range(0, len(con) - i + 1):  # start point
35 |                 ngram = con[start]
36 |                 for j in range(1, i):  # builds the sub-string
37 |                     ngram += " " + con[start + j]
38 |                 ngrams.append(ngram)
39 |         return ngrams
40 | 


--------------------------------------------------------------------------------
/nordlys/tagme/tagme.py:
--------------------------------------------------------------------------------
  1 | """
  2 | TAGME implementation
  3 | 
  4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
  5 | """
  6 | 
  7 | import argparse
  8 | import math
  9 | from nordlys.config import OUTPUT_DIR
 10 | from nordlys.tagme import config
 11 | from nordlys.tagme import test_coll
 12 | from nordlys.tagme.query import Query
 13 | from nordlys.tagme.mention import Mention
 14 | from nordlys.tagme.lucene_tools import Lucene
 15 | 
 16 | 
 17 | ENTITY_INDEX = Lucene(config.INDEX_PATH)
 18 | ANNOT_INDEX = Lucene(config.INDEX_ANNOT_PATH, use_ram=True)
 19 | 
 20 | # ENTITY_INDEX = IndexCache("/data/wikipedia-indices/20120502-index1")
 21 | # ANNOT_INDEX = IndexCache("/data/wikipedia-indices/20120502-index1-annot/", use_ram=True)
 22 | 
 23 | ENTITY_INDEX.open_searcher()
 24 | ANNOT_INDEX.open_searcher()
 25 | 
 26 | 
 27 | class Tagme(object):
 28 | 
 29 |     DEBUG = 0
 30 | 
 31 |     def __init__(self, query, rho_th, sf_source="wiki"):
 32 |         self.query = query
 33 |         self.rho_th = rho_th
 34 |         self.sf_source = sf_source
 35 | 
 36 |         # TAMGE params
 37 |         self.link_prob_th = 0.001
 38 |         self.cmn_th = 0.02
 39 |         self.k_th = 0.3
 40 | 
 41 |         self.link_probs = {}
 42 |         self.in_links = {}
 43 |         self.rel_scores = {}  # dictionary {men: {en: rel_score, ...}, ...}
 44 |         self.disamb_ens = {}
 45 | 
 46 |     def parse(self):
 47 |         """
 48 |         Parses the query and returns all candidate mention-entity pairs.
 49 | 
 50 |         :return: candidate entities {men:{en:cmn, ...}, ...}
 51 |         """
 52 |         ens = {}
 53 |         for ngram in self.query.get_ngrams():
 54 |             mention = Mention(ngram)
 55 |             # performs mention filtering (based on the paper)
 56 |             if (len(ngram) == 1) or (ngram.isdigit()) or (mention.wiki_occurrences < 2) or (len(ngram.split()) > 6):
 57 |                 continue
 58 |             link_prob = self.__get_link_prob(mention)
 59 |             if link_prob < self.link_prob_th:
 60 |                 continue
 61 |             # These mentions will be kept
 62 |             self.link_probs[ngram] = link_prob
 63 |             # Filters entities by cmn threshold 0.001; this was only in TAGME source code and speeds up the process.
 64 |             # TAGME source code: it.acubelab.tagme.anchor (lines 279-284)
 65 |             ens[ngram] = mention.get_men_candidate_ens(0.001)
 66 | 
 67 |         # filters containment mentions (based on paper)
 68 |         candidate_entities = {}
 69 |         sorted_mentions = sorted(ens.keys(), key=lambda item: len(item.split()))  # sorts by mention length
 70 |         for i in range(0, len(sorted_mentions)):
 71 |             m_i = sorted_mentions[i]
 72 |             ignore_m_i = False
 73 |             for j in range(i+1, len(sorted_mentions)):
 74 |                 m_j = sorted_mentions[j]
 75 |                 if (m_i in m_j) and (self.link_probs[m_i] < self.link_probs[m_j]):
 76 |                     ignore_m_i = True
 77 |                     break
 78 |             if not ignore_m_i:
 79 |                 candidate_entities[m_i] = ens[m_i]
 80 |         return candidate_entities
 81 | 
 82 |     def disambiguate(self, candidate_entities):
 83 |         """
 84 |         Performs disambiguation and link each mention to a single entity.
 85 | 
 86 |         :param candidate_entities: {men:{en:cmn, ...}, ...}
 87 |         :return: disambiguated entities {men:en, ...}
 88 |         """
 89 |         # Gets the relevance score
 90 |         rel_scores = {}
 91 |         for m_i in candidate_entities.keys():
 92 |             if self.DEBUG:
 93 |                 print "********************", m_i, "********************"
 94 |             rel_scores[m_i] = {}
 95 |             for e_m_i in candidate_entities[m_i].keys():
 96 |                 if self.DEBUG:
 97 |                     print "-- ", e_m_i
 98 |                 rel_scores[m_i][e_m_i] = 0
 99 |                 for m_j in candidate_entities.keys():  # all other mentions
100 |                     if (m_i == m_j) or (len(candidate_entities[m_j].keys()) == 0):
101 |                         continue
102 |                     vote_e_m_j = self.__get_vote(e_m_i, candidate_entities[m_j])
103 |                     rel_scores[m_i][e_m_i] += vote_e_m_j
104 |                     if self.DEBUG:
105 |                         print m_j, vote_e_m_j
106 | 
107 |         # pruning uncommon entities (based on the paper)
108 |         self.rel_scores = {}
109 |         for m_i in rel_scores:
110 |             for e_m_i in rel_scores[m_i]:
111 |                 cmn = candidate_entities[m_i][e_m_i]
112 |                 if cmn >= self.cmn_th:
113 |                     if m_i not in self.rel_scores:
114 |                         self.rel_scores[m_i] = {}
115 |                     self.rel_scores[m_i][e_m_i] = rel_scores[m_i][e_m_i]
116 | 
117 |         # DT pruning
118 |         disamb_ens = {}
119 |         for m_i in self.rel_scores:
120 |             if len(self.rel_scores[m_i].keys()) == 0:
121 |                 continue
122 |             top_k_ens = self.__get_top_k(m_i)
123 |             best_cmn = 0
124 |             best_en = None
125 |             for en in top_k_ens:
126 |                 cmn = candidate_entities[m_i][en]
127 |                 if cmn >= best_cmn:
128 |                     best_en = en
129 |                     best_cmn = cmn
130 |             disamb_ens[m_i] = best_en
131 | 
132 |         return disamb_ens
133 | 
134 |     def prune(self, dismab_ens):
135 |         """
136 |         Performs AVG pruning.
137 | 
138 |         :param dismab_ens: {men: en, ... }
139 |         :return: {men: (en, score), ...}
140 |         """
141 |         linked_ens = {}
142 |         for men, en in dismab_ens.iteritems():
143 |             coh_score = self.__get_coherence_score(men, en, dismab_ens)
144 |             rho_score = (self.link_probs[men] + coh_score) / 2.0
145 |             if rho_score >= self.rho_th:
146 |                 linked_ens[men] = (en, rho_score)
147 |         return linked_ens
148 | 
149 |     def __get_link_prob(self, mention):
150 |         """
151 |         Gets link probability for the given mention.
152 |         Here, in fact, we are computing key-phraseness.
153 |         """
154 | 
155 |         pq = ENTITY_INDEX.get_phrase_query(mention.text, Lucene.FIELDNAME_CONTENTS)
156 |         mention_freq = ENTITY_INDEX.searcher.search(pq, 1).totalHits
157 |         if mention_freq == 0:
158 |             return 0
159 |         if self.sf_source == "wiki":
160 |             link_prob = mention.wiki_occurrences / float(mention_freq)
161 |             # This is TAGME implementation, from source code:
162 |             # link_prob = float(mention.wiki_occurrences) / max(mention_freq, mention.wiki_occurrences)
163 |         elif self.sf_source == "facc":
164 |             link_prob = mention.facc_occurrences / float(mention_freq)
165 |         return link_prob
166 | 
167 |     def __get_vote(self, entity, men_cand_ens):
168 |         """
169 |         vote_e = sum_e_i(mw_rel(e, e_i) * cmn(e_i)) / i
170 | 
171 |         :param entity: en
172 |         :param men_cand_ens: {en: cmn, ...}
173 |         :return: voting score
174 |         """
175 |         entity = entity if self.sf_source == "wiki" else entity[0]
176 |         vote = 0
177 |         for e_i, cmn in men_cand_ens.iteritems():
178 |             e_i = e_i if self.sf_source == "wiki" else e_i[0]
179 |             mw_rel = self.__get_mw_rel(entity, e_i)
180 |             # print "\t", e_i, "cmn:", cmn, "mw_rel:", mw_rel
181 |             vote += cmn * mw_rel
182 |         vote /= float(len(men_cand_ens))
183 |         return vote
184 | 
185 |     def __get_mw_rel(self, e1, e2):
186 |         """
187 |         Calculates Milne & Witten relatedness for two entities.
188 |         This implementation is based on Dexter implementation (which is similar to TAGME implementation).
189 |           - Dexter implementation: https://github.com/dexter/dexter/blob/master/dexter-core/src/main/java/it/cnr/isti/hpc/dexter/relatedness/MilneRelatedness.java
190 |           - TAGME: it.acubelab.tagme.preprocessing.graphs.OnTheFlyArrayMeasure
191 |         """
192 |         if e1 == e2:  # to speed-up
193 |             return 1.0
194 |         en_uris = tuple(sorted({e1, e2}))
195 |         ens_in_links = [self.__get_in_links([en_uri]) for en_uri in en_uris]
196 |         if min(ens_in_links) == 0:
197 |             return 0
198 |         conj = self.__get_in_links(en_uris)
199 |         if conj == 0:
200 |             return 0
201 |         numerator = math.log(max(ens_in_links)) - math.log(conj)
202 |         denominator = math.log(ANNOT_INDEX.num_docs()) - math.log(min(ens_in_links))
203 |         rel = 1 - (numerator / denominator)
204 |         if rel < 0:
205 |             return 0
206 |         return rel
207 | 
208 |     def __get_in_links(self, en_uris):
209 |         """
210 |         returns "and" occurrences of entities in the corpus.
211 | 
212 |         :param en_uris: list of dbp_uris
213 |         """
214 |         en_uris = tuple(sorted(set(en_uris)))
215 |         if en_uris in self.in_links:
216 |             return self.in_links[en_uris]
217 | 
218 |         term_queries = []
219 |         for en_uri in en_uris:
220 |             term_queries.append(ANNOT_INDEX.get_id_lookup_query(en_uri, Lucene.FIELDNAME_CONTENTS))
221 |         and_query = ANNOT_INDEX.get_and_query(term_queries)
222 |         self.in_links[en_uris] = ANNOT_INDEX.searcher.search(and_query, 1).totalHits
223 |         return self.in_links[en_uris]
224 | 
225 |     def __get_coherence_score(self, men, en, dismab_ens):
226 |         """
227 |         coherence_score = sum_e_i(rel(e_i, en)) / len(ens) - 1
228 | 
229 |         :param en: entity
230 |         :param dismab_ens: {men:  (dbp_uri, fb_id), ....}
231 |         """
232 |         coh_score = 0
233 |         for m_i, e_i in dismab_ens.iteritems():
234 |             if m_i == men:
235 |                 continue
236 |             coh_score += self.__get_mw_rel(e_i, en)
237 |         coh_score = coh_score / float(len(dismab_ens.keys()) - 1) if len(dismab_ens.keys()) - 1 != 0 else 0
238 |         return coh_score
239 | 
240 |     def __get_top_k(self, mention):
241 |         """Returns top-k percent of the entities based on rel score."""
242 |         k = int(round(len(self.rel_scores[mention].keys()) * self.k_th))
243 |         k = 1 if k == 0 else k
244 |         sorted_rel_scores = sorted(self.rel_scores[mention].items(), key=lambda item: item[1], reverse=True)
245 |         top_k_ens = []
246 |         count = 1
247 |         prev_rel_score = sorted_rel_scores[0][1]
248 |         for en, rel_score in sorted_rel_scores:
249 |             if rel_score != prev_rel_score:
250 |                 count += 1
251 |             if count > k:
252 |                 break
253 |             top_k_ens.append(en)
254 |             prev_rel_score = rel_score
255 |         return top_k_ens
256 | 
257 | 
258 | def main():
259 |     parser = argparse.ArgumentParser()
260 |     parser.add_argument("-th", "--threshold", help="score threshold", type=float, default=0)
261 |     parser.add_argument("-data", help="Data set name", choices=['y-erd', 'erd-dev', 'wiki-annot30', 'wiki-disamb30'])
262 |     args = parser.parse_args()
263 | 
264 |     if args.data == "erd-dev":
265 |         queries = test_coll.read_erd_queries()
266 |     elif args.data == "y-erd":
267 |         queries = test_coll.read_yerd_queries()
268 |     elif args.data == "wiki-annot30":
269 |         queries = test_coll.read_tagme_queries(config.WIKI_ANNOT30_SNIPPET)
270 |     elif args.data == "wiki-disamb30":
271 |         queries = test_coll.read_tagme_queries(config.WIKI_DISAMB30_SNIPPET)
272 | 
273 |     out_file_name = OUTPUT_DIR + "/" + args.data + "_tagme_wiki10.txt"
274 |     open(out_file_name, "w").close()
275 |     out_file = open(out_file_name, "a")
276 | 
277 |     # process the queries
278 |     for qid, query in sorted(queries.items(), key=lambda item: int(item[0]) if item[0].isdigit() else item[0]):
279 |         print "[" + qid + "]", query
280 |         tagme = Tagme(Query(qid, query), args.threshold)
281 |         print "  parsing ..."
282 |         cand_ens = tagme.parse()
283 |         print "  disambiguation ..."
284 |         disamb_ens = tagme.disambiguate(cand_ens)
285 |         print "  pruning ..."
286 |         linked_ens = tagme.prune(disamb_ens)
287 | 
288 |         out_str = ""
289 |         for men, (en, score) in linked_ens.iteritems():
290 |             out_str += str(qid) + "\t" + str(score) + "\t" + en + "\t" + men + "\tpage-id" + "\n"
291 |         print out_str, "-----------\n"
292 |         out_file.write(out_str)
293 | 
294 |     print "output:", out_file_name
295 | 
296 | 
297 | if __name__ == "__main__":
298 |     main()


--------------------------------------------------------------------------------
/nordlys/tagme/tagme_api.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Methods to annotate queries with TagMe API.
 3 | 
 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
 5 | """
 6 | 
 7 | import argparse
 8 | import requests
 9 | from nordlys.config import OUTPUT_DIR
10 | from nordlys.tagme import config
11 | from nordlys.tagme.test_coll import read_erd_queries, read_yerd_queries, read_tagme_queries
12 | from nordlys.wikipedia.utils import WikipediaUtils
13 | 
14 | 
15 | class TagmeAPI(object):
16 |     TAGME_URI = "http://tagme.di.unipi.it/tag"
17 |     NONE = "*NONE*"
18 | 
19 |     def __init__(self, key):
20 |         self.key = key
21 | 
22 |     def ask_tagme_query(self, query):
23 |         """Sends queries to Tagme Api."""
24 |         data = {'key': self.key, 'lang': "en", 'text': query}
25 |         res = requests.post(self.TAGME_URI, data).json()
26 |         res['query'] = query
27 |         return res
28 | 
29 |     def aks_tagme_queries(self, queries, out_file):
30 |         """
31 |         Sends queries to Tagme Api and writes them in a json file.
32 | 
33 |         :param queries: dictionary {qid: query, ...}
34 |         :param out_file: The file to write the output
35 |         """
36 |         print "Getting results from Tagme ..."
37 |         out_str = ""
38 |         open(out_file, "w").close()
39 |         out = open(out_file, "a")
40 |         i = 0
41 |         for qid in sorted(queries):
42 |             query = queries[qid]
43 |             print "[" + qid + "]", query
44 |             tagme_res = self.ask_tagme_query(query)
45 |             out_str += self.__to_str(qid, tagme_res)
46 |             # responses[qid] = tagme_res
47 |             i += 1
48 |             if i % 1000 == 0:
49 |                 out.write(out_str)
50 |                 print "until qid:", qid
51 |                 out_str = ""
52 |         out.write(out_str)
53 |         print "TagMe results: " + out_file
54 | 
55 |     def __to_str(self, qid, response):
56 |         """
57 |         Output format:
58 |         qid, score, wiki-uri, mention, page-id, start, end
59 |         """
60 |         out_str = ""
61 |         for annot in response['annotations']:
62 |             title = str(annot.get('title', TagmeAPI.NONE))
63 |             page_id = str(annot.get('id', TagmeAPI.NONE))
64 |             wiki_uri = self.__get_uri_from_title(title)
65 |             out_str += str(qid) + "\t" + str(annot.get('rho', TagmeAPI.NONE)) + "\t" + wiki_uri + "\t" + \
66 |                        annot.get('spot', TagmeAPI.NONE) + "\t" + page_id + "\t" + \
67 |                        str(annot.get('start', TagmeAPI.NONE)) + "\t" + str(annot.get('end', TagmeAPI.NONE)) + "\n"
68 |         return out_str
69 | 
70 |     def __get_uri_from_title(self, title):
71 |         return WikipediaUtils.wiki_title_to_uri(title) if title != TagmeAPI.NONE else TagmeAPI.NONE
72 | 
73 | 
74 | def main():
75 |     key = "XXX"  # To be taken from TAGME authors
76 | 
77 |     parser = argparse.ArgumentParser()
78 |     parser.add_argument("-data", help="Data set name", choices=['y-erd', 'erd-dev', 'wiki-annot30', 'wiki-disamb30'])
79 |     args = parser.parse_args()
80 | 
81 |     if args.data == "erd-dev":
82 |         queries = read_erd_queries()
83 |     elif args.data == "y-erd":
84 |         queries = read_yerd_queries()
85 |     elif args.data == "wiki-annot30":
86 |         queries = read_tagme_queries(config.WIKI_ANNOT30_SNIPPET)
87 |     elif args.data == "wiki-disamb30":
88 |         queries = read_tagme_queries(config.WIKI_DISAMB30_SNIPPET)
89 | 
90 |     # Asks TAGME and creates json file
91 |     out_file = OUTPUT_DIR + "/" + args.data + "_tagmeAPI" + ".txt"
92 |     tagme = TagmeAPI(key)
93 |     tagme.aks_tagme_queries(queries, out_file)
94 | 
95 | if __name__ == '__main__':
96 |     main()


--------------------------------------------------------------------------------
/nordlys/tagme/test_coll.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Reads queries from test collections
 3 | 
 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
 5 | """
 6 | 
 7 | import csv
 8 | from nordlys.tagme import config
 9 | 
10 | 
11 | def read_yerd_queries(y_erd_file=config.Y_ERD):
12 |     """
13 |     Reads queries from Erd query file.
14 | 
15 |     :return dictionary {query_id : query_content}
16 |     """
17 |     queries = {}
18 |     with open(y_erd_file, 'rb') as y_erd:
19 |         reader = csv.DictReader(y_erd, delimiter="\t", quoting=csv.QUOTE_NONE)
20 | 
21 |         for line in reader:
22 |             qid = line['qid']
23 |             query = line['query']
24 |             queries[qid] = query.strip()
25 |     print "Number of queries:", len(queries)
26 |     return queries
27 | 
28 | 
29 | def read_erd_queries(erd_q_file=config.ERD_QUERY):
30 |     """
31 |     Reads queries from Erd query file.
32 | 
33 |     :return dictionary {qid : query}
34 |     """
35 |     queries = {}
36 |     q_file = open(erd_q_file, "r")
37 |     for line in q_file:
38 |         line = line.split("\t")
39 |         query_id = line[0].strip()
40 |         query = line[-1].strip()
41 |         queries[query_id] = query
42 |     q_file.close()
43 |     print "Number of queries:", len(queries)
44 |     return queries
45 | 
46 | 
47 | def read_tagme_queries(dataset_file):
48 |     """
49 |     Reads queries from snippet file.
50 | 
51 |     :return dictionary {qid : query}
52 |     """
53 |     queries = {}
54 |     q_file = open(dataset_file, "r")
55 |     for line in q_file:
56 |         line = line.strip().split("\t")
57 |         query_id = line[0].strip()
58 |         query = line[1].strip()
59 |         queries[query_id] = query
60 |     q_file.close()
61 |     print "Number of queries:", len(queries)
62 |     return queries


--------------------------------------------------------------------------------
/nordlys/wikipedia/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'faeghehhasibi'
2 | 


--------------------------------------------------------------------------------
/nordlys/wikipedia/anchor_extractor.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Creates a single anchor file for all entity-linking annotations.
 3 | 
 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
 5 | """
 6 | import argparse
 7 | import os
 8 | 
 9 | 
10 | def merge_anchors(basedir, outfile):
11 |     """Writes all annotations into a single file."""
12 |     open(outfile, "w").close()
13 |     out = open(outfile, "a")
14 |     i = 0
15 |     for path, dirs, files in os.walk(basedir):
16 |         for fn in sorted(files):
17 |             if fn.endswith(".tsv"):
18 |                 with open(os.path.join(path, fn)) as in_file:
19 |                     out.write(in_file.read())
20 |             i += 1
21 |             if i % 100 == 0:
22 |                 print i, "th file is added!"
23 |                 print "file:", os.path.join(path, fn)
24 | 
25 | 
26 | def count_anchors(anchor_file, out_file):
27 |     """Counts the number of occurrences anchor-entity pairs"""
28 |     sf_dict = {}
29 |     in_file = open(anchor_file)
30 |     i = 0
31 |     for line in in_file:
32 |         i += 1
33 |         cols = line.strip().split("\t")
34 |         if (len(cols) < 4) or (cols[2].strip().lower() == ""):
35 |             continue
36 |         sf = cols[2].strip().lower()
37 |         en = cols[3].strip()
38 |         if sf not in sf_dict:
39 |             sf_dict[sf] = {}
40 |         if en not in sf_dict[sf]:
41 |             sf_dict[sf][en] = 1
42 |         else:
43 |             sf_dict[sf][en] += 1
44 |         if i % 1000000 == 0:
45 |             print i, "th line processed!"
46 | 
47 |     out_str = ""
48 |     for sf, en_counts in sf_dict.iteritems():
49 |         for en, count in en_counts.iteritems():
50 |             out_str += sf + "\t" + en + "\t" + str(count) + "\n"
51 |     out = open(out_file, "w")
52 |     out.write(out_str)
53 |     out.close()
54 | 
55 | 
56 | def main():
57 |     # Builds anchor file
58 |     parser = argparse.ArgumentParser()
59 |     parser.add_argument("-inputdir", help="Path to directory to read from")
60 |     parser.add_argument("-outputdir", help="Path to write the annotations (.tsv files)")
61 |     args = parser.parse_args()
62 | 
63 |     merge_anchors(args.inputdir, args.outputdir + "/anchors.txt")
64 |     count_anchors(args.outputdir + "/anchors.txt", args.outputdir + "/anchors_count.txt")
65 | 
66 | if __name__ == "__main__":
67 |     main()
68 | 


--------------------------------------------------------------------------------
/nordlys/wikipedia/annot_extractor.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Extracts annotations from processed Wikipedia articles and writes them in .tsv files.
 3 | 
 4 | author: Faegheh Hasibi
 5 | """
 6 | 
 7 | import argparse
 8 | import os
 9 | import re
10 | from datetime import datetime
11 | 
12 | tagRE = re.compile(r'(.*?)(<(/?\w+)[^>]*>)(?:([^<]*)(<.*?>)?)?')
13 | idRE = re.compile(r'id="([0-9]+)"')
14 | titleRE = re.compile(r'title="(.*)"')
15 | linkRE = re.compile(r'href="(.*)"')
16 | 
17 | 
18 | def process_file(wiki_file, out_file):
19 |     """
20 |     Extracts annotations from XML annotated file and write annotations.
21 |     output file format: page_id   title   mention    linked_en
22 | 
23 |     :param wiki_file:  XML file containing multiple articles.
24 |     :param out_file: Name of tsv file.
25 |     """
26 |     print "Processing " + wiki_file + " ...",
27 |     open(out_file, "w").close()
28 |     out = open(out_file, "a")
29 |     f = open(wiki_file, "r")
30 |     annots = []
31 |     doc_id, doc_title = None, None
32 |     for line in f:
33 |         # Writes annotations and reset variables
34 |         if re.search(r'</doc>', line):
35 |             out.write("".join(annots))
36 |             annots = []
37 |             doc_id, doc_title = None, None
38 |         for m in tagRE.finditer(line):
39 |             if not m:
40 |                 continue
41 |             tag = m.group(3)
42 |             if tag == "doc":
43 |                 doc_id = idRE.search(m.group(2))
44 |                 doc_title = titleRE.search(m.group(2))
45 |                 if (not doc_id) or (not doc_title):
46 |                     print "\nINFO: doc id or title not found in " + wiki_file,
47 |                     continue
48 |             if tag == "a":
49 |                 mention = m.group(4)
50 |                 link = linkRE.search(m.group(2))
51 |                 if (not link) or (doc_id is None) or (doc_title is None):
52 |                     print "\nINFO: link not found in " + wiki_file,
53 |                     continue
54 |                 annot = doc_id.group(1) + "\t" + doc_title.group(1) + "\t" + mention + "\t" + link.group(1) + "\n"
55 |                 annots.append(annot)
56 |     print " --> output in " + out_file
57 | 
58 | 
59 | def add_dir(base_in_dir, base_out_dir):
60 |     """Adds FACC annotations from a directory recursively."""
61 |     for path, dirs, _ in os.walk(base_in_dir):
62 |         for dir in sorted(dirs):
63 |             s_t = datetime.now()  # start time
64 |             total_time = 0.0
65 |             for _, _, files in os.walk(os.path.join(base_in_dir, dir)):
66 |                 for fn in files:
67 |                     out_dir = base_out_dir + dir
68 |                     if not os.path.exists(out_dir):
69 |                         os.makedirs(out_dir)
70 |                     out_file = os.path.join(out_dir,  "an_" + fn + ".tsv")
71 |                     in_file = os.path.join(base_in_dir + dir, fn)
72 |                     process_file(in_file, out_file)
73 | 
74 |             e_t = datetime.now()  # end time
75 |             diff = e_t - s_t
76 |             total_time += diff.total_seconds()
77 |             print "[processing time for bunch " + dir + " (min)]:", total_time/60
78 | 
79 | 
80 | def main():
81 |     parser = argparse.ArgumentParser()
82 |     parser.add_argument("-inputdir", help="Path to directory to read from")
83 |     parser.add_argument("-outputdir", help="Path to write the annotations (.tsv files)")
84 |     args = parser.parse_args()
85 | 
86 |     add_dir(args.inputdir, args.outputdir)
87 | 
88 | if __name__ == "__main__":
89 |     main()
90 | 


--------------------------------------------------------------------------------
/nordlys/wikipedia/indexer.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Creates a Lucene index for Wikipedia articles.
  3 | 
  4 | - A single field index is created.
  5 | - disambiguation and list pages are ignored.
  6 | - wiki page annotations are ignored and only mentions are kept.
  7 | 
  8 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
  9 | """
 10 | import argparse
 11 | import os
 12 | import re
 13 | from urllib import unquote
 14 | from nordlys.wikipedia.utils import WikipediaUtils
 15 | from nordlys.tagme.lucene_tools import Lucene
 16 | 
 17 | 
 18 | class Indexer(object):
 19 |     tagRE = re.compile(r'(.*?)(<(/?\w+)[^>]*>)(?:([^<]*)(<.*?>)?)?')
 20 |     idRE = re.compile(r'id="([0-9]+)"')
 21 |     titleRE = re.compile(r'title="(.*)"')
 22 |     linkRE = re.compile(r'href="(.*)"')
 23 | 
 24 |     def __init__(self, annot_only):
 25 |         self.annot_only = annot_only
 26 |         self.contents = None
 27 |         self.lucene = None
 28 | 
 29 |     def __add_to_contents(self, field_name, field_value, field_type):
 30 |         """
 31 |         Adds field to document contents.
 32 |         Field value can be a list, where each item is added separately (i.e., the field is multi-valued).
 33 |         """
 34 |         if type(field_value) is list:
 35 |             for fv in field_value:
 36 |                 self.__add_to_contents(field_name, fv, field_type)
 37 |         else:
 38 |             if len(field_value) > 0:  # ignore empty fields
 39 |                 self.contents.append({'field_name': field_name,
 40 |                                       'field_value': field_value,
 41 |                                       'field_type': field_type})
 42 | 
 43 |     def index_file(self, file_name):
 44 |         """
 45 |         Adds one file to the index.
 46 | 
 47 |         :param file_name: file to be indexed
 48 |         """
 49 |         self.contents = []
 50 |         article_text = ""
 51 |         article_annots = []  # for annot-only index
 52 | 
 53 |         f = open(file_name, "r")
 54 |         for line in f:
 55 |             line = line.replace("#redirect", "")
 56 |             # ------ Reaches the end tag for an article ---------
 57 |             if re.search(r'</doc>', line):
 58 |                 # ignores null titles
 59 |                 if wiki_uri is None:
 60 |                     print "\tINFO: Null Wikipedia title!"
 61 |                 # ignores disambiguation pages
 62 |                 elif (wiki_uri.endswith("(disambiguation)>")) or \
 63 |                         ((len(article_text) < 200) and ("may refer to:" in article_text)):
 64 |                     print "\tINFO: disambiguation page " + wiki_uri + " ignored!"
 65 |                 # ignores list pages
 66 |                 elif (wiki_uri.startswith("<wikipedia:List_of")) or (wiki_uri.startswith("<wikipedia:Table_of")):
 67 |                     print "\tINFO: List page " + wiki_uri + " ignored!"
 68 |                 # adds the document to the index
 69 |                 else:
 70 |                     self.__add_to_contents(Lucene.FIELDNAME_ID, wiki_uri, Lucene.FIELDTYPE_ID)
 71 |                     if self.annot_only:
 72 |                         self.__add_to_contents(Lucene.FIELDNAME_CONTENTS, article_annots, Lucene.FIELDTYPE_ID_TV)
 73 |                     else:
 74 |                         self.__add_to_contents(Lucene.FIELDNAME_CONTENTS, article_text, Lucene.FIELDTYPE_TEXT_TVP)
 75 |                     self.lucene.add_document(self.contents)
 76 |                 self.contents = []
 77 |                 article_text = ""
 78 |                 article_annots = []
 79 | 
 80 |             # ------ Process other lines of article ---------
 81 |             tag_iter = list(self.tagRE.finditer(line))
 82 |             # adds line to content if there is no annotation
 83 |             if len(tag_iter) == 0:
 84 |                 article_text += line
 85 |                 continue
 86 |             # A tag is detected in the line
 87 |             for t in tag_iter:
 88 |                 tag = t.group(3)
 89 |                 if tag == "doc":
 90 |                     doc_title = self.titleRE.search(t.group(2))
 91 |                     wiki_uri = WikipediaUtils.wiki_title_to_uri(doc_title.group(1)) if doc_title else None
 92 |                 if tag == "a":
 93 |                     article_text += t.group(1) + t.group(4)  # resolves annotations and replace them with mention
 94 |                     # extracts only annotations
 95 |                     if self.annot_only:
 96 |                         link_title = self.linkRE.search(t.group(2))
 97 |                         link_uri = WikipediaUtils.wiki_title_to_uri(unquote(link_title.group(1))) if link_title else None
 98 |                         if link_uri is not None:
 99 |                             article_annots.append(link_uri)
100 |                         else:
101 |                             print "\nINFO: link to the annotation not found in " + file_name
102 |             last_span = tag_iter[-1].span()
103 |             article_text += line[last_span[1]:]
104 |         f.close()
105 | 
106 |     def index_files(self, input_dir, output_dir):
107 |         """Build index for all files."""
108 |         self.lucene = Lucene(output_dir)
109 |         self.lucene.open_writer()
110 |         for path, dirs, _ in os.walk(input_dir):
111 |             for dir in sorted(dirs):
112 |                 for _, _, files in os.walk(os.path.join(input_dir, dir)):
113 |                     for fn in sorted(files):
114 |                         print "Indexing ", os.path.join(input_dir + dir, fn),  "..."
115 |                         self.index_file(os.path.join(input_dir + dir, fn))
116 |         # closes Lucene index
117 |         self.lucene.close_writer()
118 | 
119 | 
120 | def main():
121 |     parser = argparse.ArgumentParser()
122 |     parser.add_argument("-inputdir", help="Path to directory to read from")
123 |     parser.add_argument("-outputdir", help="Path to write the annotations (.tsv files)")
124 |     parser.add_argument("-annot", help="Annotation-only index", action="store_true", default=False)
125 | 
126 |     args = parser.parse_args()
127 | 
128 |     output_dir = args.outputdir
129 |     input_dir = args.inputdir
130 |     print "index dir: " + output_dir
131 |     indexer = Indexer(args.annot)
132 |     indexer.index_files(input_dir, output_dir)
133 |     print "index build" + output_dir
134 | 
135 | 
136 | if __name__ == "__main__":
137 |     main()


--------------------------------------------------------------------------------
/nordlys/wikipedia/merge_sf.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Reads surface forms extracted from titles, redirects, and anchors and merge them into a single json file,
  3 | which can be directly imported to Mongodb using this command:
  4 | 
  5 |  mongoimport --db <db_name> --collection surfaceforms_wiki_YYYYMMDD --file <path_to_json_file> --jsonArray
  6 | 
  7 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
  8 | """
  9 | import argparse
 10 | 
 11 | import os
 12 | import json
 13 | from urllib import unquote
 14 | from nordlys.storage.mongo import Mongo
 15 | from nordlys.wikipedia.utils import WikipediaUtils
 16 | 
 17 | 
 18 | class Merger(object):
 19 | 
 20 |     def __init__(self):
 21 |         self.all_sfs = {}
 22 | 
 23 |     def merge_all(self, titles_file, redirects_file, anchors_file, out_file):
 24 |         self.add_anchors(anchors_file)
 25 |         self.add_titles(titles_file)
 26 |         self.add_redirects(redirects_file)
 27 | 
 28 |         # Converting all surface forms to mongo format
 29 |         print "Converting to mongodb format ..."
 30 |         sf_mongo_entries = []
 31 |         i = 0
 32 |         for sf, en_sources in self.all_sfs.iteritems():
 33 |             escaped_sf = Mongo.escape(sf)
 34 |             entry = {"_id": escaped_sf}
 35 |             for source, en in en_sources.iteritems():
 36 |                 entry[source] = en
 37 |             sf_mongo_entries.append(entry)
 38 |             i += 1
 39 |             if i % 1000000 == 0:
 40 |                 print "processes", i, "the surface form"
 41 |         print "writing to json file ..."
 42 |         json.dump(sf_mongo_entries, open(out_file, "w"), indent=4, sort_keys=True)
 43 | 
 44 |     def __add_to_dict(self, sf, pred, en, count=1):
 45 |         if sf not in self.all_sfs:
 46 |             self.all_sfs[sf] = {}
 47 |         if pred not in self.all_sfs[sf]:
 48 |             self.all_sfs[sf][pred] = {en: count}
 49 |         else:
 50 |             self.all_sfs[sf][pred][en] = count
 51 | 
 52 |     # ============== ANCHORS ==============
 53 | 
 54 |     def add_anchors(self, anchor_file):
 55 |         print "Adding anchors ..."
 56 |         i = 0
 57 |         infile = open(anchor_file, "r")
 58 |         for line in infile:
 59 |             # print line
 60 |             cols = line.strip().split("\t")
 61 |             sf = cols[0].strip()
 62 |             count = int(cols[2])
 63 |             wiki_uri = WikipediaUtils.wiki_title_to_uri(unquote(cols[1].strip()))
 64 |             self.__add_to_dict(sf, "anchor", wiki_uri, count)
 65 |             i += 1
 66 |             if i % 1000000 == 0:
 67 |                 print "Processed", i, "th anchor!"
 68 | 
 69 |     # ============== REDIRECTS ==============
 70 | 
 71 |     def add_redirects(self, redirect_file):
 72 |         """Adds redirect pages to the surface form dictionary."""
 73 |         print "Adding redirects ..."
 74 |         redirects = open(redirect_file, "r")
 75 |         count = 0
 76 |         for line in redirects:
 77 |             cols = line.strip().split("\t")
 78 |             sf = cols[0].strip().lower()
 79 |             wiki_uri = WikipediaUtils.wiki_title_to_uri(cols[1].strip())
 80 |             # print sf, wiki_uri
 81 |             self.__add_to_dict(sf, "redirect", wiki_uri)
 82 |             count += 1
 83 |             if count % 1000000 == 0:
 84 |                 print "Processed ", count, "th redirects."
 85 | 
 86 |     # ============== TITLES ==============
 87 | 
 88 |     def add_titles(self, title_file):
 89 |         """Adds titles and title name variants to the surface form dictionary."""
 90 |         print "Adding titles ..."
 91 |         redirects = open(title_file, "r")
 92 |         count = 0
 93 |         for line in redirects:
 94 |             cols = line.strip().split("\t")
 95 |             title = unquote(cols[1].strip())
 96 |             wiki_uri = WikipediaUtils.wiki_title_to_uri(title)
 97 |             self.__add_to_dict(title.lower(), "title", wiki_uri)
 98 |             title_nv = self.__title_nv(title)
 99 |             if (title_nv != title) and (title_nv.strip() != ""):
100 |                 self.__add_to_dict(title_nv.lower(), "title-nv", wiki_uri)
101 |             count += 1
102 |             if count % 1000000 == 0:
103 |                 print "Processed ", count, "th titles."
104 | 
105 |     @staticmethod
106 |     def __title_nv(title):
107 |         """Removes all letters after "(" and "," from page title."""
108 |         p_pos = title.find("(")
109 |         title_nv = title[:p_pos] if p_pos != -1 else title
110 |         c_pos = title.find(",")
111 |         title_nv = title[:c_pos] if c_pos != -1 else title_nv
112 |         return title_nv.strip()
113 | 
114 | 
115 | def main():
116 |     parser = argparse.ArgumentParser()
117 |     parser.add_argument("-anchors", help="Path to anchor file")
118 |     parser.add_argument("-redirects", help="Path to redirect file")
119 |     parser.add_argument("-titles", help="Path to page-title file")
120 |     parser.add_argument("-outputdir", help="Path to output directory")
121 |     args = parser.parse_args()
122 | 
123 | 
124 |     # Merges titles, redirects, and anchors
125 |     merger = Merger()
126 |     merger.merge_all(args.titles, args.redirects, args.anchors, args.outputdir + "/sf_dict_mongo.json")
127 | 
128 | if __name__ == "__main__":
129 |     main()


--------------------------------------------------------------------------------
/nordlys/wikipedia/pageid_extractor.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Extracts page id and titles from Wikipedia dump and writes them into a single file
 3 | 
 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
 5 | """
 6 | 
 7 | import argparse
 8 | import os
 9 | import re
10 | 
11 | 
12 | tagRE = re.compile(r'(.*?)(<(/?\w+)[^>]*>)(?:([^<]*)(<.*?>)?)?')
13 | idRE = re.compile(r'id="([0-9]+)"')
14 | titleRE = re.compile(r'title="(.*)"')
15 | 
16 | 
17 | def read_file(file_name):
18 |     """Extracts page ids and titles from a single file."""
19 |     out_str = ""
20 |     f = open(file_name, "r")
21 | 
22 |     for line in f:
23 |         for m in tagRE.finditer(line):
24 |             if not m:
25 |                 continue
26 |             tag = m.group(3)
27 |             if tag == "doc":
28 |                 doc_id = idRE.search(m.group(2))
29 |                 doc_title = titleRE.search(m.group(2))
30 |                 if (not doc_id) or (not doc_title):
31 |                     print "\nINFO: doc id or title not found in " + file_name,
32 |                     continue
33 |                 out_str += doc_id.group(1) + "\t" + doc_title.group(1) + "\n"
34 |                 break
35 |     return out_str
36 | 
37 | 
38 | def read_files(basedir, output_file):
39 |     """Extracts page id and titles to a single file."""
40 |     open(output_file, "w").close()
41 |     out_file = open(output_file, "a")
42 |     for path, dirs, _ in os.walk(basedir):
43 |         for dir in sorted(dirs):
44 |             for _, _, files in os.walk(os.path.join(basedir, dir)):
45 |                 for fn in sorted(files):
46 |                     print "parsing ", os.path.join(basedir + dir, fn),  "..."
47 |                     out_str = read_file(os.path.join(basedir + dir, fn))
48 |                     out_file.write(out_str)
49 | 
50 | 
51 | def main():
52 |     parser = argparse.ArgumentParser()
53 |     parser.add_argument("-inputdir", help="Path to directory to read from")
54 |     parser.add_argument("-output", help="Path to write the annotations (.tsv files)")
55 |     args = parser.parse_args()
56 | 
57 |     read_files(args.inputdir, args.output + "/page-id-titles.txt")
58 |     print "All page ids are added"
59 | 
60 | 
61 | if __name__ == "__main__":
62 |     main()
63 | 


--------------------------------------------------------------------------------
/nordlys/wikipedia/utils.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Wikipedia utils.
 3 | 
 4 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
 5 | """
 6 | 
 7 | from urllib import quote
 8 | 
 9 | 
10 | class WikipediaUtils(object):
11 |     mongo = None
12 |     
13 |     @staticmethod
14 |     def wiki_title_to_uri(title):
15 |         """
16 |         Converts wiki page title to wiki_uri
17 |         based on https://en.wikipedia.org/wiki/Wikipedia:Page_name#Spaces.2C_underscores_and_character_coding
18 |         encoding based on http://dbpedia.org/services-resources/uri-encoding
19 |         """
20 |         if title:
21 |             wiki_uri = "<wikipedia:" + quote(title, ' !$&\'()*+,-./:;=@_~').replace(' ', '_') + ">"
22 |             return wiki_uri
23 |         else:
24 |             return None
25 | 
26 |     @staticmethod
27 |     def wiki_uri_to_dbp_uri(wiki_uri):
28 |         """Converts Wikipedia uri to DBpedia URI."""
29 |         return wiki_uri.replace("<wikipedia:", "<dbpedia:")
30 | 
31 | 
32 | def main():
33 |     # example usage    
34 |     print WikipediaUtils.wiki_title_to_uri("Tango (genre musical)")
35 | 
36 | if __name__ == "__main__":
37 |     main()


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | requests
2 | pymongo>=2.7
3 | sphinx-bootstrap-theme>=0.4.0
4 | sphinxcontrib-httpdomain>=1.2.1
5 | lxml>=2.3.2
6 | beautifulsoup4>=4.3.2
7 | numpy>=1.8.1
8 | nltk>=2.0.4
9 | 


--------------------------------------------------------------------------------
/run_scripts.sh:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env bash
 2 | 
 3 | # NOTES:
 4 | #   1. Before running this script, download the `data` folder from `http://hasibi.com/files/res/data.tar.gz`
 5 | #      and put it under the main repository directory (i.e., tagme-rep/data)
 6 | #   2. For all of the experiments we get all linked entities by setting the threshold to 0.
 7 | #      Later, in the evaluation scripts we filter out the entities below a certain threshold.
 8 | 
 9 | # ===============
10 | # Reproducibility
11 | # ===============
12 | 
13 | # TAGME-API - Wiki-Disamb30
14 | python -m nordlys.tagme.tagme_api -data wiki-disamb30
15 | python -m scripts.evaluator_disamb qrels/qrels_wiki-disamb30.txt output/wiki-disamb30_tagmeAPI.txt
16 | 
17 | # TAGME-API - Wiki-Annot30
18 | python -m nordlys.tagme.tagme_api -data wiki-annot30
19 | python -m scripts.evaluator_annot qrels/qrels_wiki-annot30.txt output/wiki-annot30_tagmeAPI.txt 0.2
20 | python -m scripts.evaluator_topics qrels/qrels_wiki-annot30.txt output/wiki-annot30_tagmeAPI.txt 0.2
21 | 
22 | # TAGME-our(wiki10)
23 | python -m nordlys.tagme.tagme -data wiki-annot30 
24 | python -m scripts.evaluator_annot qrels/qrels_wiki-annot30.txt output/wiki-annot30_tagme_wiki10.txt 0.2
25 | python -m scripts.evaluator_topics qrels/qrels_wiki-annot30.txt output/wiki-annot30_tagme_wiki10.txt 0.2
26 | 
27 | # Dexter
28 | python -m nordlys.tagme.dexter_api -data wiki-annot30 
29 | python -m scripts.evaluator_annot qrels/qrels_wiki-annot30.txt output/wiki-annot30_dexter.txt 0.2
30 | python -m scripts.evaluator_topics qrels/qrels_wiki-annot30.txt output/wiki-annot30_dexter.txt 0.2
31 | 
32 | 
33 | 
34 | # ================
35 | # Generalizability
36 | # ================
37 | 
38 | # TAGME API - ERD-dev
39 | python -m nordlys.tagme.tagme_api -data erd-dev
40 | python -m scripts.to_elq output/erd-dev_tagmeAPI.txt 0.1
41 | python -m scripts.evaluator_strict qrels/qrels_erd-dev.txt output/erd-dev_tagmeAPI_0.1.elq
42 | 
43 | # TAGME API - Y-ERD
44 | python -m nordlys.tagme.tagme_api -data y-erd
45 | python -m scripts.to_elq output/y-erd_tagmeAPI.txt 0.1
46 | python -m scripts.evaluator_strict qrels/qrels_y-erd.txt output/y-erd_tagmeAPI_0.1.elq
47 | 
48 | #TAGME-wp10 - ERD-dev
49 | python -m nordlys.tagme.tagme -data erd-dev 
50 | python -m scripts.to_elq output/erd-dev_tagme_wiki10.txt 0.1
51 | python -m scripts.evaluator_strict qrels/qrels_erd-dev.txt output/erd-dev_tagme_wiki10_0.1.elq
52 | 
53 | #TAGME-wp10 - Y-ERD
54 | python -m nordlys.tagme.tagme -data y-erd 
55 | python -m scripts.to_elq output/y-erd_tagme_wiki10.txt 0.1
56 | python -m scripts.evaluator_strict qrels/qrels_y-erd.txt output/y-erd_tagme_wiki10_0.1.elq
57 | 
58 | #TAGME-wp12 - ERD-dev
59 | python -m nordlys.tagme.tagme -data erd-dev 
60 | python -m scripts.to_elq output/erd-dev_tagme_wiki12.txt 0.1
61 | python -m scripts.evaluator_strict qrels/qrels_erd-dev.txt output/erd-dev_tagme_wiki12_0.1.elq
62 | 
63 | #TAGME-wp12 - Y-ERD
64 | python -m nordlys.tagme.tagme -data y-erd 
65 | python -m scripts.to_elq output/y-erd_tagme_wiki12.txt 0.1
66 | python -m scripts.evaluator_strict qrels/qrels_y-erd.txt output/y-erd_tagme_wiki12_0.1.elq
67 | 
68 | # Dexter - ERD-dev
69 | python -m nordlys.tagme.dexter_api -data erd-dev 
70 | python -m scripts.to_elq output/erd-dev_dexter.txt 0.1
71 | python -m scripts.evaluator_strict qrels/qrels_erd-dev.txt output/erd-dev_dexter_0.1.elq
72 | 
73 | # Dexter - Y-ERD
74 | python -m nordlys.tagme.dexter_api -data y-erd 
75 | python -m scripts.to_elq output/y-erd_dexter.txt 0.1
76 | python -m scripts.evaluator_strict qrels/qrels_y-erd.txt output/y-erd_dexter_0.1.elq
77 | 


--------------------------------------------------------------------------------
/scripts/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hasibi/TAGME-Reproducibility/d21ed0d826fc60a6e4caaa5ec7b6c39e16f7c6c6/scripts/__init__.py


--------------------------------------------------------------------------------
/scripts/evaluator_annot.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This script computes Topic metrics for the end-to-end performance.
  3 | Precision and recall are macro-averaged.
  4 | Matching condition: entities should match and mentions should be equal or contained in each other.
  5 | 
  6 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
  7 | """
  8 | 
  9 | from __future__ import division
 10 | import sys
 11 | from collections import defaultdict
 12 | 
 13 | 
 14 | class EvaluatorAnnot(object):
 15 |     def __init__(self, qrels, results, score_th, null_qrels=None):
 16 |         self.qrels_dict = self.__group_by_queries(qrels)
 17 |         self.results_dict = self.__group_by_queries(results, res=True, score_th=score_th)
 18 |         self.null_qrels = self.__group_by_queries(null_qrels) if null_qrels else None
 19 | 
 20 |     @staticmethod
 21 |     def __group_by_queries(file_lines, res=False, score_th=None):
 22 |         """
 23 |         Groups the lines by query id.
 24 | 
 25 |         :param file_lines: list of lines [[qid, score, en_id, mention, page_id], ...]
 26 |         :return: {qid: {(men0, en0), (men1, en01), ..}, ..};
 27 |         """
 28 |         grouped_inters = defaultdict(set)
 29 |         for cols in file_lines:
 30 |             if len(cols) > 2:
 31 |                 if res and (float(cols[1]) < score_th):
 32 |                     continue
 33 |                 grouped_inters[cols[0]].add((cols[3].lower(), cols[2].lower()))
 34 |         return grouped_inters
 35 | 
 36 |     def rm_nulls_from_res(self):
 37 |         """
 38 |         Removes mentions that not linked to an entity in the qrel.
 39 |         There are some entities in the qrel with "*NONE*" as id. We remove the related mentions from the result file.
 40 |         Null entities are generated due to the inconsistency between TAGME Wikipedia dump (2009) and our dump (2010).
 41 |         """
 42 |         print "Removing mentions with null entities ..."
 43 |         new_results_dict = defaultdict(set)
 44 |         for qid in self.results_dict:
 45 |             # easy case: the query does not have any null entity.
 46 |             if qid not in set(self.null_qrels.keys()):
 47 |                 new_results_dict[qid] = self.results_dict[qid]
 48 |                 continue
 49 | 
 50 |             qrel_null_mentions = [item[0] for item in self.null_qrels[qid]]
 51 |             # check null mentions with results mentions
 52 |             for men, en in self.results_dict[qid]:
 53 |                 is_null = False
 54 |                 for qrel_null_men in qrel_null_mentions:
 55 |                     # results mention does not match null qrel mention
 56 |                     if mention_match(qrel_null_men, men):
 57 |                         is_null = True
 58 |                         break
 59 | 
 60 |                 if not is_null:
 61 |                     new_results_dict[qid].add((men, en))
 62 |         self.results_dict = new_results_dict
 63 | 
 64 |     def eval(self, eval_query_func):
 65 |         """
 66 |         Evaluates all queries and calculates total precision, recall and F1 (macro averaging).
 67 | 
 68 |         :param eval_query_func: A function that takes qrel and results for a query and returns evaluation metrics
 69 |         :return  Total precision, recall, and F1 for all queries
 70 |         """
 71 |         self.rm_nulls_from_res()
 72 |         queries_eval = {}
 73 |         total_prec, total_rec, total_f = 0, 0, 0
 74 |         for qid in sorted(self.qrels_dict):
 75 |             queries_eval[qid] = eval_query_func(self.qrels_dict[qid], self.results_dict.get(qid, {}))
 76 | 
 77 |             total_prec += queries_eval[qid]['prec']
 78 |             total_rec += queries_eval[qid]['rec']
 79 | 
 80 |         n = len(self.qrels_dict)  # number of queries
 81 |         total_prec /= n
 82 |         total_rec /= n
 83 |         total_f = 2 * total_prec * total_rec / (total_prec + total_rec)
 84 | 
 85 |         log = "\n----------------" + "\nEvaluation results:\n" + \
 86 |               "Prec: " + str(round(total_prec, 4)) + "\n" +\
 87 |               "Rec:  " + str(round(total_rec, 4)) + "\n" + \
 88 |               "F1:   " + str(round(total_f, 4)) + "\n" + \
 89 |               "all:  " + str(round(total_prec, 4)) + ", " + str(round(total_rec, 4)) + ", " + str(round(total_f, 4))
 90 |         print log
 91 |         metrics = {'prec': total_prec, 'rec': total_rec, 'f': total_f}
 92 |         return metrics
 93 | 
 94 | 
 95 | def erd_eval_query(query_qrels, query_results):
 96 |     """
 97 |     Evaluates a single query.
 98 | 
 99 |     :param query_qrels: Query interpretations from Qrel [{en1, en2, ..}, ..]
100 |     :param query_results: Query interpretations from result file [{en1, en2, ..}, ..]
101 |     :return: precision, recall, and F1 for a query
102 |     """
103 |     tp = 0  # correct
104 |     fn = 0  # missed
105 |     fp = 0  # incorrectly returned
106 | 
107 |     # ----- Query has at least an interpretation set. -----
108 |     # Iterate over qrels to calculate TP and FN
109 |     for qrel_item in query_qrels:
110 |         if find_item(qrel_item, query_results):
111 |             tp += 1
112 |         else:
113 |             fn += 1
114 |     # Iterate over results to calculate FP
115 |     for res_item in query_results:
116 |         if not find_item(res_item, query_qrels):  # Finds the result in the qrels
117 |             fp += 1
118 | 
119 |     prec = tp / (tp+fp) if tp+fp != 0 else 0
120 |     rec = tp / (tp+fn) if tp+fn != 0 else 0
121 |     f = (2 * prec * rec) / (prec + rec) if prec + rec != 0 else 0
122 |     metrics = {'prec': prec, 'rec': rec, 'f': f}
123 |     return metrics
124 | 
125 | 
126 | def find_item(item_to_find, items_list):
127 |     """
128 |     Returns True if an item is found in the item list.
129 | 
130 |     :param item_to_find: item to be found
131 |     :param items_list: list of items to search in
132 |     :return boolean
133 |     """
134 |     is_found = False
135 | 
136 |     for item in items_list:
137 |         if (item[1] == item_to_find[1]) and mention_match(item[0], item_to_find[0]):
138 |             is_found = True
139 |     return is_found
140 | 
141 | 
142 | def mention_match(mention1, mention2):
143 |     """
144 |     Checks if two mentions matches each other.
145 |     Matching condition: One of the mentions is sub-string of the other one.
146 |     """
147 |     match = ((mention1 in mention2) or (mention2 in mention1))
148 |     return match
149 | 
150 | 
151 | def parse_file(file_name, res=False):
152 |     """
153 |     Parses file and returns the positive instances for each query.
154 | 
155 |     :param file_name: Name of file to be parsed
156 |     :return lists of lines [[qid, label, en_id, ...], ...], lines with null entities are separated
157 |     """
158 |     null_lines = []
159 |     file_lines = []
160 |     efile = open(file_name, "r")
161 |     for line in efile.readlines():
162 |         if line.strip() == "":
163 |             continue
164 |         cols = line.strip().split("\t")
165 |         if (not res) and (cols[2].strip() == "*NONE*"):
166 |             null_lines.append(cols)
167 |         else:
168 |             file_lines.append(cols)
169 |     return file_lines, null_lines
170 | 
171 | 
172 | def main(args):
173 |     if len(args) < 2:
174 |         print "\tUsage: <qrel_file> <result_file>"
175 |         exit(0)
176 |     print "parsing qrel ..."
177 |     qrels, null_qrels = parse_file(args[0])  # here qrel does not contain null entities
178 |     print "parsing results ..."
179 |     results = parse_file(args[1], res=True)[0]
180 |     print "evaluating ..."
181 |     evaluator = EvaluatorAnnot(qrels, results, float(args[2]), null_qrels=null_qrels)
182 |     evaluator.eval(erd_eval_query)
183 | 
184 | if __name__ == '__main__':
185 |     main(sys.argv[1:])
186 | 


--------------------------------------------------------------------------------
/scripts/evaluator_disamb.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This script computes evaluation metrics for disambiguation phase.
  3 | Foe each query, if the ground truth entity is found in the results, both Precision and recall are set to 1;
  4 | otherwise to 0.
  5 | 
  6 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
  7 | """
  8 | 
  9 | from __future__ import division
 10 | import sys
 11 | from collections import defaultdict
 12 | 
 13 | 
 14 | class EvaluatorDisamb(object):
 15 | 
 16 |     def __init__(self, qrels, results, null_qrels=None):
 17 |         self.qrels_dict = self.__group_by_queries(qrels)
 18 |         self.results_dict = self.__group_by_queries(results)
 19 |         self.null_qrels = self.__group_by_queries(null_qrels) if null_qrels else None
 20 | 
 21 |     @staticmethod
 22 |     def __group_by_queries(file_lines):
 23 |         """
 24 |         Groups the lines by query id.
 25 | 
 26 |         :param file_lines: list of lines [[qid, score, wiki_uri, mention, page_id], ...]
 27 |         :return: {qid: {(men0, en0), (men1, en01), ..}, ..};
 28 |         """
 29 |         grouped_inters = defaultdict(set)
 30 |         for cols in file_lines:
 31 |             if len(cols) > 2:
 32 |                 grouped_inters[cols[0]].add((cols[3].lower(), cols[2]))
 33 |         return grouped_inters
 34 | 
 35 |     def eval(self, eval_query_func):
 36 |         """
 37 |         Evaluates all queries and calculates total precision, recall and F1 (macro averaging).
 38 | 
 39 |         :param eval_query_func: A function that takes qrel and results for a query and returns evaluation metrics
 40 |         :return  Total precision, recall, and F1 for all queries
 41 |         """
 42 |         queries_eval = {}
 43 |         total_prec, total_rec, total_f = 0, 0, 0
 44 |         for qid in set(sorted(self.qrels_dict)):
 45 |             queries_eval[qid] = eval_query_func(self.qrels_dict[qid], self.results_dict.get(qid, {}))
 46 |             total_prec += queries_eval[qid]['prec']
 47 |             total_rec += queries_eval[qid]['rec']
 48 | 
 49 |         n = len(self.qrels_dict)  # number of queries
 50 |         total_prec /= n
 51 |         total_rec /= n
 52 |         total_f = (2 * total_prec * total_rec) / (total_prec + total_rec)
 53 | 
 54 |         log = "\n----------------" + "\nEvaluation results:\n" + \
 55 |               "Prec: " + str(round(total_prec, 4)) + "\n" +\
 56 |               "Rec:  " + str(round(total_rec, 4)) + "\n" + \
 57 |               "F1:   " + str(round(total_f, 4)) + "\n" + \
 58 |               "all:  " + str(round(total_prec, 4)) + ", " + str(round(total_rec, 4)) + ", " + str(round(total_f, 4))
 59 |         print log
 60 |         metrics = {'prec': total_prec, 'rec': total_rec, 'f': total_f}
 61 |         return metrics
 62 | 
 63 | 
 64 | def erd_eval_query(query_qrels, query_results):
 65 |     """
 66 |     Evaluates a single query.
 67 | 
 68 |     :param query_qrels: Query interpretations from Qrel [{en1, en2, ..}, ..]
 69 |     :param query_results: Query interpretations from result file [{en1, en2, ..}, ..]
 70 |     :return: precision, recall, and F1 for a query
 71 |     """
 72 |     prec, rec = 0, 0
 73 | 
 74 |     # ----- Query has at least an interpretation set. -----
 75 |     # Iterate over qrels to calculate TP and FN
 76 |     for qrel_item in query_qrels:
 77 |         if find_item(qrel_item, query_results):
 78 |             prec += 1
 79 |             rec += 1
 80 | 
 81 |     prec /= len(query_qrels)
 82 |     rec /= len(query_qrels)
 83 |     f = (2 * prec * rec) / (prec + rec) if prec + rec != 0 else 0
 84 |     metrics = {'prec': prec, 'rec': rec, 'f': f}
 85 |     return metrics
 86 | 
 87 | 
 88 | def find_item(item_to_find, items_list):
 89 |     """
 90 |     Returns True if an item is found in the item list.
 91 | 
 92 |     :param item_to_find: item to be found
 93 |     :param items_list: list of items to search in
 94 |     :return boolean
 95 |     """
 96 |     is_found = False
 97 | 
 98 |     for item in items_list:
 99 |         if item[1] == item_to_find[1]:
100 |             is_found = True
101 |     return is_found
102 | 
103 | 
104 | def parse_file(file_name, res=False):
105 |     """
106 |     Parses file and returns the positive instances for each query.
107 | 
108 |     :param file_name: Name of file to be parsed
109 |     :return list of lines [[qid, label, en_id, ...], ...]
110 |     """
111 |     null_lines = []
112 |     file_lines = []
113 |     efile = open(file_name, "r")
114 |     for line in efile.readlines():
115 |         if line.strip() == "":
116 |             continue
117 |         cols = line.strip().split("\t")
118 |         if (not res) and (cols[2].strip() == "*NONE*"):
119 |             null_lines.append(cols)
120 |         else:
121 |             file_lines.append(cols)
122 |     return file_lines, null_lines
123 | 
124 | 
125 | def main(args):
126 |     if len(args) < 2:
127 |         print "\tUsage: <qrel_file> <result_file>"
128 |         exit(0)
129 |     print "parsing qrel ..."
130 |     qrels, null_qrels = parse_file(args[0])  # here qrel does not contain null entities
131 |     print "parsing results ..."
132 |     results = parse_file(args[1], res=True)[0]
133 |     print "evaluating ..."
134 |     evaluator = EvaluatorDisamb(qrels, results, null_qrels=null_qrels)
135 |     evaluator.eval(erd_eval_query)
136 | 
137 | 
138 | if __name__ == '__main__':
139 |     main(sys.argv[1:])
140 | 


--------------------------------------------------------------------------------
/scripts/evaluator_strict.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This script evaluates query interpretations based on the strict evaluation metrics;
  3 | macro averaging of precision, recall and F-measure.
  4 | 
  5 | For detailed information see:
  6 |     F. Hasibi, K. Balog, and S. E. Bratsberg. "Entity Linking in Queries: Tasks and Evaluation",
  7 |     In Proceedings of ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '15), Sep 2015.
  8 |     DOI: http://dx.doi.org/10.1145/2808194.2809473
  9 | 
 10 | Usage:
 11 |     python evaluation_erd.py <qrel_file> <result_file>
 12 | e.g.
 13 |     python evaluation_erd.py qrels_sets_ERD-dev.txt ERD-dev_MLMcg-GIF.txt
 14 | 
 15 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
 16 | """
 17 | 
 18 | from __future__ import division
 19 | import sys
 20 | from collections import defaultdict
 21 | 
 22 | 
 23 | class Evaluator(object):
 24 | 
 25 |     def __init__(self, qrels, results):
 26 |         self.qrels_dict = self.__group_by_queries(qrels)
 27 |         self.results_dict = self.__group_by_queries(results)
 28 |         qid_overlap = set(self.qrels_dict.keys()) & set(self.results_dict.keys())
 29 |         if len(qid_overlap) == 0:
 30 |             print "ERR: Query mismatch between qrel and result file!"
 31 |             exit(0)
 32 | 
 33 |     @staticmethod
 34 |     def __group_by_queries(file_lines):
 35 |         """
 36 |         Groups the lines by query id.
 37 | 
 38 |         :param file_lines: list of lines [[qid, label, en_id, ...], ...]
 39 |         :return: {qid: [iset0, iset1, ..], ..}; isets are sets of entity ids
 40 |         """
 41 |         grouped_inters = defaultdict(list)
 42 |         for cols in file_lines:
 43 |             if len(cols) > 2:
 44 |                 grouped_inters[cols[0]].append(set(cols[2:]))
 45 |             elif cols[0] not in grouped_inters:
 46 |                 grouped_inters[cols[0]] = []
 47 | 
 48 |         # check that identical interpretations are not assigned to a query
 49 |         for qid, interprets in grouped_inters.iteritems():
 50 |             q_interprets = set()
 51 |             for inter in interprets:
 52 |                 if tuple(sorted(inter)) in q_interprets:
 53 |                     print "Err: Identical interpretations for query [" + qid + "]!"
 54 |                     exit(0)
 55 |                 else:
 56 |                     q_interprets.add(tuple(sorted(inter)))
 57 |         return grouped_inters
 58 | 
 59 |     def eval(self, eval_query_func):
 60 |         """
 61 |         Evaluates all queries and calculates total precision, recall and F1 (macro averaging).
 62 | 
 63 |         :param eval_query_func: A function that takes qrel and results for a query and returns evaluation metrics
 64 |         :return  Total precision, recall, and F1 for all queries
 65 |         """
 66 |         queries_eval = {}
 67 |         total_prec, total_rec, total_f = 0, 0, 0
 68 |         for qid in sorted(self.qrels_dict):
 69 |             queries_eval[qid] = eval_query_func(self.qrels_dict[qid], self.results_dict.get(qid, []))
 70 |             total_prec += queries_eval[qid]['prec']
 71 |             total_rec += queries_eval[qid]['rec']
 72 |         n = len(self.qrels_dict)  # number of queries
 73 |         total_prec /= n
 74 |         total_rec /= n
 75 |         total_f = (2 * total_rec * total_prec) / (total_rec + total_prec) if total_prec + total_rec != 0 else 0
 76 | 
 77 |         log = "\n----------------" + "\nEvaluation results:\n" + \
 78 |               "Prec: " + str(round(total_prec, 4)) + "\n" +\
 79 |               "Rec:  " + str(round(total_rec, 4)) + "\n" + \
 80 |               "F1:   " + str(round(total_f, 4)) + "\n" + \
 81 |               "all:  " + str(round(total_prec, 4)) + ", " + str(round(total_rec, 4)) + ", " + str(round(total_f, 4))
 82 |         print log
 83 |         metrics = {'prec': total_prec, 'rec': total_rec, 'f': total_f}
 84 |         return metrics
 85 | 
 86 | 
 87 | def erd_eval_query(query_qrels, query_results):
 88 |     """
 89 |     Evaluates a single query.
 90 | 
 91 |     :param query_qrels: Query interpretations from Qrel [{en1, en2, ..}, ..]
 92 |     :param query_results: Query interpretations from result file [{en1, en2, ..}, ..]
 93 |     :return: precision, recall, and F1 for a query
 94 |     """
 95 |     tp = 0  # correct
 96 |     fn = 0  # missed
 97 |     fp = 0  # incorrectly returned
 98 | 
 99 |     # ----- Query has no interpretation set. ------
100 |     if len(query_qrels) == 0:
101 |         if len(query_results) == 0:
102 |             return {'prec': 1, 'rec': 1, 'f': 1}
103 |         return {'prec': 0, 'rec': 0, 'f': 0}
104 | 
105 |     # ----- Query has at least an interpretation set. -----
106 |     # Iterate over qrels to calculate TP and FN
107 |     for qrel_item in query_qrels:
108 |         if find_item(qrel_item, query_results):
109 |             tp += 1
110 |         else:
111 |             fn += 1
112 |     # Iterate over results to calculate FP
113 |     for res_item in query_results:
114 |         if not find_item(res_item, query_qrels):  # Finds the result in the qrels
115 |             fp += 1
116 | 
117 |     prec = tp / (tp+fp) if tp+fp != 0 else 0
118 |     rec = tp / (tp+fn) if tp+fn != 0 else 0
119 |     metrics = {'prec': prec, 'rec': rec}
120 |     return metrics
121 | 
122 | 
123 | def find_item(item_to_find, items_list):
124 |     """
125 |     Returns True if an item is found in the item list.
126 | 
127 |     :param item_to_find: item to be found
128 |     :param items_list: list of items to search in
129 |     :return boolean
130 |     """
131 |     is_found = False
132 | 
133 |     item_to_find = set([en.lower() for en in item_to_find])
134 | 
135 |     for item in items_list:
136 |         item = set([en.lower() for en in item])
137 |         if item == item_to_find:
138 |             is_found = True
139 |     return is_found
140 | 
141 | 
142 | def parse_file(file_name):
143 |     """
144 |     Parses file and returns the positive instances for each query.
145 | 
146 |     :param file_name: Name of file to be parsed
147 |     :return list of lines [[qid, label, en_id, ...], ...]
148 |     """
149 |     file_lines = []
150 |     efile = open(file_name, "r")
151 |     for line in efile.readlines():
152 |         if line.strip() == "":
153 |             continue
154 |         cols = line.strip().split("\t")
155 |         file_lines.append(cols)
156 |     return file_lines
157 | 
158 | 
159 | def main(args):
160 |     if len(args) < 2:
161 |         print "\tUsage: [qrel_file] [result_file]"
162 |         exit(0)
163 |     qrels = parse_file(args[0])
164 |     results = parse_file(args[1])
165 |     evaluator = Evaluator(qrels, results)
166 |     evaluator.eval(erd_eval_query)
167 | 
168 | if __name__ == '__main__':
169 |     main(sys.argv[1:])
170 | 


--------------------------------------------------------------------------------
/scripts/evaluator_topics.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This script computes Topic metrics for the end-to-end performance.
  3 | Precision and recall are micro-averaged.
  4 | Matching condition: only entities should match.
  5 | 
  6 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
  7 | """
  8 | 
  9 | from __future__ import division
 10 | import sys
 11 | from collections import defaultdict
 12 | 
 13 | 
 14 | class EvaluatorTopics(object):
 15 | 
 16 |     def __init__(self, qrels, results, null_qrels=None, score_th=0):
 17 |         self.qrels_dict = self.__group_by_queries(qrels)
 18 |         self.results_dict = self.__group_by_queries(results, score_th=score_th)
 19 |         self.null_qrels = self.__group_by_queries(null_qrels) if null_qrels else None
 20 |         self.score_th = score_th
 21 | 
 22 |     @staticmethod
 23 |     def __group_by_queries(file_lines, score_th=None):
 24 |         """
 25 |         Groups the lines by query id.
 26 | 
 27 |         :param file_lines: list of lines [[qid, score, en_id, mention, page_id], ...]
 28 |         :return: {qid: {(men0, en0), (men1, en01), ..}, ..};
 29 |         """
 30 |         grouped_inters = defaultdict(set)
 31 |         for cols in file_lines:
 32 |             if len(cols) > 2:
 33 |                 if score_th and (float(cols[1]) < score_th):
 34 |                     continue
 35 |                 grouped_inters[cols[0]].add((cols[3].lower(), cols[2].lower()))
 36 |         return grouped_inters
 37 | 
 38 |     def rm_nulls_res(self):
 39 |         """
 40 |         Removes mentions that not linked to an entity in the qrel.
 41 |         There are some entities in the qrel with "*NONE*" as id. We remove the related mentions from the result file.
 42 |         Null entities are generated due to the inconsistency between TAGME Wikipedia dump (2009) and our dump (2010).
 43 |         """
 44 |         print "Removing mentions with null entities ..."
 45 |         new_results_dict = defaultdict(set)
 46 |         for qid in self.results_dict:
 47 |             # easy case
 48 |             if qid not in set(self.null_qrels.keys()):
 49 |                 new_results_dict[qid] = self.results_dict[qid]
 50 |                 continue
 51 | 
 52 |             qrel_null_mentions = [item[0] for item in self.null_qrels[qid]]
 53 |             # check null mentions with results mentions
 54 |             for men, en in self.results_dict[qid]:
 55 |                 is_null = False
 56 |                 for qrel_null_men in qrel_null_mentions:
 57 |                     # results mention does not match null qrel mention
 58 |                     if mention_match(qrel_null_men, men):
 59 |                         is_null = True
 60 |                         break
 61 | 
 62 |                 if not is_null:
 63 |                     new_results_dict[qid].add((men, en))
 64 |                 # else:
 65 |                     # print qid, men, en, "QREL mention:", qrel_null_men, "-*-"
 66 |         self.results_dict = new_results_dict
 67 | 
 68 |     def eval(self, eval_query_func):
 69 |         """
 70 |         Evaluates all queries and calculates total precision, recall and F1 (macro averaging).
 71 | 
 72 |         :param eval_query_func: A function that takes qrel and results for a query and returns evaluation metrics
 73 |         :return  Total precision, recall, and F1 for all queries
 74 |         """
 75 |         self.rm_nulls_res()
 76 |         print "comparing results ..."
 77 |         queries_eval = {}
 78 |         total_tp, total_fp, total_fn = 0, 0, 0
 79 |         for qid in sorted(self.qrels_dict):
 80 |             queries_eval[qid] = eval_query_func(self.qrels_dict[qid], self.results_dict.get(qid, {}))
 81 | 
 82 |             total_tp += queries_eval[qid]['tp']
 83 |             total_fp += queries_eval[qid]['fp']
 84 |             total_fn += queries_eval[qid]['fn']
 85 | 
 86 |         total_prec = total_tp / (total_tp + total_fp)
 87 |         total_rec = total_tp / (total_tp + total_fn)
 88 |         total_f = 2 * total_prec * total_rec / (total_prec + total_rec)
 89 | 
 90 |         log = "\n----------------" + "\nEvaluation results:\n" + \
 91 |               "Prec: " + str(round(total_prec, 4)) + "\n" +\
 92 |               "Rec:  " + str(round(total_rec, 4)) + "\n" + \
 93 |               "F1:   " + str(round(total_f, 4)) + "\n" + \
 94 |               "all:  " + str(round(total_prec, 4)) + ", " + str(round(total_rec, 4)) + ", " + str(round(total_f, 4))
 95 |         print log
 96 |         metrics = {'prec': total_prec, 'rec': total_rec, 'f': total_f}
 97 |         return metrics
 98 | 
 99 | 
100 | def erd_eval_query(query_qrels, query_results):
101 |     """
102 |     Evaluates a single query.
103 | 
104 |     :param query_qrels: Query interpretations from Qrel [{en1, en2, ..}, ..]
105 |     :param query_results: Query interpretations from result file [{en1, en2, ..}, ..]
106 |     :return: precision, recall, and F1 for a query
107 |     """
108 |     tp = 0  # correct
109 |     fn = 0  # missed
110 |     fp = 0  # incorrectly returned
111 | 
112 |     # ----- Query has at least an interpretation set. -----
113 |     # Iterate over qrels to calculate TP and FN
114 |     results_ens = [item[1] for item in query_results]
115 |     qrel_ens = [item[1] for item in query_qrels]
116 |     for qrel_item in qrel_ens:
117 |         if find_item(qrel_item, results_ens):
118 |             tp += 1
119 |         else:
120 |             fn += 1
121 |     # Iterate over results to calculate FP
122 |     for res_item in results_ens:
123 |         if not find_item(res_item, qrel_ens):  # Finds the result in the qrels
124 |             fp += 1
125 | 
126 |     stats = {'tp': tp, 'fp': fp, 'fn': fn}
127 |     return stats
128 | 
129 | 
130 | def find_item(item_to_find, items_list):
131 |     """
132 |     Returns True if an item is found in the item list.
133 | 
134 |     :param item_to_find: item to be found
135 |     :param items_list: list of items to search in
136 |     :return boolean
137 |     """
138 |     is_found = False
139 |     for item in items_list:
140 |         if item == item_to_find:
141 |             is_found = True
142 |     return is_found
143 | 
144 | 
145 | def mention_match(mention1, mention2):
146 |     """
147 |     Checks if two mentions matches each other.
148 |     Matching condition: One of the mentions is sub-string of the other one.
149 |     """
150 |     match = ((mention1 in mention2) or (mention2 in mention1))
151 |     return match
152 | 
153 | 
154 | def parse_file(file_name, res=False):
155 |     """
156 |     Parses file and returns the positive instances for each query.
157 | 
158 |     :param file_name: Name of file to be parsed
159 |     :return list of lines [[qid, score, en_id, mention, ...], ...]
160 |     """
161 |     null_lines = []
162 |     file_lines = []
163 |     infile = open(file_name, "r")
164 |     for line in infile.readlines():
165 |         if line.strip() == "":
166 |             continue
167 |         cols = line.strip().split("\t")
168 |         if (not res) and (cols[2].strip() == "*NONE*"):
169 |             null_lines.append(cols)
170 |         else:
171 |             file_lines.append(cols)
172 |     return file_lines, null_lines
173 | 
174 | 
175 | def main(args):
176 |     if len(args) < 2:
177 |         print "\tUsage: <qrel_file> <result_file>"
178 |         exit(0)
179 |     print "parsing qrel ..."
180 |     qrels, null_qrels = parse_file(args[0])  # here qrel does not contain null entities
181 |     print "parsing results ..."
182 |     results = parse_file(args[1])[0]
183 |     print "evaluating ..."
184 |     score_th = 0 if len(args) == 2 else float(args[2])
185 |     evaluator = EvaluatorTopics(qrels, results, null_qrels=null_qrels, score_th=score_th)
186 |     evaluator.eval(erd_eval_query)
187 | 
188 | if __name__ == '__main__':
189 |     main(sys.argv[1:])
190 | 


--------------------------------------------------------------------------------
/scripts/to_elq.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Converts the results to ELQ format:
 3 |  - Filters general concept entities (keeps only proper noun entities)
 4 |  - Creates ELQ format file
 5 | 
 6 | @author: Faegheh Hasibi (faegheh.hasibi@idi.ntnu.no)
 7 | """
 8 | 
 9 | import sys
10 | from collections import defaultdict
11 | from nordlys.config import DATA_DIR
12 | from nordlys.wikipedia.utils import WikipediaUtils
13 | 
14 | 
15 | def load_kb():
16 |     """Loads Freebase snapshot of proper noun entities."""
17 |     print "Loading knowledge base snapshot ..."
18 |     __fb_dbp_file = open(DATA_DIR + "/fb_dbp_snapshot.txt", "r")
19 |     global KB_SNP_DBP
20 |     for line in __fb_dbp_file:
21 |         cols = line.strip().split("\t")
22 |         KB_SNP_DBP.add(cols[1])
23 |     __fb_dbp_file.close()
24 | 
25 | KB_SNP_DBP = set()
26 | 
27 | 
28 | def read_file(input_file, score_th):
29 |     lines = []
30 |     with open(input_file, "r") as input:
31 |         for line in input:
32 |             cols = line.strip().split("\t")
33 |             if float(cols[1]) < score_th:
34 |                 continue
35 |             lines.append(cols)
36 |     return lines
37 | 
38 | 
39 | def filter_general_ens(lines):
40 |     """Returns tab-separated lines: qid score   en  men fb_id"""
41 |     filtered_annots = []
42 |     for line in lines:
43 |         dbp_uri = WikipediaUtils.wiki_uri_to_dbp_uri(line[2])
44 |         if dbp_uri in KB_SNP_DBP:  # check fb is in the KB snapshot
45 |             filtered_annots.append(line)
46 |     return filtered_annots
47 | 
48 | 
49 | def to_inter_sets(lines):
50 |     """Groups linked entities and interpretation set."""
51 |     group_by_qid = defaultdict(set)
52 |     for cols in lines:
53 |         group_by_qid[cols[0]].add(cols[2])
54 |     return group_by_qid
55 | 
56 | 
57 | def main(args):
58 |     if len(args) < 2:
59 |         print "USAGE: <input file> <score threshold>"
60 |         exit(0)
61 |     load_kb()
62 |     lines = read_file(args[0], float(args[1]))
63 |     filtered_annots = filter_general_ens(lines)
64 |     inter_sets = to_inter_sets(filtered_annots)
65 | 
66 |     out_str = ""
67 |     for qid in sorted(inter_sets.keys()):
68 |         ens = inter_sets[qid]
69 |         out_str += qid + "\t1\t" + "\t".join(ens) + "\n"
70 |     out_file = args[0][:args[0].rfind(".")] + "_" + str(args[1]) + ".elq"
71 |     open(out_file, "w").write(out_str)
72 |     print "Output file:",  out_file
73 | 
74 | 
75 | if __name__ == "__main__":
76 |     main(sys.argv[1:])


--------------------------------------------------------------------------------
/setup.md:
--------------------------------------------------------------------------------
1 | # SetupIn order to set up and run our implementation of TAGME, you need to install [PyLucence](https://lucene.apache.org/pylucene/) and the packages listed in the ``requirements.txt`` file.Once the required packages are installed, you need to have the resources required for running the code, which are: (i) a surface form dictionary and (ii) indices for a Wikipedia dump.You can directly ask the authors of [1] to provide you with these resources or build them using the following steps: 1. Downloading a Wikipedia dump 2. Preprocessing the dump 3. Building indices 4. Building a surface form dictionary 5. Setting the config fileBelow we describe each of these steps.## 1. Downloading a Wikipedia dumpOur TAGME is built based on a Wikipedia dump; i.e., ``enwiki-YYYYMMDD-pages-articles.xml.bz2`` file that can be downloaded from [here](http://dumps.wikimedia.org/enwiki/).  For the experiments in [1], we used the dumps from *20100408* and *20120502*, which are available upon request.## 2. Preprocessing the dumpThe [Wikipedia Extractor](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) tool is used for preprocessing of the Wikipedia dump.  The dump used for experiments is available under `lib/wikiextractor-master`.The following command is executed to pre-process the dumps.  Mind that the `-l` option is necessary, as it preserves the links.```python tagme-rep/lib/wikiextractor-master/WikiExtractor.py -o path/to/output/folder -l path/to/enwiki-YYYYMMDD-pages-articles.xml.bz2```We assume that the resulting files are stored under `preprocessed-YYYYMMDD` folder.## 3. Building indices Two type of indices are built from the Wikipedia dumps:  - **YYYYMMDD-index**: Index of Wikipedia articles (with resolved URIs).  - **YYYYMMDD-index-annot**: Index containing only Wikipedia annotations. This index is used to compute relatedness between entities.Run the following commands to build these indices: - ``python -m nordlys.wikipedia.indexer -i preprocessed-YYYYMMDD/ -o YYYYMMDD-index/`` - ``python -m nordlys.wikipedia.indexer -a -i preprocessed-YYYYMMDD/ -o YYYYMMDD-index-annot/``We note that the following pages are ignored from the indices: - **List pages**: Wikipedia URIs starting with "<wikipedia:List_of" OR "<wikipedia:Table_of". - **Disambiguation pages**: Articles [ending with "(disambiguation)"] OR [containing "may refer to:" AND having lass than 200 words]. - **Redirect pages**: These pages are excluded while pre-processing of Wikipedia articles (i.e. by the WikiExtractor).## 4. Building a surface form dictionaryThe surface form dictionary is built from these sources: - Redirect pages - Anchor texts of Wikipedia articles - Wikipedia page titles and their variants (removing parts after the comma or in parentheses)Below we describe how to build each source and merge them into as a single dictionary.### Redirect pages:We used [Wikipedia Redirect](https://code.google.com/p/wikipedia-redirect/) tool to extract the redirect pages from the original dump, which is available under  `lib/edu.cmu.lti.wikipedia_redirect`.Run the following commands under the main code directory:  - ``bunzip2 path/to/enwiki-YYYYMMDD-pages-articles.xml.bz2``  - ``cd lib/edu.cmu.lti.wikipedia_redirect``  - ``javac src/edu/cmu/lti/wikipedia_redirect/*.java``  - ``java -cp src edu.cmu.lti.wikipedia_redirect.WikipediaRedirectExtractor path/to/enwiki-YYYYMMDD-pages-articles.xml``The resulting file is `redirects.txt`.### Anchor texts:The anchor texts file are extracted using the following commands:- ``python -m nordlys.wikipedia.annot_extractor -i /path/to/preprocessed-YYYYMMDD/ -o /path/to/annotations-YYYYMMDD/``- ``python -m nordlys.wikipedia.anchor_extractor -i /path/to/annotations-YYYYMMDD/ -o path/to/output/folder``The resulting file is `anchors_count.txt`.### Wikipedia page titles:The following command extracts page id and title for all Wikipedia articles:```
2 | python -m nordlys.wikipedia.pageid_extractor -in /data/wikipedia/preprocessed-YYYYMMDD/ -o path/to/output/folder
3 | ```The resulting file is `page-id-titles.txt`.### Merging all sources:Once all the above sources are built, run the following command to merge them all into a single JSON file```
4 | python -m nordlys.wikipedia.merge_sf -redirects path/to/redirects.txt -anchors path/to/anchors_count.txt -titles path/to/page-id-titles.txt -o path/to/output/folder
5 | ```The resulting file is `sf_dict_mongo.json`, which can be directly imported to a MongoDB collection using this command:```
6 | mongoimport --db nordlys --collection surfaceforms_wiki_YYYYMMDD --file /path/to/sf_dict_mongo.json --jsonArray
7 | ```## 5. Setting the config fileOnce all the resources are built, you need to change the following lines in the ``nordlys/tagme/config.py`` file:```COLLECTION_SURFACEFORMS_WIKI = "surfaceforms_wiki_YYYYMMDD"``````INDEX_PATH = "path/to/YYYYMMDD-index"INDEX_ANNOT_PATH = "path/to/YYYYMMDD-index-annot"```## ContactIf you have any questions, feel free to contact Faegheh Hasibi at <faegheh.hasibi@idi.ntnu.no>.```[1]  F. Hasibi, K. Balog, and S.E. Bratsberg. “On the reproducibility of  the TAGME Entity Linking System”, In proceedings of 38th European Conference on Information Retrieval (ECIR ’16), Padova, Italy, March 2016.```


--------------------------------------------------------------------------------