├── LICENSE ├── README.md └── src └── com └── lc └── nlp ├── keyword ├── algorithm │ ├── TFIDF.java │ ├── TextRank.java │ ├── TextRankWithMultiWin.java │ └── TextRankWithTFIDF.java └── evaluate │ └── F1Score.java └── parsedoc ├── ParseXML.java ├── ReadDir.java └── ReadFile.java /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2016-2017 GitHub Inc. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Keyword Extraction in Java 2 | 3 | Implementation of serveral algorithms for keyword extraction,including TextRank,TF-IDF,TextRank along with TFTF-IDF.Cutting words and filtering stop words are relied on [HanLP](https://github.com/hankcs/HanLP) 4 | 5 | The repository mainly consists of three parts: 6 | 7 | **1. Algorithm**: implementation of serveral algorithms for keyword exraction,including TextRank,TF-IDF and combining of TextRank and TF-IDF 8 | 9 | **2.Evaluate**: the method to evaluate the result of the algorithm,currently only the F1 Score is available 10 | 11 | **3.Parse Documents**: methods provided to read the contens of the corpus used for test 12 | 13 | 14 | More details can be found in [this passage](http://wulc.me/2016/05/28/%E5%85%B3%E9%94%AE%E8%AF%8D%E6%8A%BD%E5%8F%96%E7%AE%97%E6%B3%95%E7%9A%84%E7%A0%94%E7%A9%B6/) 15 | 16 | ## 1. Algorithm 17 | 18 | ### 1.1 TextRank 19 | 20 | Source File: `TexkRank.java` 21 | 22 | With title and content of a document as input,return 5 keywords of the documents.For example 23 | 24 | ```java 25 | String title = "关键词抽取"; 26 | String content = "关键词自动提取是一种识别有意义且具有代表性片段或词汇的自动化技术。关键词自动提取在文本挖掘域被称为关键词抽取,在计算语言学领域通常着眼于术语自动识别,在信息检索领域,就是指自动标引。"; 27 | System.out.println(TextRank.getKeyword(title, content)); 28 | 29 | // Output: [自动, 领域, 关键词, 提取, 抽取] 30 | ``` 31 | 32 | You can change the number of keywords and the size of co-occur window ,whose default values are 5 and 3,respectively.For example: 33 | ```java 34 | TextRank.setKeywordNumber(6); 35 | TextRank.setWindowSize(4); 36 | String title = "关键词抽取"; 37 | String content = "关键词自动提取是一种识别有意义且具有代表性片段或词汇的自动化技术。关键词自动提取在文本挖掘域被称为关键词抽取,在计算语言学领域通常着眼于术语自动识别,在信息检索领域,就是指自动标引。"; 38 | System.out.println(TextRank.getKeyword(title, content)); 39 | // Output:[自动, 关键词, 领域, 提取, 抽取, 自动识别] 40 | ``` 41 | 42 | From the output you can see clearly the number of keywords has change due to `TextRank.setKeywordNumber(6);`,and the size of co-occur window is not visible in the result but will affect the resutlt if you are aware of the principle of TextRank algorithm. 43 | 44 | 45 | ### 1.2 TF-IDF 46 | 47 | Source File: `TFIDF.java` 48 | 49 | TF-IDF algorithm is to extract the keywords of a corpus, that is, it will extract keywords for multiple documents at the same time.For example: 50 | 51 | ```java 52 | String dir = "G:/corpusMini"; 53 | Map> result= TFIDF.getKeywords(dir); 54 | System.out.println(result); 55 | ``` 56 | 57 | Output is like this(suppose there are only 3 documents in the directory): 58 | ``` 59 | { 60 | G:/corpusMini/00001.xml=[曹奔, 周某, 民警, 摩托车, 禁区], 61 | G:/corpusMini/00002.xml=[翁刚, 吴红娟, 婆婆, 分割, 丈夫], 62 | G:/corpusMini/00003.xml=[止痛片, 伤肾, 医生, 肾衰竭, 胡婆婆] 63 | } 64 | ``` 65 | 66 | The default number of keywords to extract is 5,you can change it by yourself,for example: 67 | 68 | ```java 69 | String dir = "G:/corpusMini"; 70 | TFIDF.setKeywordsNumber(3); 71 | Map> result= TFIDF.getKeywords(dir); 72 | System.out.println(result); 73 | ``` 74 | 75 | and the output will be like this 76 | ``` 77 | { 78 | G:/corpusMini/00001.xml=[曹奔, 周某, 民警], 79 | G:/corpusMini/00002.xml=[翁刚, 吴红娟, 婆婆], 80 | G:/corpusMini/00003.xml=[止痛片, 伤肾, 医生] 81 | } 82 | ``` 83 | 84 | **To be able to run the above code, remember to specify how to read the content of file under the directory in the `ReadFile.java `, the following part `3.2 ReadFile` will explain the details about this** 85 | 86 | 87 | ### 1.3 TextRank With Multiple Window 88 | 89 | Source File: `TextRankWithMultiWin.java` 90 | 91 | `TextRank` algorithm is classical, and this algorithm `TextRank With Multiple Window` is based on TextRank. 92 | 93 | As we know, co-occurance window is an uncertain parameter in TextRank, currently we have no way to find the best co-occurance window for a document, therefore this co-occurance window will integrate the resutls generated by several co-occurance windowd by summing up the score that they generate. 94 | 95 | Sample code is like this: 96 | ```java 97 | String title = "关键词抽取"; 98 | String content = "关键词自动提取是一种识别有意义且具有代表性片段或词汇的自动化技术。关键词自动提取在文本挖掘域被称为关键词抽取,在计算语言学领域通常着眼于术语自动识别,在信息检索领域,就是指自动标引。"; 99 | int miniWindow = 3, maxWindow = 5; 100 | List keywords = TextRankWithMultiWin.integrateMultiWindow(title, content, miniWindow, maxWindow); 101 | System.out.println(keywords); 102 | 103 | /*Output*/ 104 | [自动, 关键词, 领域, 提取, 抽取] 105 | ``` 106 | 107 | the above code combines the results generated by the co-occurance window of size 3~5;notice that the result is a little different from the original TextRank algorithm in 1.1 108 | 109 | Also,you can set the number of keywords to extract by `TextRankWithMultiWin.setKeywordNumber(3);` 110 | 111 | 112 | ### 1.4 TextRank With TF-IDF 113 | 114 | This part includes two algorithms: `TextRank-score multipy IDF` and `TextRank and TF-IDF vote together`, which are both based on the original TextRank and TF-IDF. 115 | 116 | **To be able to run these two algorithms, remember to specify how to read the content of file under the directory in the `ReadFile.java `, the following part `3.2 ReadFile` will tell you details about this** 117 | 118 | #### 1.4.1 TextRank-score multipy IDF 119 | 120 | Source File: `TextRankWithTFIDF.java` 121 | 122 | Based on the score generated by the original TextRank algorithm, this algorithm multipy the IDF value by the score for each word.In this way, it can consider the weight of a word in a corpus.**Therefore, this is an algorithm for a corpus(that is,it can't used to extract keywords for a single document)** 123 | ```java 124 | String dir = "G:/corpusMini"; 125 | //TextRankWithTFIDF.setKeywordsNumber(3);//set the size of co-occurance window,default 5 126 | Map> result= TextRankWithTFIDF.textRankMultiplyIDF(dir); 127 | System.out.println(result); 128 | ``` 129 | 130 | Output is like this(suppose there are only 3 documents in the directory) 131 | ``` 132 | { 133 | G:/corpusMini/00001.xml=[曹奔, 周某, 民警, 摩托车, 禁区], 134 | G:/corpusMini/00002.xml=[吴红娟, 翁刚, 婆婆, 丈夫, 分割], 135 | G:/corpusMini/00003.xml=[伤肾, 止痛片, 胡婆婆, 肾内科, 肾衰竭] 136 | } 137 | ``` 138 | Also, you can change the number of keywords to extract as the commented line did above in the code. 139 | 140 | #### 1.4.2 TextRank and TF-IDF vote together 141 | 142 | Source File: `TextRankWithTFIDF.java` 143 | 144 | This algorithm is based on both TextRank and TF-IDF. 145 | 146 | In order to extract n keywords from a document, firstly extract 2*n candidate keywords with TextRank and TFIDF respectively, then select the words that co-occur in these candidate keywords as the final keywords, if the number of final keywords if not enough,select the left part from the result generated TFIDF.**Therefore, this is also an algorithm for a corpus.** 147 | 148 | Sampel code is like this: 149 | ```java 150 | String dir = "G:/corpusMini"; 151 | //TextRankWithTFIDF.setKeywordsNumber(3);//set the size of co-occurance window,default 5 152 | Map> result= TextRankWithTFIDF.textRankTFIDFVote(dir); 153 | System.out.println(result); 154 | ``` 155 | 156 | Output is like this: 157 | ``` 158 | G:/corpusMini/00007.xml=[曹奔, 周某, 民警, 摩托车, 困难], 159 | G:/corpusMini/00002.xml=[翁刚, 吴红娟, 婆婆, 丈夫, 法院], 160 | G:/corpusMini/00018.xml=[止痛片, 伤肾, 肾衰竭, 胡婆婆, 肾内科] 161 | ``` 162 | 163 | Also, you can change the number of keywords to extract as the commented line did above in the code. 164 | 165 | 166 | ## 2. Evaluate 167 | 168 | The Class `F1Score` uses f1 score to evaluate the keywords extracted by the algorithm.You had got to take the keywords extracted by the algorithm and the keywords extracted manually as input.Sample code is like this: 169 | 170 | ```java 171 | String title = "关键词抽取"; 172 | String content = "关键词自动提取是一种识别有意义且具有代表性片段或词汇的自动化技术。关键词自动提取在文本挖掘域被称为关键词抽取,在计算语言学领域通常着眼于术语自动识别,在信息检索领域,就是指自动标引。"; 173 | List sysKeywords = TextRank.getKeyword(title, content); 174 | String[] manualKeywords = {"关键词","自动提取"}; 175 | List result = F1Score.calculate(sysKeywords,manualKeywords); 176 | System.out.println(result); 177 | /*output*/ 178 | [20.0, 50.0, 28.57] // the three numbers represent precision = 20% recall=50% F1 =28.57% 179 | ``` 180 | 181 | 182 | ## 3. Parse Documents 183 | 184 | ### 3.1 ReadDir 185 | `ReadDir` Class provides a method to find all the paths of files under a certain directory,including the sub-directories.For example 186 | 187 | ```java 188 | String dirPath = "G:/corpusMini"; 189 | List fileList = ReadDir.readDirFileNames(dirPath); 190 | for(String file : fileList) 191 | System.out.println(file); 192 | ``` 193 | 194 | and the output is like this: 195 | ``` 196 | G:/corpusMini/00001.xml 197 | G:/corpusMini/00002.xml 198 | G:/corpusMini/test/00003.xml 199 | ``` 200 | 201 | as you can see,the method can also read the files of subdirectory `test`,because of this,remember not to take `/` as the last character of dirPath 202 | 203 | ### 3.2 ReadFile 204 | 205 | `ReadFile` class is designed to load the content of file of a certain type, **remember you had got to implement the method `loadFile` in trems of the type of your file.** The default method in it is to parse the XML files [here](https://github.com/iamxiatian/data/tree/master/sohu-dataset) ,and the code is like this 206 | 207 | ```java 208 | //remember to replace the following code to yours in terms of the type of your files 209 | /* 210 | String filePath = "G:/corpusMini/00001.xml"; 211 | ParseXML parser = new ParseXML(); 212 | String content = parser.parseXML(filePath, "content"); 213 | */ 214 | ``` 215 | 216 | -------------------------------------------------------------------------------- /src/com/lc/nlp/keyword/algorithm/TFIDF.java: -------------------------------------------------------------------------------- 1 | /** 2 | * Author: WuLC 3 | * Date: 2016-05-22 17:46:15 4 | * Last modified by: WuLC 5 | * Last Modified time: 2016-05-23 23:31:25 6 | * Email: liangchaowu5@gmail.com 7 | ************************************************************ 8 | * Function:get keywords of file through TF-IDF algorithm 9 | * Input: path of directory of files that need to extract keywords 10 | * Output: keywords of each file 11 | */ 12 | 13 | package com.lc.nlp.keyword.algorithm; 14 | 15 | import java.io.*; 16 | import java.util.*; 17 | import com.hankcs.hanlp.HanLP; 18 | import com.hankcs.hanlp.dictionary.stopword.CoreStopWordDictionary; 19 | import com.hankcs.hanlp.seg.common.Term; 20 | 21 | import com.lc.nlp.parsedoc.*; 22 | 23 | 24 | public class TFIDF 25 | { 26 | 27 | private static int keywordsNumber = 5; 28 | 29 | /** 30 | * change the number of keywords,default 5 31 | * @param keywordNum(int): number of keywords that need to be extracted 32 | */ 33 | public static void setKeywordsNumber(int keywordNum) 34 | { 35 | keywordsNumber = keywordNum; 36 | } 37 | 38 | /** 39 | * calculate TF value of each word in terms of the content of file 40 | * @param fileContent(String): content of file 41 | * @return(HashMap): "words:TF value" pairs 42 | */ 43 | public static HashMap getTF(String fileContent) 44 | { 45 | List terms=new ArrayList(); 46 | ArrayList words = new ArrayList(); 47 | 48 | terms=HanLP.segment(fileContent); 49 | for(Term t:terms) 50 | { 51 | if(TFIDF.shouldInclude(t)) 52 | { 53 | words.add(t.word); 54 | } 55 | } 56 | 57 | // get TF values 58 | HashMap wordCount = new HashMap(); 59 | HashMap TFValues = new HashMap(); 60 | for(String word : words) 61 | { 62 | if(wordCount.get(word) == null) 63 | { 64 | wordCount.put(word, 1); 65 | } 66 | else 67 | { 68 | wordCount.put(word, wordCount.get(word) + 1); 69 | } 70 | } 71 | 72 | int wordLen = words.size(); 73 | //traverse the HashMap 74 | Iterator> iter = wordCount.entrySet().iterator(); 75 | while(iter.hasNext()) 76 | { 77 | Map.Entry entry = (Map.Entry)iter.next(); 78 | TFValues.put(entry.getKey().toString(), Float.parseFloat(entry.getValue().toString()) / wordLen); 79 | //System.out.println(entry.getKey().toString() + " = "+ Float.parseFloat(entry.getValue().toString()) / wordLen); 80 | } 81 | return TFValues; 82 | } 83 | 84 | 85 | /** 86 | * judge whether a word belongs to stop words 87 | * @param term(Term): word needed to be judged 88 | * @return(boolean): if the word is a stop word,return false;otherwise return true 89 | */ 90 | public static boolean shouldInclude(Term term) 91 | { 92 | return CoreStopWordDictionary.shouldInclude(term); 93 | } 94 | 95 | 96 | /** 97 | * calculate TF values for each word of each file under a directory 98 | * @param dirPath(String): path of the directory 99 | * @return(HashMap>): path of file and its corresponding "word-TF Value" pairs 100 | * @throws IOException 101 | */ 102 | public static HashMap> tfForDir(String dirPath) 103 | { 104 | HashMap> allTF = new HashMap>(); 105 | List filelist = ReadDir.readDirFileNames(dirPath); 106 | 107 | for(String file : filelist) 108 | { 109 | HashMap dict = new HashMap(); 110 | String content = ReadFile.loadFile(file); // remember to modify the loadFile method of class ReadFile 111 | dict = TFIDF.getTF(content); 112 | allTF.put(file, dict); 113 | } 114 | return allTF; 115 | } 116 | 117 | 118 | /** 119 | * calculate IDF values for each word under a directory 120 | * @param dirPath(String): path of the directory 121 | * @return(HashMap): "word:IDF Value" pairs 122 | */ 123 | public static HashMap idfForDir(String dirPath) 124 | { 125 | List fileList = new ArrayList(); 126 | fileList = ReadDir.readDirFileNames(dirPath); 127 | int docNum = fileList.size(); 128 | 129 | Map> passageWords = new HashMap>(); 130 | // get words that are not repeated of a file 131 | for(String filePath:fileList) 132 | { 133 | List terms=new ArrayList(); 134 | Set words = new HashSet(); 135 | String content = ReadFile.loadFile(filePath); // remember to modify the loadFile method of class ReadFile 136 | terms=HanLP.segment(content); 137 | for(Term t:terms) 138 | { 139 | if(TFIDF.shouldInclude(t)) 140 | { 141 | words.add(t.word); 142 | } 143 | } 144 | passageWords.put(filePath, words); 145 | } 146 | 147 | // get IDF values 148 | HashMap wordPassageNum = new HashMap(); 149 | for(String filePath : fileList) 150 | { 151 | Set wordSet = new HashSet(); 152 | wordSet = passageWords.get(filePath); 153 | for(String word:wordSet) 154 | { 155 | if(wordPassageNum.get(word) == null) 156 | wordPassageNum.put(word,1); 157 | else 158 | wordPassageNum.put(word, wordPassageNum.get(word) + 1); 159 | } 160 | } 161 | 162 | HashMap wordIDF = new HashMap(); 163 | Iterator> iter_dict = wordPassageNum.entrySet().iterator(); 164 | while(iter_dict.hasNext()) 165 | { 166 | Map.Entry entry = (Map.Entry)iter_dict.next(); 167 | float value = (float)Math.log( docNum / (Float.parseFloat(entry.getValue().toString())) ); 168 | wordIDF.put(entry.getKey().toString(), value); 169 | //System.out.println(entry.getKey().toString() + "=" +value); 170 | } 171 | return wordIDF; 172 | } 173 | 174 | 175 | /** 176 | * calculate TF-IDF value for each word of each file under a directory 177 | * @param dirPath(String): path of the directory 178 | * @return(Map>): path of file and its corresponding "word:TF-IDF Value" pairs 179 | */ 180 | public static Map> getDirTFIDF(String dirPath) 181 | { 182 | HashMap> dirFilesTF = new HashMap>(); 183 | HashMap dirFilesIDF = new HashMap(); 184 | 185 | dirFilesTF = TFIDF.tfForDir(dirPath); 186 | dirFilesIDF = TFIDF.idfForDir(dirPath); 187 | 188 | Map> dirFilesTFIDF = new HashMap>(); 189 | Map singlePassageWord= new HashMap(); 190 | List fileList = new ArrayList(); 191 | fileList = ReadDir.readDirFileNames(dirPath); 192 | for (String filePath: fileList) 193 | { 194 | HashMap temp= new HashMap(); 195 | singlePassageWord = dirFilesTF.get(filePath); 196 | Iterator> it = singlePassageWord.entrySet().iterator(); 197 | while(it.hasNext()) 198 | { 199 | Map.Entry entry = it.next(); 200 | String word = entry.getKey(); 201 | Float TFIDF = entry.getValue()*dirFilesIDF.get(word); 202 | temp.put(word, TFIDF); 203 | } 204 | dirFilesTFIDF.put(filePath, temp); 205 | } 206 | return dirFilesTFIDF; 207 | } 208 | 209 | 210 | /** 211 | * get keywords of each file under a certain directory 212 | * @param dirPath(String): path of directory 213 | * @param keywordNum(int): number of keywords to extract 214 | * @return(Map>): path of file and its corresponding keywords 215 | */ 216 | public static Map> getKeywords(String dirPath) 217 | { 218 | List fileList = new ArrayList(); 219 | fileList = ReadDir.readDirFileNames(dirPath); 220 | 221 | // calculate TF-IDF value for each word of each file under the dirPath 222 | Map> dirTFIDF = new HashMap>(); 223 | dirTFIDF = TFIDF.getDirTFIDF(dirPath); 224 | 225 | Map> keywordsForDir = new HashMap>(); 226 | for (String file:fileList) 227 | { 228 | Map singlePassageTFIDF= new HashMap(); 229 | singlePassageTFIDF = dirTFIDF.get(file); 230 | 231 | //sort the keywords in terms of TF-IDF value in descending order 232 | List> entryList=new ArrayList>(singlePassageTFIDF.entrySet()); 233 | 234 | 235 | Collections.sort(entryList,new Comparator>() 236 | { 237 | @Override 238 | public int compare(Map.Entry c1,Map.Entry c2) 239 | { 240 | return c2.getValue().compareTo(c1.getValue()); 241 | } 242 | } 243 | ); 244 | 245 | // get keywords 246 | List systemKeywordList=new ArrayList(); 247 | for(int k=0;k): keywords of the text 11 | */ 12 | package com.lc.nlp.keyword.algorithm; 13 | 14 | import java.util.*; 15 | import com.hankcs.hanlp.HanLP; 16 | import com.hankcs.hanlp.dictionary.stopword.CoreStopWordDictionary; 17 | import com.hankcs.hanlp.seg.common.Term; 18 | 19 | 20 | public class TextRank 21 | { 22 | static final float d = 0.85f; //damping factor, default 0.85 23 | static final int max_iter = 200; //max iteration times 24 | static final float min_diff = 0.0001f; //condition to judge whether recurse or not 25 | private static int nKeyword=5; //number of keywords to extract,default 5 26 | private static int coOccuranceWindow=3; //size of the co-occurance window, default 3 27 | 28 | // change default parameters 29 | public static void setKeywordNumber(int sysKeywordNum) 30 | { 31 | nKeyword = sysKeywordNum; 32 | } 33 | 34 | 35 | public static void setWindowSize(int window) 36 | { 37 | coOccuranceWindow = window; 38 | } 39 | 40 | 41 | /** 42 | * extract keywords in terms of title and content of document 43 | * @param title(String): title of document 44 | * @param content(String): content of document 45 | * @param sysKeywordCount(int): number of keywords to extract,default 5 46 | * @param window(int): size of the co-occur window, default 3 47 | * @return (List): list of keywords 48 | */ 49 | public static List getKeyword(String title, String content) 50 | { 51 | 52 | Map score = TextRank.getWordScore(title, content); 53 | 54 | //rank keywords in terms of their score 55 | List> entryList = new ArrayList>(score.entrySet()); 56 | Collections.sort(entryList, new Comparator>() 57 | { 58 | @Override 59 | public int compare(Map.Entry o1, Map.Entry o2) 60 | { 61 | return (o1.getValue() - o2.getValue() > 0 ? -1 : 1); 62 | } 63 | } 64 | ); 65 | 66 | //System.out.println("After sorting: "+entryList); 67 | 68 | List sysKeywordList=new ArrayList(); 69 | 70 | //List unmergedList=new ArrayList(); 71 | for (int i = 0; i < nKeyword; ++i){ 72 | try{ 73 | //unmergedList.add(entryList.get(i).getKey()); 74 | sysKeywordList.add(entryList.get(i).getKey()); 75 | }catch(IndexOutOfBoundsException e){ 76 | continue; 77 | } 78 | } 79 | 80 | System.out.print("window:"+coOccuranceWindow+"\nkeywordNum: "+nKeyword); 81 | return sysKeywordList; 82 | } 83 | 84 | 85 | /** 86 | * judge whether a word belongs to stop words 87 | * @param term(Term): word needed to be judged 88 | * @return(boolean): if the word is a stop word,return false;otherwise return true 89 | */ 90 | public static boolean shouldInclude(Term term) 91 | { 92 | return CoreStopWordDictionary.shouldInclude(term); 93 | } 94 | 95 | 96 | 97 | /** 98 | * return score of each word after TextRank algorithm 99 | * @param title(String): title of document 100 | * @param content(String): content of document 101 | * @param window(int): size of the co-occur window, default 3 102 | * @return (Map): score of each word 103 | */ 104 | public static Map getWordScore(String title, String content) 105 | { 106 | 107 | //segment text into words 108 | List termList = HanLP.segment(title + content); 109 | 110 | int count=1; //position of each word 111 | Map wordPosition = new HashMap(); 112 | 113 | List wordList=new ArrayList(); 114 | 115 | //filter stop words 116 | for (Term t : termList) 117 | { 118 | if (shouldInclude(t)) 119 | { 120 | wordList.add(t.word); 121 | if(!wordPosition.containsKey(t.word)) 122 | { 123 | wordPosition.put(t.word,count); 124 | count++; 125 | } 126 | } 127 | } 128 | //System.out.println("Keyword candidates:"+wordList); 129 | 130 | //generate word-graph in terms of size of co-occur window 131 | Map> words = new HashMap>(); 132 | Queue que = new LinkedList(); 133 | for (String w : wordList) 134 | { 135 | if (!words.containsKey(w)) 136 | { 137 | words.put(w, new HashSet()); 138 | } 139 | que.offer(w); // insert into the end of the queue 140 | if (que.size() > coOccuranceWindow) 141 | { 142 | que.poll(); // pop from the queue 143 | } 144 | 145 | for (String w1 : que) 146 | { 147 | for (String w2 : que) 148 | { 149 | if (w1.equals(w2)) 150 | { 151 | continue; 152 | } 153 | 154 | words.get(w1).add(w2); 155 | words.get(w2).add(w1); 156 | } 157 | } 158 | } 159 | //System.out.println("word-graph:"+words); //each k,v represents all the words in v point to k 160 | 161 | // iterate till recurse 162 | Map score = new HashMap(); 163 | for (int i = 0; i < max_iter; ++i) 164 | { 165 | Map m = new HashMap(); 166 | float max_diff = 0; 167 | for (Map.Entry> entry : words.entrySet()) 168 | { 169 | String key = entry.getKey(); 170 | Set value = entry.getValue(); 171 | m.put(key, 1 - d); 172 | for (String other : value) 173 | { 174 | int size = words.get(other).size(); 175 | if (key.equals(other) || size == 0) continue; 176 | m.put(key, m.get(key) + d / size * (score.get(other) == null ? 0 : score.get(other))); 177 | } 178 | 179 | max_diff = Math.max(max_diff, Math.abs(m.get(key) - (score.get(key) == null ? 1 : score.get(key)))); 180 | } 181 | score = m; 182 | 183 | //exit once recurse 184 | if (max_diff <= min_diff) 185 | break; 186 | } 187 | return score; 188 | } 189 | } 190 | 191 | -------------------------------------------------------------------------------- /src/com/lc/nlp/keyword/algorithm/TextRankWithMultiWin.java: -------------------------------------------------------------------------------- 1 | /** 2 | * Author: WuLC 3 | * Date: 2016-05-23 16:04:42 4 | * Last modified by: WuLC 5 | * Last Modified time: 2016-05-24 16:00:11 6 | * Email: liangchaowu5@gmail.com 7 | *********************************************************************************** 8 | * Function: integrate results of TextRank algorithm with different size of co-occurance window 9 | * Input: title and content of a document 10 | * Output: keywords of the document 11 | */ 12 | package com.lc.nlp.keyword.algorithm; 13 | 14 | import java.util.*; 15 | 16 | public class TextRankWithMultiWin 17 | { 18 | private static int keywordNum = 5; 19 | 20 | /** 21 | * set the number of keywords to extract 22 | * @param sysKeywordNum(int): number of keywords to extractt 23 | */ 24 | public static void setKeywordNumber(int sysKeywordNum) 25 | { 26 | keywordNum = sysKeywordNum; 27 | } 28 | 29 | 30 | /** 31 | * integrate the results of TextRank algorithm with different co-occurance window 32 | * @param title(String): title of the document 33 | * @param content(String): content of the document 34 | * @param minWindow(int): the minimum size of co-occurance window 35 | * @param maxWindow(int): the maximum size of co-occurance window 36 | * @return(List): keywords of the document 37 | */ 38 | public static List integrateMultiWindow(String title,String content, int minWindow, int maxWindow) 39 | { 40 | 41 | Map tempKeywordScore = new HashMap(); 42 | Map allKeywordScore = new HashMap(); 43 | String key=null; 44 | Float value=null; 45 | for(int i=minWindow;i<=maxWindow;i++) 46 | { 47 | TextRank.setWindowSize(i); // set the size of co-occurance window 48 | tempKeywordScore=TextRank.getWordScore(title,content); 49 | Iterator> it = tempKeywordScore.entrySet().iterator(); 50 | while (it.hasNext()) 51 | { 52 | Map.Entry entry = it.next(); 53 | key = entry.getKey(); 54 | value = entry.getValue(); 55 | if(allKeywordScore.containsKey(key)) 56 | allKeywordScore.put(key, allKeywordScore.get(key)+value); 57 | else 58 | allKeywordScore.put(key, value); 59 | } 60 | } 61 | // sort the result in terms of the score of each word 62 | List> entryList = new ArrayList>(allKeywordScore.entrySet()); 63 | Collections.sort(entryList,new Comparator>() 64 | { 65 | @Override 66 | public int compare(Map.Entry c1 , Map.Entry c2) 67 | { 68 | return c2.getValue().compareTo(c1.getValue()); 69 | } 70 | }); 71 | 72 | List fileKeywords = new ArrayList(); 73 | for(int j=0;j>): keywords of each document of the corpus 39 | */ 40 | public static Map> textRankMultiplyIDF(String dirPath) 41 | { 42 | Map> result = new HashMap>(); 43 | 44 | // get the IDF values for the words of a corpus 45 | Map idfForDir = TFIDF.idfForDir(dirPath); 46 | List fileList = ReadDir.readDirFileNames(dirPath); 47 | String content= null; 48 | 49 | for(String file:fileList) 50 | { 51 | content = ReadFile.loadFile(file); 52 | Map trKeywords = TextRank.getWordScore("", content); 53 | Iterator> it = trKeywords.entrySet().iterator(); 54 | while(it.hasNext()) 55 | { 56 | Map.Entry temp =it.next(); 57 | String key = temp.getKey(); 58 | trKeywords.put(key, temp.getValue()*idfForDir.get(key)); 59 | } 60 | 61 | //sort the words in terms of their score in descending order 62 | List> entryList = new ArrayList>(trKeywords.entrySet()); 63 | Collections.sort(entryList, 64 | new Comparator>() 65 | { 66 | public int compare(Map.Entry c1, Map.Entry c2) 67 | { 68 | return c2.getValue().compareTo(c1.getValue()); 69 | } 70 | 71 | } 72 | ); 73 | 74 | List temp = new ArrayList(); 75 | for (int i=0;i>): keywords of each document of the corpus 89 | */ 90 | public static Map> textRankTFIDFVote(String dirPath) 91 | { 92 | Map> result = new HashMap>(); 93 | List fileList = ReadDir.readDirFileNames(dirPath); 94 | 95 | // get keywords generated by TF-IDF 96 | TFIDF.setKeywordsNumber(keywordCandidateNum); 97 | Map> tfidfKeywordsForDir = TFIDF.getKeywords(dirPath); 98 | 99 | List trKeyword = new ArrayList(); 100 | List tfidfKeyword = new ArrayList(); 101 | String content = null; 102 | for(String file:fileList) 103 | { 104 | 105 | content = ReadFile.loadFile(file); 106 | trKeyword = TextRank.getKeyword("", content); 107 | tfidfKeyword = tfidfKeywordsForDir.get(file); 108 | 109 | List temp = new ArrayList(); 110 | for(String keyword:tfidfKeyword) 111 | { 112 | if (trKeyword.contains(keyword)) 113 | temp.add(keyword); 114 | if (temp.size()== keywordsNumber) 115 | break; 116 | } 117 | if (temp.size()== keywordsNumber) 118 | result.put(file,temp); 119 | else 120 | for(String keyword:tfidfKeyword) 121 | { 122 | if (!temp.contains(keyword)) 123 | temp.add(keyword); 124 | if (temp.size()==keywordsNumber) 125 | result.put(file, temp); 126 | } 127 | } 128 | return result; 129 | } 130 | 131 | } 132 | -------------------------------------------------------------------------------- /src/com/lc/nlp/keyword/evaluate/F1Score.java: -------------------------------------------------------------------------------- 1 | /** 2 | * Author: WuLC 3 | * Date: 2016-05-20 22:56:12 4 | * Last modified by: WuLC 5 | * Last Modified time: 2016-05-20 23:01:16 6 | * Email: liangchaowu5@gmail.com 7 | *************************************************************************** 8 | * Function: calculate precision, recall and f1 score in terms of the keywords extracted by the algorithm and manually 9 | * Input: keywords extracted by the algorithm and manually 10 | * Output: precision, recall, f1 score 11 | */ 12 | 13 | package com.lc.nlp.keyword.evaluate; 14 | 15 | import java.util.*; 16 | import java.text.DecimalFormat; 17 | 18 | public class F1Score 19 | { 20 | private static DecimalFormat df = new DecimalFormat("0.00");//format the output to reserve two decimal places 21 | 22 | /** 23 | * calculate the precision value, recall value and f1 score in terms of the keywords extracted by the algorithm and manually 24 | * @param systemKeywords(List): keywords extracted by the algorithm 25 | * @param manualKeywords(String[]): keywords extracted manually 26 | * return (List): precision, recall, f1 score 27 | */ 28 | public static List calculate(List systemKeywords,String[] manualKeywords) 29 | { 30 | int sysLen=systemKeywords.size(); 31 | int manLen=manualKeywords.length; 32 | //Caculate.printKeywords(systemKeywords,manualKeywords); 33 | int hit=0; 34 | for(int i=0;i result = new ArrayList(); 62 | result.add(Float.parseFloat(df.format(pValue))); 63 | result.add(Float.parseFloat(df.format(rValue))); 64 | result.add(Float.parseFloat(df.format(fValue))); 65 | return result; 66 | } 67 | } 68 | -------------------------------------------------------------------------------- /src/com/lc/nlp/parsedoc/ParseXML.java: -------------------------------------------------------------------------------- 1 | /** 2 | * Author: WuLC 3 | * Date: 2016-05-19 22:49:42 4 | * Last modified by: WuLC 5 | * Last Modified time: 2016-05-19 22:53:31 6 | * Email: liangchaowu5@gmail.com 7 | ************************************************************** 8 | * Function: parse XML file in terms of file and certain tag 9 | * Input: file path of XML file and a tag 10 | * Output: content of the tag 11 | */ 12 | 13 | package com.lc.nlp.parsedoc; 14 | 15 | import java.io.File; 16 | import java.util.Iterator; 17 | import org.dom4j.Document; 18 | import org.dom4j.DocumentException; 19 | import org.dom4j.Element; 20 | import org.dom4j.io.SAXReader; 21 | 22 | public class ParseXML 23 | { 24 | /** 25 | * return the content of a XML file of a certain tag 26 | * the tag should be second level,or you can modify the code to change to any level 27 | * @param fileName(String): file path of the XML file 28 | * @param tag(String): a second-level tag of the XML file 29 | * @return(String): content of tag of the file 30 | */ 31 | public String parseXML(String fileName,String tag) 32 | { 33 | File inputXml = new File(fileName); 34 | SAXReader saxReader = new SAXReader(); 35 | Document document=null; 36 | Element rootTag=null,subTag=null; 37 | boolean hasTag=false; 38 | try 39 | { 40 | document = saxReader.read(inputXml); 41 | rootTag = document.getRootElement(); 42 | for (Iterator i = rootTag.elementIterator(); i.hasNext();) 43 | { 44 | subTag = (Element) i.next(); 45 | if(subTag.getName().equals(tag)) 46 | { 47 | hasTag=true; 48 | break; 49 | } 50 | } 51 | } 52 | catch (DocumentException e) 53 | { 54 | System.out.println(e.getMessage()); 55 | } 56 | if(hasTag) 57 | return subTag.getText(); 58 | else 59 | return "not such a tag in the xml file"; 60 | } 61 | } 62 | 63 | -------------------------------------------------------------------------------- /src/com/lc/nlp/parsedoc/ReadDir.java: -------------------------------------------------------------------------------- 1 | /** 2 | * Author: WuLC 3 | * Date: 2016-05-19 22:49:42 4 | * Last modified by: WuLC 5 | * Last Modified time: 2016-05-19 22:53:09 6 | * Email: liangchaowu5@gmail.com 7 | ************************************************************************************ 8 | * Function: read the paths of all the files under a directory,including the sub-directories of it 9 | * Input(String): path of the directory,the last character of the path cannot be / due to sub-directories 10 | * Output(List): paths of all files under the directory 11 | */ 12 | 13 | package com.lc.nlp.parsedoc; 14 | 15 | import java.io.File; 16 | import java.util.ArrayList; 17 | import java.util.List; 18 | 19 | public class ReadDir 20 | { 21 | /** 22 | * read the paths of all the files under a directory,including the sub-directories of it 23 | * @param dirPath(String): path of the directory, remember the last character can't be / 24 | * @return(List): paths of all files under the directory 25 | */ 26 | public static List readDirFileNames(String dirPath) 27 | { 28 | if (dirPath.equals("")) 29 | { 30 | System.out.println("The path of the directory can't be empty"); 31 | System.exit(0); 32 | } 33 | 34 | else if(dirPath != null && (dirPath.substring(dirPath.length()-1)).equals("/")) 35 | { 36 | System.out.println("The last character of the path of the directory can't be /"); 37 | System.exit(0); 38 | } 39 | 40 | File dirFile = new File(dirPath); 41 | String [] fileNameList=null; 42 | String tmp=null; 43 | List fileList=new ArrayList(); 44 | List subFileList=new ArrayList(); 45 | 46 | fileNameList=dirFile.list(); 47 | for(int i=0;i