├── LICENSE
├── README.md
└── src
    └── com
        └── lc
            └── nlp
                ├── keyword
                    ├── algorithm
                    │   ├── TFIDF.java
                    │   ├── TextRank.java
                    │   ├── TextRankWithMultiWin.java
                    │   └── TextRankWithTFIDF.java
                    └── evaluate
                    │   └── F1Score.java
                └── parsedoc
                    ├── ParseXML.java
                    ├── ReadDir.java
                    └── ReadFile.java


/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2016-2017 GitHub Inc.
2 | 
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 | 
5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 | 
7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Keyword Extraction in Java
  2 | 
  3 | Implementation of serveral algorithms for keyword extraction,including TextRank,TF-IDF,TextRank along with TFTF-IDF.Cutting words and filtering stop words are relied on [HanLP](https://github.com/hankcs/HanLP)
  4 | 
  5 | The repository mainly consists of three parts:
  6 | 
  7 | **1. Algorithm**: implementation of serveral algorithms for keyword exraction,including TextRank,TF-IDF and combining of  TextRank and TF-IDF
  8 | 
  9 | **2.Evaluate**: the method to evaluate the result of the algorithm,currently only the F1 Score is available
 10 | 
 11 | **3.Parse Documents**: methods provided to read the contens of the corpus used for test
 12 | 
 13 | 
 14 | More details can be found in [this passage](http://wulc.me/2016/05/28/%E5%85%B3%E9%94%AE%E8%AF%8D%E6%8A%BD%E5%8F%96%E7%AE%97%E6%B3%95%E7%9A%84%E7%A0%94%E7%A9%B6/)
 15 | 
 16 | ## 1. Algorithm
 17 | 
 18 | ### 1.1 TextRank
 19 | 
 20 | Source File: `TexkRank.java`
 21 | 
 22 | With title and content of a document as input,return 5 keywords of the documents.For example
 23 | 
 24 | ```java
 25 | String title = "关键词抽取";
 26 | String content = "关键词自动提取是一种识别有意义且具有代表性片段或词汇的自动化技术。关键词自动提取在文本挖掘域被称为关键词抽取，在计算语言学领域通常着眼于术语自动识别，在信息检索领域，就是指自动标引。";
 27 | System.out.println(TextRank.getKeyword(title, content));
 28 | 
 29 | // Output: [自动, 领域, 关键词, 提取, 抽取]
 30 | ```
 31 | 
 32 | You can change the number of keywords and the size of co-occur window ,whose default values are 5 and 3,respectively.For example:
 33 | ```java
 34 | TextRank.setKeywordNumber(6);
 35 | TextRank.setWindowSize(4);
 36 | String title = "关键词抽取";
 37 | String content = "关键词自动提取是一种识别有意义且具有代表性片段或词汇的自动化技术。关键词自动提取在文本挖掘域被称为关键词抽取，在计算语言学领域通常着眼于术语自动识别，在信息检索领域，就是指自动标引。";
 38 | System.out.println(TextRank.getKeyword(title, content));
 39 | // Output:[自动, 关键词, 领域, 提取, 抽取, 自动识别]
 40 | ```
 41 | 
 42 | From the output you can see clearly the number of keywords has change due to `TextRank.setKeywordNumber(6);`,and the size of co-occur window is not visible in the result but will affect the resutlt if you are aware of the principle of TextRank algorithm.
 43 | 
 44 | 
 45 | ### 1.2 TF-IDF
 46 | 
 47 | Source File: `TFIDF.java`
 48 | 
 49 | TF-IDF algorithm is to extract the keywords of a corpus, that is, it will extract keywords for multiple documents at the same time.For example:
 50 | 
 51 | ```java
 52 | String dir = "G:/corpusMini";
 53 | Map<String,List<String>> result= TFIDF.getKeywords(dir);
 54 | System.out.println(result);
 55 | ```
 56 | 
 57 | Output is like this(suppose there are only 3 documents in the directory):
 58 | ```
 59 | {
 60 | G:/corpusMini/00001.xml=[曹奔, 周某, 民警, 摩托车, 禁区],
 61 | G:/corpusMini/00002.xml=[翁刚, 吴红娟, 婆婆, 分割, 丈夫], 
 62 | G:/corpusMini/00003.xml=[止痛片, 伤肾, 医生, 肾衰竭, 胡婆婆]
 63 | }
 64 | ```
 65 | 
 66 | The default number of keywords to extract is 5,you can change it by yourself,for example:
 67 | 
 68 | ```java
 69 | String dir = "G:/corpusMini";
 70 | TFIDF.setKeywordsNumber(3);
 71 | Map<String,List<String>> result= TFIDF.getKeywords(dir);
 72 | System.out.println(result);
 73 | ```
 74 | 
 75 | and the output will be like this
 76 | ```
 77 | {
 78 | G:/corpusMini/00001.xml=[曹奔, 周某, 民警],
 79 | G:/corpusMini/00002.xml=[翁刚, 吴红娟, 婆婆], 
 80 | G:/corpusMini/00003.xml=[止痛片, 伤肾, 医生]
 81 | }
 82 | ```
 83 | 
 84 | **To be able to run the above code, remember to specify how to read the content of file under the directory in the `ReadFile.java `, the following part `3.2 ReadFile` will explain the details about this**
 85 | 
 86 | 
 87 | ### 1.3 TextRank With Multiple Window
 88 | 
 89 | Source File: `TextRankWithMultiWin.java`
 90 | 
 91 | `TextRank`  algorithm is classical, and this algorithm `TextRank With Multiple Window` is based on TextRank.
 92 | 
 93 | As we know, co-occurance window is an uncertain parameter in TextRank, currently we have no way to find the best co-occurance window for a document, therefore this co-occurance window will integrate the resutls generated by several co-occurance windowd by summing up the score that they generate.
 94 | 
 95 | Sample code is like this:
 96 | ```java
 97 | String title = "关键词抽取";
 98 | String content = "关键词自动提取是一种识别有意义且具有代表性片段或词汇的自动化技术。关键词自动提取在文本挖掘域被称为关键词抽取，在计算语言学领域通常着眼于术语自动识别，在信息检索领域，就是指自动标引。";
 99 | int miniWindow = 3, maxWindow = 5;
100 | List<String> keywords = TextRankWithMultiWin.integrateMultiWindow(title, content, miniWindow, maxWindow);
101 | System.out.println(keywords);
102 | 
103 | /*Output*/
104 | [自动, 关键词, 领域, 提取, 抽取]
105 | ```
106 | 
107 | the above code combines the results generated by the co-occurance window of size 3~5;notice that the result is a little different from the original TextRank algorithm in 1.1
108 | 
109 | Also,you can set the number of keywords to extract by `TextRankWithMultiWin.setKeywordNumber(3);`
110 | 
111 | 
112 | ### 1.4 TextRank With TF-IDF
113 | 
114 | This part includes two algorithms: `TextRank-score multipy IDF` and `TextRank and TF-IDF vote together`, which are both based on the original TextRank and TF-IDF.
115 | 
116 | **To be able to run these two algorithms, remember to specify how to read the content of file under the directory in the `ReadFile.java `, the following part `3.2 ReadFile` will tell you details about this**
117 | 
118 | #### 1.4.1 TextRank-score multipy IDF
119 | 
120 | Source File: `TextRankWithTFIDF.java`
121 | 
122 | Based on the score generated by the original TextRank algorithm, this algorithm multipy the IDF value by the score for each word.In this way, it can consider the weight of a word in a corpus.**Therefore, this is an algorithm for a corpus(that is,it can't used to extract keywords for a single document)**
123 | ```java
124 | String dir = "G:/corpusMini";
125 | //TextRankWithTFIDF.setKeywordsNumber(3);//set the size of co-occurance window,default 5 
126 | Map<String,List<String>> result= TextRankWithTFIDF.textRankMultiplyIDF(dir);
127 | System.out.println(result);
128 | ```
129 | 
130 | Output is like this(suppose there are only 3 documents in the directory)
131 | ```
132 | {
133 | G:/corpusMini/00001.xml=[曹奔, 周某, 民警, 摩托车, 禁区], 
134 | G:/corpusMini/00002.xml=[吴红娟, 翁刚, 婆婆, 丈夫, 分割], 
135 | G:/corpusMini/00003.xml=[伤肾, 止痛片, 胡婆婆, 肾内科, 肾衰竭]
136 | }
137 | ```
138 | Also, you can change the number of keywords to extract as the commented line did above in the code.
139 | 
140 | #### 1.4.2 TextRank and TF-IDF vote together
141 | 
142 | Source File: `TextRankWithTFIDF.java`
143 | 
144 | This algorithm is based on both TextRank and TF-IDF.
145 | 
146 | In order to extract n keywords from a document, firstly extract 2*n candidate keywords with TextRank and TFIDF respectively, then select the words that co-occur in these candidate keywords as the final keywords, if the number of final keywords if not enough,select the left part from the result generated TFIDF.**Therefore, this is also an algorithm for a corpus.**
147 | 
148 | Sampel code is like this:
149 | ```java
150 | String dir = "G:/corpusMini";
151 | //TextRankWithTFIDF.setKeywordsNumber(3);//set the size of co-occurance window,default 5 
152 | Map<String,List<String>> result= TextRankWithTFIDF.textRankTFIDFVote(dir);
153 | System.out.println(result);
154 | ```
155 | 
156 | Output is like this:
157 | ```
158 | G:/corpusMini/00007.xml=[曹奔, 周某, 民警, 摩托车, 困难],
159 | G:/corpusMini/00002.xml=[翁刚, 吴红娟, 婆婆, 丈夫, 法院], 
160 | G:/corpusMini/00018.xml=[止痛片, 伤肾, 肾衰竭, 胡婆婆, 肾内科]
161 | ```
162 | 
163 | Also, you can change the number of keywords to extract as the commented line did above in the code.
164 | 
165 | 
166 | ## 2. Evaluate
167 | 
168 | The Class `F1Score`  uses f1 score to evaluate the keywords extracted by the algorithm.You had got to take the keywords extracted by the algorithm and the keywords extracted manually as input.Sample code is like this:
169 | 
170 | ```java
171 | String title = "关键词抽取";
172 | String content = "关键词自动提取是一种识别有意义且具有代表性片段或词汇的自动化技术。关键词自动提取在文本挖掘域被称为关键词抽取，在计算语言学领域通常着眼于术语自动识别，在信息检索领域，就是指自动标引。";
173 | List<String> sysKeywords = TextRank.getKeyword(title, content);
174 | String[] manualKeywords = {"关键词","自动提取"};
175 | List<Float> result = F1Score.calculate(sysKeywords,manualKeywords);
176 | System.out.println(result);
177 | /*output*/
178 | [20.0, 50.0, 28.57] // the three numbers represent precision = 20% recall=50% F1 =28.57%
179 | ```
180 | 
181 | 
182 | ## 3. Parse Documents
183 | 
184 | ### 3.1 ReadDir
185 | `ReadDir` Class provides a method to find all the paths of files under a certain directory,including the sub-directories.For example
186 | 
187 | ```java
188 | String dirPath = "G:/corpusMini";
189 | List<String> fileList =  ReadDir.readDirFileNames(dirPath);
190 | for(String file : fileList)
191 |     System.out.println(file);
192 | ```
193 | 
194 | and the output is like this:
195 | ```
196 | G:/corpusMini/00001.xml
197 | G:/corpusMini/00002.xml
198 | G:/corpusMini/test/00003.xml
199 | ```
200 | 
201 | as you can see,the method can also read the files of subdirectory `test`,because of this,remember not to take  `/`  as the last character of dirPath
202 | 
203 | ### 3.2 ReadFile
204 | 
205 | `ReadFile` class is designed to load the content of file of a certain type, **remember you had got to implement the method `loadFile` in trems of the type of your file.** The default method in it is to parse the XML files [here](https://github.com/iamxiatian/data/tree/master/sohu-dataset) ,and the code is like this
206 | 
207 | ```java
208 | //remember to replace the following code to yours in terms of the type of your files
209 | /*
210 | String filePath = "G:/corpusMini/00001.xml";
211 | ParseXML parser = new ParseXML();
212 | String content = parser.parseXML(filePath, "content");
213 | */
214 | ```
215 | 
216 | 


--------------------------------------------------------------------------------
/src/com/lc/nlp/keyword/algorithm/TFIDF.java:
--------------------------------------------------------------------------------
  1 | /**
  2 | * Author: WuLC
  3 | * Date:   2016-05-22 17:46:15
  4 | * Last modified by:   WuLC
  5 | * Last Modified time: 2016-05-23 23:31:25
  6 | * Email: liangchaowu5@gmail.com
  7 | ************************************************************
  8 | * Function:get keywords of file through TF-IDF algorithm
  9 | * Input: path of directory of files that need to extract keywords
 10 | * Output: keywords of each file
 11 | */
 12 | 
 13 | package com.lc.nlp.keyword.algorithm;
 14 | 
 15 | import java.io.*;
 16 | import java.util.*;
 17 | import com.hankcs.hanlp.HanLP;
 18 | import com.hankcs.hanlp.dictionary.stopword.CoreStopWordDictionary;
 19 | import com.hankcs.hanlp.seg.common.Term;
 20 | 
 21 | import com.lc.nlp.parsedoc.*;
 22 | 
 23 | 
 24 | public class TFIDF
 25 | {
 26 | 	
 27 |     private static int keywordsNumber = 5;
 28 |     
 29 |     /**
 30 |      * change the number of keywords,default 5
 31 |      * @param keywordNum(int): number of keywords that need to be extracted
 32 |      */
 33 |     public static void setKeywordsNumber(int keywordNum)
 34 |     {
 35 |         keywordsNumber = keywordNum;
 36 |     }
 37 | 
 38 | 	/**
 39 |      * calculate TF value of each word in terms of  the content of file
 40 |      * @param fileContent(String): content of file
 41 |      * @return(HashMap<String, Float>): "words:TF value" pairs
 42 |      */
 43 |     public static HashMap<String, Float> getTF(String fileContent)
 44 |     {    
 45 |     	List<Term> terms=new ArrayList<Term>();
 46 |         ArrayList<String> words = new ArrayList<String>();
 47 |        
 48 |         terms=HanLP.segment(fileContent);
 49 |         for(Term t:terms)
 50 |         {
 51 |         	if(TFIDF.shouldInclude(t))
 52 |         	{
 53 |         		words.add(t.word);
 54 |         	}      		
 55 |         }
 56 |         
 57 |         // get TF values
 58 |     	 HashMap<String, Integer> wordCount = new HashMap<String, Integer>();
 59 |     	 HashMap<String, Float> TFValues = new HashMap<String, Float>();
 60 |     	 for(String word : words)
 61 |          {
 62 |              if(wordCount.get(word) == null)
 63 |              {
 64 |             	 wordCount.put(word, 1);
 65 |              }
 66 |              else
 67 |              {
 68 |             	 wordCount.put(word, wordCount.get(word) + 1);
 69 |              }
 70 |          }
 71 |     	 
 72 |          int wordLen = words.size();
 73 |          //traverse the HashMap
 74 |          Iterator<Map.Entry<String, Integer>> iter = wordCount.entrySet().iterator(); 
 75 |          while(iter.hasNext())
 76 |          {
 77 |              Map.Entry<String, Integer> entry = (Map.Entry<String, Integer>)iter.next();
 78 |              TFValues.put(entry.getKey().toString(), Float.parseFloat(entry.getValue().toString()) / wordLen);
 79 |            //System.out.println(entry.getKey().toString() + " = "+  Float.parseFloat(entry.getValue().toString()) / wordLen);
 80 |          }
 81 |          return TFValues;
 82 |      } 
 83 |   
 84 |     
 85 |     /**
 86 |      * judge whether a word belongs to stop words
 87 |      * @param term(Term): word needed to be judged
 88 |      * @return(boolean):  if the word is a stop word,return false;otherwise return true    
 89 |      */
 90 |     public static boolean shouldInclude(Term term)
 91 |     {
 92 |         return CoreStopWordDictionary.shouldInclude(term);
 93 |     }
 94 |       
 95 |     
 96 |     /**
 97 |      * calculate TF values for each word of each file under a directory
 98 |      * @param dirPath(String): path of the directory
 99 |      * @return(HashMap<String,HashMap<String, Float>>): path of file and its  corresponding "word-TF Value" pairs
100 |      * @throws IOException
101 |      */
102 |     public static HashMap<String,HashMap<String, Float>> tfForDir(String dirPath) 
103 |     {
104 |         HashMap<String, HashMap<String, Float>> allTF = new HashMap<String, HashMap<String, Float>>();
105 |         List<String> filelist = ReadDir.readDirFileNames(dirPath);
106 |         
107 |         for(String file : filelist)
108 |         {
109 |             HashMap<String, Float> dict = new HashMap<String, Float>();
110 |             String content = ReadFile.loadFile(file); // remember to modify the loadFile method of class ReadFile
111 |             dict = TFIDF.getTF(content);
112 |             allTF.put(file, dict);
113 |         }
114 |         return allTF;
115 |     }
116 | 
117 |     
118 |     /**
119 |      * calculate IDF values for each word  under a directory
120 |      * @param dirPath(String): path of the directory
121 |      * @return(HashMap<String, Float>): "word:IDF Value" pairs
122 |      */
123 |     public static HashMap<String, Float> idfForDir(String dirPath)
124 |     {
125 |     	List<String> fileList = new ArrayList<String>();
126 |     	fileList = ReadDir.readDirFileNames(dirPath);
127 |     	int docNum = fileList.size();  
128 |     	
129 |         Map<String, Set<String>> passageWords = new HashMap<String, Set<String>>();        
130 |         // get words that are not repeated of a file 
131 |         for(String filePath:fileList)
132 |         {   
133 |         	List<Term> terms=new ArrayList<Term>();
134 |             Set<String> words = new HashSet<String>();
135 |             String content = ReadFile.loadFile(filePath); // remember to modify the loadFile method of class ReadFile   
136 |             terms=HanLP.segment(content);
137 |             for(Term t:terms)
138 |             {
139 |             	if(TFIDF.shouldInclude(t))
140 |             	{
141 |             		words.add(t.word);
142 |             	}      		
143 |             }
144 |             passageWords.put(filePath, words);
145 |         }
146 |         
147 |         // get IDF values
148 |         HashMap<String, Integer> wordPassageNum = new HashMap<String, Integer>();
149 |         for(String filePath : fileList)
150 |         {
151 |             Set<String> wordSet = new HashSet<String>();
152 |             wordSet = passageWords.get(filePath);
153 |             for(String word:wordSet)
154 |             {           	
155 |                 if(wordPassageNum.get(word) == null)
156 |                 	wordPassageNum.put(word,1);
157 |                 else             
158 |                 	wordPassageNum.put(word, wordPassageNum.get(word) + 1);           
159 |             }
160 |         }
161 |         
162 |         HashMap<String, Float> wordIDF = new HashMap<String, Float>(); 
163 |         Iterator<Map.Entry<String, Integer>> iter_dict = wordPassageNum.entrySet().iterator();
164 |         while(iter_dict.hasNext())
165 |         {
166 |             Map.Entry<String, Integer> entry = (Map.Entry<String, Integer>)iter_dict.next();
167 |             float value = (float)Math.log( docNum / (Float.parseFloat(entry.getValue().toString())) );
168 |             wordIDF.put(entry.getKey().toString(), value);
169 |             //System.out.println(entry.getKey().toString() + "=" +value);
170 |         }
171 |         return wordIDF;
172 |     }
173 | 
174 |     
175 |     /**
176 |      * calculate TF-IDF value for each word of each file under a directory
177 |      * @param dirPath(String): path of the directory
178 |      * @return(Map<String, HashMap<String, Float>>): path of file and its corresponding "word:TF-IDF Value" pairs
179 |      */
180 |     public static Map<String, HashMap<String, Float>> getDirTFIDF(String dirPath)
181 |     {
182 |         HashMap<String, HashMap<String, Float>> dirFilesTF = new HashMap<String, HashMap<String, Float>>();  
183 |         HashMap<String, Float> dirFilesIDF = new HashMap<String, Float>(); 
184 |         
185 |         dirFilesTF = TFIDF.tfForDir(dirPath);
186 |         dirFilesIDF = TFIDF.idfForDir(dirPath);
187 |         
188 |         Map<String, HashMap<String, Float>> dirFilesTFIDF = new HashMap<String, HashMap<String, Float>>(); 
189 |         Map<String,Float> singlePassageWord= new HashMap<String,Float>();
190 |         List<String> fileList = new ArrayList<String>();
191 |         fileList = ReadDir.readDirFileNames(dirPath);
192 |         for (String filePath: fileList)
193 |         {
194 |         	HashMap<String,Float> temp= new HashMap<String,Float>();
195 |         	singlePassageWord = dirFilesTF.get(filePath);
196 |         	Iterator<Map.Entry<String, Float>> it = singlePassageWord.entrySet().iterator();
197 |         	while(it.hasNext())
198 |         	{
199 |         		Map.Entry<String, Float> entry = it.next();
200 |         		String word = entry.getKey();
201 |         		Float TFIDF = entry.getValue()*dirFilesIDF.get(word);
202 |         		temp.put(word, TFIDF);
203 |         	}
204 |         	dirFilesTFIDF.put(filePath, temp);
205 |         }
206 |         return dirFilesTFIDF;
207 |     }
208 |  
209 |     
210 |     /**
211 |      * get keywords of each file under a certain directory 
212 |      * @param dirPath(String): path of directory
213 |      * @param keywordNum(int): number of keywords to extract
214 |      * @return(Map<String,List<String>>): path of file and its corresponding keywords
215 |      */
216 |     public static Map<String,List<String>> getKeywords(String dirPath)
217 |     {
218 |     	List<String> fileList = new ArrayList<String>();
219 |     	fileList = ReadDir.readDirFileNames(dirPath);
220 |     	
221 |     	// calculate TF-IDF value for each word of each file under the dirPath
222 |     	Map<String, HashMap<String, Float>> dirTFIDF = new HashMap<String, HashMap<String, Float>>(); 
223 |     	dirTFIDF = TFIDF.getDirTFIDF(dirPath);
224 |     	
225 |     	Map<String,List<String>> keywordsForDir = new HashMap<String,List<String>>(); 
226 |     	for (String file:fileList)
227 |     	{
228 |     		Map<String,Float> singlePassageTFIDF= new HashMap<String,Float>();
229 |     		singlePassageTFIDF = dirTFIDF.get(file);
230 |     		
231 |     		//sort the keywords in terms of TF-IDF value in descending order
232 | 	        List<Map.Entry<String,Float>> entryList=new ArrayList<Map.Entry<String,Float>>(singlePassageTFIDF.entrySet());
233 | 	        
234 | 	
235 | 	        Collections.sort(entryList,new Comparator<Map.Entry<String,Float>>()
236 | 	        {
237 | 	        	@Override
238 | 	        	public int compare(Map.Entry<String,Float> c1,Map.Entry<String,Float> c2)
239 | 	        	{
240 | 	        		return c2.getValue().compareTo(c1.getValue()); 	        		
241 | 	        	}
242 | 	        }
243 | 	        );
244 | 	        	        
245 | 	       // get keywords 
246 |             List<String> systemKeywordList=new ArrayList<String>();
247 |             for(int k=0;k<keywordsNumber;k++)
248 |             {
249 |             	try
250 |             	{
251 |             	systemKeywordList.add(entryList.get(k).getKey());
252 |             	}
253 |             	catch(IndexOutOfBoundsException e)
254 |             	{
255 |             		continue;
256 |             	}
257 |             }
258 |             
259 |             keywordsForDir.put(file, systemKeywordList);
260 |         }
261 |         return keywordsForDir;
262 |     }
263 |            
264 | }
265 | 


--------------------------------------------------------------------------------
/src/com/lc/nlp/keyword/algorithm/TextRank.java:
--------------------------------------------------------------------------------
  1 | /**
  2 | * Author: WuLC
  3 | * Date:   2016-05-18 23:10:12
  4 | * Last modified by:   WuLC
  5 | * Last Modified time: 2016-05-28 11:38:02
  6 | * Email: liangchaowu5@gmail.com
  7 | * **************************************************************
  8 | * Function: extract keywords of document through TextRank algorithm
  9 | * Input(String): target text that keywords will be extracted from 
 10 | * Output(List<String>): keywords of the text
 11 |  */
 12 | package com.lc.nlp.keyword.algorithm;
 13 | 
 14 | import java.util.*;
 15 | import com.hankcs.hanlp.HanLP;
 16 | import com.hankcs.hanlp.dictionary.stopword.CoreStopWordDictionary;
 17 | import com.hankcs.hanlp.seg.common.Term;
 18 | 
 19 | 
 20 | public class TextRank
 21 | {
 22 |     static final float d = 0.85f;           //damping factor, default 0.85
 23 |     static final int max_iter = 200;        //max iteration times
 24 |     static final float min_diff = 0.0001f;  //condition to judge whether recurse or not
 25 |     private static  int nKeyword=5;         //number of keywords to extract,default 5
 26 |     private static  int coOccuranceWindow=3; //size of the co-occurance window, default 3
 27 |     
 28 |     // change default parameters
 29 |     public static void setKeywordNumber(int sysKeywordNum)
 30 |     {
 31 |     	nKeyword = sysKeywordNum;
 32 |     }
 33 |  
 34 |     
 35 |     public static void setWindowSize(int window)
 36 |     {
 37 |     	coOccuranceWindow = window;
 38 |     }
 39 | 
 40 |     
 41 |     /**
 42 |      * extract keywords in terms of title and content of document
 43 |      * @param title(String): title of document
 44 |      * @param content(String): content of document
 45 |      * @param sysKeywordCount(int): number of keywords to extract,default 5
 46 |      * @param window(int): size of the co-occur window, default 3
 47 |      * @return (List<String>): list of keywords 
 48 |      */
 49 |     public static List<String> getKeyword(String title, String content)
 50 |     {
 51 |     	
 52 |     	Map<String, Float> score = TextRank.getWordScore(title, content); 
 53 |     	
 54 |         //rank keywords in terms of their score
 55 |         List<Map.Entry<String, Float>> entryList = new ArrayList<Map.Entry<String, Float>>(score.entrySet());
 56 |         Collections.sort(entryList, new Comparator<Map.Entry<String, Float>>()
 57 |         {
 58 |             @Override
 59 |             public int compare(Map.Entry<String, Float> o1, Map.Entry<String, Float> o2)
 60 |             {
 61 |                 return (o1.getValue() - o2.getValue() > 0 ? -1 : 1);
 62 |             }
 63 |         }
 64 |         );
 65 |         
 66 |         //System.out.println("After sorting: "+entryList);
 67 |         
 68 |         List<String> sysKeywordList=new ArrayList<String>();
 69 |         
 70 |         //List<String>  unmergedList=new ArrayList<String>();
 71 |         for (int i = 0; i < nKeyword; ++i){
 72 |             try{
 73 |         	//unmergedList.add(entryList.get(i).getKey());
 74 |         	sysKeywordList.add(entryList.get(i).getKey());
 75 |             }catch(IndexOutOfBoundsException e){
 76 |             	continue;
 77 |             }
 78 |         }
 79 | 
 80 |         System.out.print("window:"+coOccuranceWindow+"\nkeywordNum: "+nKeyword);
 81 |         return sysKeywordList;
 82 |     }
 83 |     
 84 |     
 85 |     /**
 86 |      * judge whether a word belongs to stop words
 87 |      * @param term(Term): word needed to be judged
 88 |      * @return(boolean):  if the word is a stop word,return false;otherwise return true    
 89 |      */
 90 |     public static boolean shouldInclude(Term term)
 91 | 	    {
 92 | 	        return CoreStopWordDictionary.shouldInclude(term);
 93 | 	    }
 94 |  
 95 |     
 96 | 
 97 |     /**
 98 |      * return score of each word after TextRank algorithm
 99 |      * @param title(String): title of document
100 |      * @param content(String): content of document
101 |      * @param window(int): size of the co-occur window, default 3
102 |      * @return (Map<String,Float>):  score of each word
103 |      */
104 |     public static Map<String,Float> getWordScore(String title, String content)
105 |      {
106 |     	 	
107 |      	 //segment text into words
108 |          List<Term> termList = HanLP.segment(title + content);
109 |         
110 |          int count=1;  //position of each word
111 |          Map<String,Integer> wordPosition = new HashMap<String,Integer>();
112 |          
113 |          List<String> wordList=new ArrayList<String>();
114 |          
115 |          //filter stop words
116 |          for (Term t : termList)
117 |          {
118 |              if (shouldInclude(t))
119 |              {
120 |                  wordList.add(t.word);
121 |                  if(!wordPosition.containsKey(t.word))
122 |                  {
123 |                    wordPosition.put(t.word,count);
124 |                    count++;
125 |                  }
126 |              }
127 |          }
128 |          //System.out.println("Keyword candidates:"+wordList);
129 |          
130 |          //generate word-graph in terms of size of co-occur window
131 |          Map<String, Set<String>> words = new HashMap<String, Set<String>>();
132 |          Queue<String> que = new LinkedList<String>();
133 |          for (String w : wordList)
134 |          {
135 |              if (!words.containsKey(w))
136 |              {
137 |                  words.put(w, new HashSet<String>());
138 |              }
139 |              que.offer(w);    // insert into the end of the queue
140 |              if (que.size() > coOccuranceWindow)
141 |              {
142 |                  que.poll();  // pop from the queue
143 |              }
144 | 
145 |              for (String w1 : que)
146 |              {
147 |                  for (String w2 : que)
148 |                  {
149 |                      if (w1.equals(w2))
150 |                      {
151 |                          continue;
152 |                      }
153 | 
154 |                      words.get(w1).add(w2);
155 |                      words.get(w2).add(w1);
156 |                  }
157 |              }
158 |          }       
159 |          //System.out.println("word-graph:"+words); //each k,v represents all the words in v point to k 
160 |          
161 |          // iterate till recurse
162 |          Map<String, Float> score = new HashMap<String, Float>();
163 |          for (int i = 0; i < max_iter; ++i)
164 |          {
165 |              Map<String, Float> m = new HashMap<String, Float>();
166 |              float max_diff = 0;
167 |              for (Map.Entry<String, Set<String>> entry : words.entrySet())
168 |              {
169 |                  String key = entry.getKey();
170 |                  Set<String> value = entry.getValue();
171 |                  m.put(key, 1 - d);
172 |                  for (String other : value)
173 |                  {
174 |                      int size = words.get(other).size();
175 |                      if (key.equals(other) || size == 0) continue;
176 |                      m.put(key, m.get(key) + d / size * (score.get(other) == null ? 0 : score.get(other))); 
177 |                  }
178 |                  
179 |                  max_diff = Math.max(max_diff, Math.abs(m.get(key) - (score.get(key) == null ? 1 : score.get(key))));
180 |              }
181 |              score = m;
182 |              
183 |              //exit once recurse
184 |              if (max_diff <= min_diff) 
185 |              	break;
186 |          }
187 |          return score;
188 |      }
189 | }
190 | 
191 | 


--------------------------------------------------------------------------------
/src/com/lc/nlp/keyword/algorithm/TextRankWithMultiWin.java:
--------------------------------------------------------------------------------
 1 | /**
 2 | * Author: WuLC
 3 | * Date:   2016-05-23 16:04:42
 4 | * Last modified by:   WuLC
 5 | * Last Modified time: 2016-05-24 16:00:11
 6 | * Email: liangchaowu5@gmail.com
 7 | ***********************************************************************************
 8 | * Function: integrate results of TextRank algorithm with different size of co-occurance window
 9 | * Input: title and content of a document
10 | * Output: keywords of  the document
11 | */
12 | package com.lc.nlp.keyword.algorithm;
13 | 
14 | import java.util.*;
15 | 
16 | public class TextRankWithMultiWin 
17 | {
18 | 	private static int keywordNum = 5;
19 | 	
20 | 	/**
21 | 	 * set the number of keywords to extract
22 | 	 * @param sysKeywordNum(int): number of keywords to extractt
23 | 	 */
24 | 	public static void setKeywordNumber(int sysKeywordNum)
25 | 	{
26 | 		keywordNum = sysKeywordNum;
27 | 	}
28 | 	
29 | 	
30 | 	/**
31 | 	 * integrate the results of TextRank algorithm  with different co-occurance window
32 | 	 * @param title(String): title of the document 
33 | 	 * @param content(String): content of the document
34 | 	 * @param minWindow(int): the minimum size of co-occurance window
35 | 	 * @param maxWindow(int): the maximum size of co-occurance window
36 | 	 * @return(List<String>): keywords of the document
37 | 	 */
38 | 	public static List<String> integrateMultiWindow(String title,String content, int minWindow, int maxWindow)
39 | 	{
40 | 		
41 | 		Map<String,Float> tempKeywordScore = new HashMap<String,Float>();
42 | 		Map<String,Float> allKeywordScore = new HashMap<String,Float>();	
43 | 		String key=null;
44 | 		Float value=null;
45 | 		for(int i=minWindow;i<=maxWindow;i++)
46 | 		{
47 | 			TextRank.setWindowSize(i); // set the size of co-occurance window
48 | 			tempKeywordScore=TextRank.getWordScore(title,content);
49 | 			Iterator<Map.Entry<String, Float>> it = tempKeywordScore.entrySet().iterator();
50 | 			while (it.hasNext())
51 | 			{
52 | 				Map.Entry<String, Float> entry = it.next();
53 | 				key = entry.getKey();
54 | 				value = entry.getValue();
55 | 				if(allKeywordScore.containsKey(key))
56 | 					allKeywordScore.put(key, allKeywordScore.get(key)+value);
57 | 				else
58 | 					allKeywordScore.put(key, value);			
59 | 			}
60 | 		}
61 | 			// sort the result in terms of the score of each word
62 | 			List<Map.Entry<String, Float>> entryList = new ArrayList<Map.Entry<String,Float>>(allKeywordScore.entrySet());
63 | 			Collections.sort(entryList,new Comparator<Map.Entry<String, Float>>()
64 | 			{
65 | 				@Override
66 | 				public int compare(Map.Entry<String, Float> c1 , Map.Entry<String, Float> c2)
67 | 				{
68 | 					return c2.getValue().compareTo(c1.getValue());
69 | 				}
70 | 			});
71 | 		    
72 | 			List<String> fileKeywords = new ArrayList<String>();
73 | 			for(int j=0;j<keywordNum;j++)
74 | 			{
75 | 				fileKeywords.add(entryList.get(j).getKey());
76 | 			}
77 | 			
78 | 		return fileKeywords;
79 | 	}
80 | 
81 | }
82 | 
83 | 


--------------------------------------------------------------------------------
/src/com/lc/nlp/keyword/algorithm/TextRankWithTFIDF.java:
--------------------------------------------------------------------------------
  1 | /**
  2 | * Author: WuLC
  3 | * Date:   2016-05-25 09:18:09
  4 | * Last modified by:   WuLC
  5 | * Last Modified time: 2016-05-25 14:50:52
  6 | * Email: liangchaowu5@gmail.com
  7 | ******************************************************
  8 | * Function: combine TextRank and TF-IDF to extract keywords 
  9 | * Input: path of the directory of the corpus
 10 | * Output: keywords extracted for each document
 11 | */
 12 | 
 13 | package com.lc.nlp.keyword.algorithm;
 14 | 
 15 | import java.util.*;
 16 | import com.lc.nlp.parsedoc.ReadDir;
 17 | import com.lc.nlp.parsedoc.ReadFile;
 18 | 
 19 | 
 20 | public class TextRankWithTFIDF 
 21 | {
 22 | 	private static int keywordsNumber = 5;
 23 | 	private static int keywordCandidateNum = 10;
 24 | 	
 25 | 	/**
 26 | 	 * set the number of keywords to extract 
 27 | 	 * @param number(int): number of keywords to extract 
 28 | 	 */
 29 | 	public static void setKeywordsNumber(int number)
 30 | 	{
 31 | 		keywordsNumber = number;
 32 | 		keywordCandidateNum = 2 * number;
 33 | 	}
 34 | 	
 35 | 	 /**
 36 | 	  * multiply the TextRank-socre of a word by the IDF value of this word in a corpus 
 37 | 	  * @param dirPath(String): path of the directory of the corpus
 38 |  	  * @return(Map<String,List<String>>): keywords of each document of the corpus
 39 | 	  */
 40 | 	public static Map<String,List<String>> textRankMultiplyIDF(String dirPath)
 41 | 	{
 42 | 		Map<String,List<String>> result = new HashMap<String,List<String>>();
 43 | 		
 44 | 		// get the IDF values for the words of a corpus
 45 | 		Map<String,Float> idfForDir = TFIDF.idfForDir(dirPath);
 46 | 		List<String> fileList = ReadDir.readDirFileNames(dirPath);
 47 | 		String content= null;
 48 | 		
 49 | 		for(String file:fileList)
 50 | 		{
 51 | 			content = ReadFile.loadFile(file);
 52 | 			Map<String,Float> trKeywords = TextRank.getWordScore("", content);
 53 | 			Iterator<Map.Entry<String, Float>> it = trKeywords.entrySet().iterator();
 54 | 			while(it.hasNext())
 55 | 			{
 56 | 				Map.Entry<String,Float> temp =it.next();
 57 | 				String key = temp.getKey();
 58 | 				trKeywords.put(key, temp.getValue()*idfForDir.get(key));
 59 | 			}
 60 | 			
 61 | 			//sort the words in terms of their score in descending order
 62 | 			List<Map.Entry<String, Float>> entryList = new ArrayList<Map.Entry<String,Float>>(trKeywords.entrySet());
 63 | 			Collections.sort(entryList,
 64 | 					new Comparator<Map.Entry<String, Float>>()
 65 | 				{
 66 | 					public int compare(Map.Entry<String, Float> c1, Map.Entry<String, Float> c2)
 67 | 					{
 68 | 						return c2.getValue().compareTo(c1.getValue());
 69 | 					}
 70 | 					
 71 | 				}
 72 | 			);
 73 | 			
 74 | 			List<String> temp = new ArrayList<String>();
 75 | 			for (int i=0;i<keywordsNumber;i++)
 76 | 			{
 77 | 				temp.add(entryList.get(i).getKey());
 78 | 			}
 79 | 		result.put(file, temp);
 80 | 		}
 81 | 		return result;
 82 | 	}
 83 |    
 84 | 	/**
 85 | 	 * integrate the results generated by TextRank and TF-IDF, choose those words that co-occure in both 
 86 | 	 * results, if the number of co-occuring words is not enough, choose the left part from the results of TF-IDF
 87 | 	 * @param dirPath(String): path of the directory of the corpus
 88 |  	 * @return(Map<String,List<String>>): keywords of each document of the corpus
 89 | 	 */
 90 | 	public static Map<String,List<String>> textRankTFIDFVote(String dirPath)
 91 | 	{
 92 | 		Map<String, List<String>> result = new HashMap<String,List<String>>();
 93 | 		List<String> fileList = ReadDir.readDirFileNames(dirPath);
 94 | 		
 95 | 		// get keywords generated by TF-IDF
 96 | 		TFIDF.setKeywordsNumber(keywordCandidateNum);
 97 | 		Map<String,List<String>> tfidfKeywordsForDir = TFIDF.getKeywords(dirPath);
 98 | 		
 99 | 		List<String> trKeyword = new ArrayList<String>();
100 | 		List<String> tfidfKeyword = new ArrayList<String>();
101 | 		String content = null;
102 | 		for(String file:fileList)
103 | 		{
104 | 			
105 | 			content = ReadFile.loadFile(file); 
106 | 			trKeyword = TextRank.getKeyword("", content);
107 | 			tfidfKeyword = tfidfKeywordsForDir.get(file);
108 | 			
109 | 			List<String> temp = new ArrayList<String>();
110 | 			for(String keyword:tfidfKeyword)
111 | 			{
112 | 				if (trKeyword.contains(keyword))
113 | 					temp.add(keyword);
114 | 				if (temp.size()== keywordsNumber)
115 | 					break;
116 | 			}
117 | 			if (temp.size()== keywordsNumber)
118 | 				result.put(file,temp);
119 | 			else
120 | 				for(String keyword:tfidfKeyword)
121 | 				{
122 | 					if (!temp.contains(keyword))
123 | 						temp.add(keyword);
124 | 				    if (temp.size()==keywordsNumber)
125 | 				    	result.put(file, temp);
126 | 				}
127 | 		}
128 | 		return result;
129 | 	}
130 | 
131 | }
132 | 


--------------------------------------------------------------------------------
/src/com/lc/nlp/keyword/evaluate/F1Score.java:
--------------------------------------------------------------------------------
 1 | /**
 2 | * Author: WuLC
 3 | * Date:   2016-05-20 22:56:12
 4 | * Last modified by:   WuLC
 5 | * Last Modified time: 2016-05-20 23:01:16
 6 | * Email: liangchaowu5@gmail.com
 7 | ***************************************************************************
 8 | * Function: calculate precision, recall and f1 score in terms of the keywords extracted by the algorithm and manually
 9 | * Input: keywords extracted by the algorithm and manually
10 | * Output: precision, recall, f1 score 
11 | */
12 | 
13 | package com.lc.nlp.keyword.evaluate;
14 | 
15 | import java.util.*;
16 | import java.text.DecimalFormat; 
17 | 
18 | public class F1Score 
19 | {
20 | 	private static DecimalFormat df = new DecimalFormat("0.00");//format the output to reserve two decimal places 
21 | 	
22 | 	/**
23 | 	 * calculate the precision value, recall value and f1 score in terms of the keywords extracted by the algorithm and manually
24 | 	 * @param systemKeywords(List<String>): keywords extracted by the algorithm
25 | 	 * @param manualKeywords(String[]): keywords extracted manually
26 | 	 * return (List<Float>): precision, recall, f1 score
27 | 	 */
28 | 	public static List<Float> calculate(List<String> systemKeywords,String[] manualKeywords)
29 | 	{
30 | 	    int sysLen=systemKeywords.size();
31 | 		int manLen=manualKeywords.length;
32 | 		//Caculate.printKeywords(systemKeywords,manualKeywords);
33 | 		int hit=0; 
34 | 		for(int i=0;i<sysLen;i++)
35 | 		{
36 | 			for(int j=0;j<manLen;j++)
37 | 			{
38 | 				if(systemKeywords.get(i).equals(manualKeywords[j]))
39 | 				{
40 | 					hit++;
41 | 					break;
42 | 				}	
43 | 			}
44 | 		}
45 | 
46 | 		//Get Precision Value
47 | 		float pValue=(float)hit/sysLen;
48 | 		pValue*=100; //represent in the form of %
49 |          
50 | 		//Get Recall Value
51 | 	    float rValue=(float)hit/manLen;
52 | 	    rValue*=100;
53 | 
54 | 	    //Get F-Measure
55 | 	    float fValue;
56 | 	    if(rValue==0 || pValue == 0)
57 | 	    	fValue=0;
58 | 	    else
59 | 	    	fValue=2*rValue*pValue/(rValue+pValue);
60 | 	    
61 | 	   List<Float> result = new ArrayList<Float>();
62 | 	   result.add(Float.parseFloat(df.format(pValue)));
63 | 	   result.add(Float.parseFloat(df.format(rValue)));
64 | 	   result.add(Float.parseFloat(df.format(fValue)));
65 | 	   return result;
66 | 	}
67 | }
68 | 


--------------------------------------------------------------------------------
/src/com/lc/nlp/parsedoc/ParseXML.java:
--------------------------------------------------------------------------------
 1 | /**
 2 | * Author: WuLC
 3 | * Date:   2016-05-19 22:49:42
 4 | * Last modified by:   WuLC
 5 | * Last Modified time: 2016-05-19 22:53:31
 6 | * Email: liangchaowu5@gmail.com
 7 | **************************************************************
 8 | * Function: parse XML file in terms of file and certain tag
 9 | * Input: file path of XML file and a tag 
10 | * Output: content of the tag
11 | */
12 | 
13 | package com.lc.nlp.parsedoc;
14 | 
15 | import java.io.File;
16 | import java.util.Iterator;
17 | import org.dom4j.Document;
18 | import org.dom4j.DocumentException;
19 | import org.dom4j.Element;
20 | import org.dom4j.io.SAXReader;
21 | 
22 | public class ParseXML 
23 | {   
24 | 	/**
25 | 	 * return the content of a XML file of a certain tag
26 |      * the tag should be second level,or you can modify the code to change to any level
27 | 	 * @param fileName(String): file path of the XML file
28 | 	 * @param tag(String): a second-level tag of the XML file
29 | 	 * @return(String): content of tag of the file
30 | 	 */
31 |     public String parseXML(String fileName,String tag) 
32 |     {
33 |         File inputXml = new File(fileName);
34 |         SAXReader saxReader = new SAXReader();
35 |         Document document=null;
36 |         Element rootTag=null,subTag=null;
37 |         boolean hasTag=false;
38 |         try 
39 |         {
40 |             document = saxReader.read(inputXml);
41 |             rootTag = document.getRootElement();
42 |             for (Iterator i = rootTag.elementIterator(); i.hasNext();) 
43 |             {
44 |                 subTag = (Element) i.next();    
45 |                 if(subTag.getName().equals(tag))
46 |                 {
47 |                 	hasTag=true;
48 |                 	break;
49 |                 }
50 |             }
51 |         } 
52 |         catch (DocumentException e) 
53 |         {
54 |             System.out.println(e.getMessage());
55 |         }        
56 |         if(hasTag)
57 |         	return subTag.getText();
58 |         else 
59 |         	return "not such a tag in the xml file";
60 |     }
61 | }
62 | 
63 | 


--------------------------------------------------------------------------------
/src/com/lc/nlp/parsedoc/ReadDir.java:
--------------------------------------------------------------------------------
 1 | /**
 2 | * Author: WuLC
 3 | * Date:   2016-05-19 22:49:42
 4 | * Last modified by:   WuLC
 5 | * Last Modified time: 2016-05-19 22:53:09
 6 | * Email: liangchaowu5@gmail.com
 7 | ************************************************************************************
 8 | * Function: read the paths of all the files under a directory,including the sub-directories of  it 
 9 | * Input(String): path of the directory,the last character of the path cannot be / due to  sub-directories
10 | * Output(List<String>): paths of all files under the directory
11 | */
12 | 
13 | package com.lc.nlp.parsedoc;
14 | 
15 | import java.io.File;
16 | import java.util.ArrayList;
17 | import java.util.List;
18 | 
19 | public class ReadDir 
20 | {   
21 | 	/**
22 | 	 * read the paths of all the files under a directory,including the sub-directories of  it
23 | 	 * @param dirPath(String): path of the directory, remember the last character can't be /
24 | 	 * @return(List<String>): paths of all files under the directory
25 | 	 */
26 | 	public static List<String> readDirFileNames(String dirPath)
27 | 	{
28 | 		if (dirPath.equals(""))
29 | 		{
30 | 			System.out.println("The path of the directory can't be empty");
31 | 			System.exit(0);
32 | 		}
33 | 		
34 | 		else if(dirPath != null && (dirPath.substring(dirPath.length()-1)).equals("/"))
35 | 		{
36 | 			System.out.println("The last character of the path of the directory can't be /");
37 | 			System.exit(0);
38 | 		}
39 | 		
40 | 		File dirFile = new File(dirPath);
41 | 		String [] fileNameList=null;
42 | 		String tmp=null;
43 | 		List<String> fileList=new ArrayList<String>();
44 | 		List<String> subFileList=new ArrayList<String>();
45 | 		
46 | 		fileNameList=dirFile.list();
47 | 		  for(int i=0;i<fileNameList.length;i++)
48 | 		  {
49 | 			 tmp=dirPath+'/'+fileNameList[i];
50 | 			 File f1 = new File(tmp);
51 | 			 if (f1.isFile())
52 | 				 fileList.add(tmp);
53 | 			 else
54 | 				 subFileList = readDirFileNames(tmp);
55 | 			     fileList.addAll(subFileList);
56 | 		  }
57 | 		return fileList;		
58 | 	}
59 | 
60 | }
61 | 


--------------------------------------------------------------------------------
/src/com/lc/nlp/parsedoc/ReadFile.java:
--------------------------------------------------------------------------------
 1 | /**
 2 | * Author: WuLC
 3 | * Date:   2016-05-19 22:49:42
 4 | * Last modified by:   WuLC
 5 | * Last Modified time: 2016-05-19 22:53:55
 6 | * Email: liangchaowu5@gmail.com
 7 | **************************************************************
 8 | *Function: given a text file of a certain type, read and load its' content as String
 9 | *Input(String): path of the file
10 | *Output(String): text content of the file
11 | */
12 | package com.lc.nlp.parsedoc;
13 | 
14 | import java.io.File;
15 | 
16 | public class ReadFile 
17 | {
18 | 	public static String loadFile(String filePath)
19 | 	{
20 | 		File f = new File(filePath);
21 | 		if(!f.isFile())
22 | 		{
23 | 			System.out.println("The input "+filePath+" is not a file or the file doesn't exist");
24 | 			System.exit(0);
25 | 		}
26 | 		String content = new String();
27 | 		
28 | /*define your own way of loading the your file's content, the following commented two lines 
29 | is an example of loading the content of a XML file with the Class ParseXML in the same package*/
30 | 		//ParseXML parser = new ParseXML();
31 | 		//content = parser.parseXML(filePath, "content");
32 | 		
33 | 		return content;
34 | 	}
35 | 
36 | }
37 | 


--------------------------------------------------------------------------------