├── .gitignore ├── README.md ├── build.gradle ├── data └── model │ └── perceptron │ └── pku199801 │ └── cws.bin ├── libs ├── hanlp-1.6.0-sources.jar └── hanlp-1.6.0.jar ├── settings.gradle └── src └── main ├── java └── com │ └── zongwu33 │ └── test │ ├── CreateSimpleCorpus.java │ ├── DeleteBlankLine.java │ ├── PerformanceTest.java │ ├── TestForSIGHan2005.java │ └── TestModel.java └── resources └── hanlp.properties /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/** 2 | .idea/vcs.xmlg 3 | build/** 4 | out/** 5 | .gradle/** -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # HanLP vs LTP 分词功能测试 2 | 本次测试选用 HanLP 1.6.0 , LTP 3.4.0 3 | ## 测试思路 4 | 使用同一份语料训练两个分词库,同一份测试数据测试两个分词库的性能。 5 | 6 | 语料库选取1998年01月的人民日报语料库。[199801人民日报语料](https://github.com/hankcs/OpenCorpus/blob/master/pku98/199801.txt) 7 | 8 | 该词库带有词性标注,为了遵循LTP的训练数据集格式,需要处理掉词性标注。 9 | 10 | 测试数据选择[SIGHan2005](http://sighan.cs.uchicago.edu/bakeoff2005/)提供的开放测试集。 11 | 12 | SIGHan2005的使用可以参见其附带的readme。 13 | 14 | 15 | ## HanLP 16 | 17 | ```bash 18 | java -cp libs/hanlp-1.6.0.jar com.hankcs.hanlp.model.perceptron.Main -task CWS -train -reference ../OpenCorpus/pku98/199801.txt -model cws.bin 19 | 20 | mkdir -p data/model/perceptron/pku199801 21 | 22 | mv -f cws.bin data/model/perceptron/pku199801/cws.bin 23 | 24 | ``` 25 | 默认情况下,训练的迭代次数为5。 26 | 27 | 修改 src/main/resouces 文件: 28 | ```bash 29 | root=../test-hanlp-ltp 30 | ``` 31 | 32 | 打包命令: 33 | 34 | ```groovy 35 | gradle clean build 36 | ``` 37 | 38 | ​ 39 | ### SIGHan2005的MSR测试集 40 | 41 | 执行命令: 42 | ```bash 43 | java -cp build/libs/test-hanlp-ltp-1.0-SNAPSHOT.jar com.zongwu33.test.TestForSIGHan2005 ../NLP/icwb2-data/testing/msr_test.utf8 segment-msr-result.txt 44 | 45 | ``` 46 | 将分词的结果生成到`segment-msr-result.txt`文件里。 47 | 利用SIGHan2005的脚本生成分数: 48 | ```perl 49 | perl ../NLP/icwb2-data/scripts/score ../NLP/icwb2-data/gold/msr_training_words.utf8 \ 50 | ../NLP/icwb2-data/gold/msr_test_gold.utf8 segment-msr-result.txt > score-msr.ut8 51 | ``` 52 | 可以得到 HanLP在MSR数据集上的测试结果: 53 | ``` 54 | === TOTAL TRUE WORDS RECALL: 0.870 55 | === TOTAL TEST WORDS PRECISION: 0.848 56 | === F MEASURE: 0.859 57 | ``` 58 | ### SIGHan2005的PKU测试集 59 | 60 | ```bash 61 | java -cp build/libs/test-hanlp-ltp-1.0-SNAPSHOT.jar com.zongwu33.test.TestForSIGHan2005 ../NLP/icwb2-data/testing/pku_test.utf8 segment-pku-result.txt 62 | 63 | ``` 64 | ```bash 65 | perl ../NLP/icwb2-data/scripts/score ../NLP/icwb2-data/gold/pku_training_words.utf8 ../NLP/icwb2-data/gold/pku_test_gold.utf8 segment-pku-result.txt > score-pku.utf8 66 | 67 | ``` 68 | 结果: 69 | ``` 70 | === TOTAL TRUE WORDS RECALL: 0.894 71 | === TOTAL TEST WORDS PRECISION: 0.915 72 | === F MEASURE: 0.905 73 | ``` 74 | 75 | 76 | Docker安装 LTP 77 | 78 | ## LTP 79 | 生成符合LTP训练格式的训练集文件: 80 | ```bash 81 | java -cp build/libs/test-hanlp-ltp-1.0-SNAPSHOT.jar com.zongwu33.test.CreateSimpleCorpus ../OpenCorpus/pku98/199801.txt simple-199801.txt 82 | 83 | ``` 84 | simple-199801.txt 即为结果。 85 | 训练集 和开发集都指定为这个文件: 86 | 87 | ```bash 88 | ../LTP/ltp-3.4.0/tools/train/otcws learn --model model-test --reference simple-199801.txt --development simple-199801.txt --max-iter 5 89 | ``` 90 | ### SIGHan2005的MSR测试集 91 | 92 | 测试: 93 | ```bash 94 | ../LTP/ltp-3.4.0/tools/train/otcws test --model model-test --input /data/testLTP/icwb2-data/testing/msr_test.utf8 > msr_result.txt 95 | ``` 96 | 利用SIGHan2005的脚本生成分数: 97 | ```bash 98 | perl icwb2-data/scripts/score icwb2-data/gold/msr_training_words.utf8 \ 99 | icwb2-data/gold/msr_test_gold.utf8 msr_result.txt > ltp-msr-score.utf8 100 | ``` 101 | 查看ltp-msr-score.utf8 : 102 | 103 | ```bash 104 | === TOTAL TRUE WORDS RECALL: 0.886 105 | === TOTAL TEST WORDS PRECISION: 0.854 106 | === F MEASURE: 0.870 107 | ``` 108 | ### SIGHan2005的PKU测试集 109 | 110 | ```bash 111 | ../LTP/ltp-3.4.0/tools/train/otcws test --model model-test --input /data/testLTP/icwb2-data/testing/pku_test.utf8 > pku_result.txt 112 | ``` 113 | ```bash 114 | perl icwb2-data/scripts/score icwb2-data/gold/pku_training_words.utf8 \ 115 | icwb2-data/gold/pku_test_gold.utf8 pku_result.txt > ltp-pku-score.ut8 116 | 117 | ``` 118 | 119 | ``` 120 | === TOTAL TRUE WORDS RECALL: 0.928 121 | === TOTAL TEST WORDS PRECISION: 0.939 122 | === F MEASURE: 0.934 123 | ``` 124 | ## 对比 125 | 126 | MSR测试集: 127 | 128 | | |迭代次数| RECALL | PRECISION | F1 | 129 | |-----|--- | ------ | --------- | ----- | 130 | |HanLP|5 | 0.870 | 0.848 | 0.859 | 131 | |LTP |5 | 0.886 | 0.854 | 0.870 | 132 | |HanLP|50 | 0.881 | 0.855 | 0.868 | 133 | |LTP |50 | 0.888 | 0.859 | 0.873 | 134 | 135 | 136 | PKU测试集: 137 | 138 | | |迭代次数| RECALL | PRECISION | F1 | 139 | |-----|------|------| --------- | ----- | 140 | |HanLP|5 | 0.894 | 0.915 | 0.905 | 141 | |LTP |5 | 0.928 | 0.939 | 0.934 | 142 | |HanLP|50 | 0.908 | 0.922 | 0.915 | 143 | |LTP |50 | 0.931 | 0.946 | 0.939 | 144 | 145 | 146 | ### 性能测试 147 | 148 | 阿里云ECS机器配置: 149 | 150 | 机器配置:Intel Xeon CPU *4 2.50GHz,内存16G 151 | 152 | 测试数据集 20M的网络小说,约140315句(不含空行)。 153 | 154 | #### HanLP 155 | ```bash 156 | java -cp test-hanlp-ltp-1.0-SNAPSHOT.jar com.zongwu33.test.PerformanceTest ../NLP/strict-utf8-booken.txt 157 | ``` 158 | ``` 159 | init model: 313 ms 160 | total time:15677 ms 161 | total num:140315 162 | ``` 163 | 需要15.677 s,可以计算得到处理速度 1375k/s 。 164 | 165 | #### LTP 166 | ``` 167 | 168 | ../LTP/ltp-3.4.0/tools/train/otcws test --model model-test --input strict-utf8-booken.txt > /dev/null 169 | ``` 170 | ``` 171 | 172 | [INFO] 2018-03-26 17:04:19 ||| ltp segmentor, testing ... 173 | [INFO] 2018-03-26 17:04:19 report: input file = strict-utf8-booken.txt 174 | [INFO] 2018-03-26 17:04:19 report: model file = model-test 175 | [INFO] 2018-03-26 17:04:19 report: evaluate = false 176 | [INFO] 2018-03-26 17:04:19 report: sequence probability = false 177 | [INFO] 2018-03-26 17:04:19 report: marginal probability = false 178 | [INFO] 2018-03-26 17:04:19 report: number of labels = 4 179 | [INFO] 2018-03-26 17:04:19 report: number of features = 491820 180 | [INFO] 2018-03-26 17:04:19 report: number of dimension = 1967296 181 | [INFO] 2018-03-26 17:05:13 Elapsed time 53.680000 182 | ``` 183 | 需要53s。处理速度389k/s。 184 | 185 | ### 对比 186 | 187 | | | 总耗时 | 速度 | | 188 | | ----- | ------- | ------- | ---- | 189 | | HanLP | 15.68 s | 1375k/s | | 190 | | LTP | 53.68 s | 389k/s | | 191 | 192 | ## 开源协议 193 | **Apache License Version 2.0** 194 | ## 致谢 195 | [HanLP](https://github.com/hankcs/HanLP) 196 | [LTP](https://github.com/HIT-SCIR/ltp) 197 | -------------------------------------------------------------------------------- /build.gradle: -------------------------------------------------------------------------------- 1 | group 'com.zongwu233' 2 | version '1.0-SNAPSHOT' 3 | 4 | apply plugin: 'java' 5 | 6 | sourceCompatibility = 1.8 7 | 8 | 9 | repositories { 10 | mavenCentral() 11 | } 12 | 13 | configurations{ 14 | extraLibs 15 | } 16 | 17 | sourceSets { 18 | main { 19 | compileClasspath += configurations.extraLibs 20 | } 21 | test { 22 | compileClasspath += configurations.extraLibs 23 | } 24 | } 25 | 26 | dependencies { 27 | extraLibs fileTree(dir: 'libs', include: ['*.jar']) 28 | extraLibs 'org.apache.commons:commons-lang3:3.6' 29 | 30 | testCompile group: 'junit', name: 'junit', version: '4.12' 31 | } 32 | 33 | jar { 34 | from { 35 | configurations.extraLibs.collect { 36 | it.isDirectory() ? it : zipTree(it) 37 | } 38 | } 39 | } 40 | -------------------------------------------------------------------------------- /data/model/perceptron/pku199801/cws.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zongwu233/HanLPvsLTP/bd120ed5ee393592e43a9674fd1cf0c0c7631227/data/model/perceptron/pku199801/cws.bin -------------------------------------------------------------------------------- /libs/hanlp-1.6.0-sources.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zongwu233/HanLPvsLTP/bd120ed5ee393592e43a9674fd1cf0c0c7631227/libs/hanlp-1.6.0-sources.jar -------------------------------------------------------------------------------- /libs/hanlp-1.6.0.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zongwu233/HanLPvsLTP/bd120ed5ee393592e43a9674fd1cf0c0c7631227/libs/hanlp-1.6.0.jar -------------------------------------------------------------------------------- /settings.gradle: -------------------------------------------------------------------------------- 1 | rootProject.name = 'test-hanlp-ltp' 2 | 3 | -------------------------------------------------------------------------------- /src/main/java/com/zongwu33/test/CreateSimpleCorpus.java: -------------------------------------------------------------------------------- 1 | package com.zongwu33.test; 2 | 3 | import org.apache.commons.lang3.StringUtils; 4 | 5 | import java.io.*; 6 | import java.util.LinkedList; 7 | import java.util.List; 8 | import java.util.regex.Matcher; 9 | import java.util.regex.Pattern; 10 | 11 | /** 12 | * Created by zongwu 13 | * on 26/03/2018. 14 | * 将人民日报 带标注语料转换为 仅分词语料 15 | */ 16 | public class CreateSimpleCorpus { 17 | 18 | public static void main(String[] args) { 19 | if (args.length < 2) { 20 | System.out.println("args length <2!"); 21 | return; 22 | } 23 | String originFile = args[0]; 24 | String resultFile = args[1]; 25 | if (StringUtils.equals(originFile, resultFile)) { 26 | System.out.println("args can not be the same!!"); 27 | return; 28 | } 29 | File outFile = new File(resultFile); 30 | if (outFile.exists()) { 31 | System.out.println("warning! out file exist!"); 32 | return; 33 | } 34 | try { 35 | BufferedReader reader = new BufferedReader(new FileReader(new File(originFile))); 36 | BufferedWriter writer = new BufferedWriter(new FileWriter(outFile)); 37 | String line = null; 38 | while ((line = reader.readLine()) != null) { 39 | List wordList = new LinkedList(); 40 | Pattern pattern = Pattern.compile("(\\[(([^\\s]+/[0-9a-zA-Z]+)\\s+)+?([^\\s]+/[0-9a-zA-Z]+)]/?[0-9a-zA-Z]+)|([^\\s]+/[0-9a-zA-Z]+)"); 41 | Matcher matcher = pattern.matcher(line); 42 | while (matcher.find()) { 43 | String single = matcher.group(); 44 | if (single.startsWith("[") && !single.startsWith("[/")) { 45 | List words = createComb(single); 46 | wordList.addAll(words); 47 | } else { 48 | String word = createSingle(single); 49 | wordList.add(word); 50 | } 51 | } 52 | 53 | save(wordList, writer); 54 | } 55 | writer.flush(); 56 | reader.close(); 57 | writer.close(); 58 | 59 | } catch (IOException e) { 60 | e.printStackTrace(); 61 | } 62 | 63 | } 64 | 65 | //[中央/n 人民/n 广播/vn 电台/n]/nt 66 | private static List createComb(String param) { 67 | if (param == null) return null; 68 | int cutIndex = param.lastIndexOf(']'); 69 | if (cutIndex <= 2 || cutIndex == param.length() - 1) return null; 70 | String wordParam = param.substring(1, cutIndex); 71 | List wordList = new LinkedList<>(); 72 | for (String single : wordParam.split("\\s+")) { 73 | if (single.length() == 0) continue; 74 | wordList.add(createSingle(single)); 75 | } 76 | return wordList; 77 | } 78 | 79 | // 朋友/n 们/k ,/w 致以/v 80 | private static String createSingle(String rawWord) { 81 | int cutIndex = rawWord.lastIndexOf('/'); 82 | return rawWord.substring(0, cutIndex); 83 | } 84 | 85 | private static void save(List segments, BufferedWriter writer) throws IOException { 86 | String content = StringUtils.join(segments, " ") + '\n'; 87 | System.out.println(content); 88 | writer.append(content); 89 | } 90 | } 91 | -------------------------------------------------------------------------------- /src/main/java/com/zongwu33/test/DeleteBlankLine.java: -------------------------------------------------------------------------------- 1 | package com.zongwu33.test; 2 | 3 | import org.apache.commons.lang3.StringUtils; 4 | 5 | import java.io.*; 6 | 7 | /** 8 | * Created by zongwu 9 | * on 23/03/2018. 10 | * 11 | * 删除小说里的空白行,存在多个空行的话,LTP会终止处理 12 | */ 13 | public class DeleteBlankLine { 14 | 15 | public static void main(String[] args) { 16 | if (args.length < 2) { 17 | System.out.println("args length <2!"); 18 | return; 19 | } 20 | String testFile = args[0]; 21 | String resultFile = args[1]; 22 | File outFile = new File(resultFile); 23 | if (outFile.exists()) { 24 | System.out.println("warning! out file exist!"); 25 | return; 26 | } 27 | try { 28 | BufferedReader reader = new BufferedReader(new FileReader(new File(testFile))); 29 | BufferedWriter writer = new BufferedWriter(new FileWriter(outFile)); 30 | String line = null; 31 | while ((line = reader.readLine()) != null) { 32 | if (StringUtils.isBlank(line)) continue; 33 | save(line, writer); 34 | } 35 | writer.flush(); 36 | writer.close(); 37 | reader.close(); 38 | 39 | } catch (IOException e) { 40 | e.printStackTrace(); 41 | } 42 | } 43 | 44 | private static void save(String content, BufferedWriter writer) throws IOException { 45 | writer.append(content+'\n'); 46 | } 47 | } 48 | -------------------------------------------------------------------------------- /src/main/java/com/zongwu33/test/PerformanceTest.java: -------------------------------------------------------------------------------- 1 | package com.zongwu33.test; 2 | 3 | import com.hankcs.hanlp.HanLP; 4 | import com.hankcs.hanlp.model.perceptron.PerceptronSegmenter; 5 | import org.apache.commons.lang3.StringUtils; 6 | 7 | import java.io.BufferedReader; 8 | import java.io.File; 9 | import java.io.FileReader; 10 | import java.io.IOException; 11 | 12 | /** 13 | * Created by zongwu 14 | * on 26/03/2018. 15 | */ 16 | public class PerformanceTest { 17 | 18 | public static void main(String[] args) { 19 | if (args.length < 1) { 20 | System.out.println("args length <1!"); 21 | return; 22 | } 23 | String testFile = args[0]; 24 | try { 25 | long stt = System.currentTimeMillis(); 26 | PerceptronSegmenter segmenter = new PerceptronSegmenter( 27 | HanLP.Config.PerceptronCWSModelPath); 28 | 29 | //触发模型的懒加载 30 | segmenter.segment("商品和服务"); 31 | long edd = System.currentTimeMillis(); 32 | System.out.println("init model: " + (edd - stt)); 33 | 34 | BufferedReader reader = new BufferedReader(new FileReader(new File(testFile))); 35 | String line = null; 36 | long totalTime = 0L; 37 | long num = 0; 38 | while ((line = reader.readLine()) != null) { 39 | if (StringUtils.isBlank(line)) continue; 40 | long start = System.currentTimeMillis(); 41 | segmenter.segment(line); 42 | long end = System.currentTimeMillis(); 43 | ++num; 44 | totalTime += (end - start); 45 | } 46 | reader.close(); 47 | System.out.println("total time:" + totalTime); 48 | System.out.println("total num:" + num); 49 | 50 | } catch (IOException e) { 51 | e.printStackTrace(); 52 | } 53 | } 54 | } 55 | -------------------------------------------------------------------------------- /src/main/java/com/zongwu33/test/TestForSIGHan2005.java: -------------------------------------------------------------------------------- 1 | package com.zongwu33.test; 2 | 3 | import com.hankcs.hanlp.HanLP; 4 | import com.hankcs.hanlp.model.perceptron.PerceptronSegmenter; 5 | import org.apache.commons.lang3.StringUtils; 6 | 7 | import java.io.*; 8 | import java.util.List; 9 | 10 | /** 11 | * Created by zongwu 12 | * on 23/03/2018. 13 | */ 14 | public class TestForSIGHan2005 { 15 | 16 | public static void main(String[] args) { 17 | if (args.length < 2) { 18 | System.out.println("args length <2!"); 19 | return; 20 | } 21 | String testFile = args[0]; 22 | String resultFile = args[1]; 23 | File outFile = new File(resultFile); 24 | if (outFile.exists()) { 25 | System.out.println("warning! out file exist!"); 26 | return; 27 | } 28 | try { 29 | PerceptronSegmenter segmenter = new PerceptronSegmenter( 30 | HanLP.Config.PerceptronCWSModelPath); 31 | BufferedReader reader = new BufferedReader(new FileReader(new File(testFile))); 32 | BufferedWriter writer = new BufferedWriter(new FileWriter(outFile)); 33 | String line = null; 34 | while ((line = reader.readLine()) != null) { 35 | //System.out.println(line); 36 | if (StringUtils.isBlank(line)) continue; 37 | List segments = segmenter.segment(line); 38 | save(segments, writer); 39 | } 40 | writer.flush(); 41 | writer.close(); 42 | reader.close(); 43 | 44 | } catch (IOException e) { 45 | e.printStackTrace(); 46 | } 47 | } 48 | 49 | private static void save(List segments, BufferedWriter writer) throws IOException { 50 | String content = StringUtils.join(segments, " ") + '\n'; 51 | writer.append(content); 52 | } 53 | } 54 | -------------------------------------------------------------------------------- /src/main/java/com/zongwu33/test/TestModel.java: -------------------------------------------------------------------------------- 1 | package com.zongwu33.test; 2 | 3 | import com.hankcs.hanlp.HanLP; 4 | import com.hankcs.hanlp.model.perceptron.PerceptronSegmenter; 5 | 6 | /** 7 | * Created by zongwu 8 | * on 23/03/2018. 9 | */ 10 | public class TestModel { 11 | public static void main(String[] args) { 12 | try { 13 | PerceptronSegmenter segmenter = new PerceptronSegmenter( 14 | HanLP.Config.PerceptronCWSModelPath); 15 | System.out.println(segmenter.segment("商品和服务")); 16 | } catch (Exception e) { 17 | System.out.println(e.getLocalizedMessage()); 18 | } 19 | 20 | } 21 | } 22 | -------------------------------------------------------------------------------- /src/main/resources/hanlp.properties: -------------------------------------------------------------------------------- 1 | #本配置文件中的路径的根目录,根目录+其他路径=完整路径(支持相对路径,请参考:https://github.com/hankcs/HanLP/pull/254) 2 | #Windows用户请注意,路径分隔符统一使用/ 3 | root=../test-hanlp-ltp 4 | #核心词典路径 5 | CoreDictionaryPath=data/dictionary/CoreNatureDictionary.txt 6 | #2元语法词典路径 7 | BiGramDictionaryPath=data/dictionary/CoreNatureDictionary.ngram.txt 8 | #停用词词典路径 9 | CoreStopWordDictionaryPath=data/dictionary/stopwords.txt 10 | #同义词词典路径 11 | CoreSynonymDictionaryDictionaryPath=data/dictionary/synonym/CoreSynonym.txt 12 | #人名词典路径 13 | PersonDictionaryPath=data/dictionary/person/nr.txt 14 | #人名词典转移矩阵路径 15 | PersonDictionaryTrPath=data/dictionary/person/nr.tr.txt 16 | #繁简词典根目录 17 | tcDictionaryRoot=data/dictionary/tc 18 | #自定义词典路径,用;隔开多个自定义词典,空格开头表示在同一个目录,使用“文件名 词性”形式则表示这个词典的词性默认是该词性。优先级递减。 19 | #另外data/dictionary/custom/CustomDictionary.txt是个高质量的词库,请不要删除。所有词典统一使用UTF-8编码。 20 | CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; 现代汉语补充词库.txt; 全国地名大全.txt ns; 人名词典.txt; 机构名词典.txt; 上海地名.txt ns;data/dictionary/person/nrf.txt nrf; 21 | #CRF分词模型路径 22 | CRFSegmentModelPath=data/model/segment/CRFSegmentModel.txt 23 | #HMM分词模型 24 | HMMSegmentModelPath=data/model/segment/HMMSegmentModel.bin 25 | #分词结果是否展示词性 26 | ShowTermNature=true 27 | #IO适配器,实现com.hankcs.hanlp.corpus.io.IIOAdapter接口以在不同的平台(Hadoop、Redis等)上运行HanLP 28 | #默认的IO适配器如下,该适配器是基于普通文件系统的。 29 | #IOAdapter=com.hankcs.hanlp.corpus.io.FileIOAdapter --------------------------------------------------------------------------------