├── NLPCC2016-Ensemble of Feature Sets and Classification Methods for Stance Detection.pdf ├── README.md ├── code ├── ChiSquare │ └── README.md ├── Ensemble_models_matlab │ ├── TaskA_Ensemble_models.m │ ├── main_TaskA_Ensemble_models.m │ ├── normalize.m │ ├── predict.mexw64 │ ├── svmpredict.mexw64 │ ├── svmtrain.mexw64 │ ├── tf_idf.m │ └── train.mexw64 ├── Feature_process_java │ ├── lib │ │ ├── split.jar │ │ ├── tree-split-word-1.1.jar │ │ └── typetrans.jar │ ├── library │ │ ├── stopDict.dic │ │ └── userLibrary │ │ │ └── userLibrary.dic │ └── src │ │ ├── DataPreprocess │ │ ├── Step01_01_TaskA_DataInfoStat.java │ │ ├── Step02_01_TaskA_GenDict.java │ │ ├── Step02_01_TaskA_GenVSM.java │ │ ├── Step04_01_TaskA_ChiSquare.java │ │ ├── Step04_02_TaskA_ChiSquare_Rank.java │ │ ├── Step04_03_TaskA_ChiSquare_VSM.java │ │ ├── Step05_01_TaskA_Opinion.java │ │ └── Step09_01_ResultSubmit.java │ │ └── Tools │ │ ├── CharacterAnalyzer.java │ │ ├── ConvertUnicode.java │ │ ├── PreProcessText.java │ │ ├── StringAnalyzer.java │ │ └── WordSegment_Ansj.java ├── Feature_ranking_matlab │ ├── TaskA_Feature_ranking.m │ ├── main_TaskA_Feature_ranking.m │ ├── normalize.m │ ├── predict.mexw64 │ ├── svmpredict.mexw64 │ ├── svmtrain.mexw64 │ ├── tf_idf.m │ └── train.mexw64 ├── LDA │ ├── README.md │ └── jgibblda.jar ├── LSA_and_LPI_and_LE_matlab │ ├── LE │ │ └── LapEig.m │ ├── LPI │ │ ├── LGE.m │ │ ├── lpp.m │ │ └── mySVD.m │ ├── LSA │ │ └── LSA.m │ ├── Step_03_main_nlpcc2016.m │ ├── Step_03_nlpcc2016.m │ ├── Step_04_main_nlpcc2016.m │ ├── Step_04_nlpcc2016.m │ └── tools │ │ ├── EuDist2.m │ │ ├── constructW.m │ │ ├── normalize.m │ │ └── tf_idf.m └── Para2vec │ ├── README.md │ ├── go_paravec.sh │ └── word2vec.c ├── corpus ├── Hownet │ ├── neg_opinion.txt │ ├── pos_opinion.txt │ └── url.txt └── Tsinghua │ ├── readme.txt │ ├── tsinghua.negative.gb.txt │ ├── tsinghua.positive.gb.txt │ └── url.txt └── data ├── NLPCC2016_Stance_Detection_Task_A_Testdata.txt ├── NLPCC_2016_Stance_Detection_Task_A_Unknown.txt ├── NLPCC_2016_Stance_Detection_Task_A_gold.txt ├── evasampledata4-TaskAA.txt └── evasampledata4-TaskAR.txt /NLPCC2016-Ensemble of Feature Sets and Classification Methods for Stance Detection.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/NLPCC2016-Ensemble of Feature Sets and Classification Methods for Stance Detection.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # README # 2 | 3 | This is our solution for [NLPCC2016 shared task: Detecting Stance in Chinese Weibo (Task A)](http://tcci.ccf.org.cn/conference/2016/pages/page05_CFPTasks.html) 4 | 5 | @inproceedings{xu2016ensemble, 6 | title={Ensemble of Feature Sets and Classification Methods for Stance Detection}, 7 | author={Xu, Jiaming and Zheng, Suncong and Shi, Jing and Yao, Yiqun and Xu, Bo}, 8 | booktitle={Natural Language Processing and Chinese Computing (NLPCC)}, 9 | year={2016}, 10 | publisher={Springer} 11 | } 12 | 13 | - This is a supervised task towards five targets. For each target, 600 labled Weibo texts, 600 unlabeled Weibo texts and 3,000 test Weibo texts are provided. The task is to detect the author's stance. 14 | 15 | We give an ensemble framework by integrating various feature sets, such as Paragraph Vector (Para2vec) [1], Latent Dirichlet Allocation (LDA) [2], Latent Semantic Analysis (LSA) [3], Laplacian Eigenmaps (LE) [4] and Locality Preserving Indexing (LPI) [5], and various classification methods, such as Random Forest (RF) [6], Linear Support Vector Machines (SVM-Linear) [7], SVM with RBF Kernel (SVM-RBF) [8] and AdaBoot [9]. 16 | 17 | The official results show that our solution of the team "CBrain" achieves one 1st place and one 2nd place on these targets, and the overall ranking is 4th out of 16 teams with 0.6856 F1 score. 18 | 19 | **Team member**: [Jiaming Xu](http://jacoxu.com/?page_id=2), Suncong Zheng, Jing Shi, Yiqun Yao, Bo Xu. 20 | 21 | Please feel free to send me emails (*jacoxu@msn.com*) if you have any problems. 22 | 23 | [1]. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In:ICML. vol. 14, pp. 1188-1196 (2014) 24 | [2]. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3(Jan), 993-1022 (2003) 25 | [3]. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. JASIS 41(6), 391 (1990) 26 | [4]. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373-1396 (2003) 27 | [5]. He, X., Cai, D., Liu, H., Ma, W.Y.: Locality preserving indexing for document representation. In: SIGIR. pp. 96-103. ACM (2004) 28 | [6]. Breiman, L.: Random forests. Machine Learning 45(1), 5-32 (2001) 29 | [7]. Joachims, T.: Learning to classify text using support vector machines: Methods, theory and algorithms. Kluwer Academic Publishers (2002) 30 | [8]. Scholkopf, B., Smola, A.J.: Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press (2002) 31 | [9]. Freund, Y., Schapire, R.E., et al.: Experiments with a new boosting algorithm. In:ICML. vol. 96, pp. 148-156 (1996) 32 | -------------------------------------------------------------------------------- /code/ChiSquare/README.md: -------------------------------------------------------------------------------- 1 | # Chi-Squared Test 2 | This project provide a text feature selection method with chi-squared test. 3 | 4 | Please find the project at the following url: 5 | 6 | https://github.com/kn45/Chi-Square 7 | -------------------------------------------------------------------------------- /code/Ensemble_models_matlab/TaskA_Ensemble_models.m: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Ensemble_models_matlab/TaskA_Ensemble_models.m -------------------------------------------------------------------------------- /code/Ensemble_models_matlab/main_TaskA_Ensemble_models.m: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Ensemble_models_matlab/main_TaskA_Ensemble_models.m -------------------------------------------------------------------------------- /code/Ensemble_models_matlab/normalize.m: -------------------------------------------------------------------------------- 1 | function Xn = normalize(X) 2 | % Normalize all feature vectors to unit length 3 | 4 | n = size(X,1); % the number of documents 5 | Xt = X'; 6 | l = sqrt(sum(Xt.^2)); % the row vector length (L2 norm) 7 | Ni = sparse(1:n,1:n,l); 8 | Ni(Ni>0) = 1./Ni(Ni>0); 9 | Xn = (Xt*Ni)'; 10 | 11 | end 12 | -------------------------------------------------------------------------------- /code/Ensemble_models_matlab/predict.mexw64: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Ensemble_models_matlab/predict.mexw64 -------------------------------------------------------------------------------- /code/Ensemble_models_matlab/svmpredict.mexw64: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Ensemble_models_matlab/svmpredict.mexw64 -------------------------------------------------------------------------------- /code/Ensemble_models_matlab/svmtrain.mexw64: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Ensemble_models_matlab/svmtrain.mexw64 -------------------------------------------------------------------------------- /code/Ensemble_models_matlab/tf_idf.m: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Ensemble_models_matlab/tf_idf.m -------------------------------------------------------------------------------- /code/Ensemble_models_matlab/train.mexw64: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Ensemble_models_matlab/train.mexw64 -------------------------------------------------------------------------------- /code/Feature_process_java/lib/split.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_process_java/lib/split.jar -------------------------------------------------------------------------------- /code/Feature_process_java/lib/tree-split-word-1.1.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_process_java/lib/tree-split-word-1.1.jar -------------------------------------------------------------------------------- /code/Feature_process_java/lib/typetrans.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_process_java/lib/typetrans.jar -------------------------------------------------------------------------------- /code/Feature_process_java/library/stopDict.dic: -------------------------------------------------------------------------------- 1 | 2 | , 3 | ? 4 | 、 5 | 。 6 | “ 7 | ” 8 | 《 9 | 》 10 | ! 11 | , 12 | : 13 | ; 14 | ? 15 | 】 16 | 【 17 | 人民 18 | 啊 19 | 阿 20 | 哎 21 | 哎呀 22 | 哎哟 23 | 唉 24 | 俺 25 | 俺们 26 | 按 27 | 按照 28 | 吧 29 | 吧哒 30 | 把 31 | 罢了 32 | 被 33 | 本 34 | 本着 35 | 比 36 | 比方 37 | 比如 38 | 鄙人 39 | 彼 40 | 彼此 41 | 边 42 | 别 43 | 别的 44 | 别说 45 | 并 46 | 并且 47 | 不比 48 | 不成 49 | 不单 50 | 不但 51 | 不独 52 | 不管 53 | 不光 54 | 不过 55 | 不仅 56 | 不拘 57 | 不论 58 | 不怕 59 | 不然 60 | 不如 61 | 不特 62 | 不惟 63 | 不问 64 | 不只 65 | 朝 66 | 朝着 67 | 趁 68 | 趁着 69 | 乘 70 | 冲 71 | 除 72 | 除此之外 73 | 除非 74 | 除了 75 | 此 76 | 此间 77 | 此外 78 | 从 79 | 从而 80 | 打 81 | 待 82 | 但 83 | 但是 84 | 当 85 | 当着 86 | 到 87 | 得 88 | 的 89 | 的话 90 | 等 91 | 等等 92 | 地 93 | 第 94 | 叮咚 95 | 对 96 | 对于 97 | 多 98 | 多少 99 | 而 100 | 而况 101 | 而且 102 | 而是 103 | 而外 104 | 而言 105 | 而已 106 | 尔后 107 | 反过来 108 | 反过来说 109 | 反之 110 | 非但 111 | 非徒 112 | 否则 113 | 嘎 114 | 嘎登 115 | 该 116 | 赶 117 | 个 118 | 各 119 | 各个 120 | 各位 121 | 各种 122 | 各自 123 | 给 124 | 根据 125 | 跟 126 | 故 127 | 故此 128 | 固然 129 | 关于 130 | 管 131 | 归 132 | 果然 133 | 果真 134 | 过 135 | 哈 136 | 哈哈 137 | 呵 138 | 和 139 | 何 140 | 何处 141 | 何况 142 | 何时 143 | 嘿 144 | 哼 145 | 哼唷 146 | 呼哧 147 | 乎 148 | 哗 149 | 还是 150 | 还有 151 | 换句话说 152 | 换言之 153 | 或 154 | 或是 155 | 或者 156 | 极了 157 | 及 158 | 及其 159 | 及至 160 | 即 161 | 即便 162 | 即或 163 | 即令 164 | 即若 165 | 即使 166 | 几 167 | 几时 168 | 己 169 | 既 170 | 既然 171 | 既是 172 | 继而 173 | 加之 174 | 假如 175 | 假若 176 | 假使 177 | 鉴于 178 | 将 179 | 较 180 | 较之 181 | 叫 182 | 接着 183 | 结果 184 | 借 185 | 紧接着 186 | 进而 187 | 尽 188 | 尽管 189 | 经 190 | 经过 191 | 就 192 | 就是 193 | 就是说 194 | 据 195 | 具体地说 196 | 具体说来 197 | 开始 198 | 开外 199 | 靠 200 | 咳 201 | 可 202 | 可见 203 | 可是 204 | 可以 205 | 况且 206 | 啦 207 | 来 208 | 来着 209 | 离 210 | 例如 211 | 哩 212 | 连 213 | 连同 214 | 两者 215 | 了 216 | 临 217 | 另 218 | 另外 219 | 另一方面 220 | 论 221 | 嘛 222 | 吗 223 | 慢说 224 | 漫说 225 | 冒 226 | 么 227 | 每 228 | 每当 229 | 们 230 | 莫若 231 | 某 232 | 某个 233 | 某些 234 | 拿 235 | 哪 236 | 哪边 237 | 哪儿 238 | 哪个 239 | 哪里 240 | 哪年 241 | 哪怕 242 | 哪天 243 | 哪些 244 | 哪样 245 | 那 246 | 那边 247 | 那儿 248 | 那个 249 | 那会儿 250 | 那里 251 | 那么 252 | 那么些 253 | 那么样 254 | 那时 255 | 那些 256 | 那样 257 | 乃 258 | 乃至 259 | 呢 260 | 能 261 | 你 262 | 你们 263 | 您 264 | 宁 265 | 宁可 266 | 宁肯 267 | 宁愿 268 | 哦 269 | 呕 270 | 啪达 271 | 旁人 272 | 呸 273 | 凭 274 | 凭借 275 | 其 276 | 其次 277 | 其二 278 | 其他 279 | 其它 280 | 其一 281 | 其余 282 | 其中 283 | 起 284 | 起见 285 | 岂但 286 | 恰恰相反 287 | 前后 288 | 前者 289 | 且 290 | 然而 291 | 然后 292 | 然则 293 | 让 294 | 人家 295 | 任 296 | 任何 297 | 任凭 298 | 如 299 | 如此 300 | 如果 301 | 如何 302 | 如其 303 | 如若 304 | 如上所述 305 | 若 306 | 若非 307 | 若是 308 | 啥 309 | 上下 310 | 尚且 311 | 设若 312 | 设使 313 | 甚而 314 | 甚么 315 | 甚至 316 | 省得 317 | 时候 318 | 什么 319 | 什么样 320 | 使得 321 | 是 322 | 是的 323 | 首先 324 | 谁 325 | 谁知 326 | 顺 327 | 顺着 328 | 似的 329 | 虽 330 | 虽然 331 | 虽说 332 | 虽则 333 | 随 334 | 随着 335 | 所 336 | 所以 337 | 他 338 | 他们 339 | 他人 340 | 它 341 | 它们 342 | 她 343 | 她们 344 | 倘 345 | 倘或 346 | 倘然 347 | 倘若 348 | 倘使 349 | 腾 350 | 替 351 | 通过 352 | 同 353 | 同时 354 | 哇 355 | 万一 356 | 往 357 | 望 358 | 为 359 | 为何 360 | 为了 361 | 为什么 362 | 为着 363 | 喂 364 | 嗡嗡 365 | 我 366 | 我们 367 | 呜 368 | 呜呼 369 | 乌乎 370 | 无论 371 | 无宁 372 | 毋宁 373 | 嘻 374 | 吓 375 | 相对而言 376 | 像 377 | 向 378 | 向着 379 | 嘘 380 | 呀 381 | 焉 382 | 沿 383 | 沿着 384 | 要 385 | 要不 386 | 要不然 387 | 要不是 388 | 要么 389 | 要是 390 | 也 391 | 也罢 392 | 也好 393 | 一 394 | 一般 395 | 一旦 396 | 一方面 397 | 一来 398 | 一切 399 | 一样 400 | 一则 401 | 一个 402 | 依 403 | 依照 404 | 矣 405 | 以 406 | 以便 407 | 以及 408 | 以免 409 | 以至 410 | 以至于 411 | 以致 412 | 抑或 413 | 因 414 | 因此 415 | 因而 416 | 因为 417 | 哟 418 | 用 419 | 由 420 | 由此可见 421 | 由于 422 | 有 423 | 有的 424 | 有关 425 | 有些 426 | 又 427 | 于 428 | 于是 429 | 于是乎 430 | 与 431 | 与此同时 432 | 与否 433 | 与其 434 | 越是 435 | 云云 436 | 哉 437 | 再说 438 | 再者 439 | 在 440 | 在下 441 | 咱 442 | 咱们 443 | 则 444 | 怎 445 | 怎么 446 | 怎么办 447 | 怎么样 448 | 怎样 449 | 咋 450 | 照 451 | 照着 452 | 者 453 | 这 454 | 这边 455 | 这儿 456 | 这个 457 | 这会儿 458 | 这就是说 459 | 这里 460 | 这么 461 | 这么点儿 462 | 这么些 463 | 这么样 464 | 这时 465 | 这些 466 | 这样 467 | 正如 468 | 吱 469 | 之 470 | 之类 471 | 之所以 472 | 之一 473 | 只是 474 | 只限 475 | 只要 476 | 只有 477 | 至 478 | 至于 479 | 诸位 480 | 着 481 | 着呢 482 | 自 483 | 自从 484 | 自个儿 485 | 自各儿 486 | 自己 487 | 自家 488 | 自身 489 | 综上所述 490 | 总的来看 491 | 总的来说 492 | 总的说来 493 | 总而言之 494 | 总之 495 | 纵 496 | 纵令 497 | 纵然 498 | 纵使 499 | 遵照 500 | 作为 501 | 兮 502 | 呃 503 | 呗 504 | 咚 505 | 咦 506 | 喏 507 | 啐 508 | 喔唷 509 | 嗬 510 | 嗯 511 | 嗳 512 | ~ 513 | ! 514 | . 515 | : 516 | " 517 | ' 518 | ( 519 | ) 520 | * 521 | A 522 | 白 523 | 社会主义 524 | -- 525 | .. 526 | >> 527 | [ 528 | ] 529 | 530 | < 531 | > 532 | / 533 | \ 534 | | 535 | - 536 | _ 537 | + 538 | = 539 | & 540 | ^ 541 | % 542 | # 543 | @ 544 | ` 545 | ; 546 | $ 547 | ( 548 | ) 549 | —— 550 | — 551 | ¥ 552 | · 553 | ... 554 | ‘ 555 | ’ 556 | 〉 557 | 〈 558 | … 559 |   560 | 0 561 | 1 562 | 2 563 | 3 564 | 4 565 | 5 566 | 6 567 | 7 568 | 8 569 | 9 570 | 0 571 | 1 572 | 2 573 | 3 574 | 4 575 | 5 576 | 6 577 | 7 578 | 8 579 | 9 580 | 二 581 | 三 582 | 四 583 | 五 584 | 六 585 | 七 586 | 八 587 | 九 588 | 零 589 | > 590 | < 591 | @ 592 | # 593 | $ 594 | % 595 | ︿ 596 | & 597 | * 598 | + 599 | ~ 600 | | 601 | [ 602 | ] 603 | { 604 | } 605 | 啊哈 606 | 啊呀 607 | 啊哟 608 | 挨次 609 | 挨个 610 | 挨家挨户 611 | 挨门挨户 612 | 挨门逐户 613 | 挨着 614 | 按理 615 | 按期 616 | 按时 617 | 按说 618 | 暗地里 619 | 暗中 620 | 暗自 621 | 昂然 622 | 八成 623 | 白白 624 | 半 625 | 梆 626 | 保管 627 | 保险 628 | 饱 629 | 背地里 630 | 背靠背 631 | 倍感 632 | 倍加 633 | 本人 634 | 本身 635 | 甭 636 | 比起 637 | 比如说 638 | 比照 639 | 毕竟 640 | 必 641 | 必定 642 | 必将 643 | 必须 644 | 便 645 | 别人 646 | 并非 647 | 并肩 648 | 并没 649 | 并没有 650 | 并排 651 | 并无 652 | 勃然 653 | 不 654 | 不必 655 | 不常 656 | 不大 657 | 不但...而且 658 | 不得 659 | 不得不 660 | 不得了 661 | 不得已 662 | 不迭 663 | 不定 664 | 不对 665 | 不妨 666 | 不管怎样 667 | 不会 668 | 不仅...而且 669 | 不仅仅 670 | 不仅仅是 671 | 不经意 672 | 不可开交 673 | 不可抗拒 674 | 不力 675 | 不了 676 | 不料 677 | 不满 678 | 不免 679 | 不能不 680 | 不起 681 | 不巧 682 | 不然的话 683 | 不日 684 | 不少 685 | 不胜 686 | 不时 687 | 不是 688 | 不同 689 | 不能 690 | 不要 691 | 不外 692 | 不外乎 693 | 不下 694 | 不限 695 | 不消 696 | 不已 697 | 不亦乐乎 698 | 不由得 699 | 不再 700 | 不择手段 701 | 不怎么 702 | 不曾 703 | 不知不觉 704 | 不止 705 | 不止一次 706 | 不至于 707 | 才 708 | 才能 709 | 策略地 710 | 差不多 711 | 差一点 712 | 常 713 | 常常 714 | 常言道 715 | 常言说 716 | 常言说得好 717 | 长此下去 718 | 长话短说 719 | 长期以来 720 | 长线 721 | 敞开儿 722 | 彻夜 723 | 陈年 724 | 趁便 725 | 趁机 726 | 趁热 727 | 趁势 728 | 趁早 729 | 成年 730 | 成年累月 731 | 成心 732 | 乘机 733 | 乘胜 734 | 乘势 735 | 乘隙 736 | 乘虚 737 | 诚然 738 | 迟早 739 | 充分 740 | 充其极 741 | 充其量 742 | 抽冷子 743 | 臭 744 | 初 745 | 出 746 | 出来 747 | 出去 748 | 除此 749 | 除此而外 750 | 除此以外 751 | 除开 752 | 除去 753 | 除却 754 | 除外 755 | 处处 756 | 川流不息 757 | 传 758 | 传说 759 | 传闻 760 | 串行 761 | 纯 762 | 纯粹 763 | 此后 764 | 此中 765 | 次第 766 | 匆匆 767 | 从不 768 | 从此 769 | 从此以后 770 | 从古到今 771 | 从古至今 772 | 从今以后 773 | 从宽 774 | 从来 775 | 从轻 776 | 从速 777 | 从头 778 | 从未 779 | 从无到有 780 | 从小 781 | 从新 782 | 从严 783 | 从优 784 | 从早到晚 785 | 从中 786 | 从重 787 | 凑巧 788 | 粗 789 | 存心 790 | 达旦 791 | 打从 792 | 打开天窗说亮话 793 | 大 794 | 大不了 795 | 大大 796 | 大抵 797 | 大都 798 | 大多 799 | 大凡 800 | 大概 801 | 大家 802 | 大举 803 | 大略 804 | 大面儿上 805 | 大事 806 | 大体 807 | 大体上 808 | 大约 809 | 大张旗鼓 810 | 大致 811 | 呆呆地 812 | 带 813 | 殆 814 | 待到 815 | 单 816 | 单纯 817 | 单单 818 | 但愿 819 | 弹指之间 820 | 当场 821 | 当儿 822 | 当即 823 | 当口儿 824 | 当然 825 | 当庭 826 | 当头 827 | 当下 828 | 当真 829 | 当中 830 | 倒不如 831 | 倒不如说 832 | 倒是 833 | 到处 834 | 到底 835 | 到了儿 836 | 到目前为止 837 | 到头 838 | 到头来 839 | 得起 840 | 得天独厚 841 | 的确 842 | 等到 843 | 叮当 844 | 顶多 845 | 定 846 | 动不动 847 | 动辄 848 | 陡然 849 | 都 850 | 独 851 | 独自 852 | 断然 853 | 顿时 854 | 多次 855 | 多多 856 | 多多少少 857 | 多多益善 858 | 多亏 859 | 多年来 860 | 多年前 861 | 而后 862 | 而论 863 | 而又 864 | 尔等 865 | 二话不说 866 | 二话没说 867 | 反倒 868 | 反倒是 869 | 反而 870 | 反手 871 | 反之亦然 872 | 反之则 873 | 方 874 | 方才 875 | 方能 876 | 放量 877 | 非常 878 | 非得 879 | 分期 880 | 分期分批 881 | 分头 882 | 奋勇 883 | 愤然 884 | 风雨无阻 885 | 逢 886 | 弗 887 | 甫 888 | 嘎嘎 889 | 该当 890 | 概 891 | 赶快 892 | 赶早不赶晚 893 | 敢 894 | 敢情 895 | 敢于 896 | 刚 897 | 刚才 898 | 刚好 899 | 刚巧 900 | 高低 901 | 格外 902 | 隔日 903 | 隔夜 904 | 个人 905 | 各式 906 | 更 907 | 更加 908 | 更进一步 909 | 更为 910 | 公然 911 | 共 912 | 共总 913 | 够瞧的 914 | 姑且 915 | 古来 916 | 故而 917 | 故意 918 | 固 919 | 怪 920 | 怪不得 921 | 惯常 922 | 光 923 | 光是 924 | 归根到底 925 | 归根结底 926 | 过于 927 | 毫不 928 | 毫无 929 | 毫无保留地 930 | 毫无例外 931 | 好在 932 | 何必 933 | 何尝 934 | 何妨 935 | 何苦 936 | 何乐而不为 937 | 何须 938 | 何止 939 | 很 940 | 很多 941 | 很少 942 | 轰然 943 | 后来 944 | 呼啦 945 | 忽地 946 | 忽然 947 | 互 948 | 互相 949 | 哗啦 950 | 话说 951 | 还 952 | 恍然 953 | 会 954 | 豁然 955 | 活 956 | 伙同 957 | 或多或少 958 | 或许 959 | 基本 960 | 基本上 961 | 基于 962 | 极 963 | 极大 964 | 极度 965 | 极端 966 | 极力 967 | 极其 968 | 极为 969 | 急匆匆 970 | 即将 971 | 即刻 972 | 即是说 973 | 几度 974 | 几番 975 | 几乎 976 | 几经 977 | 既...又 978 | 继之 979 | 加上 980 | 加以 981 | 间或 982 | 简而言之 983 | 简言之 984 | 简直 985 | 见 986 | 将才 987 | 将近 988 | 将要 989 | 交口 990 | 较比 991 | 较为 992 | 接连不断 993 | 接下来 994 | 皆可 995 | 截然 996 | 截至 997 | 藉以 998 | 借此 999 | 借以 1000 | 届时 1001 | 仅 1002 | 仅仅 1003 | 谨 1004 | 进来 1005 | 进去 1006 | 近 1007 | 近几年来 1008 | 近来 1009 | 近年来 1010 | 尽管如此 1011 | 尽可能 1012 | 尽快 1013 | 尽量 1014 | 尽然 1015 | 尽如人意 1016 | 尽心竭力 1017 | 尽心尽力 1018 | 尽早 1019 | 精光 1020 | 经常 1021 | 竟 1022 | 竟然 1023 | 究竟 1024 | 就此 1025 | 就地 1026 | 就算 1027 | 居然 1028 | 局外 1029 | 举凡 1030 | 据称 1031 | 据此 1032 | 据实 1033 | 据说 1034 | 据我所知 1035 | 据悉 1036 | 具体来说 1037 | 决不 1038 | 决非 1039 | 绝 1040 | 绝不 1041 | 绝顶 1042 | 绝对 1043 | 绝非 1044 | 均 1045 | 喀 1046 | 看 1047 | 看来 1048 | 看起来 1049 | 看上去 1050 | 看样子 1051 | 可好 1052 | 可能 1053 | 恐怕 1054 | 快 1055 | 快要 1056 | 来不及 1057 | 来得及 1058 | 来讲 1059 | 来看 1060 | 拦腰 1061 | 牢牢 1062 | 老 1063 | 老大 1064 | 老老实实 1065 | 老是 1066 | 累次 1067 | 累年 1068 | 理当 1069 | 理该 1070 | 理应 1071 | 历 1072 | 立 1073 | 立地 1074 | 立刻 1075 | 立马 1076 | 立时 1077 | 联袂 1078 | 连连 1079 | 连日 1080 | 连日来 1081 | 连声 1082 | 连袂 1083 | 临到 1084 | 另方面 1085 | 另行 1086 | 另一个 1087 | 路经 1088 | 屡 1089 | 屡次 1090 | 屡次三番 1091 | 屡屡 1092 | 缕缕 1093 | 率尔 1094 | 率然 1095 | 略 1096 | 略加 1097 | 略微 1098 | 略为 1099 | 论说 1100 | 马上 1101 | 蛮 1102 | 满 1103 | 没 1104 | 没有 1105 | 每逢 1106 | 每每 1107 | 每时每刻 1108 | 猛然 1109 | 猛然间 1110 | 莫 1111 | 莫不 1112 | 莫非 1113 | 莫如 1114 | 默默地 1115 | 默然 1116 | 呐 1117 | 那末 1118 | 奈 1119 | 难道 1120 | 难得 1121 | 难怪 1122 | 难说 1123 | 内 1124 | 年复一年 1125 | 凝神 1126 | 偶而 1127 | 偶尔 1128 | 怕 1129 | 砰 1130 | 碰巧 1131 | 譬如 1132 | 偏偏 1133 | 乒 1134 | 平素 1135 | 颇 1136 | 迫于 1137 | 扑通 1138 | 其后 1139 | 其实 1140 | 奇 1141 | 齐 1142 | 起初 1143 | 起来 1144 | 起首 1145 | 起头 1146 | 起先 1147 | 岂 1148 | 岂非 1149 | 岂止 1150 | 迄 1151 | 恰逢 1152 | 恰好 1153 | 恰恰 1154 | 恰巧 1155 | 恰如 1156 | 恰似 1157 | 千 1158 | 千万 1159 | 千万千万 1160 | 切 1161 | 切不可 1162 | 切莫 1163 | 切切 1164 | 切勿 1165 | 窃 1166 | 亲口 1167 | 亲身 1168 | 亲手 1169 | 亲眼 1170 | 亲自 1171 | 顷 1172 | 顷刻 1173 | 顷刻间 1174 | 顷刻之间 1175 | 请勿 1176 | 穷年累月 1177 | 取道 1178 | 去 1179 | 权时 1180 | 全都 1181 | 全力 1182 | 全年 1183 | 全然 1184 | 全身心 1185 | 然 1186 | 人人 1187 | 仍 1188 | 仍旧 1189 | 仍然 1190 | 日复一日 1191 | 日见 1192 | 日渐 1193 | 日益 1194 | 日臻 1195 | 如常 1196 | 如此等等 1197 | 如次 1198 | 如今 1199 | 如期 1200 | 如前所述 1201 | 如上 1202 | 如下 1203 | 汝 1204 | 三番两次 1205 | 三番五次 1206 | 三天两头 1207 | 瑟瑟 1208 | 沙沙 1209 | 上 1210 | 上来 1211 | 上去 1212 | 标签 1213 | 转发 1214 | 关注 1215 | 分享 1216 | 回复 1217 | 地址 1218 | -------------------------------------------------------------------------------- /code/Feature_process_java/library/userLibrary/userLibrary.dic: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_process_java/library/userLibrary/userLibrary.dic -------------------------------------------------------------------------------- /code/Feature_process_java/src/DataPreprocess/Step01_01_TaskA_DataInfoStat.java: -------------------------------------------------------------------------------- 1 | package DataPreprocess; 2 | 3 | import java.io.BufferedReader; 4 | import java.io.BufferedWriter; 5 | import java.io.File; 6 | import java.io.FileInputStream; 7 | import java.io.FileOutputStream; 8 | import java.io.InputStreamReader; 9 | import java.io.OutputStreamWriter; 10 | import java.util.HashMap; 11 | import Tools.PreProcessText; 12 | 13 | public class Step01_01_TaskA_DataInfoStat { 14 | /** 15 | * @param args 16 | * @author jacoxu.com-2016/07/02 17 | */ 18 | //rawTrainData, dataRefine, tagInfo 19 | private static void dataAnalysis(String rawTrain_labeled, String rawTrain_unlabeled, String rawTest_unlabeled 20 | , String train_labeled_tag_pre, String train_labeled_text_pre, String train_unlabeled_text_pre 21 | , String test_unlabeled_text_pre, HashMap topic2filesuffix 22 | , boolean hasTestData) { 23 | // 24 | HashMap topicCountInfoMap = new HashMap(); 25 | HashMap tagCountInfoMap = new HashMap(); 26 | 27 | try { 28 | String encoding = "UTF-8"; 29 | File rawTrainLabelledFile = new File(rawTrain_labeled); 30 | File rawTrainUnlabeledFile = new File(rawTrain_unlabeled); 31 | File rawTestUnlabeledFile = new File(rawTest_unlabeled); 32 | if (rawTrainLabelledFile.exists()) { 33 | BufferedReader rawTrainLabelledReader = new BufferedReader(new InputStreamReader( 34 | new FileInputStream(rawTrainLabelledFile), encoding)); 35 | BufferedReader rawTrainUnlabeledReader = new BufferedReader(new InputStreamReader( 36 | new FileInputStream(rawTrainUnlabeledFile), encoding)); 37 | 38 | //开始读取训练数据 39 | System.out.println("Start to process train data info..."); 40 | String tmpId; 41 | String tmpTopic; 42 | String tmpText; 43 | String tmpTag; 44 | 45 | int lineNum=0; 46 | int dropNum = 0; 47 | String tmpLineStr = null; 48 | while ((tmpLineStr = rawTrainLabelledReader.readLine()) != null) { 49 | lineNum++; 50 | //第一行是Title,所以跳过去 51 | if (lineNum<=1) continue; 52 | String[] trainSentInfo = tmpLineStr.split("\t"); 53 | if (trainSentInfo.length==3) { 54 | System.err.println("Error: rawTrainLabelledReader.length is "+ trainSentInfo.length 55 | + " in lineNum" + lineNum + ", we drop it."); 56 | dropNum++; 57 | continue; 58 | } 59 | if (trainSentInfo.length!=4) { 60 | System.err.println("Error: rawTrainLabelledReader.length is "+ trainSentInfo.length 61 | + " in lineNum" + lineNum); 62 | System.exit(0); 63 | } 64 | tmpId = trainSentInfo[0].trim(); 65 | tmpTopic = trainSentInfo[1].trim(); 66 | tmpText = trainSentInfo[2].trim(); 67 | tmpTag = trainSentInfo[3].trim(); 68 | 69 | //如果是初次遇到此标签 70 | int tmpTopicCount = 0; 71 | int tmpTagCount = 0; 72 | 73 | if (topicCountInfoMap.containsKey(tmpTopic)) { 74 | tmpTopicCount = topicCountInfoMap.get(tmpTopic); 75 | }else { 76 | tmpTopicCount = 0; 77 | } 78 | tmpTopicCount++; 79 | topicCountInfoMap.put(tmpTopic, tmpTopicCount); 80 | 81 | if (tagCountInfoMap.containsKey(tmpTopic+"_"+tmpTag)) { 82 | tmpTagCount = tagCountInfoMap.get(tmpTopic+"_"+tmpTag); 83 | }else { 84 | tmpTagCount = 0; 85 | } 86 | tmpTagCount++; 87 | tagCountInfoMap.put(tmpTopic+"_"+tmpTag, tmpTagCount); 88 | 89 | //对文本进行预处理 90 | tmpText = PreProcessText.preProcess4NLPCC2016(tmpText, tmpTopic); 91 | int tag_flag = -1; 92 | if (tmpTag.equals("FAVOR")) 93 | tag_flag = 1; 94 | else if (tmpTag.equals("AGAINST")) 95 | tag_flag = 2; 96 | else if (tmpTag.equals("NONE")) 97 | tag_flag = 0; 98 | else { 99 | System.out.println("Error tag:" + tmpTag); 100 | } 101 | Result2Txt(train_labeled_tag_pre+"_"+topic2filesuffix.get(tmpTopic) 102 | , String.valueOf(tag_flag)); 103 | Result2Txt(train_labeled_text_pre+"_"+topic2filesuffix.get(tmpTopic) 104 | , tmpText); 105 | if (lineNum%1000 ==0) { 106 | System.out.println("hasProcessed train data numbers:" + lineNum); 107 | } 108 | } 109 | System.out.println("topicCountInfoMap:"+topicCountInfoMap.toString()); 110 | System.out.println("tagCountInfoMap:"+tagCountInfoMap.toString()); 111 | System.out.println("Totally processed train data numbers:" + lineNum); 112 | System.out.println("rawTrainLabelledFile dropNum is:" + dropNum); 113 | rawTrainLabelledReader.close(); 114 | 115 | dropNum = 0; 116 | lineNum=0; 117 | topicCountInfoMap.clear(); 118 | tagCountInfoMap.clear(); 119 | while ((tmpLineStr = rawTrainUnlabeledReader.readLine()) != null) { 120 | lineNum++; 121 | //第一行是Title,所以跳过去 122 | if (lineNum<=1) continue; 123 | String[] trainSentInfo = tmpLineStr.split("\t"); 124 | if (trainSentInfo.length == 2){ 125 | System.err.println("Error: rawTrainUnlabeledFile.length is "+ trainSentInfo.length 126 | + " in lineNum" + lineNum + ", we drop it."); 127 | dropNum++; 128 | continue; 129 | } 130 | 131 | if (trainSentInfo.length!=3) { 132 | System.err.println("Error: rawTrainUnlabeledFile.length is "+ trainSentInfo.length 133 | + " in lineNum" + lineNum); 134 | System.exit(0); 135 | } 136 | tmpId = trainSentInfo[0].trim(); 137 | tmpTopic = trainSentInfo[1].trim(); 138 | tmpText = trainSentInfo[2].trim(); 139 | 140 | //如果是初次遇到此主题 141 | int tmpTopicCount = 0; 142 | 143 | if (topicCountInfoMap.containsKey(tmpTopic)) { 144 | tmpTopicCount = topicCountInfoMap.get(tmpTopic); 145 | }else { 146 | tmpTopicCount = 0; 147 | } 148 | tmpTopicCount++; 149 | topicCountInfoMap.put(tmpTopic, tmpTopicCount); 150 | 151 | //对文本进行预处理 152 | tmpText = PreProcessText.preProcess4NLPCC2016(tmpText, tmpTopic); 153 | Result2Txt(train_unlabeled_text_pre+"_"+topic2filesuffix.get(tmpTopic) 154 | , tmpText); 155 | if (lineNum%1000 ==0) { 156 | System.out.println("hasProcessed unlabeled train data numbers:" + lineNum); 157 | } 158 | } 159 | System.out.println("topicCountInfoMap:"+topicCountInfoMap.toString()); 160 | System.out.println("Totally processed unlabeled train data numbers:" + lineNum); 161 | System.out.println("rawTrainUnlabeledFile dropNum is:" + dropNum); 162 | rawTrainUnlabeledReader.close(); 163 | 164 | 165 | if (hasTestData) { 166 | BufferedReader rawTestUnlabeledReader = new BufferedReader(new InputStreamReader( 167 | new FileInputStream(rawTestUnlabeledFile), encoding)); 168 | //开始读取测试数据 169 | lineNum=0; 170 | topicCountInfoMap.clear(); 171 | tagCountInfoMap.clear(); 172 | while ((tmpLineStr = rawTestUnlabeledReader.readLine()) != null) { 173 | lineNum++; 174 | //第一行是Title,所以跳过去 175 | if (lineNum<=1) continue; 176 | String[] testSentInfo = tmpLineStr.split("\t"); 177 | if (testSentInfo.length!=4) { 178 | // System.err.println("【Error】: rawTestUnlabeledFile.length is "+ testSentInfo.length 179 | // + " in lineNum" + lineNum); 180 | System.err.println(testSentInfo[0].trim() + "\t" + testSentInfo[4].trim()); 181 | // System.exit(0); 182 | } 183 | tmpId = testSentInfo[0].trim(); 184 | tmpTopic = testSentInfo[1].trim(); 185 | tmpText = testSentInfo[2].trim(); 186 | 187 | //如果是初次遇到此主题 188 | int tmpTopicCount = 0; 189 | 190 | if (topicCountInfoMap.containsKey(tmpTopic)) { 191 | tmpTopicCount = topicCountInfoMap.get(tmpTopic); 192 | }else { 193 | tmpTopicCount = 0; 194 | } 195 | tmpTopicCount++; 196 | topicCountInfoMap.put(tmpTopic, tmpTopicCount); 197 | 198 | //对文本进行预处理 199 | tmpText = PreProcessText.preProcess4NLPCC2016(tmpText, tmpTopic); 200 | Result2Txt(test_unlabeled_text_pre+"_"+topic2filesuffix.get(tmpTopic) 201 | , tmpText); 202 | if (lineNum%1000 ==0) { 203 | System.out.println("hasProcessed unlabeled test data numbers:" + lineNum); 204 | } 205 | } 206 | System.out.println("topicCountInfoMap:"+topicCountInfoMap.toString()); 207 | System.out.println("Totally processed unlabeled test data numbers:" + lineNum); 208 | rawTestUnlabeledReader.close(); 209 | } 210 | } else { 211 | System.out.println("can't find the file"); 212 | } 213 | } catch (Exception e) { 214 | System.out.println("something error when reading the content of the file"); 215 | e.printStackTrace(); 216 | } 217 | return; 218 | 219 | } 220 | 221 | public static void Result2Txt(String file, String txt) { 222 | try { 223 | BufferedWriter os = new BufferedWriter(new OutputStreamWriter( 224 | new FileOutputStream(new File(file),true), "UTF-8")); 225 | os.write(txt + "\n"); 226 | os.close(); 227 | } catch (Exception e) { 228 | e.printStackTrace(); 229 | } 230 | } 231 | 232 | public static void main(String[] args) { 233 | //***********测试区域************ 234 | System.out.println("test"); 235 | //***********测试区域************ 236 | 237 | String dataPathStr="./../../data/"; 238 | String resultsPathStr="./../../data/RefineData/Step01_topics/"; 239 | File resultsPathFile = new File(resultsPathStr); 240 | if (!resultsPathFile.exists()) resultsPathFile.mkdir(); 241 | 242 | //分析训练/测试数据集 243 | String rawTrain_labeled = dataPathStr+"evasampledata4-TaskAA.txt"; 244 | String rawTrain_unlabeled = dataPathStr+"evasampledata4-TaskAR.txt"; 245 | 246 | String rawTest_unlabeled = dataPathStr+"NLPCC2016_Stance_Detection_Task_A_Testdata.txt"; 247 | //按5个主题 拆分 标签和文本 到文件中 248 | String train_labeled_tag_pre = resultsPathStr+"TaskA_train_labeled_tag"; 249 | String train_labeled_text_pre = resultsPathStr+"TaskA_train_labeled_text"; 250 | String train_unlabeled_text_pre = resultsPathStr+"TaskA_train_unlabeled_text"; 251 | String test_unlabeled_text_pre = resultsPathStr+"TaskA_test_unlabeled_text"; 252 | //主题到文件后缀的映射 253 | HashMap topic2filesuffix = new HashMap(); 254 | topic2filesuffix.put("IphoneSE", "iphonese"); 255 | topic2filesuffix.put("iPhone SE", "iphonese"); 256 | topic2filesuffix.put("春节放鞭炮", "bianpao"); 257 | topic2filesuffix.put("俄罗斯在叙利亚的反恐行动", "fankong"); 258 | topic2filesuffix.put("俄罗斯叙利亚反恐行动", "fankong"); 259 | topic2filesuffix.put("开放二胎", "ertai"); 260 | topic2filesuffix.put("深圳禁摩限电", "jinmo"); 261 | 262 | boolean hasTestData = true; 263 | 264 | long readstart=System.currentTimeMillis(); 265 | dataAnalysis(rawTrain_labeled, rawTrain_unlabeled, rawTest_unlabeled 266 | , train_labeled_tag_pre, train_labeled_text_pre, train_unlabeled_text_pre 267 | , test_unlabeled_text_pre, topic2filesuffix, hasTestData); 268 | long readend=System.currentTimeMillis(); 269 | System.out.println((readend-readstart)/1000.0+"s had been consumed to process the raw train/test data"); 270 | } 271 | } 272 | -------------------------------------------------------------------------------- /code/Feature_process_java/src/DataPreprocess/Step02_01_TaskA_GenDict.java: -------------------------------------------------------------------------------- 1 | package DataPreprocess; 2 | 3 | import java.io.BufferedReader; 4 | import java.io.BufferedWriter; 5 | import java.io.File; 6 | import java.io.FileOutputStream; 7 | import java.io.FileReader; 8 | import java.io.OutputStreamWriter; 9 | import java.util.ArrayList; 10 | import java.util.HashMap; 11 | import java.util.Iterator; 12 | import java.util.Map; 13 | 14 | public class Step02_01_TaskA_GenDict { 15 | public static void main(String[] args) throws Exception { 16 | 17 | //读取训练数据文件 18 | String dataPathStr="./../../data/RefineData/Step01_topics/"; 19 | String resultsPathStr="./../../data/RefineData/Step02_vsm/"; 20 | File resultsPathFile = new File(resultsPathStr); 21 | if (!resultsPathFile.exists()) resultsPathFile.mkdir(); 22 | 23 | String train_labeled_text_pre = dataPathStr+"TaskA_train_labeled_text"; 24 | String train_unlabeled_text_pre = dataPathStr+"TaskA_train_unlabeled_text"; 25 | String test_unlabeled_text_pre = dataPathStr+"TaskA_test_unlabeled_text"; 26 | 27 | //主题到文件后缀的映射 28 | ArrayList topic2filesuffix = new ArrayList(); 29 | topic2filesuffix.add("iphonese"); 30 | topic2filesuffix.add("bianpao"); 31 | topic2filesuffix.add("fankong"); 32 | topic2filesuffix.add("ertai"); 33 | topic2filesuffix.add("jinmo"); 34 | 35 | boolean hasTestData = true; 36 | 37 | for (int topic_idx = 0; topic_idx < topic2filesuffix.size(); topic_idx++) { 38 | String usrWordDict = resultsPathStr + "usrWordDict_" + topic2filesuffix.get(topic_idx); 39 | BufferedWriter wordDictFileW = new BufferedWriter(new OutputStreamWriter( 40 | new FileOutputStream(new File(usrWordDict)), "UTF-8")); 41 | 42 | String train_labeled_text = 43 | train_labeled_text_pre+"_"+topic2filesuffix.get(topic_idx); 44 | FileReader train_labeled_text_fr = new FileReader(train_labeled_text); 45 | BufferedReader train_labeled_text_br = new BufferedReader(train_labeled_text_fr); 46 | 47 | String train_unlabeled_text = 48 | train_unlabeled_text_pre+"_"+topic2filesuffix.get(topic_idx); 49 | FileReader train_unlabeled_text_fr = new FileReader(train_unlabeled_text); 50 | BufferedReader train_unlabeled_text_br = new BufferedReader(train_unlabeled_text_fr); 51 | 52 | //先读入处理好的训练语料 53 | int wordReadNum = 0; 54 | int wordWriteNum = 0; 55 | 56 | HashMap wordMap = new HashMap(); 57 | System.out.println("Start to read wordSet ..."); 58 | //train_labeled_text 59 | int lineNum = 0; 60 | String tempLine; 61 | while ((tempLine = train_labeled_text_br.readLine()) != null) { 62 | lineNum++; 63 | String[] wordArraysStr = tempLine.trim().split("\\s+"); 64 | for (int i = 0; i < wordArraysStr.length; i++) { 65 | String tmpWord = wordArraysStr[i].trim(); 66 | if (tmpWord.length()<1) continue; 67 | if (wordMap.containsKey(tmpWord)) { 68 | wordMap.put(tmpWord, wordMap.get(tmpWord)+1); 69 | }else { 70 | wordMap.put(tmpWord,(long) 1); 71 | wordReadNum++; 72 | } 73 | } 74 | if (lineNum%1000 ==0) { 75 | System.out.println("hasProcessed train data numbers:" + lineNum); 76 | } 77 | } 78 | System.out.println("train_labeled_text total lineNum:"+lineNum 79 | +", and wordSet.size():"+wordMap.size()); 80 | train_labeled_text_br.close(); 81 | 82 | //train_unlabeled_text 83 | lineNum = 0; 84 | while ((tempLine = train_unlabeled_text_br.readLine()) != null) { 85 | lineNum++; 86 | String[] wordArraysStr = tempLine.trim().split("\\s+"); 87 | for (int i = 0; i < wordArraysStr.length; i++) { 88 | String tmpWord = wordArraysStr[i].trim(); 89 | if (tmpWord.length()<1) continue; 90 | if (wordMap.containsKey(tmpWord)) { 91 | wordMap.put(tmpWord, wordMap.get(tmpWord)+1); 92 | }else { 93 | wordMap.put(tmpWord,(long) 1); 94 | wordReadNum++; 95 | } 96 | } 97 | if (lineNum%1000 ==0) { 98 | System.out.println("hasProcessed train_unlabeled_text data numbers:" + lineNum); 99 | } 100 | } 101 | System.out.println("train_unlabeled_text total lineNum:"+lineNum 102 | +", and wordSet.size():"+wordMap.size()); 103 | train_unlabeled_text_br.close(); 104 | 105 | if (hasTestData){ 106 | String test_unlabeled_text = 107 | test_unlabeled_text_pre+"_"+topic2filesuffix.get(topic_idx); 108 | FileReader test_unlabeled_text_fr = new FileReader(test_unlabeled_text); 109 | BufferedReader test_unlabeled_text_br = new BufferedReader(test_unlabeled_text_fr); 110 | 111 | //test_unlabeled_text 112 | lineNum = 0; 113 | while ((tempLine = test_unlabeled_text_br.readLine()) != null) { 114 | lineNum++; 115 | String[] wordArraysStr = tempLine.trim().split("\\s+"); 116 | for (int i = 0; i < wordArraysStr.length; i++) { 117 | String tmpWord = wordArraysStr[i].trim(); 118 | if (tmpWord.length()<1) continue; 119 | if (wordMap.containsKey(tmpWord)) { 120 | wordMap.put(tmpWord, wordMap.get(tmpWord)+1); 121 | }else { 122 | wordMap.put(tmpWord,(long) 1); 123 | wordReadNum++; 124 | } 125 | } 126 | if (lineNum%1000 ==0) { 127 | System.out.println("hasProcessed test_unlabeled_text data numbers:" + lineNum); 128 | } 129 | } 130 | System.out.println("test_unlabeled_text total lineNum:"+lineNum 131 | +", and wordSet.size():"+wordMap.size()); 132 | test_unlabeled_text_br.close(); 133 | } 134 | 135 | Iterator iter=wordMap.entrySet().iterator(); 136 | while(iter.hasNext()){ 137 | Map.Entry entry = (Map.Entry) iter.next(); 138 | String tmpWord = entry.getKey(); 139 | 140 | wordWriteNum++; 141 | wordDictFileW.write(tmpWord+"\t"+wordWriteNum+"\n"); 142 | } 143 | System.out.println("wordReadNum:" +wordReadNum); 144 | System.out.println("wordWriteNum:" +wordWriteNum); 145 | if (wordMap.size()!=wordReadNum) { 146 | System.out.println("Error! wordReadNum is diffent with wordMap.size()"); 147 | } 148 | wordDictFileW.close(); 149 | } 150 | } 151 | } -------------------------------------------------------------------------------- /code/Feature_process_java/src/DataPreprocess/Step02_01_TaskA_GenVSM.java: -------------------------------------------------------------------------------- 1 | package DataPreprocess; 2 | 3 | import java.io.BufferedReader; 4 | import java.io.BufferedWriter; 5 | import java.io.File; 6 | import java.io.FileInputStream; 7 | import java.io.FileOutputStream; 8 | import java.io.IOException; 9 | import java.io.InputStreamReader; 10 | import java.io.OutputStreamWriter; 11 | import java.util.ArrayList; 12 | import java.util.HashMap; 13 | 14 | public class Step02_01_TaskA_GenVSM { 15 | public static void main(String[] args) throws Exception { 16 | //Test 17 | // String tempLine ="宸濄伄娴併倢銇倛銇?28521"; 18 | // String[] tokensStr = tempLine.split("\t"); 19 | //利用纯文本和wordmap构建基于词频的向量空间模型,用于STH预处理 20 | // all = [test_data;train_data]! 21 | String dataPathStr = "./../../data/RefineData/Step01_topics/"; 22 | String resultsPathStr = "./../../data/RefineData/Step02_vsm/"; 23 | 24 | String train_labeled_text_pre = dataPathStr+"TaskA_train_labeled_text"; 25 | String train_unlabeled_text_pre = dataPathStr+"TaskA_train_unlabeled_text"; 26 | String test_unlabeled_text_pre = dataPathStr+"TaskA_test_unlabeled_text"; 27 | 28 | //主题到文件后缀的映射 29 | ArrayList topic2filesuffix = new ArrayList(); 30 | topic2filesuffix.add("iphonese"); 31 | topic2filesuffix.add("bianpao"); 32 | topic2filesuffix.add("fankong"); 33 | topic2filesuffix.add("ertai"); 34 | topic2filesuffix.add("jinmo"); 35 | 36 | boolean hasTestData = true; 37 | 38 | for (int topic_idx = 0; topic_idx < topic2filesuffix.size(); topic_idx++) { 39 | String wordMapStr = resultsPathStr + "usrWordDict_" + topic2filesuffix.get(topic_idx); 40 | 41 | String vsmBW_train_labeledStr = resultsPathStr + "/vsm_train_labeled_" 42 | + topic2filesuffix.get(topic_idx); 43 | String vsmBW_train_unlabeledStr = resultsPathStr+"/vsm_train_unlabeled_" 44 | + topic2filesuffix.get(topic_idx); 45 | String vsmBW_test_unlabeledStr = resultsPathStr+"/vsm_test_unlabeled_" 46 | + topic2filesuffix.get(topic_idx); 47 | BufferedReader train_labeled_File = new BufferedReader(new InputStreamReader( 48 | new FileInputStream(new File(train_labeled_text_pre 49 | +"_"+topic2filesuffix.get(topic_idx))), "UTF-8")); 50 | BufferedReader train_unlabeled_File = new BufferedReader(new InputStreamReader( 51 | new FileInputStream(new File(train_unlabeled_text_pre 52 | +"_"+topic2filesuffix.get(topic_idx))), "UTF-8")); 53 | 54 | //构造训练VSM词频向量空间模型 55 | BufferedReader wordMapRD = new BufferedReader( 56 | new InputStreamReader(new FileInputStream(new File(wordMapStr)), "UTF-8")); 57 | creatVSMText(train_labeled_File, wordMapRD, vsmBW_train_labeledStr); 58 | wordMapRD.close(); 59 | train_labeled_File.close(); 60 | 61 | wordMapRD = new BufferedReader( 62 | new InputStreamReader(new FileInputStream(new File(wordMapStr)), "UTF-8")); 63 | creatVSMText(train_unlabeled_File, wordMapRD, vsmBW_train_unlabeledStr); 64 | wordMapRD.close(); 65 | train_unlabeled_File.close(); 66 | 67 | if(hasTestData){ 68 | BufferedReader test_unlabeled_File = new BufferedReader(new InputStreamReader( 69 | new FileInputStream(new File(test_unlabeled_text_pre 70 | +"_"+topic2filesuffix.get(topic_idx))), "UTF-8")); 71 | 72 | wordMapRD = new BufferedReader( 73 | new InputStreamReader(new FileInputStream(new File(wordMapStr)), "UTF-8")); 74 | creatVSMText(test_unlabeled_File, wordMapRD, vsmBW_test_unlabeledStr); 75 | wordMapRD.close(); 76 | test_unlabeled_File.close(); 77 | } 78 | } 79 | 80 | System.out.println("It is done, ok!"); 81 | } 82 | 83 | public static void creatVSMText(BufferedReader sourceTextRD, 84 | BufferedReader wordMapRD, String vsmBW_Str) throws IOException, Exception { 85 | System.out.println("Start to create VSM ...!"); 86 | String tempLine; 87 | //先读入词典 88 | int wordIdxNum = 1; 89 | HashMap wordMap = new HashMap(); 90 | 91 | while ((tempLine = wordMapRD.readLine()) != null) { 92 | //词典中放着词和索引号,索引号 93 | if (wordMap.containsKey(tempLine.trim())) { 94 | System.out.println("Test, the word is replicate:"+tempLine.trim() 95 | +", in wordIdxNum:"+wordIdxNum); 96 | } 97 | if (tempLine.trim().length()==0) continue; 98 | //wordMap.put(tempLine.trim(), wordIdxNum); 99 | wordMap.put(tempLine.split("\\s+")[0].trim(), Integer.valueOf(tempLine.split("\\s+")[1])); 100 | wordIdxNum =wordIdxNum+1; 101 | } 102 | //定义了这个数据集的特征维数 103 | int dimVector = wordIdxNum-1; 104 | System.out.println("Has read the dictionary, the size is:"+wordMap.size()); 105 | ArrayList wordFreqList = new ArrayList(); 106 | int lineNum = 1; 107 | boolean hasWordFeature = false; 108 | StringBuffer tmpVSMBuffer = new StringBuffer(); 109 | while ((tempLine = sourceTextRD.readLine()) != null) { 110 | hasWordFeature = false; 111 | //读入一行,即一个文档; 112 | wordFreqList.clear(); 113 | for (int i = 0; i < dimVector; i++) { 114 | wordFreqList.add(0); 115 | } 116 | 117 | String[] tokensStr; 118 | boolean isvalid = true; 119 | 120 | tokensStr = tempLine.trim().split("\\s+"); 121 | 122 | if (!(tokensStr.length<1)) { 123 | for (int j = 0; j < tokensStr.length; j++) { 124 | String tempToken = tokensStr[j]; 125 | if (wordMap.containsKey(tempToken.trim())) { 126 | hasWordFeature = true; 127 | int index = wordMap.get(tempToken.trim()); 128 | if (index>dimVector) { 129 | System.out.print("Error, and the word is: "+tempToken.trim()); 130 | } 131 | wordFreqList.set(index-1, wordFreqList.get(index-1)+1); 132 | }else { 133 | System.out.println("error: the map has not contain the word:" 134 | +tempToken+" in Line:"+lineNum); 135 | } 136 | } 137 | }else { 138 | isvalid = false; 139 | } 140 | 141 | if (!isvalid) { 142 | System.out.println("warning: the string has lacked contents:" 143 | +tempLine.trim()+" in Line:"+lineNum); 144 | } 145 | for (int tempFreq:wordFreqList) { 146 | tmpVSMBuffer.append(String.valueOf(tempFreq)+" "); 147 | } 148 | //把处理好的文本写入到新的文本文件中 149 | Result2Txt(vsmBW_Str,tmpVSMBuffer.toString().trim()); 150 | tmpVSMBuffer.delete(0, tmpVSMBuffer.length()); 151 | 152 | if (!hasWordFeature) { 153 | System.out.println("++++++++++"+"has no word in Line:"+lineNum+"++++++++++"); 154 | } 155 | lineNum++; 156 | if (lineNum%1000 ==0) { 157 | System.out.println("hasProcessed text numbers:" + lineNum); 158 | } 159 | } 160 | } 161 | public static void Result2Txt(String file, String txt) { 162 | try { 163 | BufferedWriter os = new BufferedWriter(new OutputStreamWriter( 164 | new FileOutputStream(new File(file),true), "UTF-8")); 165 | os.write(txt + "\n"); 166 | os.close(); 167 | } catch (Exception e) { 168 | e.printStackTrace(); 169 | } 170 | } 171 | } 172 | -------------------------------------------------------------------------------- /code/Feature_process_java/src/DataPreprocess/Step04_01_TaskA_ChiSquare.java: -------------------------------------------------------------------------------- 1 | package DataPreprocess; 2 | 3 | import java.io.BufferedReader; 4 | import java.io.BufferedWriter; 5 | import java.io.File; 6 | import java.io.FileOutputStream; 7 | import java.io.FileReader; 8 | import java.io.OutputStreamWriter; 9 | import java.util.ArrayList; 10 | import java.util.HashMap; 11 | import java.util.Iterator; 12 | import java.util.Map; 13 | 14 | public class Step04_01_TaskA_ChiSquare { 15 | public static void main(String[] args) throws Exception { 16 | 17 | //读取训练数据文件 18 | String dataPathStr="./../../data/RefineData/Step01_topics/"; 19 | String resultsPathStr="./../../data/RefineData/Step04_chisquare/"; 20 | File resultsPathFile = new File(resultsPathStr); 21 | if (!resultsPathFile.exists()) resultsPathFile.mkdir(); 22 | 23 | String train_labeled_text_pre = dataPathStr+"TaskA_train_labeled_text"; 24 | String train_labeled_tag_pre = dataPathStr+"TaskA_train_labeled_tag"; 25 | 26 | //主题到文件后缀的映射 27 | ArrayList topic2filesuffix = new ArrayList(); 28 | topic2filesuffix.add("iphonese"); 29 | topic2filesuffix.add("bianpao"); 30 | topic2filesuffix.add("fankong"); 31 | topic2filesuffix.add("ertai"); 32 | topic2filesuffix.add("jinmo"); 33 | int low_feq = 1; 34 | boolean hasTestData = false; 35 | 36 | for (int topic_idx = 0; topic_idx < topic2filesuffix.size(); topic_idx++) { 37 | String usrWordDict = resultsPathStr + "usrWordDict_low_feq_" + String.valueOf(low_feq) 38 | + "_"+ topic2filesuffix.get(topic_idx); 39 | BufferedWriter wordDictFileW = new BufferedWriter(new OutputStreamWriter( 40 | new FileOutputStream(new File(usrWordDict)), "UTF-8")); 41 | 42 | String taskA_ChiSquare = resultsPathStr + "TaskA_chiSquare_" 43 | + topic2filesuffix.get(topic_idx); 44 | BufferedWriter taskA_ChiSquareFileW = new BufferedWriter(new OutputStreamWriter( 45 | new FileOutputStream(new File(taskA_ChiSquare)), "UTF-8")); 46 | 47 | String train_labeled_text = 48 | train_labeled_text_pre+"_"+topic2filesuffix.get(topic_idx); 49 | FileReader train_labeled_text_fr = new FileReader(train_labeled_text); 50 | BufferedReader train_labeled_text_br = new BufferedReader(train_labeled_text_fr); 51 | 52 | String train_labeled_tag = 53 | train_labeled_tag_pre+"_"+topic2filesuffix.get(topic_idx); 54 | FileReader train_labeled_tag_fr = new FileReader(train_labeled_tag); 55 | BufferedReader train_labeled_tag_br = new BufferedReader(train_labeled_tag_fr); 56 | 57 | //先读入处理好的训练语料 58 | int wordReadNum = 0; 59 | int wordWriteNum = 0; 60 | 61 | HashMap wordMap = new HashMap(); 62 | System.out.println("Start to read wordSet ..."); 63 | //train_labeled_text 64 | int lineNum = 0; 65 | String tempLine; 66 | while ((tempLine = train_labeled_text_br.readLine()) != null) { 67 | lineNum++; 68 | String[] wordArraysStr = tempLine.trim().split("\\s+"); 69 | for (int i = 0; i < wordArraysStr.length; i++) { 70 | String tmpWord = wordArraysStr[i].trim(); 71 | if (tmpWord.length()<1) continue; 72 | if (wordMap.containsKey(tmpWord)) { 73 | wordMap.put(tmpWord, wordMap.get(tmpWord)+1); 74 | }else { 75 | wordMap.put(tmpWord,(long) 1); 76 | wordReadNum++; 77 | } 78 | } 79 | if (lineNum%1000 ==0) { 80 | System.out.println("hasProcessed train data numbers:" + lineNum); 81 | } 82 | } 83 | System.out.println("train_labeled_text total lineNum:"+lineNum 84 | +", and wordSet.size():"+wordMap.size()); 85 | train_labeled_text_br.close(); 86 | 87 | //开始筛选low_feq>3的并输出 88 | System.out.println("raw wordMap size:" + wordMap.size()); 89 | Iterator iter = wordMap.entrySet().iterator(); 90 | while(iter.hasNext()){ 91 | Map.Entry entry = (Map.Entry) iter.next(); 92 | String tmpWord = entry.getKey(); 93 | long tmp_low_feq = entry.getValue(); 94 | if (tmp_low_feq < low_feq) { 95 | wordMap.remove(tmpWord); 96 | continue; 97 | } 98 | wordWriteNum++; 99 | wordDictFileW.write(tmpWord+"\t"+wordWriteNum+"\n"); 100 | } 101 | System.out.println("wordReadNum:" + wordReadNum); 102 | System.out.println("wordWriteNum:" + wordWriteNum); 103 | System.out.println("refined wordMap size:" + wordMap.size()); 104 | wordDictFileW.close(); 105 | 106 | //开始输出卡方格式 107 | //tag \t text 108 | lineNum = 0; 109 | train_labeled_text_fr = new FileReader(train_labeled_text); 110 | train_labeled_text_br = new BufferedReader(train_labeled_text_fr); 111 | String tempTag; 112 | String tempText; 113 | String[] tokensStr; 114 | StringBuffer tmpLineBuffer = new StringBuffer(); 115 | while ((tempTag = train_labeled_tag_br.readLine()) != null) { 116 | lineNum++; 117 | tempText = train_labeled_text_br.readLine(); 118 | 119 | tokensStr = tempText.trim().split("\\s+"); 120 | 121 | if (!(tokensStr.length<1)) { 122 | for (int j = 0; j < tokensStr.length; j++) { 123 | String tempToken = tokensStr[j]; 124 | if (wordMap.containsKey(tempToken.trim())) { 125 | tmpLineBuffer.append(tempToken + " "); 126 | }else { 127 | System.out.println("error: the map has not contain the word:" 128 | + tempToken+" in Line:" + lineNum); 129 | } 130 | } 131 | taskA_ChiSquareFileW.write(tempTag+"\t"+tmpLineBuffer.toString().trim()+"\n"); 132 | } 133 | 134 | tmpLineBuffer.delete(0, tmpLineBuffer.length()); 135 | if (lineNum%1000 ==0) { 136 | System.out.println("hasProcessed train_unlabeled_text data numbers:" + lineNum); 137 | } 138 | } 139 | taskA_ChiSquareFileW.close(); 140 | System.out.println("taskA_ChiSquareFileW total lineNum:"+lineNum 141 | +", and wordSet.size():"+wordMap.size()); 142 | train_labeled_tag_br.close(); 143 | train_labeled_text_br.close(); 144 | 145 | } 146 | } 147 | } -------------------------------------------------------------------------------- /code/Feature_process_java/src/DataPreprocess/Step04_02_TaskA_ChiSquare_Rank.java: -------------------------------------------------------------------------------- 1 | package DataPreprocess; 2 | 3 | import java.io.BufferedReader; 4 | import java.io.BufferedWriter; 5 | import java.io.File; 6 | import java.io.FileOutputStream; 7 | import java.io.FileReader; 8 | import java.io.OutputStreamWriter; 9 | import java.util.ArrayList; 10 | import java.util.Collections; 11 | import java.util.Comparator; 12 | import java.util.HashMap; 13 | import java.util.Iterator; 14 | import java.util.List; 15 | import java.util.Map; 16 | 17 | public class Step04_02_TaskA_ChiSquare_Rank { 18 | public static void main(String[] args) throws Exception { 19 | 20 | //读取训练数据文件 21 | String dataPathStr="./../../data/RefineData/Step04_chisquare/"; 22 | String resultsPathStr="./../../data/RefineData/Step04_chisquare/"; 23 | File resultsPathFile = new File(resultsPathStr); 24 | if (!resultsPathFile.exists()) resultsPathFile.mkdir(); 25 | 26 | String train_chi2_score_pre = dataPathStr+"output_chi2_"; 27 | 28 | //主题到文件后缀的映射 29 | ArrayList topic2filesuffix = new ArrayList(); 30 | topic2filesuffix.add("iphonese"); 31 | topic2filesuffix.add("bianpao"); 32 | topic2filesuffix.add("fankong"); 33 | topic2filesuffix.add("ertai"); 34 | topic2filesuffix.add("jinmo"); 35 | int top_words = 500; 36 | 37 | for (int topic_idx = 0; topic_idx < topic2filesuffix.size(); topic_idx++) { 38 | HashMap chi2_rank_favor_map = new HashMap(); 39 | HashMap chi2_rank_none_map = new HashMap(); 40 | HashMap chi2_rank_against_map = new HashMap(); 41 | 42 | String usrChi2ScoreDict = resultsPathStr + "usrWordDict_chi2_score_" 43 | + topic2filesuffix.get(topic_idx); 44 | BufferedWriter wordChi2ScoreFileW = new BufferedWriter(new OutputStreamWriter( 45 | new FileOutputStream(new File(usrChi2ScoreDict)), "UTF-8")); 46 | 47 | String usrWordDict = resultsPathStr + "usrWordDict_chi2_refined_" 48 | + topic2filesuffix.get(topic_idx); 49 | BufferedWriter wordDictFileW = new BufferedWriter(new OutputStreamWriter( 50 | new FileOutputStream(new File(usrWordDict)), "UTF-8")); 51 | 52 | // String taskA_ChiSquare = resultsPathStr + "TaskA_chi2_refined_" 53 | // + topic2filesuffix.get(topic_idx); 54 | // BufferedWriter taskA_ChiSquareFileW = new BufferedWriter(new OutputStreamWriter( 55 | // new FileOutputStream(new File(taskA_ChiSquare)), "UTF-8")); 56 | 57 | String train_chi2_score_text = 58 | train_chi2_score_pre+topic2filesuffix.get(topic_idx)+".tst"; 59 | FileReader train_chi2_score_fr = new FileReader(train_chi2_score_text); 60 | BufferedReader train_chi2_score_br = new BufferedReader(train_chi2_score_fr); 61 | 62 | //先读入处理好的训练语料 63 | int wordReadNum = 0; 64 | int wordWriteNum = 0; 65 | 66 | //train_labeled_text 67 | int lineNum = 0; 68 | String tempLine; 69 | while ((tempLine = train_chi2_score_br.readLine()) != null) { 70 | lineNum++; 71 | String[] tokensStr = tempLine.trim().split("\t"); 72 | String tmpTag = tokensStr[0].trim(); 73 | String tmpWord = tokensStr[1].trim(); 74 | Float tmpScore = Float.valueOf(tokensStr[2].trim()); 75 | 76 | if(tmpTag.equals("FAVOR")){ 77 | chi2_rank_favor_map.put(tmpWord, tmpScore); 78 | }else if (tmpTag.equals("AGAINST")) { 79 | chi2_rank_against_map.put(tmpWord, tmpScore); 80 | }else if (tmpTag.equals("NONE")) { 81 | chi2_rank_none_map.put(tmpWord, tmpScore); 82 | }else { 83 | System.out.println("Error: wrong tag is "+tmpTag); 84 | } 85 | } 86 | System.out.println("train_labeled_text total lineNum:"+lineNum 87 | +", and wordSet.size():"+chi2_rank_favor_map.size()); 88 | train_chi2_score_br.close(); 89 | 90 | //开始对chi2进行排序,同时读入词典 91 | HashMap wordMap = new HashMap(); 92 | System.out.println("Start to read wordSet ..."); 93 | //Favor 94 | List> chi2_rank_favor_List = 95 | new ArrayList>(chi2_rank_favor_map.entrySet()); 96 | Collections.sort(chi2_rank_favor_List, new Comparator>() { 97 | @Override 98 | public int compare(Map.Entry o1, Map.Entry o2) { 99 | return (o2.getValue().compareTo(o1.getValue())); 100 | //return (o1.getKey()).toString().compareTo(o2.getKey()); 101 | } 102 | }); 103 | if (top_words>chi2_rank_favor_List.size()) 104 | top_words = chi2_rank_favor_List.size(); 105 | for (int i = 0; i < chi2_rank_favor_List.size(); i++) { 106 | String tmpWord = chi2_rank_favor_List.get(i).getKey(); 107 | float tmpChi2 = chi2_rank_favor_List.get(i).getValue(); 108 | wordChi2ScoreFileW.write(tmpWord+"\t"+tmpChi2+"\n"); 109 | if (i < top_words){ 110 | wordReadNum++; 111 | wordMap.put(tmpWord, tmpChi2); 112 | } 113 | } 114 | //Against 115 | List> chi2_rank_against_List = 116 | new ArrayList>(chi2_rank_against_map.entrySet()); 117 | Collections.sort(chi2_rank_against_List, new Comparator>() { 118 | @Override 119 | public int compare(Map.Entry o1, Map.Entry o2) { 120 | return (o2.getValue().compareTo(o1.getValue())); 121 | //return (o1.getKey()).toString().compareTo(o2.getKey()); 122 | } 123 | }); 124 | for (int i = 0; i < chi2_rank_against_List.size(); i++) { 125 | String tmpWord = chi2_rank_against_List.get(i).getKey(); 126 | float tmpChi2 = chi2_rank_against_List.get(i).getValue(); 127 | wordChi2ScoreFileW.write(tmpWord+"\t"+tmpChi2+"\n"); 128 | if (i < top_words){ 129 | wordReadNum++; 130 | wordMap.put(tmpWord, tmpChi2); 131 | } 132 | } 133 | //None 134 | List> chi2_rank_none_List = 135 | new ArrayList>(chi2_rank_none_map.entrySet()); 136 | Collections.sort(chi2_rank_none_List, new Comparator>() { 137 | @Override 138 | public int compare(Map.Entry o1, Map.Entry o2) { 139 | return (o2.getValue().compareTo(o1.getValue())); 140 | //return (o1.getKey()).toString().compareTo(o2.getKey()); 141 | } 142 | }); 143 | for (int i = 0; i < chi2_rank_none_List.size(); i++) { 144 | String tmpWord = chi2_rank_none_List.get(i).getKey(); 145 | float tmpChi2 = chi2_rank_none_List.get(i).getValue(); 146 | wordChi2ScoreFileW.write(tmpWord+"\t"+tmpChi2+"\n"); 147 | if (i < top_words){ 148 | wordReadNum++; 149 | wordMap.put(tmpWord, tmpChi2); 150 | } 151 | } 152 | 153 | //开始输出基于Chi2选择后的词典 154 | System.out.println("raw wordMap size:" + wordMap.size()); 155 | Iterator iter = wordMap.entrySet().iterator(); 156 | while(iter.hasNext()){ 157 | Map.Entry entry = (Map.Entry) iter.next(); 158 | String tmpWord = entry.getKey(); 159 | 160 | wordWriteNum++; 161 | wordDictFileW.write(tmpWord+"\t"+wordWriteNum+"\n"); 162 | } 163 | System.out.println("wordReadNum:" + wordReadNum); 164 | System.out.println("wordWriteNum:" + wordWriteNum); 165 | System.out.println("refined wordMap size:" + wordMap.size()); 166 | wordChi2ScoreFileW.close(); 167 | wordDictFileW.close(); 168 | } 169 | } 170 | } -------------------------------------------------------------------------------- /code/Feature_process_java/src/DataPreprocess/Step04_03_TaskA_ChiSquare_VSM.java: -------------------------------------------------------------------------------- 1 | package DataPreprocess; 2 | 3 | import java.io.BufferedReader; 4 | import java.io.BufferedWriter; 5 | import java.io.File; 6 | import java.io.FileInputStream; 7 | import java.io.FileOutputStream; 8 | import java.io.IOException; 9 | import java.io.InputStreamReader; 10 | import java.io.OutputStreamWriter; 11 | import java.util.ArrayList; 12 | import java.util.HashMap; 13 | 14 | public class Step04_03_TaskA_ChiSquare_VSM { 15 | public static void main(String[] args) throws Exception { 16 | //Test 17 | // String tempLine ="宸濄伄娴併倢銇倛銇?28521"; 18 | // String[] tokensStr = tempLine.split("\t"); 19 | //利用纯文本和wordmap构建基于词频的向量空间模型,用于STH预处理 20 | // all = [test_data;train_data]! 21 | String dataPathStr = "./../../data/RefineData/Step01_topics/"; 22 | String resultsPathStr = "./../../data/RefineData/Step04_chisquare/"; 23 | 24 | String train_labeled_text_pre = dataPathStr+"TaskA_train_labeled_text"; 25 | String train_unlabeled_text_pre = dataPathStr+"TaskA_train_unlabeled_text"; 26 | String test_unlabeled_text_pre = dataPathStr+"TaskA_test_unlabeled_text"; 27 | 28 | //主题到文件后缀的映射 29 | ArrayList topic2filesuffix = new ArrayList(); 30 | topic2filesuffix.add("iphonese"); 31 | topic2filesuffix.add("bianpao"); 32 | topic2filesuffix.add("fankong"); 33 | topic2filesuffix.add("ertai"); 34 | topic2filesuffix.add("jinmo"); 35 | 36 | boolean hasTestData = true; 37 | 38 | for (int topic_idx = 0; topic_idx < topic2filesuffix.size(); topic_idx++) { 39 | String wordMapStr = resultsPathStr + "usrWordDict_chi2_refined_" + topic2filesuffix.get(topic_idx); 40 | 41 | String vsmBW_train_labeledStr = resultsPathStr + "/vsm_train_labeled_chi2_" 42 | + topic2filesuffix.get(topic_idx); 43 | String vsmBW_train_unlabeledStr = resultsPathStr+"/vsm_train_unlabeled_chi2_" 44 | + topic2filesuffix.get(topic_idx); 45 | String vsmBW_test_unlabeledStr = resultsPathStr+"/vsm_test_unlabeled_chi2_" 46 | + topic2filesuffix.get(topic_idx); 47 | BufferedReader train_labeled_File = new BufferedReader(new InputStreamReader( 48 | new FileInputStream(new File(train_labeled_text_pre 49 | +"_"+topic2filesuffix.get(topic_idx))), "UTF-8")); 50 | BufferedReader train_unlabeled_File = new BufferedReader(new InputStreamReader( 51 | new FileInputStream(new File(train_unlabeled_text_pre 52 | +"_"+topic2filesuffix.get(topic_idx))), "UTF-8")); 53 | 54 | //构造训练VSM词频向量空间模型 55 | BufferedReader wordMapRD = new BufferedReader( 56 | new InputStreamReader(new FileInputStream(new File(wordMapStr)), "UTF-8")); 57 | creatVSMText(train_labeled_File, wordMapRD, vsmBW_train_labeledStr); 58 | wordMapRD.close(); 59 | train_labeled_File.close(); 60 | 61 | wordMapRD = new BufferedReader( 62 | new InputStreamReader(new FileInputStream(new File(wordMapStr)), "UTF-8")); 63 | creatVSMText(train_unlabeled_File, wordMapRD, vsmBW_train_unlabeledStr); 64 | wordMapRD.close(); 65 | train_unlabeled_File.close(); 66 | 67 | if(hasTestData){ 68 | BufferedReader test_unlabeled_File = new BufferedReader(new InputStreamReader( 69 | new FileInputStream(new File(test_unlabeled_text_pre 70 | +"_"+topic2filesuffix.get(topic_idx))), "UTF-8")); 71 | 72 | wordMapRD = new BufferedReader( 73 | new InputStreamReader(new FileInputStream(new File(wordMapStr)), "UTF-8")); 74 | creatVSMText(test_unlabeled_File, wordMapRD, vsmBW_test_unlabeledStr); 75 | wordMapRD.close(); 76 | test_unlabeled_File.close(); 77 | } 78 | } 79 | 80 | System.out.println("It is done, ok!"); 81 | } 82 | 83 | public static void creatVSMText(BufferedReader sourceTextRD, 84 | BufferedReader wordMapRD, String vsmBW_Str) throws IOException, Exception { 85 | System.out.println("Start to create VSM ...!"); 86 | String tempLine; 87 | //先读入词典 88 | int wordIdxNum = 1; 89 | HashMap wordMap = new HashMap(); 90 | 91 | while ((tempLine = wordMapRD.readLine()) != null) { 92 | //词典中放着词和索引号,索引号 93 | if (wordMap.containsKey(tempLine.trim())) { 94 | System.out.println("Test, the word is replicate:"+tempLine.trim() 95 | +", in wordIdxNum:"+wordIdxNum); 96 | } 97 | if (tempLine.trim().length()==0) continue; 98 | //wordMap.put(tempLine.trim(), wordIdxNum); 99 | wordMap.put(tempLine.split("\\s+")[0].trim(), Integer.valueOf(tempLine.split("\\s+")[1])); 100 | wordIdxNum =wordIdxNum+1; 101 | } 102 | //定义了这个数据集的特征维数 103 | int dimVector = wordIdxNum-1; 104 | System.out.println("Has read the dictionary, the size is:"+wordMap.size()); 105 | ArrayList wordFreqList = new ArrayList(); 106 | int lineNum = 0; 107 | boolean hasWordFeature = false; 108 | int hasNoWordFeature = 0; 109 | StringBuffer tmpVSMBuffer = new StringBuffer(); 110 | while ((tempLine = sourceTextRD.readLine()) != null) { 111 | hasWordFeature = false; 112 | //读入一行,即一个文档; 113 | wordFreqList.clear(); 114 | for (int i = 0; i < dimVector; i++) { 115 | wordFreqList.add(0); 116 | } 117 | 118 | String[] tokensStr; 119 | boolean isvalid = true; 120 | 121 | tokensStr = tempLine.trim().split("\\s+"); 122 | 123 | if (!(tokensStr.length<1)) { 124 | for (int j = 0; j < tokensStr.length; j++) { 125 | String tempToken = tokensStr[j]; 126 | if (wordMap.containsKey(tempToken.trim())) { 127 | hasWordFeature = true; 128 | int index = wordMap.get(tempToken.trim()); 129 | if (index>dimVector) { 130 | System.out.print("Error, and the word is: "+tempToken.trim()); 131 | } 132 | wordFreqList.set(index-1, wordFreqList.get(index-1)+1); 133 | } 134 | // else { 135 | // System.out.println("error: the map has not contain the word:" 136 | // +tempToken+" in Line:"+lineNum); 137 | // } 138 | } 139 | }else { 140 | isvalid = false; 141 | } 142 | 143 | if (!isvalid) { 144 | System.out.println("warning: the string has lacked contents:" 145 | +tempLine.trim()+" in Line:"+lineNum); 146 | } 147 | for (int tempFreq:wordFreqList) { 148 | tmpVSMBuffer.append(String.valueOf(tempFreq)+" "); 149 | } 150 | //把处理好的文本写入到新的文本文件中 151 | Result2Txt(vsmBW_Str,tmpVSMBuffer.toString().trim()); 152 | tmpVSMBuffer.delete(0, tmpVSMBuffer.length()); 153 | 154 | if (!hasWordFeature) { 155 | hasNoWordFeature++; 156 | } 157 | lineNum++; 158 | if (lineNum%1000 ==0) { 159 | System.out.println("hasProcessed text numbers:" + lineNum); 160 | } 161 | } 162 | System.out.println("Total number is:"+lineNum+" and no word lines is:"+ hasNoWordFeature); 163 | } 164 | public static void Result2Txt(String file, String txt) { 165 | try { 166 | BufferedWriter os = new BufferedWriter(new OutputStreamWriter( 167 | new FileOutputStream(new File(file),true), "UTF-8")); 168 | os.write(txt + "\n"); 169 | os.close(); 170 | } catch (Exception e) { 171 | e.printStackTrace(); 172 | } 173 | } 174 | } 175 | -------------------------------------------------------------------------------- /code/Feature_process_java/src/DataPreprocess/Step05_01_TaskA_Opinion.java: -------------------------------------------------------------------------------- 1 | package DataPreprocess; 2 | 3 | import java.io.BufferedReader; 4 | import java.io.BufferedWriter; 5 | import java.io.File; 6 | import java.io.FileInputStream; 7 | import java.io.FileNotFoundException; 8 | import java.io.FileOutputStream; 9 | import java.io.IOException; 10 | import java.io.InputStreamReader; 11 | import java.io.OutputStreamWriter; 12 | import java.util.ArrayList; 13 | import java.util.HashMap; 14 | import java.util.HashSet; 15 | 16 | public class Step05_01_TaskA_Opinion { 17 | public static void main(String[] args) throws Exception { 18 | //Test 19 | // String tempLine ="宸濄伄娴併倢銇倛銇?28521"; 20 | // String[] tokensStr = tempLine.split("\t"); 21 | //利用纯文本和wordmap构建基于词频的向量空间模型,用于STH预处理 22 | // all = [test_data;train_data]! 23 | String dataPathStr = "./../../data/RefineData/Step01_topics/"; 24 | String corpusPathStr = "./../../corpus/"; 25 | String resultsPathStr = "./../../data/RefineData/Step05_opinion/"; 26 | 27 | String train_labeled_text_pre = dataPathStr+"TaskA_train_labeled_text"; 28 | String test_unlabeled_text_pre = dataPathStr+"TaskA_test_unlabeled_text"; 29 | 30 | //主题到文件后缀的映射 31 | ArrayList topic2filesuffix = new ArrayList(); 32 | topic2filesuffix.add("iphonese"); 33 | topic2filesuffix.add("bianpao"); 34 | topic2filesuffix.add("fankong"); 35 | topic2filesuffix.add("ertai"); 36 | topic2filesuffix.add("jinmo"); 37 | 38 | boolean hasTestData = true; 39 | HashSet opinion_pos_set = new HashSet(); 40 | HashSet opinion_neg_set = new HashSet(); 41 | String opinion_pos_str = corpusPathStr + "Hownet/pos_opinion.txt"; 42 | String opinion_neg_str = corpusPathStr + "Hownet/neg_opinion.txt"; 43 | loadWordSet(opinion_pos_str, opinion_pos_set); 44 | loadWordSet(opinion_neg_str, opinion_neg_set); 45 | 46 | HashSet sentiment_pos_set = new HashSet(); 47 | HashSet sentiment_neg_set = new HashSet(); 48 | String sentiment_pos_str = corpusPathStr + "Tsinghua/tsinghua.positive.gb.txt"; 49 | String sentiment_neg_str = corpusPathStr + "Tsinghua/tsinghua.negative.gb.txt"; 50 | loadWordSet(sentiment_pos_str, sentiment_pos_set); 51 | loadWordSet(sentiment_neg_str, sentiment_neg_set); 52 | 53 | for (int topic_idx = 0; topic_idx < topic2filesuffix.size(); topic_idx++) { 54 | 55 | String opinion_train_labeledStr = resultsPathStr + "/opinion_train_labeled_" 56 | + topic2filesuffix.get(topic_idx); 57 | String opinion_test_unlabeledStr = resultsPathStr+"/opinion_test_unlabeled_" 58 | + topic2filesuffix.get(topic_idx); 59 | BufferedReader train_labeled_File = new BufferedReader(new InputStreamReader( 60 | new FileInputStream(new File(train_labeled_text_pre 61 | +"_"+topic2filesuffix.get(topic_idx))), "UTF-8")); 62 | 63 | //构造训练VSM词频向量空间模型 64 | creatOpinoinFea(train_labeled_File, opinion_train_labeledStr, 65 | opinion_pos_set, opinion_neg_set, sentiment_pos_set, sentiment_neg_set); 66 | train_labeled_File.close(); 67 | 68 | 69 | if(hasTestData){ 70 | BufferedReader test_unlabeled_File = new BufferedReader(new InputStreamReader( 71 | new FileInputStream(new File(test_unlabeled_text_pre 72 | +"_"+topic2filesuffix.get(topic_idx))), "UTF-8")); 73 | 74 | creatOpinoinFea(test_unlabeled_File, opinion_test_unlabeledStr, 75 | opinion_pos_set, opinion_neg_set, sentiment_pos_set, sentiment_neg_set); 76 | test_unlabeled_File.close(); 77 | } 78 | } 79 | 80 | System.out.println("It is done, ok!"); 81 | } 82 | 83 | private static void loadWordSet(String wordFileStr, HashSet wordSet) throws IOException { 84 | File wordFile = new File(wordFileStr); 85 | FileInputStream fis = null; 86 | try { 87 | fis = new FileInputStream(wordFile); 88 | } catch (FileNotFoundException fnfe) { 89 | System.out.println("not found wordFileStr:"+wordFileStr); 90 | } 91 | try { 92 | BufferedReader reader = new BufferedReader(new InputStreamReader(fis, "UTF-8")); 93 | String tempString = null; 94 | while ((tempString = reader.readLine()) != null) { 95 | wordSet.add(tempString.trim()); 96 | } 97 | System.out.println("load word dictionary... done"); 98 | reader.close(); 99 | } finally { 100 | fis.close(); 101 | } 102 | } 103 | 104 | public static void creatOpinoinFea(BufferedReader file_br, String opinion_str 105 | , HashSet opinion_pos_set, HashSet opinion_neg_set 106 | , HashSet sentiment_pos_set, HashSet sentiment_neg_set 107 | ) throws IOException, Exception { 108 | System.out.println("Start to create opinion feature ...!"); 109 | String tempLine; 110 | //定义了这个数据集的特征维数 111 | double opinion_pos_num = 0.0; 112 | double opinion_neg_num = 0.0; 113 | double sentiment_pos_num = 0.0; 114 | double sentiment_neg_num = 0.0; 115 | 116 | double opinion_pos_score = 0.0; 117 | double opinion_neg_score = 0.0; 118 | double sentiment_pos_score = 0.0; 119 | double sentiment_neg_score = 0.0; 120 | int lineNum = 0; 121 | while ((tempLine = file_br.readLine()) != null) { 122 | opinion_pos_num = 0; 123 | opinion_neg_num = 0; 124 | sentiment_pos_num = 0; 125 | sentiment_neg_num = 0; 126 | String[] tokensStr; 127 | tokensStr = tempLine.trim().split("\\s+"); 128 | 129 | for (int j = 0; j < tokensStr.length; j++) { 130 | String tempToken = tokensStr[j]; 131 | if (opinion_pos_set.contains(tempToken.trim())) { 132 | opinion_pos_num+=1.0; 133 | }else if (opinion_neg_set.contains(tempToken.trim())) { 134 | opinion_neg_num+=1.0; 135 | }else if (sentiment_pos_set.contains(tempToken.trim())) { 136 | sentiment_pos_num+=1.0; 137 | }else if (sentiment_neg_set.contains(tempToken.trim())) { 138 | sentiment_neg_num+=1.0; 139 | } 140 | } 141 | opinion_neg_num = Math.pow(opinion_neg_num, 1.3); 142 | sentiment_neg_num = Math.pow(sentiment_neg_num, 1.3); 143 | opinion_pos_score = 144 | (opinion_pos_num+1.0)/(opinion_pos_num+opinion_neg_num+2.0); 145 | opinion_neg_score = 146 | (opinion_neg_num+1.0)/(opinion_pos_num+opinion_neg_num+2.0); 147 | sentiment_pos_score = 148 | (sentiment_pos_num+1.0)/(sentiment_pos_num+sentiment_neg_num+2.0); 149 | sentiment_neg_score = 150 | (sentiment_neg_num+1.0)/(sentiment_pos_num+sentiment_neg_num+2.0); 151 | 152 | //把处理好的文本写入到新的文本文件中 153 | Result2Txt(opinion_str,String.valueOf(opinion_pos_score)+" "+String.valueOf(opinion_neg_score)+" " 154 | + String.valueOf(sentiment_pos_score)+" "+String.valueOf(sentiment_neg_score)); 155 | lineNum++; 156 | if (lineNum%1000 ==0) { 157 | System.out.println("hasProcessed text numbers:" + lineNum); 158 | } 159 | } 160 | } 161 | public static void Result2Txt(String file, String txt) { 162 | try { 163 | BufferedWriter os = new BufferedWriter(new OutputStreamWriter( 164 | new FileOutputStream(new File(file),true), "UTF-8")); 165 | os.write(txt + "\n"); 166 | os.close(); 167 | } catch (Exception e) { 168 | e.printStackTrace(); 169 | } 170 | } 171 | } 172 | -------------------------------------------------------------------------------- /code/Feature_process_java/src/DataPreprocess/Step09_01_ResultSubmit.java: -------------------------------------------------------------------------------- 1 | package DataPreprocess; 2 | 3 | import java.io.BufferedReader; 4 | import java.io.BufferedWriter; 5 | import java.io.File; 6 | import java.io.FileInputStream; 7 | import java.io.FileOutputStream; 8 | import java.io.InputStreamReader; 9 | import java.io.OutputStreamWriter; 10 | import java.util.ArrayList; 11 | import java.util.HashMap; 12 | import java.util.Iterator; 13 | import java.util.Map; 14 | 15 | public class Step09_01_ResultSubmit { 16 | /** 17 | * @param args 18 | * @author jacoxu.com-2016/06/23 19 | */ 20 | //predictResult_pre, testFile_str, submitFile_str, topic2filesuffix 21 | private static void dataAnalysis(String predictResult_pre, String testFile_str 22 | , String submitFile_str, ArrayList topic2filesuffix) { 23 | 24 | try { 25 | String encoding = "UTF-8"; 26 | File testFile = new File(testFile_str); 27 | BufferedReader testFileReader = new BufferedReader(new InputStreamReader( 28 | new FileInputStream(testFile), encoding)); 29 | String tmpPredictLineStr; 30 | int readPredictLineNum = 0; 31 | String tmpTestLineStr; 32 | int readTestLineNum = 0; 33 | //先把测试文件的头部写到提交文件中 34 | tmpTestLineStr = testFileReader.readLine(); 35 | if (tmpTestLineStr.split("\t").length != 4) { 36 | System.err.println("Error, in the first line of text file:" + tmpTestLineStr); 37 | System.exit(0); 38 | } 39 | Result2Txt(submitFile_str, tmpTestLineStr); 40 | //开始写结果,按照Topic依次进行写入 41 | String tmpPredictLabel = "UNKNOWN"; 42 | String tmpNewTestStr; 43 | for (int topic_idx = 0; topic_idx < topic2filesuffix.size(); topic_idx++) { 44 | System.out.println("Current topic is:"+topic2filesuffix.get(topic_idx)); 45 | String predictResult_str = predictResult_pre + topic2filesuffix.get(topic_idx); 46 | File predictResultFile = new File(predictResult_str); 47 | BufferedReader predictResultReader = new BufferedReader(new InputStreamReader( 48 | new FileInputStream(predictResultFile), encoding)); 49 | while ((tmpPredictLineStr = predictResultReader.readLine()) != null) { 50 | readPredictLineNum++; 51 | float predictScore = Float.valueOf(tmpPredictLineStr); 52 | if (predictScore < -0.2) 53 | System.err.println("Error, wrong predict score:" 54 | + String.valueOf(predictScore)); 55 | else if (predictScore < 0.2) 56 | tmpPredictLabel = "NONE"; 57 | else if (predictScore < 1.2) 58 | tmpPredictLabel = "FAVOR"; 59 | else if (predictScore < 2.2) 60 | tmpPredictLabel = "AGAINST"; 61 | else { 62 | System.err.println("Error, wrong predict score:" 63 | + String.valueOf(predictScore)); 64 | } 65 | tmpTestLineStr = testFileReader.readLine(); 66 | String[] tmpTestTokens = tmpTestLineStr.split("\t"); 67 | if (tmpTestTokens.length != 4) { 68 | // System.err.println("Error, in the first line of text file:" + tmpTestLineStr); 69 | // System.exit(0); 70 | System.out.println(tmpTestTokens[0]+"\t"+tmpTestTokens[4]); 71 | } 72 | readTestLineNum++; 73 | tmpNewTestStr = tmpTestTokens[0]+"\t"+tmpTestTokens[1]+"\t"+tmpTestTokens[2]+"\t"+tmpPredictLabel; 74 | Result2Txt(submitFile_str, tmpNewTestStr); 75 | } 76 | predictResultReader.close(); 77 | } 78 | testFileReader.close(); 79 | if (readPredictLineNum != readTestLineNum) { 80 | System.err.println("Error, readPredictLineNum:" + readPredictLineNum 81 | + ", readTestLineNum"+ readTestLineNum); 82 | System.exit(0); 83 | } 84 | 85 | } catch (Exception e) { 86 | System.out.println("something error when reading the content of the file"); 87 | e.printStackTrace(); 88 | } 89 | return; 90 | 91 | } 92 | 93 | public static void Result2Txt(String file, String txt) { 94 | try { 95 | BufferedWriter os = new BufferedWriter(new OutputStreamWriter( 96 | new FileOutputStream(new File(file),true), "UTF-8")); 97 | os.write(txt + "\n"); 98 | os.close(); 99 | } catch (Exception e) { 100 | e.printStackTrace(); 101 | } 102 | } 103 | 104 | public static void main(String[] args) { 105 | //***********测试区域************ 106 | System.out.println("test"); 107 | 108 | //***********测试区域************ 109 | 110 | String dataPathStr="./../../data/"; 111 | String resultsPathStr="./../../data/RefineData/Results/"; 112 | File resultsPathFile = new File(resultsPathStr); 113 | if (!resultsPathFile.exists()) resultsPathFile.mkdir(); 114 | 115 | //进行测试集结果展示 116 | String predictResult_pre=resultsPathStr+"predict_";//resultsPathStr+topic; 117 | //TODO 这里为测试集的文件名 118 | String testFile_str = dataPathStr+"NLPCC2016_Stance_Detection_Task_A_Testdata.txt"; 119 | String submitFile_str = dataPathStr +"TaskA_prediction_submission.txt"; 120 | ArrayList topic2filesuffix = new ArrayList(); 121 | topic2filesuffix.add("bianpao"); 122 | topic2filesuffix.add("iphonese"); 123 | topic2filesuffix.add("fankong"); 124 | topic2filesuffix.add("ertai"); 125 | topic2filesuffix.add("jinmo"); 126 | 127 | long readstart=System.currentTimeMillis(); 128 | dataAnalysis(predictResult_pre, testFile_str, submitFile_str, topic2filesuffix); 129 | long readend=System.currentTimeMillis(); 130 | System.out.println((readend-readstart)/1000.0+"s had been consumed to process submission."); 131 | } 132 | 133 | } 134 | -------------------------------------------------------------------------------- /code/Feature_process_java/src/Tools/CharacterAnalyzer.java: -------------------------------------------------------------------------------- 1 | package Tools; 2 | 3 | /** 4 | * @author cBrain 5 | * 6 | */ 7 | public class CharacterAnalyzer { 8 | // java unicode字符集 9 | // http://doc.java.sun.com/DocWeb/api/java.lang.Character.UnicodeBlock 10 | // http://hi.baidu.com/qing419925094/item/bb32a7b915273871244b09c9 11 | public static final boolean isChinese(char c) { 12 | return (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS)?true:false; 13 | } 14 | 15 | public static final boolean isNumber(char c) 16 | { 17 | return ((c<='9')&&(c>='0'))?true:false; 18 | } 19 | 20 | public static final boolean isEnglish(char c) 21 | { 22 | return (((c<='z')&&(c>='a'))||((c<='Z')&&(c>='A')))?true:false; 23 | } 24 | 25 | public static final boolean isWhiteSpace(char c) 26 | { 27 | return c == ' '; 28 | } 29 | 30 | public static final boolean isRegularSymbol(char c) 31 | { 32 | return (c=='*')||(c=='?'); 33 | } 34 | 35 | //判断是否是 中文/英文/数字/常见标点 过滤掉其他字符 36 | public static final boolean isGoodCharacter(char c) 37 | { 38 | if(filterUnicode(c)) 39 | return false; 40 | if(isChinese(c)) 41 | return true; 42 | if(isWhiteSpace(c)) 43 | return true; 44 | // if (isSpecialSymbol(c)) //filter the special symbol 45 | // return false; 46 | if(isNumber(c)) 47 | return true; 48 | if(isEnglish(c)) 49 | return true; 50 | // if(isSolrKeywords(c)) //filter the solr keywords 51 | // return false; 52 | // if(isBiaoDian(c)) 53 | // return true; 54 | // if(isRegularSymbol(c)) 55 | // return true; 56 | // if(isOtherBiaoDian(c)) 57 | // return true; 58 | return false; 59 | } 60 | 61 | private static boolean filterUnicode(char c) { 62 | //方法1:直接利用UnicodeBlock 63 | Character.UnicodeBlock ub = Character.UnicodeBlock.of(c); 64 | if (ub == Character.UnicodeBlock.GENERAL_PUNCTUATION 65 | || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION) { 66 | return true; 67 | } 68 | //方法2:直接通过Unicode编码集剔除 69 | //ASCII 标点符号 70 | if((c>=0x0020)&&(c<=0x002F)||(c>=0x003A)&&(c<=0x0040)||(c>=0x005B)&&(c<=0x0060)||(c>=0x007B)&&(c<=0X007E)) 71 | return true; 72 | //拉丁文第一增补集标点符号 73 | if((c>=0x00A0)&&(c<=0x00BF)) 74 | return true; 75 | //通用标点符号 76 | if((c>=0x2000)&&(c<=0x206F)) 77 | return true; 78 | //增补标点符号 79 | if((c>=0x2E00)&&(c<=0x2E7F)) 80 | return true; 81 | //CJK标点符号 82 | if((c>=0x3000)&&(c<=0x303F)) 83 | return true; 84 | //全角ASCII标点符号 85 | if((c>=0xFF01)&&(c<=0xFF0F)||(c>=0xFF1A)&&(c<=0xFF20)||(c>=0xFF3B)&&(c<=0xFF40)||(c>=0xFF5B)&&(c<=0xFF5E)) 86 | return true; 87 | //竖排标点符号 88 | if((c>=0xFE10)&&(c<=0xFE1F)) 89 | return true; 90 | 91 | //方法3:通过词典过滤与前台保证一致性 92 | // if (SmsBase.getSymbolSet().contains((int)c)) return true; 93 | 94 | return false; 95 | } 96 | 97 | private static boolean isSpecialSymbol(char c) { 98 | //"`~!@#$%^&*()-_=+[]{}|;:',<.>/?\ 99 | // if(c=='"'||c=='`'||c=='~'||c=='!'||c=='@'||c=='#'||c=='$'|| 100 | // c=='%'||c=='^'||c=='&'||c=='*'||c=='('||c==')'||c=='-'|| 101 | // c=='_'||c=='='||c=='+'||c=='['||c==']'||c=='{'||c=='}'|| 102 | // c=='|'||c==';'||c==':'||c=='\''||c==','||c=='<'||c=='.'|| 103 | // c=='>'||c=='/'||c=='?'||c=='\\') 104 | // return true; 105 | 106 | //ASCII 标点符号 107 | if((c>=0x0020)&&(c<=0x002F)||(c>=0x003A)&&(c<=0x0040)||(c>=0x005B)&&(c<=0x0060)||(c>=0x007B)&&(c<=0X007E)) 108 | return true; 109 | //全角ASCII标点符号 110 | if((c>=0xFF01)&&(c<=0xFF0F)||(c>=0xFF1A)&&(c<=0xFF20)||(c>=0xFF3B)&&(c<=0xFF40)||(c>=0xFF5B)&&(c<=0xFF5E)) 111 | return true; 112 | return false; 113 | } 114 | 115 | private static boolean isSolrKeywords(char c) { 116 | if(c==':'||c=='?'||c=='*'||c=='"'||c=='('||c==')') 117 | return true; 118 | return false; 119 | } 120 | 121 | public static final boolean isOtherBiaoDian(char c) 122 | { 123 | return (c=='\"')||(c=='$')||(c==':')||(c=='|')||(c==','); 124 | } 125 | 126 | public static final boolean isBiaoDian(char c) 127 | { 128 | Character.UnicodeBlock ub = Character.UnicodeBlock.of(c); 129 | if (ub == Character.UnicodeBlock.GENERAL_PUNCTUATION 130 | || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION) { 131 | return true; 132 | } 133 | return false; 134 | } 135 | /** 136 | * @param c 137 | * @return 138 | */ 139 | public static final boolean isSymbol(char c) 140 | { 141 | // Character.UnicodeBlock ub = Character.UnicodeBlock.of(c); 142 | // if (ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS 143 | // || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS 144 | // || ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A 145 | // || ub == Character.UnicodeBlock.GENERAL_PUNCTUATION 146 | // || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION 147 | // || ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) { 148 | // return true; 149 | // } 150 | // return false; 151 | if(c=='-'||c=='/'||c=='_'||c==':') 152 | return true; 153 | return false; 154 | } 155 | public static void main (String[] args){ 156 | // String ss = "、!@#$%^&*北京()!@¥%……&*()明天!@#¥%……&×"; 157 | String ss ="℃$¤¢£‰§№☆★○●◎◇◆□■△▲※→←↑↓〓ⅰⅱ我ⅲⅳⅴⅵⅶ北 京ⅷⅸⅹ⒈⒉,你呢?⒊⒋⒌⒍⒎⒏"; 158 | StringBuffer str = new StringBuffer(); 159 | char test1 = 0x25; 160 | System.out.println(test1); 161 | char[] ch = ss.toCharArray(); 162 | for (int i = 0; i < ch.length; i++) { 163 | int test = ch[i]; 164 | System.out.println(ch[i]+" - 0x"+Integer.toHexString(test)); 165 | Character.UnicodeBlock ub = Character.UnicodeBlock.of(ch[i]); 166 | if (!isGoodCharacter(ch[i])) 167 | continue; 168 | str.append(ch[i]); 169 | } 170 | System.out.println(ss); 171 | System.out.println(str.toString()); 172 | } 173 | 174 | private static boolean filterSympoTest(char c) { 175 | //方法1:直接利用UnicodeBlock 176 | // Character.UnicodeBlock ub = Character.UnicodeBlock.of(c); 177 | // if (ub == Character.UnicodeBlock.GENERAL_PUNCTUATION 178 | // || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION) { 179 | // return true; 180 | // } 181 | //方法2:直接通过Unicode编码集剔除 182 | if (isWhiteSpace(c)) return false; 183 | //ASCII 标点符号 184 | if((c>=0x0020)&&(c<=0x002F)||(c>=0x003A)&&(c<=0x0040)||(c>=0x005B)&&(c<=0x0060)||(c>=0x007B)&&(c<=0X007E)) 185 | return true; 186 | //拉丁文第一增补集标点符号 187 | if((c>=0x00A0)&&(c<=0x00BF)) 188 | return true; 189 | //通用标点符号 190 | if((c>=0x2000)&&(c<=0x206F)) 191 | return true; 192 | //增补标点符号 193 | if((c>=0x2E00)&&(c<=0x2E7F)) 194 | return true; 195 | //CJK标点符号 196 | if((c>=0x3000)&&(c<=0x303F)) 197 | return true; 198 | //全角ASCII标点符号 199 | if((c>=0xFF01)&&(c<=0xFF0F)||(c>=0xFF1A)&&(c<=0xFF20)||(c>=0xFF3B)&&(c<=0xFF40)||(c>=0xFF5B)&&(c<=0xFF5E)) 200 | return true; 201 | //竖排标点符号 202 | if((c>=0xFE10)&&(c<=0xFE1F)) 203 | return true; 204 | 205 | return false; 206 | } 207 | } -------------------------------------------------------------------------------- /code/Feature_process_java/src/Tools/ConvertUnicode.java: -------------------------------------------------------------------------------- 1 | package Tools; 2 | 3 | public class ConvertUnicode { 4 | public static String convertUnicode(String ori){ 5 | char aChar; 6 | int len = ori.length(); 7 | StringBuffer outBuffer = new StringBuffer(len); 8 | for (int x = 0; x < len;) { 9 | aChar = ori.charAt(x++); 10 | if (aChar == '\\') { 11 | aChar = ori.charAt(x++); 12 | if (aChar == 'u') { 13 | // Read the xxxx 14 | int value = 0; 15 | for (int i = 0; i < 4; i++) { 16 | aChar = ori.charAt(x++); 17 | switch (aChar) { 18 | case '0': 19 | case '1': 20 | case '2': 21 | case '3': 22 | case '4': 23 | case '5': 24 | case '6': 25 | case '7': 26 | case '8': 27 | case '9': 28 | value = (value << 4) + aChar - '0'; 29 | break; 30 | case 'a': 31 | case 'b': 32 | case 'c': 33 | case 'd': 34 | case 'e': 35 | case 'f': 36 | value = (value << 4) + 10 + aChar - 'a'; 37 | break; 38 | case 'A': 39 | case 'B': 40 | case 'C': 41 | case 'D': 42 | case 'E': 43 | case 'F': 44 | value = (value << 4) + 10 + aChar - 'A'; 45 | break; 46 | default: 47 | throw new IllegalArgumentException( 48 | "Malformed \\uxxxx encoding."); 49 | } 50 | } 51 | outBuffer.append((char) value); 52 | } else { 53 | if (aChar == 't') 54 | aChar = '\t'; 55 | else if (aChar == 'r') 56 | aChar = '\r'; 57 | else if (aChar == 'n') 58 | aChar = '\n'; 59 | else if (aChar == 'f') 60 | aChar = '\f'; 61 | outBuffer.append(aChar); 62 | } 63 | } else 64 | outBuffer.append(aChar); 65 | 66 | } 67 | return outBuffer.toString(); 68 | } 69 | } 70 | -------------------------------------------------------------------------------- /code/Feature_process_java/src/Tools/PreProcessText.java: -------------------------------------------------------------------------------- 1 | package Tools; 2 | import java.io.IOException; 3 | 4 | import Tools.StringAnalyzer; 5 | import TypeTrans.Full2Half; 6 | 7 | public class PreProcessText { 8 | static public String preProcess4Task1(String inputStr, String tmpRelationP, String tmpEntityS, String tmpEntityO) throws IOException{ 9 | if (inputStr.length()<1) return inputStr; 10 | //强制限制了实体词分开 11 | if (tmpRelationP!=null) { 12 | if (tmpRelationP.length()==4) { //传闻不和、同为校花、昔日情敌、绯闻女友 等关系 重心词在后面 13 | if (!inputStr.contains(tmpRelationP)) { 14 | tmpRelationP = tmpRelationP.substring(2); 15 | } 16 | } 17 | inputStr = inputStr.replaceAll(tmpRelationP, " "+tmpRelationP+" "); 18 | } 19 | inputStr = inputStr.replaceAll(tmpEntityS, " "+tmpEntityS+" "); 20 | inputStr = inputStr.replaceAll(tmpEntityO, " "+tmpEntityO+" "); 21 | inputStr=Full2Half.ToDBC(inputStr);//全角转半角 22 | inputStr=inputStr.toLowerCase();//字母全部小写 23 | inputStr=inputStr.replaceAll("\\s+", " ");//多个空格缩成单个空格 24 | inputStr = StringAnalyzer.extractGoodCharacter(inputStr); //去除所有特殊字符 25 | // 无词性 带词性 26 | inputStr = WordSegment_Ansj.splitWord(inputStr)+"\t"+WordSegment_Ansj.splitWordwithTag(inputStr);//进行分词 27 | 28 | return inputStr; 29 | } 30 | static public String preProcess4Task2(String inputStr) throws IOException{ 31 | if (inputStr.length()<1) return inputStr; 32 | inputStr=Full2Half.ToDBC(inputStr);//全角转半角 33 | inputStr=inputStr.toLowerCase();//字母全部小写 34 | inputStr=inputStr.replaceAll("\\s+", " ");//多个空格缩成单个空格 35 | inputStr = StringAnalyzer.extractGoodCharacter(inputStr); //去除所有特殊字符 36 | // 无词性 带词性 37 | inputStr = WordSegment_Ansj.splitWordwithOutTag4Task2(inputStr);//进行分词 38 | 39 | return inputStr.trim(); 40 | } 41 | 42 | static public String preProcess4NLPCC2016(String inputStr, String topic) throws IOException{ 43 | if (inputStr.length()<1) return inputStr; 44 | inputStr=inputStr.replaceAll("#"+topic+"#", " ");//(1)过滤掉HashTag的标识 45 | inputStr=inputStr.replaceAll("http://t.cn/(.{7})", " ");//(2)过滤掉http://t.cn/[7个字符] 46 | //(3)过滤掉一些特殊的分享标识,如: 47 | //(分享[10个字符以内])(分享[10个字符以内]) 【分享[10个字符以内]】 48 | //(来自[10个字符以内])(来自[10个字符以内]) 【来自[10个字符以内]】 49 | inputStr=inputStr.replaceAll("(分享(.{0,9}))", " "); 50 | inputStr=inputStr.replaceAll("\\(分享(.{0,9})\\)", " "); 51 | inputStr=inputStr.replaceAll("【分享(.{0,9})】", " "); 52 | 53 | inputStr=inputStr.replaceAll("(来自(.{0,9}))", " "); 54 | inputStr=inputStr.replaceAll("\\(来自(.{0,9})\\)", " "); 55 | inputStr=inputStr.replaceAll("【来自(.{0,9})】", " "); 56 | //(4)过滤掉所有的@标识,如@腾讯新闻客户端 @[10个字符以内]后接另一个@、空格或换行符 57 | String[] inputStr_sub = inputStr.split("\\s+"); 58 | StringBuffer inputStr_bf = new StringBuffer(); 59 | for (String tmpinputStr_sub:inputStr_sub) { 60 | tmpinputStr_sub = tmpinputStr_sub+""; 61 | tmpinputStr_sub = tmpinputStr_sub.replaceAll("@(.{0,9})@", "@"); 62 | tmpinputStr_sub = tmpinputStr_sub.replaceAll("@(.{0,9}) ", " "); 63 | tmpinputStr_sub = tmpinputStr_sub.replaceAll("@(.{0,9})", " "); 64 | tmpinputStr_sub = tmpinputStr_sub.replaceAll("", ""); 65 | inputStr_bf.append(tmpinputStr_sub); 66 | inputStr_bf.append(" "); 67 | } 68 | inputStr = inputStr_bf.toString().trim(); 69 | inputStr_bf = null; 70 | 71 | inputStr=Full2Half.ToDBC(inputStr);//(5)全角转半角 72 | inputStr=inputStr.toLowerCase();//(6)字母全部小写 73 | inputStr=inputStr.replaceAll("\\s+", " ");//(7)多个空格缩成单个空格 74 | inputStr = StringAnalyzer.extractGoodCharacter(inputStr); //(8)去除所有特殊字符 75 | inputStr = WordSegment_Ansj.splitWord(inputStr);//(9)进行分词 76 | 77 | return inputStr.trim(); 78 | } 79 | 80 | private static boolean isITSuffixSpamInfo(String tmpQuerySnippet, String tmpEntityS, String tmpEntityO) { 81 | if ((tmpQuerySnippet.contains(tmpEntityS)||tmpQuerySnippet.contains(tmpEntityO)) 82 | &&tmpQuerySnippet.length()>4) { 83 | return false; 84 | }else { 85 | return true; 86 | } 87 | } 88 | } 89 | -------------------------------------------------------------------------------- /code/Feature_process_java/src/Tools/StringAnalyzer.java: -------------------------------------------------------------------------------- 1 | package Tools; 2 | 3 | public class StringAnalyzer { 4 | 5 | 6 | private static String extractSendTime(String ss) { 7 | String[] tokens = ss.split("{$}"); 8 | return tokens[0]; 9 | } 10 | private static String removeShortTerm(String ss){ 11 | StringBuffer sb = new StringBuffer(); 12 | String[] tokens = ss.split(" "); 13 | for(int i = 0;i1) 16 | { 17 | sb.append(tokens[i]); 18 | sb.append(" "); 19 | } 20 | } 21 | return sb.toString(); 22 | } 23 | public static String extractChineseCharacterWithoutSpace(String ss) { 24 | Boolean lastCharTag = true; 25 | StringBuffer str = new StringBuffer(); 26 | char[] ch = ss.toCharArray(); 27 | for (int i = 0; i < ch.length; i++) { 28 | if(CharacterAnalyzer.isChinese(ch[i])) 29 | { 30 | str.append(ch[i]); 31 | } 32 | } 33 | return str.toString(); 34 | } 35 | 36 | public static String extractGoodCharacter(String ss){ 37 | if(ss == null) 38 | return null; 39 | Boolean lastCharTag = true; 40 | StringBuffer str = new StringBuffer(); 41 | char[] ch = ss.toCharArray(); 42 | for (int i = 0; i < ch.length; i++) { 43 | if(CharacterAnalyzer.isGoodCharacter(ch[i])){ 44 | str.append(ch[i]); 45 | }else { 46 | str.append(' '); 47 | } 48 | } 49 | return str.toString().replaceAll("\\s+", " ").trim(); 50 | } 51 | 52 | public static String extractChineseCharacter(String ss) { 53 | Boolean lastCharTag = true; 54 | StringBuffer str = new StringBuffer(); 55 | char[] ch = ss.toCharArray(); 56 | for (int i = 0; i < ch.length; i++) { 57 | if(CharacterAnalyzer.isChinese(ch[i])) 58 | { 59 | if(lastCharTag) 60 | { 61 | str.append(ch[i]); 62 | } 63 | else 64 | { 65 | str.append(" "); 66 | str.append(ch[i]); 67 | lastCharTag = true; 68 | } 69 | } 70 | else 71 | { 72 | lastCharTag = false; 73 | } 74 | } 75 | //return removeShortTerm(str.toString()); 76 | if(str.toString().length() == 0) 77 | { 78 | return ""; 79 | } 80 | 81 | return str.toString().toLowerCase(); 82 | } 83 | 84 | public static String extractEnglishCharacter(String ss){ 85 | Boolean lastCharTag = true; 86 | StringBuffer str = new StringBuffer(); 87 | char[] ch = ss.toCharArray(); 88 | for (int i = 0; i < ch.length; i++) { 89 | if(CharacterAnalyzer.isEnglish(ch[i])){ 90 | if(lastCharTag){ 91 | str.append(ch[i]); 92 | } 93 | else{ 94 | str.append(" "); 95 | str.append(ch[i]); 96 | lastCharTag = true; 97 | } 98 | } 99 | else{ 100 | lastCharTag = false; 101 | } 102 | } 103 | if(str.toString().length() == 0){ 104 | return null; 105 | } 106 | 107 | return str.toString().toLowerCase(); 108 | } 109 | 110 | public static Boolean isNumberString(String ss){ 111 | char[] ch = ss.toCharArray(); 112 | for (int i = 0; i < ch.length; i++) { 113 | if(!CharacterAnalyzer.isNumber(ch[i])) 114 | return false; 115 | } 116 | // int telephoneNumberLength = 14; 117 | // if(ss.length()> telephoneNumberLength || ss.length() < 10)//TODO 118 | // return false; 119 | // 120 | return true; 121 | } 122 | 123 | public static Boolean isNumberString2(String ss){ 124 | char[] ch = ss.toCharArray(); 125 | for (int i = 1; i < ch.length-1; i++) { 126 | if(!CharacterAnalyzer.isNumber(ch[i])) 127 | return false; 128 | } 129 | int telephoneNumberLength = 14; 130 | if(ss.length()> telephoneNumberLength || ss.length() < 10) 131 | return false; 132 | 133 | return true; 134 | } 135 | 136 | 137 | 138 | public static String extractNumberCharacter(String ss){ 139 | Boolean lastCharTag = true; 140 | StringBuffer str = new StringBuffer(); 141 | char[] ch = ss.toCharArray(); 142 | for (int i = 0; i < ch.length; i++) { 143 | //if(characterAnalyzer.isNumber(ch[i])||characterAnalyzer.isSymbol(ch[i])) 144 | if(CharacterAnalyzer.isNumber(ch[i])){ 145 | if(lastCharTag){ 146 | str.append(ch[i]); 147 | } 148 | else{ 149 | str.append(" "); 150 | str.append(ch[i]); 151 | lastCharTag = true; 152 | } 153 | } 154 | else{ 155 | lastCharTag = false; 156 | } 157 | } 158 | if(str.toString().length() == 0){ 159 | return null; 160 | } 161 | return str.toString(); 162 | } 163 | 164 | public static void main(String args[]){ 165 | String testString = "丨~~@昨天喝12了[酒],今天|丨血压高。 大事没办了1 6,小-事耽误了。 横批是:他阿了吊嬳!!!"; 166 | System.out.println(extractGoodCharacter(testString)); 167 | System.out.println(extractEnglishCharacter(testString)); 168 | System.out.println(extractNumberCharacter(testString)); 169 | } 170 | } 171 | -------------------------------------------------------------------------------- /code/Feature_process_java/src/Tools/WordSegment_Ansj.java: -------------------------------------------------------------------------------- 1 | package Tools; 2 | 3 | import java.io.IOException; 4 | import java.io.StringReader; 5 | import java.util.ArrayList; 6 | import java.util.List; 7 | 8 | import org.ansj.domain.Term; 9 | import org.ansj.splitWord.Analysis; 10 | import org.ansj.splitWord.analysis.ToAnalysis; 11 | import org.ansj.util.recognition.NatureRecognition; 12 | 13 | //参考:https://github.com/NLPchina/ansj_seg/wiki/%E5%88%86%E8%AF%8D%E4%BD%BF%E7%94%A8demo 14 | public class WordSegment_Ansj { 15 | 16 | public static String splitWord(String shortDoc) throws IOException { 17 | ArrayList termsOnDoc = new ArrayList(); 18 | List all = new ArrayList(); 19 | Analysis udf = new ToAnalysis(new StringReader(shortDoc)); 20 | Term term = null; 21 | while ((term = udf.next()) != null) { 22 | String tempTerm = term.toString().trim(); 23 | if (tempTerm.length()>0){ 24 | termsOnDoc.add(tempTerm); 25 | } 26 | } 27 | //添加句子的分词结果 28 | StringBuffer tmpContentBuffer = new StringBuffer(); 29 | for (int i = 0; i < termsOnDoc.size(); i++) { 30 | if (i!=0) { 31 | tmpContentBuffer.append(" "); 32 | } 33 | tmpContentBuffer.append(termsOnDoc.get(i)); 34 | } 35 | return tmpContentBuffer.toString(); 36 | } 37 | public static String splitWordwithTag(String shortDoc) throws IOException { 38 | ArrayList termsOnDoc = new ArrayList(); 39 | List allTerms = new ArrayList(); 40 | Analysis udf = new ToAnalysis(new StringReader(shortDoc)); 41 | Term term = null; 42 | while ((term = udf.next()) != null) { 43 | String tempTerm = term.toString().trim(); 44 | if (tempTerm.length()>0){ 45 | allTerms.add(term); 46 | // termsOnDoc.add(tempTerm); 47 | } 48 | } 49 | new NatureRecognition(allTerms).recogntion(); 50 | for (int i = 0; i < allTerms.size(); i++) { 51 | termsOnDoc.add(allTerms.get(i).toString()); 52 | } 53 | //添加句子的分词结果 54 | StringBuffer tmpContentBuffer = new StringBuffer(); 55 | for (int i = 0; i < termsOnDoc.size(); i++) { 56 | if (i!=0) { 57 | tmpContentBuffer.append(" "); 58 | } 59 | tmpContentBuffer.append(termsOnDoc.get(i)); 60 | } 61 | return tmpContentBuffer.toString(); 62 | } 63 | public static String splitWordwithOutTag4Task2(String shortDoc) throws IOException { 64 | ArrayList termsOnDoc = new ArrayList(); 65 | List allTerms = new ArrayList(); 66 | Analysis udf = new ToAnalysis(new StringReader(shortDoc)); 67 | Term term = null; 68 | while ((term = udf.next()) != null) { 69 | String tempTerm = term.toString().trim(); 70 | if (tempTerm.length()>0){ 71 | allTerms.add(term); 72 | // termsOnDoc.add(tempTerm); 73 | } 74 | } 75 | new NatureRecognition(allTerms).recogntion(); 76 | for (int i = 0; i < allTerms.size(); i++) { 77 | termsOnDoc.add(allTerms.get(i).toString()); 78 | } 79 | //添加句子的分词结果 80 | StringBuffer tmpContentBuffer = new StringBuffer(); 81 | for (int i = 0; i < termsOnDoc.size(); i++) { 82 | String[] tmpTermArrays = termsOnDoc.get(i).split("/"); 83 | if (tmpTermArrays.length==2){ 84 | String tmpWordStr = tmpTermArrays[0]; 85 | termsOnDoc.set(i, tmpWordStr); 86 | } 87 | if (i!=0) { 88 | tmpContentBuffer.append(" "); 89 | } 90 | tmpContentBuffer.append(termsOnDoc.get(i)); 91 | } 92 | return tmpContentBuffer.toString().trim(); 93 | } 94 | } 95 | 96 | 97 | -------------------------------------------------------------------------------- /code/Feature_ranking_matlab/TaskA_Feature_ranking.m: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_ranking_matlab/TaskA_Feature_ranking.m -------------------------------------------------------------------------------- /code/Feature_ranking_matlab/main_TaskA_Feature_ranking.m: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_ranking_matlab/main_TaskA_Feature_ranking.m -------------------------------------------------------------------------------- /code/Feature_ranking_matlab/normalize.m: -------------------------------------------------------------------------------- 1 | function Xn = normalize(X) 2 | % Normalize all feature vectors to unit length 3 | 4 | n = size(X,1); % the number of documents 5 | Xt = X'; 6 | l = sqrt(sum(Xt.^2)); % the row vector length (L2 norm) 7 | Ni = sparse(1:n,1:n,l); 8 | Ni(Ni>0) = 1./Ni(Ni>0); 9 | Xn = (Xt*Ni)'; 10 | 11 | end 12 | -------------------------------------------------------------------------------- /code/Feature_ranking_matlab/predict.mexw64: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_ranking_matlab/predict.mexw64 -------------------------------------------------------------------------------- /code/Feature_ranking_matlab/svmpredict.mexw64: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_ranking_matlab/svmpredict.mexw64 -------------------------------------------------------------------------------- /code/Feature_ranking_matlab/svmtrain.mexw64: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_ranking_matlab/svmtrain.mexw64 -------------------------------------------------------------------------------- /code/Feature_ranking_matlab/tf_idf.m: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_ranking_matlab/tf_idf.m -------------------------------------------------------------------------------- /code/Feature_ranking_matlab/train.mexw64: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_ranking_matlab/train.mexw64 -------------------------------------------------------------------------------- /code/LDA/README.md: -------------------------------------------------------------------------------- 1 | # Latent Dirichlet Allocation (LDA) 2 | 3 | This project is modified from the following open-source implementation GibbsLDA 4 | 5 | http://jgibblda.sourceforge.net/ 6 | 7 | The usage can be found here: http://jacoxu.com/?p=648 8 | -------------------------------------------------------------------------------- /code/LDA/jgibblda.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/LDA/jgibblda.jar -------------------------------------------------------------------------------- /code/LSA_and_LPI_and_LE_matlab/LE/LapEig.m: -------------------------------------------------------------------------------- 1 | function Y = LapEig(X, options, d) 2 | % Laplacian Eigenmap 3 | W = constructW(X,options); 4 | 5 | D = diag(sum(W)); 6 | ii = find(diag(D)==0); 7 | if size(ii)~=0 8 | for i=1:size(ii) 9 | D(ii(i),ii(i)) = 0.01; 10 | end 11 | end 12 | D = sparse(D); 13 | %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 14 | L = D - W; 15 | 16 | options = []; 17 | options.disp = 0; 18 | [eigVecs,eigVals] = eigs(L,D,1+d,'sa',options); 19 | Y = eigVecs(:,2:end); 20 | 21 | end 22 | -------------------------------------------------------------------------------- /code/LSA_and_LPI_and_LE_matlab/LPI/LGE.m: -------------------------------------------------------------------------------- 1 | function [eigvector, eigvalue] = LGE(W, D, options, data) 2 | % LGE: Linear Graph Embedding 3 | % 4 | % [eigvector, eigvalue] = LGE(W, D, options, data) 5 | % 6 | % Input: 7 | % data - data matrix. Each row vector of data is a 8 | % sample vector. 9 | % W - Affinity graph matrix. 10 | % D - Constraint graph matrix. 11 | % LGE solves the optimization problem of 12 | % a* = argmax (a'data'WXa)/(a'data'DXa) 13 | % Default: D = I 14 | % 15 | % options - Struct value in Matlab. The fields in options 16 | % that can be set: 17 | % 18 | % ReducedDim - The dimensionality of the reduced 19 | % subspace. If 0, all the dimensions 20 | % will be kept. Default is 30. 21 | % 22 | % Regu - 1: regularized solution, 23 | % a* = argmax (a'X'WXa)/(a'X'DXa+ReguAlpha*I) 24 | % 0: solve the sinularity problem by SVD (PCA) 25 | % Default: 0 26 | % 27 | % ReguAlpha - The regularization parameter. Valid 28 | % when Regu==1. Default value is 0.1. 29 | % 30 | % ReguType - 'Ridge': Tikhonov regularization 31 | % 'Custom': User provided 32 | % regularization matrix 33 | % Default: 'Ridge' 34 | % regularizerR - (nFea x nFea) regularization 35 | % matrix which should be provided 36 | % if ReguType is 'Custom'. nFea is 37 | % the feature number of data 38 | % matrix 39 | % 40 | % PCARatio - The percentage of principal 41 | % component kept in the PCA 42 | % step. The percentage is 43 | % calculated based on the 44 | % eigenvalue. Default is 1 45 | % (100%, all the non-zero 46 | % eigenvalues will be kept. 47 | % If PCARatio > 1, the PCA step 48 | % will keep exactly PCARatio principle 49 | % components (does not exceed the 50 | % exact number of non-zero components). 51 | % 52 | % 53 | % Output: 54 | % eigvector - Each column is an embedding function, for a new 55 | % sample vector (row vector) x, y = x*eigvector 56 | % will be the embedding result of x. 57 | % eigvalue - The sorted eigvalue of the eigen-problem. 58 | % elapse - Time spent on different steps 59 | % 60 | % 61 | % 62 | % Examples: 63 | % 64 | % See also LPP, NPE, IsoProjection, LSDA. 65 | % 66 | % 67 | %Reference: 68 | % 69 | % 1. Deng Cai, Xiaofei He, Jiawei Han, "Spectral Regression for Efficient 70 | % Regularized Subspace Learning", IEEE International Conference on 71 | % Computer Vision (ICCV), Rio de Janeiro, Brazil, Oct. 2007. 72 | % 73 | % 2. Deng Cai, Xiaofei He, Yuxiao Hu, Jiawei Han, and Thomas Huang, 74 | % "Learning a Spatially Smooth Subspace for Face Recognition", CVPR'2007 75 | % 76 | % 3. Deng Cai, Xiaofei He, Jiawei Han, "Spectral Regression: A Unified 77 | % Subspace Learning Framework for Content-Based Image Retrieval", ACM 78 | % Multimedia 2007, Augsburg, Germany, Sep. 2007. 79 | % 80 | % 4. Deng Cai, "Spectral Regression: A Regression Framework for 81 | % Efficient Regularized Subspace Learning", PhD Thesis, Department of 82 | % Computer Science, UIUC, 2009. 83 | % 84 | % version 3.0 --Dec/2011 85 | % version 2.1 --June/2007 86 | % version 2.0 --May/2007 87 | % version 1.0 --Sep/2006 88 | % 89 | % Written by Deng Cai (dengcai AT gmail.com) 90 | % 91 | 92 | MAX_MATRIX_SIZE = 1600; % You can change this number according your machine computational power 93 | EIGVECTOR_RATIO = 0.1; % You can change this number according your machine computational power 94 | 95 | if (~exist('options','var')) 96 | options = []; 97 | end 98 | 99 | ReducedDim = 30; 100 | if isfield(options,'ReducedDim') 101 | ReducedDim = options.ReducedDim; 102 | end 103 | 104 | if ~isfield(options,'Regu') || ~options.Regu 105 | bPCA = 1; 106 | if ~isfield(options,'PCARatio') 107 | options.PCARatio = 1; 108 | end 109 | else 110 | bPCA = 0; 111 | if ~isfield(options,'ReguType') 112 | options.ReguType = 'Ridge'; 113 | end 114 | if ~isfield(options,'ReguAlpha') 115 | options.ReguAlpha = 0.1; 116 | end 117 | end 118 | 119 | bD = 1; 120 | if ~exist('D','var') || isempty(D) 121 | bD = 0; 122 | end 123 | 124 | 125 | [nSmp,nFea] = size(data); 126 | if size(W,1) ~= nSmp 127 | error('W and data mismatch!'); 128 | end 129 | if bD && (size(D,1) ~= nSmp) 130 | error('D and data mismatch!'); 131 | end 132 | 133 | bChol = 0; 134 | if bPCA && (nSmp > nFea) && (options.PCARatio >= 1) 135 | if bD 136 | DPrime = data'*D*data; 137 | else 138 | DPrime = data'*data; 139 | end 140 | DPrime = full(DPrime); 141 | DPrime = max(DPrime,DPrime'); 142 | [R,p] = chol(DPrime); 143 | if p == 0 144 | bPCA = 0; 145 | bChol = 1; 146 | end 147 | end 148 | 149 | %====================================== 150 | % SVD 151 | %====================================== 152 | 153 | if bPCA 154 | [U, S, V] = mySVD(data); 155 | [U, S, V]=CutonRatio(U,S,V,options); 156 | eigvalue_PCA = full(diag(S)); 157 | if bD 158 | data = U*S; 159 | eigvector_PCA = V; 160 | 161 | DPrime = data'*D*data; 162 | DPrime = max(DPrime,DPrime'); 163 | else 164 | data = U; 165 | eigvector_PCA = V*spdiags(eigvalue_PCA.^-1,0,length(eigvalue_PCA),length(eigvalue_PCA)); 166 | end 167 | else 168 | if ~bChol 169 | if bD 170 | DPrime = data'*D*data; 171 | else 172 | DPrime = data'*data; 173 | end 174 | 175 | switch lower(options.ReguType) 176 | case {lower('Ridge')} 177 | if options.ReguAlpha > 0 178 | for i=1:size(DPrime,1) 179 | DPrime(i,i) = DPrime(i,i) + options.ReguAlpha; 180 | end 181 | end 182 | case {lower('Tensor')} 183 | if options.ReguAlpha > 0 184 | DPrime = DPrime + options.ReguAlpha*options.regularizerR; 185 | end 186 | case {lower('Custom')} 187 | if options.ReguAlpha > 0 188 | DPrime = DPrime + options.ReguAlpha*options.regularizerR; 189 | end 190 | otherwise 191 | error('ReguType does not exist!'); 192 | end 193 | 194 | DPrime = max(DPrime,DPrime'); 195 | end 196 | end 197 | 198 | WPrime = data'*W*data; 199 | WPrime = max(WPrime,WPrime'); 200 | 201 | 202 | 203 | %====================================== 204 | % Generalized Eigen 205 | %====================================== 206 | 207 | dimMatrix = size(WPrime,2); 208 | 209 | if ReducedDim > dimMatrix 210 | ReducedDim = dimMatrix; 211 | end 212 | 213 | 214 | if isfield(options,'bEigs') 215 | bEigs = options.bEigs; 216 | else 217 | if (dimMatrix > MAX_MATRIX_SIZE) && (ReducedDim < dimMatrix*EIGVECTOR_RATIO) 218 | bEigs = 1; 219 | else 220 | bEigs = 0; 221 | end 222 | end 223 | 224 | 225 | if bEigs 226 | %disp('use eigs to speed up!'); 227 | option = struct('disp',0); 228 | if bPCA && ~bD 229 | [eigvector, eigvalue] = eigs(WPrime,ReducedDim,'la',option); 230 | else 231 | if bChol 232 | option.cholB = 1; 233 | [eigvector, eigvalue] = eigs(WPrime,R,ReducedDim,'la',option); 234 | else 235 | [eigvector, eigvalue] = eigs(WPrime,DPrime,ReducedDim,'la',option); 236 | end 237 | end 238 | eigvalue = diag(eigvalue); 239 | else 240 | if bPCA && ~bD 241 | [eigvector, eigvalue] = eig(WPrime); 242 | else 243 | [eigvector, eigvalue] = eig(WPrime,DPrime); 244 | end 245 | eigvalue = diag(eigvalue); 246 | 247 | [junk, index] = sort(-eigvalue); 248 | eigvalue = eigvalue(index); 249 | eigvector = eigvector(:,index); 250 | 251 | if ReducedDim < size(eigvector,2) 252 | eigvector = eigvector(:, 1:ReducedDim); 253 | eigvalue = eigvalue(1:ReducedDim); 254 | end 255 | end 256 | 257 | 258 | if bPCA 259 | eigvector = eigvector_PCA*eigvector; 260 | end 261 | 262 | for i = 1:size(eigvector,2) 263 | eigvector(:,i) = eigvector(:,i)./norm(eigvector(:,i)); 264 | end 265 | 266 | 267 | 268 | 269 | 270 | function [U, S, V]=CutonRatio(U,S,V,options) 271 | if ~isfield(options, 'PCARatio') 272 | options.PCARatio = 1; 273 | end 274 | 275 | eigvalue_PCA = full(diag(S)); 276 | if options.PCARatio > 1 277 | idx = options.PCARatio; 278 | if idx < length(eigvalue_PCA) 279 | U = U(:,1:idx); 280 | V = V(:,1:idx); 281 | S = S(1:idx,1:idx); 282 | end 283 | elseif options.PCARatio < 1 284 | sumEig = sum(eigvalue_PCA); 285 | sumEig = sumEig*options.PCARatio; 286 | sumNow = 0; 287 | for idx = 1:length(eigvalue_PCA) 288 | sumNow = sumNow + eigvalue_PCA(idx); 289 | if sumNow >= sumEig 290 | break; 291 | end 292 | end 293 | U = U(:,1:idx); 294 | V = V(:,1:idx); 295 | S = S(1:idx,1:idx); 296 | end 297 | -------------------------------------------------------------------------------- /code/LSA_and_LPI_and_LE_matlab/LPI/lpp.m: -------------------------------------------------------------------------------- 1 | function [eigvector, eigvalue] = lpp(W, options, data) 2 | % LPP: Locality Preserving Projections 3 | % 4 | % [eigvector, eigvalue] = LPP(W, options, data) 5 | % 6 | % Input: 7 | % data - Data matrix. Each row vector of fea is a data point. 8 | % W - Affinity matrix. You can either call "constructW" 9 | % to construct the W, or construct it by yourself. 10 | % options - Struct value in Matlab. The fields in options 11 | % that can be set: 12 | % 13 | % Please see LGE.m for other options. 14 | % 15 | % Output: 16 | % eigvector - Each column is an embedding function, for a new 17 | % data point (row vector) x, y = x*eigvector 18 | % will be the embedding result of x. 19 | % eigvalue - The sorted eigvalue of LPP eigen-problem. 20 | % 21 | % 22 | % Examples: 23 | % 24 | % fea = rand(50,70); 25 | % options = []; 26 | % options.Metric = 'Euclidean'; 27 | % options.NeighborMode = 'KNN'; 28 | % options.k = 5; 29 | % options.WeightMode = 'HeatKernel'; 30 | % options.t = 5; 31 | % W = constructW(fea,options); 32 | % options.PCARatio = 0.99 33 | % [eigvector, eigvalue] = LPP(W, options, fea); 34 | % Y = fea*eigvector; 35 | % 36 | % 37 | % fea = rand(50,70); 38 | % gnd = [ones(10,1);ones(15,1)*2;ones(10,1)*3;ones(15,1)*4]; 39 | % options = []; 40 | % options.Metric = 'Euclidean'; 41 | % options.NeighborMode = 'Supervised'; 42 | % options.gnd = gnd; 43 | % options.bLDA = 1; 44 | % W = constructW(fea,options); 45 | % options.PCARatio = 1; 46 | % [eigvector, eigvalue] = LPP(W, options, fea); 47 | % Y = fea*eigvector; 48 | % 49 | % 50 | % Note: After applying some simple algebra, the smallest eigenvalue problem: 51 | % data^T*L*data = \lemda data^T*D*data 52 | % is equivalent to the largest eigenvalue problem: 53 | % data^T*W*data = \beta data^T*D*data 54 | % where L=D-W; \lemda= 1 - \beta. 55 | % Thus, the smallest eigenvalue problem can be transformed to a largest 56 | % eigenvalue problem. Such tricks are adopted in this code for the 57 | % consideration of calculation precision of Matlab. 58 | % 59 | % 60 | % See also constructW, LGE 61 | % 62 | %Reference: 63 | % Xiaofei He, and Partha Niyogi, "Locality Preserving Projections" 64 | % Advances in Neural Information Processing Systems 16 (NIPS 2003), 65 | % Vancouver, Canada, 2003. 66 | % 67 | % Xiaofei He, Shuicheng Yan, Yuxiao Hu, Partha Niyogi, and Hong-Jiang 68 | % Zhang, "Face Recognition Using Laplacianfaces", IEEE PAMI, Vol. 27, No. 69 | % 3, Mar. 2005. 70 | % 71 | % Deng Cai, Xiaofei He and Jiawei Han, "Document Clustering Using 72 | % Locality Preserving Indexing" IEEE TKDE, Dec. 2005. 73 | % 74 | % Deng Cai, Xiaofei He and Jiawei Han, "Using Graph Model for Face Analysis", 75 | % Technical Report, UIUCDCS-R-2005-2636, UIUC, Sept. 2005 76 | % 77 | % Xiaofei He, "Locality Preserving Projections" 78 | % PhD's thesis, Computer Science Department, The University of Chicago, 79 | % 2005. 80 | % 81 | % version 2.1 --June/2007 82 | % version 2.0 --May/2007 83 | % version 1.1 --Feb/2006 84 | % version 1.0 --April/2004 85 | % 86 | % Written by Deng Cai (dengcai2 AT cs.uiuc.edu) 87 | % 88 | 89 | 90 | if (~exist('options','var')) 91 | options = []; 92 | end 93 | 94 | [nSmp,nFea] = size(data); 95 | if size(W,1) ~= nSmp 96 | error('W and data mismatch!'); 97 | end 98 | 99 | 100 | %========================== 101 | % If data is too large, the following centering codes can be commented 102 | % options.keepMean = 1; 103 | %========================== 104 | if isfield(options,'keepMean') && options.keepMean 105 | ; 106 | else 107 | if issparse(data) 108 | data = full(data); 109 | end 110 | sampleMean = mean(data); 111 | data = (data - repmat(sampleMean,nSmp,1)); 112 | end 113 | %========================== 114 | 115 | 116 | 117 | 118 | D = full(sum(W,2)); 119 | 120 | 121 | if ~isfield(options,'Regu') || ~options.Regu 122 | DToPowerHalf = D.^.5; 123 | D_mhalf = DToPowerHalf.^-1; 124 | 125 | if nSmp < 5000 126 | tmpD_mhalf = repmat(D_mhalf,1,nSmp); 127 | W = (tmpD_mhalf.*W).*tmpD_mhalf'; 128 | clear tmpD_mhalf; 129 | else 130 | [i_idx,j_idx,v_idx] = find(W); 131 | v1_idx = zeros(size(v_idx)); 132 | for i=1:length(v_idx) 133 | v1_idx(i) = v_idx(i)*D_mhalf(i_idx(i))*D_mhalf(j_idx(i)); 134 | end 135 | W = sparse(i_idx,j_idx,v1_idx); 136 | clear i_idx j_idx v_idx v1_idx 137 | end 138 | W = max(W,W'); 139 | 140 | data = repmat(DToPowerHalf,1,nFea).*data; 141 | [eigvector, eigvalue] = LGE(W, [], options, data); 142 | else 143 | options.ReguAlpha = options.ReguAlpha*sum(D)/length(D); 144 | 145 | D = sparse(1:nSmp,1:nSmp,D,nSmp,nSmp); 146 | [eigvector, eigvalue] = LGE(W, D, options, data); 147 | end 148 | 149 | 150 | eigIdx = find(eigvalue < 1e-3); 151 | eigvalue (eigIdx) = []; 152 | eigvector(:,eigIdx) = []; -------------------------------------------------------------------------------- /code/LSA_and_LPI_and_LE_matlab/LPI/mySVD.m: -------------------------------------------------------------------------------- 1 | function [U, S, V] = mySVD(X,ReducedDim) 2 | %mySVD Accelerated singular value decomposition. 3 | % [U,S,V] = mySVD(X) produces a diagonal matrix S, of the 4 | % dimension as the rank of X and with nonnegative diagonal elements in 5 | % decreasing order, and unitary matrices U and V so that 6 | % X = U*S*V'. 7 | % 8 | % [U,S,V] = mySVD(X,ReducedDim) produces a diagonal matrix S, of the 9 | % dimension as ReducedDim and with nonnegative diagonal elements in 10 | % decreasing order, and unitary matrices U and V so that 11 | % Xhat = U*S*V' is the best approximation (with respect to F norm) of X 12 | % among all the matrices with rank no larger than ReducedDim. 13 | % 14 | % Based on the size of X, mySVD computes the eigvectors of X*X^T or X^T*X 15 | % first, and then convert them to the eigenvectors of the other. 16 | % 17 | % See also SVD. 18 | % 19 | % version 2.0 --Feb/2009 20 | % version 1.0 --April/2004 21 | % 22 | % Written by Deng Cai (dengcai AT gmail.com) 23 | % 24 | 25 | MAX_MATRIX_SIZE = 1600; % You can change this number according your machine computational power 26 | EIGVECTOR_RATIO = 0.1; % You can change this number according your machine computational power 27 | 28 | 29 | if ~exist('ReducedDim','var') 30 | ReducedDim = 0; 31 | end 32 | 33 | [nSmp, mFea] = size(X); 34 | if mFea/nSmp > 1.0713 35 | ddata = X*X'; 36 | ddata = max(ddata,ddata'); 37 | 38 | dimMatrix = size(ddata,1); 39 | if (ReducedDim > 0) && (dimMatrix > MAX_MATRIX_SIZE) && (ReducedDim < dimMatrix*EIGVECTOR_RATIO) 40 | option = struct('disp',0); 41 | [U, eigvalue] = eigs(ddata,ReducedDim,'la',option); 42 | eigvalue = diag(eigvalue); 43 | else 44 | if issparse(ddata) 45 | ddata = full(ddata); 46 | end 47 | 48 | [U, eigvalue] = eig(ddata); 49 | eigvalue = diag(eigvalue); 50 | [dump, index] = sort(-eigvalue); 51 | eigvalue = eigvalue(index); 52 | U = U(:, index); 53 | end 54 | clear ddata; 55 | 56 | maxEigValue = max(abs(eigvalue)); 57 | eigIdx = find(abs(eigvalue)/maxEigValue < 1e-10); 58 | eigvalue(eigIdx) = []; 59 | U(:,eigIdx) = []; 60 | 61 | if (ReducedDim > 0) && (ReducedDim < length(eigvalue)) 62 | eigvalue = eigvalue(1:ReducedDim); 63 | U = U(:,1:ReducedDim); 64 | end 65 | 66 | eigvalue_Half = eigvalue.^.5; 67 | S = spdiags(eigvalue_Half,0,length(eigvalue_Half),length(eigvalue_Half)); 68 | 69 | if nargout >= 3 70 | eigvalue_MinusHalf = eigvalue_Half.^-1; 71 | V = X'*(U.*repmat(eigvalue_MinusHalf',size(U,1),1)); 72 | end 73 | else 74 | ddata = X'*X; 75 | ddata = max(ddata,ddata'); 76 | 77 | dimMatrix = size(ddata,1); 78 | if (ReducedDim > 0) && (dimMatrix > MAX_MATRIX_SIZE) && (ReducedDim < dimMatrix*EIGVECTOR_RATIO) 79 | option = struct('disp',0); 80 | [V, eigvalue] = eigs(ddata,ReducedDim,'la',option); 81 | eigvalue = diag(eigvalue); 82 | else 83 | if issparse(ddata) 84 | ddata = full(ddata); 85 | end 86 | 87 | [V, eigvalue] = eig(ddata); 88 | eigvalue = diag(eigvalue); 89 | 90 | [dump, index] = sort(-eigvalue); 91 | eigvalue = eigvalue(index); 92 | V = V(:, index); 93 | end 94 | clear ddata; 95 | 96 | maxEigValue = max(abs(eigvalue)); 97 | eigIdx = find(abs(eigvalue)/maxEigValue < 1e-10); 98 | eigvalue(eigIdx) = []; 99 | V(:,eigIdx) = []; 100 | 101 | if (ReducedDim > 0) && (ReducedDim < length(eigvalue)) 102 | eigvalue = eigvalue(1:ReducedDim); 103 | V = V(:,1:ReducedDim); 104 | end 105 | 106 | eigvalue_Half = eigvalue.^.5; 107 | S = spdiags(eigvalue_Half,0,length(eigvalue_Half),length(eigvalue_Half)); 108 | 109 | eigvalue_MinusHalf = eigvalue_Half.^-1; 110 | U = X*(V.*repmat(eigvalue_MinusHalf',size(V,1),1)); 111 | end 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | -------------------------------------------------------------------------------- /code/LSA_and_LPI_and_LE_matlab/LSA/LSA.m: -------------------------------------------------------------------------------- 1 | function Y = LSA(X, nLowVec) 2 | 3 | k = nLowVec; 4 | 5 | [Y,~,~] = svds(X,k); 6 | 7 | end 8 | -------------------------------------------------------------------------------- /code/LSA_and_LPI_and_LE_matlab/Step_03_main_nlpcc2016.m: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/LSA_and_LPI_and_LE_matlab/Step_03_main_nlpcc2016.m -------------------------------------------------------------------------------- /code/LSA_and_LPI_and_LE_matlab/Step_03_nlpcc2016.m: -------------------------------------------------------------------------------- 1 | function Step_03_nlpcc2016(method, topic, parameters, hasTestData) 2 | 3 | dataStr = ['./../../data/RefineData/Step02_vsm/vsm_train_labeled_',topic]; 4 | load(dataStr); 5 | evalStr = ['fea = vsm_train_labeled_',topic,';']; 6 | eval(evalStr) 7 | dataStr = ['./../../data/RefineData/Step02_vsm/vsm_train_unlabeled_',topic]; 8 | load(dataStr); 9 | evalStr = ['fea = [fea;vsm_train_unlabeled_',topic,'];']; 10 | eval(evalStr) 11 | if hasTestData 12 | dataStr = ['./../../data/RefineData/Step02_vsm/vsm_test_unlabeled_',topic]; 13 | load(dataStr); 14 | evalStr = ['fea = [fea;vsm_test_unlabeled_',topic,'];']; 15 | eval(evalStr) 16 | end 17 | 18 | nLowVec = parameters.nLowVec; 19 | %% init parameters 20 | options = []; 21 | options.NeighborMode = 'KNN'; 22 | options.Metric = 'Euclidean'; 23 | options.WeightMode = 'HeatKernel'; 24 | options.t = 1; 25 | options.bSelfConnected = 0; 26 | options.k = 15; 27 | 28 | %% 29 | if (strcmp(method,'LSA')) 30 | disp('Here is baseline method: LSA!'); 31 | fea=tf_idf(fea); 32 | fea = normalize(fea); 33 | disp(['nLowVec is:',num2str(nLowVec)]); 34 | Y = LSA(fea,nLowVec); 35 | resultStr = ['./../../data/RefineData/Step03_lsa_lpi_le/fea_lsa_',topic]; 36 | save(resultStr, 'Y', '-ascii'); 37 | disp('LSA feature is ok!'); 38 | 39 | elseif (strcmp(method,'Spectral_LE')) 40 | disp('Here is baseline method: Spectral_LE!'); 41 | fea=tf_idf(fea); 42 | fea = normalize(fea); 43 | disp(['nLowVec is:',num2str(nLowVec)]); 44 | Y = LapEig(fea,options,nLowVec); 45 | resultStr = ['./../../data/RefineData/Step03_lsa_lpi_le/fea_le_',topic]; 46 | save(resultStr, 'Y', '-ascii'); 47 | disp('LE feature is ok!'); 48 | 49 | elseif (strcmp(method,'LPI')) 50 | disp('Here is baseline method: LPI!'); 51 | fea=tf_idf(fea); 52 | fea = normalize(fea); 53 | W = constructW(fea,options); 54 | options.PCARatio = 1; 55 | options.ReducedDim=nLowVec; 56 | [eigvector, ~] = lpp(W, options, fea); 57 | disp(['nLowVec is:',num2str(nLowVec)]); 58 | Y = fea*eigvector; 59 | resultStr = ['./../../data/RefineData/Step03_lsa_lpi_le/fea_lpi_',topic]; 60 | save(resultStr, 'Y', '-ascii'); 61 | disp('LPI feature is ok!'); 62 | 63 | else 64 | error(['You input a invalid method:',method,'!']) 65 | end -------------------------------------------------------------------------------- /code/LSA_and_LPI_and_LE_matlab/Step_04_main_nlpcc2016.m: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/LSA_and_LPI_and_LE_matlab/Step_04_main_nlpcc2016.m -------------------------------------------------------------------------------- /code/LSA_and_LPI_and_LE_matlab/Step_04_nlpcc2016.m: -------------------------------------------------------------------------------- 1 | function Step_04_nlpcc2016(method, topic, parameters, hasTestData) 2 | 3 | dataStr = ['./../../data/RefineData/Step04_chisquare/vsm_train_labeled_chi2_',topic]; 4 | load(dataStr); 5 | evalStr = ['fea = vsm_train_labeled_chi2_',topic,';']; 6 | eval(evalStr) 7 | dataStr = ['./../../data/RefineData/Step04_chisquare/vsm_train_unlabeled_chi2_',topic]; 8 | load(dataStr); 9 | evalStr = ['fea = [fea;vsm_train_unlabeled_chi2_',topic,'];']; 10 | eval(evalStr) 11 | if hasTestData 12 | dataStr = ['./../../data/RefineData/Step04_chisquare/vsm_test_unlabeled_chi2_',topic]; 13 | load(dataStr); 14 | evalStr = ['fea = [fea;vsm_test_unlabeled_chi2_',topic,'];']; 15 | eval(evalStr) 16 | end 17 | 18 | nLowVec = parameters.nLowVec; 19 | %% init parameters 20 | options = []; 21 | options.NeighborMode = 'KNN'; 22 | options.Metric = 'Euclidean'; 23 | options.WeightMode = 'HeatKernel'; 24 | options.t = 1; 25 | options.bSelfConnected = 0; 26 | options.k = 15; 27 | 28 | %% 29 | if (strcmp(method,'LSA')) 30 | disp('Here is baseline method: LSA!'); 31 | fea=tf_idf(fea); 32 | fea = normalize(fea); 33 | disp(['nLowVec is:',num2str(nLowVec)]); 34 | Y = LSA(fea,nLowVec); 35 | resultStr = ['./../../data/RefineData/Step04_chisquare/fea_chi2_lsa_',topic]; 36 | save(resultStr, 'Y', '-ascii'); 37 | disp('LSA feature is ok!'); 38 | 39 | elseif (strcmp(method,'Spectral_LE')) 40 | disp('Here is baseline method: Spectral_LE!'); 41 | fea=tf_idf(fea); 42 | fea = normalize(fea); 43 | disp(['nLowVec is:',num2str(nLowVec)]); 44 | Y = LapEig(fea,options,nLowVec); 45 | resultStr = ['./../../data/RefineData/Step04_chisquare/fea_chi2_le_',topic]; 46 | save(resultStr, 'Y', '-ascii'); 47 | disp('LE feature is ok!'); 48 | 49 | elseif (strcmp(method,'LPI')) 50 | disp('Here is baseline method: LPI!'); 51 | fea=tf_idf(fea); 52 | fea = normalize(fea); 53 | W = constructW(fea,options); 54 | options.PCARatio = 1; 55 | options.ReducedDim=nLowVec; 56 | [eigvector, ~] = lpp(W, options, fea); 57 | disp(['nLowVec is:',num2str(nLowVec)]); 58 | Y = fea*eigvector; 59 | resultStr = ['./../../data/RefineData/Step04_chisquare/fea_chi2_lpi_',topic]; 60 | save(resultStr, 'Y', '-ascii'); 61 | disp('LPI feature is ok!'); 62 | 63 | else 64 | error(['You input a invalid method:',method,'!']) 65 | end -------------------------------------------------------------------------------- /code/LSA_and_LPI_and_LE_matlab/tools/EuDist2.m: -------------------------------------------------------------------------------- 1 | function D = EuDist2(fea_a,fea_b,bSqrt) 2 | % Euclidean Distance matrix 3 | % D = EuDist(fea_a,fea_b) 4 | % fea_a: nSample_a * nFeature 5 | % fea_b: nSample_b * nFeature 6 | % D: nSample_a * nSample_a 7 | % or nSample_a * nSample_b 8 | 9 | 10 | if ~exist('bSqrt','var') 11 | bSqrt = 1; 12 | end 13 | 14 | 15 | if (~exist('fea_b','var')) | isempty(fea_b) 16 | [nSmp, nFea] = size(fea_a); 17 | 18 | aa = sum(fea_a.*fea_a,2); 19 | ab = fea_a*fea_a'; 20 | 21 | aa = full(aa); 22 | ab = full(ab); 23 | 24 | if bSqrt 25 | D = sqrt(repmat(aa, 1, nSmp) + repmat(aa', nSmp, 1) - 2*ab); 26 | D = real(D); 27 | else 28 | D = repmat(aa, 1, nSmp) + repmat(aa', nSmp, 1) - 2*ab; 29 | end 30 | 31 | D = max(D,D'); 32 | D = D - diag(diag(D)); 33 | D = abs(D); 34 | else 35 | [nSmp_a, nFea] = size(fea_a); 36 | [nSmp_b, nFea] = size(fea_b); 37 | 38 | aa = sum(fea_a.*fea_a,2); 39 | bb = sum(fea_b.*fea_b,2); 40 | ab = fea_a*fea_b'; 41 | 42 | aa = full(aa); 43 | bb = full(bb); 44 | ab = full(ab); 45 | 46 | if bSqrt 47 | D = sqrt(repmat(aa, 1, nSmp_b) + repmat(bb', nSmp_a, 1) - 2*ab); 48 | D = real(D); 49 | else 50 | D = repmat(aa, 1, nSmp_b) + repmat(bb', nSmp_a, 1) - 2*ab; 51 | end 52 | 53 | D = abs(D); 54 | end 55 | 56 | -------------------------------------------------------------------------------- /code/LSA_and_LPI_and_LE_matlab/tools/constructW.m: -------------------------------------------------------------------------------- 1 | function [W, elapse] = constructW(fea,options) 2 | % Usage: 3 | % W = constructW(fea,options) 4 | % 5 | % fea: Rows of vectors of data points. Each row is x_i 6 | % options: Struct value in Matlab. The fields in options that can be set: 7 | % Metric - Choices are: 8 | % 'Euclidean' - Will use the Euclidean distance of two data 9 | % points to evaluate the "closeness" between 10 | % them. [Default One] 11 | % 'Cosine' - Will use the cosine value of two vectors 12 | % to evaluate the "closeness" between them. 13 | % A popular similarity measure used in 14 | % Information Retrieval. 15 | % 16 | % NeighborMode - Indicates how to construct the graph. Choices 17 | % are: [Default 'KNN'] 18 | % 'KNN' - k = 0 19 | % Complete graph 20 | % k > 0 21 | % Put an edge between two nodes if and 22 | % only if they are among the k nearst 23 | % neighbors of each other. You are 24 | % required to provide the parameter k in 25 | % the options. Default k=5. 26 | % 'Supervised' - k = 0 27 | % Put an edge between two nodes if and 28 | % only if they belong to same class. 29 | % k > 0 30 | % Put an edge between two nodes if 31 | % they belong to same class and they 32 | % are among the k nearst neighbors of 33 | % each other. 34 | % Default: k=0 35 | % You are required to provide the label 36 | % information gnd in the options. 37 | % 38 | % WeightMode - Indicates how to assign weights for each edge 39 | % in the graph. Choices are: 40 | % 'Binary' - 0-1 weighting. Every edge receiveds weight 41 | % of 1. [Default One] 42 | % 'HeatKernel' - If nodes i and j are connected, put weight 43 | % W_ij = exp(-norm(x_i-x_j)^2/(2t^2)). This 44 | % weight mode can only be used under 45 | % 'Euclidean' metric and you are required to 46 | % provide the parameter t. 47 | % 'Cosine' - If nodes i and j are connected, put weight 48 | % cosine(x_i,x_j). Can only be used under 49 | % 'Cosine' metric. 50 | % 51 | % k - The parameter needed under 'KNN' NeighborMode. 52 | % Default will be 5. 53 | % gnd - The parameter needed under 'Supervised' 54 | % NeighborMode. Colunm vector of the label 55 | % information for each data point. 56 | % bLDA - 0 or 1. Only effective under 'Supervised' 57 | % NeighborMode. If 1, the graph will be constructed 58 | % to make LPP exactly same as LDA. Default will be 59 | % 0. 60 | % t - The parameter needed under 'HeatKernel' 61 | % WeightMode. Default will be 1 62 | % bNormalized - 0 or 1. Only effective under 'Cosine' metric. 63 | % Indicates whether the fea are already be 64 | % normalized to 1. Default will be 0 65 | % bSelfConnected - 0 or 1. Indicates whether W(i,i) == 1. Default 1 66 | % if 'Supervised' NeighborMode & bLDA == 1, 67 | % bSelfConnected will always be 1. Default 1. 68 | % 69 | % 70 | % Examples: 71 | % 72 | % fea = rand(50,15); 73 | % options = []; 74 | % options.Metric = 'Euclidean'; 75 | % options.NeighborMode = 'KNN'; 76 | % options.k = 5; 77 | % options.WeightMode = 'HeatKernel'; 78 | % options.t = 1; 79 | % W = constructW(fea,options); 80 | % 81 | % 82 | % fea = rand(50,15); 83 | % gnd = [ones(10,1);ones(15,1)*2;ones(10,1)*3;ones(15,1)*4]; 84 | % options = []; 85 | % options.Metric = 'Euclidean'; 86 | % options.NeighborMode = 'Supervised'; 87 | % options.gnd = gnd; 88 | % options.WeightMode = 'HeatKernel'; 89 | % options.t = 1; 90 | % W = constructW(fea,options); 91 | % 92 | % 93 | % fea = rand(50,15); 94 | % gnd = [ones(10,1);ones(15,1)*2;ones(10,1)*3;ones(15,1)*4]; 95 | % options = []; 96 | % options.Metric = 'Euclidean'; 97 | % options.NeighborMode = 'Supervised'; 98 | % options.gnd = gnd; 99 | % options.bLDA = 1; 100 | % W = constructW(fea,options); 101 | % 102 | % 103 | % For more details about the different ways to construct the W, please 104 | % refer: 105 | % Deng Cai, Xiaofei He and Jiawei Han, "Document Clustering Using 106 | % Locality Preserving Indexing" IEEE TKDE, Dec. 2005. 107 | % 108 | % 109 | % Written by Deng Cai (dengcai2 AT cs.uiuc.edu), April/2004, Feb/2006, 110 | % May/2007 111 | % 112 | 113 | if (~exist('options','var')) 114 | options = []; 115 | else 116 | if ~isstruct(options) 117 | error('parameter error!'); 118 | end 119 | end 120 | 121 | %================================================= 122 | if ~isfield(options,'Metric') 123 | options.Metric = 'Cosine'; 124 | end 125 | 126 | switch lower(options.Metric) 127 | case {lower('Euclidean')} 128 | case {lower('Cosine')} 129 | if ~isfield(options,'bNormalized') 130 | options.bNormalized = 0; 131 | end 132 | otherwise 133 | error('Metric does not exist!'); 134 | end 135 | 136 | %================================================= 137 | if ~isfield(options,'NeighborMode') 138 | options.NeighborMode = 'KNN'; 139 | end 140 | 141 | switch lower(options.NeighborMode) 142 | case {lower('KNN')} %For simplicity, we include the data point itself in the kNN 143 | if ~isfield(options,'k') 144 | options.k = 5; 145 | end 146 | case {lower('Supervised')} 147 | if ~isfield(options,'bLDA') 148 | options.bLDA = 0; 149 | end 150 | if options.bLDA 151 | options.bSelfConnected = 1; 152 | end 153 | if ~isfield(options,'k') 154 | options.k = 0; 155 | end 156 | if ~isfield(options,'gnd') 157 | error('Label(gnd) should be provided under ''Supervised'' NeighborMode!'); 158 | end 159 | if ~isempty(fea) && length(options.gnd) ~= size(fea,1) 160 | error('gnd doesn''t match with fea!'); 161 | end 162 | otherwise 163 | error('NeighborMode does not exist!'); 164 | end 165 | 166 | %================================================= 167 | 168 | if ~isfield(options,'WeightMode') 169 | options.WeightMode = 'Binary'; 170 | end 171 | 172 | bBinary = 0; 173 | switch lower(options.WeightMode) 174 | case {lower('Binary')} 175 | bBinary = 1; 176 | case {lower('HeatKernel')} 177 | if ~strcmpi(options.Metric,'Euclidean') 178 | warning('''HeatKernel'' WeightMode should be used under ''Euclidean'' Metric!'); 179 | options.Metric = 'Euclidean'; 180 | end 181 | if ~isfield(options,'t') 182 | options.t = 1; 183 | end 184 | case {lower('Cosine')} 185 | if ~strcmpi(options.Metric,'Cosine') 186 | warning('''Cosine'' WeightMode should be used under ''Cosine'' Metric!'); 187 | options.Metric = 'Cosine'; 188 | end 189 | if ~isfield(options,'bNormalized') 190 | options.bNormalized = 0; 191 | end 192 | otherwise 193 | error('WeightMode does not exist!'); 194 | end 195 | 196 | %================================================= 197 | 198 | if ~isfield(options,'bSelfConnected') 199 | options.bSelfConnected = 1; 200 | end 201 | 202 | %================================================= 203 | tmp_T = cputime; 204 | 205 | if isfield(options,'gnd') 206 | nSmp = length(options.gnd); 207 | else 208 | nSmp = size(fea,1); 209 | end 210 | 211 | maxM = 62500000; %500M 212 | BlockSize = floor(maxM/(nSmp*3)); 213 | 214 | 215 | if strcmpi(options.NeighborMode,'Supervised') 216 | Label = unique(options.gnd); 217 | nLabel = length(Label); 218 | if options.bLDA 219 | G = zeros(nSmp,nSmp); 220 | for idx=1:nLabel 221 | classIdx = options.gnd==Label(idx); 222 | G(classIdx,classIdx) = 1/sum(classIdx); 223 | end 224 | W = sparse(G); 225 | elapse = cputime - tmp_T; 226 | return; 227 | end 228 | 229 | switch lower(options.WeightMode) 230 | case {lower('Binary')} 231 | if options.k > 0 232 | G = zeros(nSmp*(options.k+1),3); 233 | idNow = 0; 234 | for i=1:nLabel 235 | classIdx = find(options.gnd==Label(i)); 236 | D = EuDist2(fea(classIdx,:),[],0); 237 | [dump idx] = sort(D,2); % sort each row 238 | clear D dump; 239 | idx = idx(:,1:options.k+1); 240 | 241 | nSmpClass = length(classIdx)*(options.k+1); 242 | G(idNow+1:nSmpClass+idNow,1) = repmat(classIdx,[options.k+1,1]); 243 | G(idNow+1:nSmpClass+idNow,2) = classIdx(idx(:)); 244 | G(idNow+1:nSmpClass+idNow,3) = 1; 245 | idNow = idNow+nSmpClass; 246 | clear idx 247 | end 248 | G = sparse(G(:,1),G(:,2),G(:,3),nSmp,nSmp); 249 | G = max(G,G'); 250 | else 251 | G = zeros(nSmp,nSmp); 252 | for i=1:nLabel 253 | classIdx = find(options.gnd==Label(i)); 254 | G(classIdx,classIdx) = 1; 255 | end 256 | end 257 | 258 | if ~options.bSelfConnected 259 | for i=1:size(G,1) 260 | G(i,i) = 0; 261 | end 262 | end 263 | 264 | W = sparse(G); 265 | case {lower('HeatKernel')} 266 | if options.k > 0 267 | G = zeros(nSmp*(options.k+1),3); 268 | idNow = 0; 269 | for i=1:nLabel 270 | classIdx = find(options.gnd==Label(i)); 271 | D = EuDist2(fea(classIdx,:),[],0); 272 | [dump idx] = sort(D,2); % sort each row 273 | clear D; 274 | idx = idx(:,1:options.k+1); 275 | dump = dump(:,1:options.k+1); 276 | dump = exp(-dump/(2*options.t^2)); 277 | 278 | nSmpClass = length(classIdx)*(options.k+1); 279 | G(idNow+1:nSmpClass+idNow,1) = repmat(classIdx,[options.k+1,1]); 280 | G(idNow+1:nSmpClass+idNow,2) = classIdx(idx(:)); 281 | G(idNow+1:nSmpClass+idNow,3) = dump(:); 282 | idNow = idNow+nSmpClass; 283 | clear dump idx 284 | end 285 | G = sparse(G(:,1),G(:,2),G(:,3),nSmp,nSmp); 286 | else 287 | G = zeros(nSmp,nSmp); 288 | for i=1:nLabel 289 | classIdx = find(options.gnd==Label(i)); 290 | D = EuDist2(fea(classIdx,:),[],0); 291 | D = exp(-D/(2*options.t^2)); 292 | G(classIdx,classIdx) = D; 293 | end 294 | end 295 | 296 | if ~options.bSelfConnected 297 | for i=1:size(G,1) 298 | G(i,i) = 0; 299 | end 300 | end 301 | 302 | W = sparse(max(G,G')); 303 | case {lower('Cosine')} 304 | if ~options.bNormalized 305 | [nSmp, nFea] = size(fea); 306 | if issparse(fea) 307 | fea2 = fea'; 308 | feaNorm = sum(fea2.^2,1).^.5; 309 | for i = 1:nSmp 310 | fea2(:,i) = fea2(:,i) ./ max(1e-10,feaNorm(i)); 311 | end 312 | fea = fea2'; 313 | clear fea2; 314 | else 315 | feaNorm = sum(fea.^2,2).^.5; 316 | for i = 1:nSmp 317 | fea(i,:) = fea(i,:) ./ max(1e-12,feaNorm(i)); 318 | end 319 | end 320 | 321 | end 322 | 323 | if options.k > 0 324 | G = zeros(nSmp*(options.k+1),3); 325 | idNow = 0; 326 | for i=1:nLabel 327 | classIdx = find(options.gnd==Label(i)); 328 | D = fea(classIdx,:)*fea(classIdx,:)'; 329 | [dump idx] = sort(-D,2); % sort each row 330 | clear D; 331 | idx = idx(:,1:options.k+1); 332 | dump = -dump(:,1:options.k+1); 333 | 334 | nSmpClass = length(classIdx)*(options.k+1); 335 | G(idNow+1:nSmpClass+idNow,1) = repmat(classIdx,[options.k+1,1]); 336 | G(idNow+1:nSmpClass+idNow,2) = classIdx(idx(:)); 337 | G(idNow+1:nSmpClass+idNow,3) = dump(:); 338 | idNow = idNow+nSmpClass; 339 | clear dump idx 340 | end 341 | G = sparse(G(:,1),G(:,2),G(:,3),nSmp,nSmp); 342 | else 343 | G = zeros(nSmp,nSmp); 344 | for i=1:nLabel 345 | classIdx = find(options.gnd==Label(i)); 346 | G(classIdx,classIdx) = fea(classIdx,:)*fea(classIdx,:)'; 347 | end 348 | end 349 | 350 | if ~options.bSelfConnected 351 | for i=1:size(G,1) 352 | G(i,i) = 0; 353 | end 354 | end 355 | 356 | W = sparse(max(G,G')); 357 | otherwise 358 | error('WeightMode does not exist!'); 359 | end 360 | elapse = cputime - tmp_T; 361 | return; 362 | end 363 | 364 | 365 | if strcmpi(options.NeighborMode,'KNN') && (options.k > 0) 366 | if strcmpi(options.Metric,'Euclidean') 367 | G = zeros(nSmp*(options.k+1),3); 368 | for i = 1:ceil(nSmp/BlockSize) 369 | if i == ceil(nSmp/BlockSize) 370 | smpIdx = (i-1)*BlockSize+1:nSmp; 371 | dist = EuDist2(fea(smpIdx,:),fea,0); 372 | dist = full(dist); 373 | [dump idx] = sort(dist,2); % sort each row 374 | idx = idx(:,1:options.k+1); 375 | dump = dump(:,1:options.k+1); 376 | if ~bBinary 377 | dump = exp(-dump/(2*options.t^2)); 378 | end 379 | 380 | G((i-1)*BlockSize*(options.k+1)+1:nSmp*(options.k+1),1) = repmat(smpIdx',[options.k+1,1]); 381 | G((i-1)*BlockSize*(options.k+1)+1:nSmp*(options.k+1),2) = idx(:); 382 | if ~bBinary 383 | G((i-1)*BlockSize*(options.k+1)+1:nSmp*(options.k+1),3) = dump(:); 384 | else 385 | G((i-1)*BlockSize*(options.k+1)+1:nSmp*(options.k+1),3) = 1; 386 | end 387 | else 388 | smpIdx = (i-1)*BlockSize+1:i*BlockSize; 389 | dist = EuDist2(fea(smpIdx,:),fea,0); 390 | dist = full(dist); 391 | [dump idx] = sort(dist,2); % sort each row 392 | idx = idx(:,1:options.k+1); 393 | dump = dump(:,1:options.k+1); 394 | if ~bBinary 395 | dump = exp(-dump/(2*options.t^2)); 396 | end 397 | 398 | G((i-1)*BlockSize*(options.k+1)+1:i*BlockSize*(options.k+1),1) = repmat(smpIdx',[options.k+1,1]); 399 | G((i-1)*BlockSize*(options.k+1)+1:i*BlockSize*(options.k+1),2) = idx(:); 400 | if ~bBinary 401 | G((i-1)*BlockSize*(options.k+1)+1:i*BlockSize*(options.k+1),3) = dump(:); 402 | else 403 | G((i-1)*BlockSize*(options.k+1)+1:i*BlockSize*(options.k+1),3) = 1; 404 | end 405 | end 406 | end 407 | 408 | W = sparse(G(:,1),G(:,2),G(:,3),nSmp,nSmp); 409 | else 410 | if ~options.bNormalized 411 | [nSmp, nFea] = size(fea); 412 | if issparse(fea) 413 | fea2 = fea'; 414 | clear fea; 415 | for i = 1:nSmp 416 | fea2(:,i) = fea2(:,i) ./ max(1e-10,sum(fea2(:,i).^2,1).^.5); 417 | end 418 | fea = fea2'; 419 | clear fea2; 420 | else 421 | feaNorm = sum(fea.^2,2).^.5; 422 | for i = 1:nSmp 423 | fea(i,:) = fea(i,:) ./ max(1e-12,feaNorm(i)); 424 | end 425 | end 426 | end 427 | 428 | G = zeros(nSmp*(options.k+1),3); 429 | for i = 1:ceil(nSmp/BlockSize) 430 | if i == ceil(nSmp/BlockSize) 431 | smpIdx = (i-1)*BlockSize+1:nSmp; 432 | dist = fea(smpIdx,:)*fea'; 433 | dist = full(dist); 434 | [dump idx] = sort(-dist,2); % sort each row 435 | idx = idx(:,1:options.k+1); 436 | dump = -dump(:,1:options.k+1); 437 | 438 | G((i-1)*BlockSize*(options.k+1)+1:nSmp*(options.k+1),1) = repmat(smpIdx',[options.k+1,1]); 439 | G((i-1)*BlockSize*(options.k+1)+1:nSmp*(options.k+1),2) = idx(:); 440 | G((i-1)*BlockSize*(options.k+1)+1:nSmp*(options.k+1),3) = dump(:); 441 | else 442 | smpIdx = (i-1)*BlockSize+1:i*BlockSize; 443 | dist = fea(smpIdx,:)*fea'; 444 | dist = full(dist); 445 | [dump idx] = sort(-dist,2); % sort each row 446 | idx = idx(:,1:options.k+1); 447 | dump = -dump(:,1:options.k+1); 448 | 449 | G((i-1)*BlockSize*(options.k+1)+1:i*BlockSize*(options.k+1),1) = repmat(smpIdx',[options.k+1,1]); 450 | G((i-1)*BlockSize*(options.k+1)+1:i*BlockSize*(options.k+1),2) = idx(:); 451 | G((i-1)*BlockSize*(options.k+1)+1:i*BlockSize*(options.k+1),3) = dump(:); 452 | end 453 | end 454 | 455 | W = sparse(G(:,1),G(:,2),G(:,3),nSmp,nSmp); 456 | end 457 | 458 | if strcmpi(options.WeightMode,'Binary') 459 | W(find(W)) = 1; 460 | end 461 | 462 | if isfield(options,'bSemiSupervised') && options.bSemiSupervised 463 | tmpgnd = options.gnd(options.semiSplit); 464 | 465 | Label = unique(tmpgnd); 466 | nLabel = length(Label); 467 | G = zeros(sum(options.semiSplit),sum(options.semiSplit)); 468 | for idx=1:nLabel 469 | classIdx = tmpgnd==Label(idx); 470 | G(classIdx,classIdx) = 1; 471 | end 472 | Wsup = sparse(G); 473 | if ~isfield(options,'SameCategoryWeight') 474 | options.SameCategoryWeight = 1; 475 | end 476 | W(options.semiSplit,options.semiSplit) = (Wsup>0)*options.SameCategoryWeight; 477 | end 478 | 479 | if ~options.bSelfConnected 480 | for i=1:size(W,1) 481 | W(i,i) = 0; 482 | end 483 | end 484 | 485 | W = max(W,W'); 486 | 487 | elapse = cputime - tmp_T; 488 | return; 489 | end 490 | 491 | 492 | % strcmpi(options.NeighborMode,'KNN') & (options.k == 0) 493 | % Complete Graph 494 | 495 | if strcmpi(options.Metric,'Euclidean') 496 | W = EuDist2(fea,[],0); 497 | W = exp(-W/(2*options.t^2)); 498 | else 499 | if ~options.bNormalized 500 | % feaNorm = sum(fea.^2,2).^.5; 501 | % fea = fea ./ repmat(max(1e-10,feaNorm),1,size(fea,2)); 502 | [nSmp, nFea] = size(fea); 503 | if issparse(fea) 504 | fea2 = fea'; 505 | feaNorm = sum(fea2.^2,1).^.5; 506 | for i = 1:nSmp 507 | fea2(:,i) = fea2(:,i) ./ max(1e-10,feaNorm(i)); 508 | end 509 | fea = fea2'; 510 | clear fea2; 511 | else 512 | feaNorm = sum(fea.^2,2).^.5; 513 | for i = 1:nSmp 514 | fea(i,:) = fea(i,:) ./ max(1e-12,feaNorm(i)); 515 | end 516 | end 517 | end 518 | 519 | W = full(fea*fea'); 520 | end 521 | 522 | if ~options.bSelfConnected 523 | for i=1:size(W,1) 524 | W(i,i) = 0; 525 | end 526 | end 527 | 528 | W = max(W,W'); 529 | 530 | 531 | 532 | elapse = cputime - tmp_T; 533 | 534 | 535 | -------------------------------------------------------------------------------- /code/LSA_and_LPI_and_LE_matlab/tools/normalize.m: -------------------------------------------------------------------------------- 1 | function Xn = normalize(X) 2 | % Normalize all feature vectors to unit length 3 | 4 | n = size(X,1); % the number of documents 5 | Xt = X'; 6 | l = sqrt(sum(Xt.^2)); % the row vector length (L2 norm) 7 | Ni = sparse(1:n,1:n,l); 8 | Ni(Ni>0) = 1./Ni(Ni>0); 9 | Xn = (Xt*Ni)'; 10 | 11 | end 12 | -------------------------------------------------------------------------------- /code/LSA_and_LPI_and_LE_matlab/tools/tf_idf.m: -------------------------------------------------------------------------------- 1 | function [trainX] = tf_idf(trainTF) 2 | % TF-IDF weighting 3 | % ([1+log(tf)]*log[N/df]) 4 | [n,m] = size(trainTF); % the number of (training) documents and terms 5 | df = sum(trainTF>0); % (training) document frequency 6 | idf = log(n./df); 7 | IDF = sparse(1:m,1:m,idf); 8 | [trainI,trainJ,trainV] = find(trainTF); 9 | trainLogTF = sparse(trainI,trainJ,1+log(trainV),size(trainTF,1),size(trainTF,2)); 10 | trainX = trainLogTF*IDF; 11 | end 12 | -------------------------------------------------------------------------------- /code/Para2vec/README.md: -------------------------------------------------------------------------------- 1 | # Paragraph Vector (Para2vec) 2 | 3 | This project is modified from the following open-source implementation Para2vec 4 | 5 | https://github.com/mesnilgr/iclr15 6 | 7 | -------------------------------------------------------------------------------- /code/Para2vec/go_paravec.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #auther:Mikolov, modifier: Jacob Xu. -- 20160614 3 | # iphonese, bianpao, fankong, ertai, jinmo 4 | topic="iphonese" 5 | rm alldata.txt 6 | rm alldata-id.txt 7 | 8 | echo "Current topic is "${topic}" ..." 9 | cat "TaskA_train_labeled_text_"${topic} "TaskA_train_unlabeled_text_"${topic} "TaskA_test_unlabeled_text_"${topic} > alldata.txt 10 | awk 'BEGIN{a=0;}{print "_*" a " " $0; a++;}' < alldata.txt > alldata-id.txt 11 | 12 | gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -funroll-loops 13 | time ./word2vec -train ./alldata-id.txt -output ${topic}"_vectors.txt" -cbow 0 -size 50 -window 5 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1 14 | grep '_\*' ${topic}"_vectors.txt" > ${topic}"_para2vecs.txt" 15 | sed -i "s+_\*+0+g" ${topic}"_para2vecs.txt" 16 | echo "It is done, OK!" -------------------------------------------------------------------------------- /code/Para2vec/word2vec.c: -------------------------------------------------------------------------------- 1 | // Copyright 2013 Google Inc. All Rights Reserved. 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // http://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | 21 | #define MAX_STRING 100 22 | #define EXP_TABLE_SIZE 1000 23 | #define MAX_EXP 6 24 | #define MAX_SENTENCE_LENGTH 1000 25 | #define MAX_CODE_LENGTH 40 26 | 27 | const int vocab_hash_size = 30000000; // Maximum 30 * 0.7 = 21M words in the vocabulary 28 | 29 | typedef float real; // Precision of float numbers 30 | 31 | struct vocab_word { 32 | long long cn; 33 | int *point; 34 | char *word, *code, codelen; 35 | }; 36 | 37 | char train_file[MAX_STRING], output_file[MAX_STRING]; 38 | char save_vocab_file[MAX_STRING], read_vocab_file[MAX_STRING]; 39 | struct vocab_word *vocab; 40 | int binary = 0, cbow = 1, debug_mode = 2, window = 5, min_count = 1, num_threads = 12, min_reduce = 1; 41 | int *vocab_hash; 42 | long long vocab_max_size = 1000, vocab_size = 0, layer1_size = 100, sentence_vectors = 0; 43 | long long train_words = 0, word_count_actual = 0, iter = 5, file_size = 0, classes = 0; 44 | real alpha = 0.025, starting_alpha, sample = 1e-3; 45 | real *syn0, *syn1, *syn1neg, *expTable; 46 | clock_t start; 47 | 48 | int hs = 0, negative = 5; 49 | const int table_size = 1e8; 50 | int *table; 51 | 52 | void InitUnigramTable() { 53 | int a, i; 54 | long long train_words_pow = 0; 55 | real d1, power = 0.75; 56 | table = (int *)malloc(table_size * sizeof(int)); 57 | for (a = 0; a < vocab_size; a++) train_words_pow += pow(vocab[a].cn, power); 58 | i = 0; 59 | d1 = pow(vocab[i].cn, power) / (real)train_words_pow; 60 | for (a = 0; a < table_size; a++) { 61 | table[a] = i; 62 | if (a / (real)table_size > d1) { 63 | i++; 64 | d1 += pow(vocab[i].cn, power) / (real)train_words_pow; 65 | } 66 | if (i >= vocab_size) i = vocab_size - 1; 67 | } 68 | } 69 | 70 | // Reads a single word from a file, assuming space + tab + EOL to be word boundaries 71 | void ReadWord(char *word, FILE *fin) { 72 | int a = 0, ch; 73 | while (!feof(fin)) { 74 | ch = fgetc(fin); 75 | if (ch == 13) continue; 76 | if ((ch == ' ') || (ch == '\t') || (ch == '\n')) { 77 | if (a > 0) { 78 | if (ch == '\n') ungetc(ch, fin); 79 | break; 80 | } 81 | if (ch == '\n') { 82 | strcpy(word, (char *)""); 83 | return; 84 | } else continue; 85 | } 86 | word[a] = ch; 87 | a++; 88 | if (a >= MAX_STRING - 1) a--; // Truncate too long words 89 | } 90 | word[a] = 0; 91 | } 92 | 93 | // Returns hash value of a word 94 | int GetWordHash(char *word) { 95 | unsigned long long a, hash = 0; 96 | for (a = 0; a < strlen(word); a++) hash = hash * 257 + word[a]; 97 | hash = hash % vocab_hash_size; 98 | return hash; 99 | } 100 | 101 | // Returns position of a word in the vocabulary; if the word is not found, returns -1 102 | int SearchVocab(char *word) { 103 | unsigned int hash = GetWordHash(word); 104 | while (1) { 105 | if (vocab_hash[hash] == -1) return -1; 106 | if (!strcmp(word, vocab[vocab_hash[hash]].word)) return vocab_hash[hash]; 107 | hash = (hash + 1) % vocab_hash_size; 108 | } 109 | return -1; 110 | } 111 | 112 | // Reads a word and returns its index in the vocabulary 113 | int ReadWordIndex(FILE *fin) { 114 | char word[MAX_STRING]; 115 | ReadWord(word, fin); 116 | if (feof(fin)) return -1; 117 | return SearchVocab(word); 118 | } 119 | 120 | // Adds a word to the vocabulary 121 | int AddWordToVocab(char *word) { 122 | unsigned int hash, length = strlen(word) + 1; 123 | if (length > MAX_STRING) length = MAX_STRING; 124 | vocab[vocab_size].word = (char *)calloc(length, sizeof(char)); 125 | strcpy(vocab[vocab_size].word, word); 126 | vocab[vocab_size].cn = 0; 127 | vocab_size++; 128 | // Reallocate memory if needed 129 | if (vocab_size + 2 >= vocab_max_size) { 130 | vocab_max_size += 1000; 131 | vocab = (struct vocab_word *)realloc(vocab, vocab_max_size * sizeof(struct vocab_word)); 132 | } 133 | hash = GetWordHash(word); 134 | while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size; 135 | vocab_hash[hash] = vocab_size - 1; 136 | return vocab_size - 1; 137 | } 138 | 139 | // Used later for sorting by word counts 140 | int VocabCompare(const void *a, const void *b) { 141 | return ((struct vocab_word *)b)->cn - ((struct vocab_word *)a)->cn; 142 | } 143 | 144 | // Sorts the vocabulary by frequency using word counts 145 | void SortVocab() { 146 | int a, size; 147 | unsigned int hash; 148 | // Sort the vocabulary and keep at the first position 149 | qsort(&vocab[1], vocab_size - 1, sizeof(struct vocab_word), VocabCompare); 150 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 151 | size = vocab_size; 152 | train_words = 0; 153 | for (a = 0; a < size; a++) { 154 | // Words occuring less than min_count times will be discarded from the vocab 155 | if (vocab[a].cn < min_count) { 156 | vocab_size--; 157 | free(vocab[vocab_size].word); 158 | } else { 159 | // Hash will be re-computed, as after the sorting it is not actual 160 | hash=GetWordHash(vocab[a].word); 161 | while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size; 162 | vocab_hash[hash] = a; 163 | train_words += vocab[a].cn; 164 | } 165 | } 166 | vocab = (struct vocab_word *)realloc(vocab, (vocab_size + 1) * sizeof(struct vocab_word)); 167 | // Allocate memory for the binary tree construction 168 | for (a = 0; a < vocab_size; a++) { 169 | vocab[a].code = (char *)calloc(MAX_CODE_LENGTH, sizeof(char)); 170 | vocab[a].point = (int *)calloc(MAX_CODE_LENGTH, sizeof(int)); 171 | } 172 | } 173 | 174 | // Reduces the vocabulary by removing infrequent tokens 175 | void ReduceVocab() { 176 | int a, b = 0; 177 | unsigned int hash; 178 | for (a = 0; a < vocab_size; a++) if (vocab[a].cn > min_reduce) { 179 | vocab[b].cn = vocab[a].cn; 180 | vocab[b].word = vocab[a].word; 181 | b++; 182 | } else free(vocab[a].word); 183 | vocab_size = b; 184 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 185 | for (a = 0; a < vocab_size; a++) { 186 | // Hash will be re-computed, as it is not actual 187 | hash = GetWordHash(vocab[a].word); 188 | while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size; 189 | vocab_hash[hash] = a; 190 | } 191 | fflush(stdout); 192 | min_reduce++; 193 | } 194 | 195 | // Create binary Huffman tree using the word counts 196 | // Frequent words will have short uniqe binary codes 197 | void CreateBinaryTree() { 198 | long long a, b, i, min1i, min2i, pos1, pos2, point[MAX_CODE_LENGTH]; 199 | char code[MAX_CODE_LENGTH]; 200 | long long *count = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long)); 201 | long long *binary = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long)); 202 | long long *parent_node = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long)); 203 | for (a = 0; a < vocab_size; a++) count[a] = vocab[a].cn; 204 | for (a = vocab_size; a < vocab_size * 2; a++) count[a] = 1e15; 205 | pos1 = vocab_size - 1; 206 | pos2 = vocab_size; 207 | // Following algorithm constructs the Huffman tree by adding one node at a time 208 | for (a = 0; a < vocab_size - 1; a++) { 209 | // First, find two smallest nodes 'min1, min2' 210 | if (pos1 >= 0) { 211 | if (count[pos1] < count[pos2]) { 212 | min1i = pos1; 213 | pos1--; 214 | } else { 215 | min1i = pos2; 216 | pos2++; 217 | } 218 | } else { 219 | min1i = pos2; 220 | pos2++; 221 | } 222 | if (pos1 >= 0) { 223 | if (count[pos1] < count[pos2]) { 224 | min2i = pos1; 225 | pos1--; 226 | } else { 227 | min2i = pos2; 228 | pos2++; 229 | } 230 | } else { 231 | min2i = pos2; 232 | pos2++; 233 | } 234 | count[vocab_size + a] = count[min1i] + count[min2i]; 235 | parent_node[min1i] = vocab_size + a; 236 | parent_node[min2i] = vocab_size + a; 237 | binary[min2i] = 1; 238 | } 239 | // Now assign binary code to each vocabulary word 240 | for (a = 0; a < vocab_size; a++) { 241 | b = a; 242 | i = 0; 243 | while (1) { 244 | code[i] = binary[b]; 245 | point[i] = b; 246 | i++; 247 | b = parent_node[b]; 248 | if (b == vocab_size * 2 - 2) break; 249 | } 250 | vocab[a].codelen = i; 251 | vocab[a].point[0] = vocab_size - 2; 252 | for (b = 0; b < i; b++) { 253 | vocab[a].code[i - b - 1] = code[b]; 254 | vocab[a].point[i - b] = point[b] - vocab_size; 255 | } 256 | } 257 | free(count); 258 | free(binary); 259 | free(parent_node); 260 | } 261 | 262 | void LearnVocabFromTrainFile() { 263 | char word[MAX_STRING]; 264 | FILE *fin; 265 | long long a, i; 266 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 267 | fin = fopen(train_file, "rb"); 268 | if (fin == NULL) { 269 | printf("ERROR: training data file not found!\n"); 270 | exit(1); 271 | } 272 | vocab_size = 0; 273 | AddWordToVocab((char *)""); 274 | while (1) { 275 | ReadWord(word, fin); 276 | if (feof(fin)) break; 277 | train_words++; 278 | if ((debug_mode > 1) && (train_words % 100000 == 0)) { 279 | printf("%lldK%c", train_words / 1000, 13); 280 | fflush(stdout); 281 | } 282 | i = SearchVocab(word); 283 | if (i == -1) { 284 | a = AddWordToVocab(word); 285 | vocab[a].cn = 1; 286 | } else vocab[i].cn++; 287 | if (vocab_size > vocab_hash_size * 0.7) ReduceVocab(); 288 | } 289 | SortVocab(); 290 | if (debug_mode > 0) { 291 | printf("Vocab size: %lld\n", vocab_size); 292 | printf("Words in train file: %lld\n", train_words); 293 | } 294 | file_size = ftell(fin); 295 | fclose(fin); 296 | } 297 | 298 | void SaveVocab() { 299 | long long i; 300 | FILE *fo = fopen(save_vocab_file, "wb"); 301 | for (i = 0; i < vocab_size; i++) fprintf(fo, "%s %lld\n", vocab[i].word, vocab[i].cn); 302 | fclose(fo); 303 | } 304 | 305 | void ReadVocab() { 306 | long long a, i = 0; 307 | char c; 308 | char word[MAX_STRING]; 309 | FILE *fin = fopen(read_vocab_file, "rb"); 310 | if (fin == NULL) { 311 | printf("Vocabulary file not found\n"); 312 | exit(1); 313 | } 314 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 315 | vocab_size = 0; 316 | while (1) { 317 | ReadWord(word, fin); 318 | if (feof(fin)) break; 319 | a = AddWordToVocab(word); 320 | fscanf(fin, "%lld%c", &vocab[a].cn, &c); 321 | i++; 322 | } 323 | SortVocab(); 324 | if (debug_mode > 0) { 325 | printf("Vocab size: %lld\n", vocab_size); 326 | printf("Words in train file: %lld\n", train_words); 327 | } 328 | fin = fopen(train_file, "rb"); 329 | if (fin == NULL) { 330 | printf("ERROR: training data file not found!\n"); 331 | exit(1); 332 | } 333 | fseek(fin, 0, SEEK_END); 334 | file_size = ftell(fin); 335 | fclose(fin); 336 | } 337 | 338 | void InitNet() { 339 | long long a, b; 340 | unsigned long long next_random = 1; 341 | a = posix_memalign((void **)&syn0, 128, (long long)vocab_size * layer1_size * sizeof(real)); 342 | if (syn0 == NULL) {printf("Memory allocation failed\n"); exit(1);} 343 | if (hs) { 344 | a = posix_memalign((void **)&syn1, 128, (long long)vocab_size * layer1_size * sizeof(real)); 345 | if (syn1 == NULL) {printf("Memory allocation failed\n"); exit(1);} 346 | for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) 347 | syn1[a * layer1_size + b] = 0; 348 | } 349 | if (negative>0) { 350 | a = posix_memalign((void **)&syn1neg, 128, (long long)vocab_size * layer1_size * sizeof(real)); 351 | if (syn1neg == NULL) {printf("Memory allocation failed\n"); exit(1);} 352 | for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) 353 | syn1neg[a * layer1_size + b] = 0; 354 | } 355 | for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) { 356 | next_random = next_random * (unsigned long long)25214903917 + 11; 357 | syn0[a * layer1_size + b] = (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size; 358 | } 359 | CreateBinaryTree(); 360 | } 361 | 362 | void *TrainModelThread(void *id) { 363 | long long a, b, d, cw, word, last_word, sentence_length = 0, sentence_position = 0; 364 | long long word_count = 0, last_word_count = 0, sen[MAX_SENTENCE_LENGTH + 1]; 365 | long long l1, l2, c, target, label, local_iter = iter; 366 | unsigned long long next_random = (long long)id; 367 | real f, g; 368 | clock_t now; 369 | real *neu1 = (real *)calloc(layer1_size, sizeof(real)); 370 | real *neu1e = (real *)calloc(layer1_size, sizeof(real)); 371 | FILE *fi = fopen(train_file, "rb"); 372 | fseek(fi, file_size / (long long)num_threads * (long long)id, SEEK_SET); 373 | while (1) { 374 | if (word_count - last_word_count > 10000) { 375 | word_count_actual += word_count - last_word_count; 376 | last_word_count = word_count; 377 | if ((debug_mode > 1)) { 378 | now=clock(); 379 | printf("%cAlpha: %f Progress: %.2f%% Words/thread/sec: %.2fk ", 13, alpha, 380 | word_count_actual / (real)(iter * train_words + 1) * 100, 381 | word_count_actual / ((real)(now - start + 1) / (real)CLOCKS_PER_SEC * 1000)); 382 | fflush(stdout); 383 | } 384 | alpha = starting_alpha * (1 - word_count_actual / (real)(iter * train_words + 1)); 385 | if (alpha < starting_alpha * 0.0001) alpha = starting_alpha * 0.0001; 386 | } 387 | if (sentence_length == 0) { 388 | while (1) { 389 | word = ReadWordIndex(fi); 390 | if (feof(fi)) break; 391 | if (word == -1) continue; 392 | word_count++; 393 | if (word == 0) break; 394 | // The subsampling randomly discards frequent words while keeping the ranking same 395 | if (sample > 0) { 396 | real ran = (sqrt(vocab[word].cn / (sample * train_words)) + 1) * (sample * train_words) / vocab[word].cn; 397 | next_random = next_random * (unsigned long long)25214903917 + 11; 398 | if (ran < (next_random & 0xFFFF) / (real)65536) continue; 399 | } 400 | sen[sentence_length] = word; 401 | sentence_length++; 402 | if (sentence_length >= MAX_SENTENCE_LENGTH) break; 403 | } 404 | sentence_position = 0; 405 | } 406 | if (feof(fi) || (word_count > train_words / num_threads)) { 407 | word_count_actual += word_count - last_word_count; 408 | local_iter--; 409 | if (local_iter == 0) break; 410 | word_count = 0; 411 | last_word_count = 0; 412 | sentence_length = 0; 413 | fseek(fi, file_size / (long long)num_threads * (long long)id, SEEK_SET); 414 | continue; 415 | } 416 | word = sen[sentence_position]; 417 | if (word == -1) continue; 418 | for (c = 0; c < layer1_size; c++) neu1[c] = 0; 419 | for (c = 0; c < layer1_size; c++) neu1e[c] = 0; 420 | next_random = next_random * (unsigned long long)25214903917 + 11; 421 | b = next_random % window; 422 | if (cbow) { //train the cbow architecture 423 | // in -> hidden 424 | cw = 0; 425 | for (a = b; a < window * 1 + 1 - b; a++) if (a != window) { 426 | c = sentence_position - window + a; 427 | if (c < 0) continue; 428 | if (c >= sentence_length) continue; 429 | if (sentence_vectors && (c == 0)) continue; 430 | last_word = sen[c]; 431 | if (last_word == -1) continue; 432 | for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size]; 433 | cw++; 434 | } 435 | if (sentence_vectors) { 436 | last_word = sen[0]; 437 | if (last_word == -1) continue; 438 | for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size]; 439 | cw++; 440 | } 441 | if (cw) { 442 | for (c = 0; c < layer1_size; c++) neu1[c] /= cw; 443 | if (hs) for (d = 0; d < vocab[word].codelen; d++) { 444 | f = 0; 445 | l2 = vocab[word].point[d] * layer1_size; 446 | // Propagate hidden -> output 447 | for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1[c + l2]; 448 | if (f <= -MAX_EXP) continue; 449 | else if (f >= MAX_EXP) continue; 450 | else f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]; 451 | // 'g' is the gradient multiplied by the learning rate 452 | g = (1 - vocab[word].code[d] - f) * alpha; 453 | // Propagate errors output -> hidden 454 | for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2]; 455 | // Learn weights hidden -> output 456 | for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c]; 457 | } 458 | // NEGATIVE SAMPLING 459 | if (negative > 0) for (d = 0; d < negative + 1; d++) { 460 | if (d == 0) { 461 | target = word; 462 | label = 1; 463 | } else { 464 | next_random = next_random * (unsigned long long)25214903917 + 11; 465 | target = table[(next_random >> 16) % table_size]; 466 | if (target == 0) target = next_random % (vocab_size - 1) + 1; 467 | if (target == word) continue; 468 | label = 0; 469 | } 470 | l2 = target * layer1_size; 471 | f = 0; 472 | for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1neg[c + l2]; 473 | if (f > MAX_EXP) g = (label - 1) * alpha; 474 | else if (f < -MAX_EXP) g = (label - 0) * alpha; 475 | else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha; 476 | for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2]; 477 | for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * neu1[c]; 478 | } 479 | // hidden -> in 480 | for (a = b; a < window * 1 + 1 - b; a++) if (a != window) { 481 | c = sentence_position - window + a; 482 | if (c < 0) continue; 483 | if (c >= sentence_length) continue; 484 | if (sentence_vectors && (c == 0)) continue; 485 | last_word = sen[c]; 486 | if (last_word == -1) continue; 487 | for (c = 0; c < layer1_size; c++) syn0[c + last_word * layer1_size] += neu1e[c]; 488 | } 489 | if (sentence_vectors) { 490 | last_word = sen[0]; 491 | if (last_word == -1) continue; 492 | for (c = 0; c < layer1_size; c++) syn0[c + last_word * layer1_size] += neu1e[c]; 493 | } 494 | } 495 | } else { //train skip-gram 496 | for (a = b; a < window * 2 + 1 + sentence_vectors - b; a++) if (a != window) { 497 | c = sentence_position - window + a; 498 | if (sentence_vectors) if (a >= window * 2 + sentence_vectors - b) c = 0; 499 | if (c < 0) continue; 500 | if (c >= sentence_length) continue; 501 | last_word = sen[c]; 502 | if (last_word == -1) continue; 503 | l1 = last_word * layer1_size; 504 | for (c = 0; c < layer1_size; c++) neu1e[c] = 0; 505 | // HIERARCHICAL SOFTMAX 506 | if (hs) for (d = 0; d < vocab[word].codelen; d++) { 507 | f = 0; 508 | l2 = vocab[word].point[d] * layer1_size; 509 | // Propagate hidden -> output 510 | for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1[c + l2]; 511 | if (f <= -MAX_EXP) continue; 512 | else if (f >= MAX_EXP) continue; 513 | else f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]; 514 | // 'g' is the gradient multiplied by the learning rate 515 | g = (1 - vocab[word].code[d] - f) * alpha; 516 | // Propagate errors output -> hidden 517 | for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2]; 518 | // Learn weights hidden -> output 519 | for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * syn0[c + l1]; 520 | } 521 | // NEGATIVE SAMPLING 522 | if (negative > 0) for (d = 0; d < negative + 1; d++) { 523 | if (d == 0) { 524 | target = word; 525 | label = 1; 526 | } else { 527 | next_random = next_random * (unsigned long long)25214903917 + 11; 528 | target = table[(next_random >> 16) % table_size]; 529 | if (target == 0) target = next_random % (vocab_size - 1) + 1; 530 | if (target == word) continue; 531 | label = 0; 532 | } 533 | l2 = target * layer1_size; 534 | f = 0; 535 | for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2]; 536 | if (f > MAX_EXP) g = (label - 1) * alpha; 537 | else if (f < -MAX_EXP) g = (label - 0) * alpha; 538 | else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha; 539 | for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2]; 540 | for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1]; 541 | } 542 | // Learn weights input -> hidden 543 | for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c]; 544 | } 545 | } 546 | sentence_position++; 547 | if (sentence_position >= sentence_length) { 548 | sentence_length = 0; 549 | continue; 550 | } 551 | } 552 | fclose(fi); 553 | free(neu1); 554 | free(neu1e); 555 | pthread_exit(NULL); 556 | } 557 | 558 | void TrainModel() { 559 | long a, b, c, d; 560 | FILE *fo; 561 | pthread_t *pt = (pthread_t *)malloc(num_threads * sizeof(pthread_t)); 562 | printf("Starting training using file %s\n", train_file); 563 | starting_alpha = alpha; 564 | if (read_vocab_file[0] != 0) ReadVocab(); else LearnVocabFromTrainFile(); 565 | if (save_vocab_file[0] != 0) SaveVocab(); 566 | if (output_file[0] == 0) return; 567 | InitNet(); 568 | if (negative > 0) InitUnigramTable(); 569 | start = clock(); 570 | for (a = 0; a < num_threads; a++) pthread_create(&pt[a], NULL, TrainModelThread, (void *)a); 571 | for (a = 0; a < num_threads; a++) pthread_join(pt[a], NULL); 572 | fo = fopen(output_file, "wb"); 573 | if (classes == 0) { 574 | // Save the word vectors 575 | fprintf(fo, "%lld %lld\n", vocab_size, layer1_size); 576 | for (a = 0; a < vocab_size; a++) { 577 | fprintf(fo, "%s ", vocab[a].word); 578 | if (binary) for (b = 0; b < layer1_size; b++) fwrite(&syn0[a * layer1_size + b], sizeof(real), 1, fo); 579 | else for (b = 0; b < layer1_size; b++) fprintf(fo, "%lf ", syn0[a * layer1_size + b]); 580 | fprintf(fo, "\n"); 581 | } 582 | } else { 583 | // Run K-means on the word vectors 584 | int clcn = classes, iter = 10, closeid; 585 | int *centcn = (int *)malloc(classes * sizeof(int)); 586 | int *cl = (int *)calloc(vocab_size, sizeof(int)); 587 | real closev, x; 588 | real *cent = (real *)calloc(classes * layer1_size, sizeof(real)); 589 | for (a = 0; a < vocab_size; a++) cl[a] = a % clcn; 590 | for (a = 0; a < iter; a++) { 591 | for (b = 0; b < clcn * layer1_size; b++) cent[b] = 0; 592 | for (b = 0; b < clcn; b++) centcn[b] = 1; 593 | for (c = 0; c < vocab_size; c++) { 594 | for (d = 0; d < layer1_size; d++) cent[layer1_size * cl[c] + d] += syn0[c * layer1_size + d]; 595 | centcn[cl[c]]++; 596 | } 597 | for (b = 0; b < clcn; b++) { 598 | closev = 0; 599 | for (c = 0; c < layer1_size; c++) { 600 | cent[layer1_size * b + c] /= centcn[b]; 601 | closev += cent[layer1_size * b + c] * cent[layer1_size * b + c]; 602 | } 603 | closev = sqrt(closev); 604 | for (c = 0; c < layer1_size; c++) cent[layer1_size * b + c] /= closev; 605 | } 606 | for (c = 0; c < vocab_size; c++) { 607 | closev = -10; 608 | closeid = 0; 609 | for (d = 0; d < clcn; d++) { 610 | x = 0; 611 | for (b = 0; b < layer1_size; b++) x += cent[layer1_size * d + b] * syn0[c * layer1_size + b]; 612 | if (x > closev) { 613 | closev = x; 614 | closeid = d; 615 | } 616 | } 617 | cl[c] = closeid; 618 | } 619 | } 620 | // Save the K-means classes 621 | for (a = 0; a < vocab_size; a++) fprintf(fo, "%s %d\n", vocab[a].word, cl[a]); 622 | free(centcn); 623 | free(cent); 624 | free(cl); 625 | } 626 | fclose(fo); 627 | } 628 | 629 | int ArgPos(char *str, int argc, char **argv) { 630 | int a; 631 | for (a = 1; a < argc; a++) if (!strcmp(str, argv[a])) { 632 | if (a == argc - 1) { 633 | printf("Argument missing for %s\n", str); 634 | exit(1); 635 | } 636 | return a; 637 | } 638 | return -1; 639 | } 640 | 641 | int main(int argc, char **argv) { 642 | int i; 643 | if (argc == 1) { 644 | printf("WORD VECTOR estimation toolkit v 0.1c\n\n"); 645 | printf("Options:\n"); 646 | printf("Parameters for training:\n"); 647 | printf("\t-train \n"); 648 | printf("\t\tUse text data from to train the model\n"); 649 | printf("\t-output \n"); 650 | printf("\t\tUse to save the resulting word vectors / word clusters\n"); 651 | printf("\t-size \n"); 652 | printf("\t\tSet size of word vectors; default is 100\n"); 653 | printf("\t-window \n"); 654 | printf("\t\tSet max skip length between words; default is 5\n"); 655 | printf("\t-sample \n"); 656 | printf("\t\tSet threshold for occurrence of words. Those that appear with higher frequency in the training data\n"); 657 | printf("\t\twill be randomly down-sampled; default is 1e-3, useful range is (0, 1e-5)\n"); 658 | printf("\t-hs \n"); 659 | printf("\t\tUse Hierarchical Softmax; default is 0 (not used)\n"); 660 | printf("\t-negative \n"); 661 | printf("\t\tNumber of negative examples; default is 5, common values are 3 - 10 (0 = not used)\n"); 662 | printf("\t-threads \n"); 663 | printf("\t\tUse threads (default 12)\n"); 664 | printf("\t-iter \n"); 665 | printf("\t\tRun more training iterations (default 5)\n"); 666 | printf("\t-min-count \n"); 667 | printf("\t\tThis will discard words that appear less than times; default is 5\n"); 668 | printf("\t-alpha \n"); 669 | printf("\t\tSet the starting learning rate; default is 0.025 for skip-gram and 0.05 for CBOW\n"); 670 | printf("\t-classes \n"); 671 | printf("\t\tOutput word classes rather than word vectors; default number of classes is 0 (vectors are written)\n"); 672 | printf("\t-debug \n"); 673 | printf("\t\tSet the debug mode (default = 2 = more info during training)\n"); 674 | printf("\t-binary \n"); 675 | printf("\t\tSave the resulting vectors in binary moded; default is 0 (off)\n"); 676 | printf("\t-save-vocab \n"); 677 | printf("\t\tThe vocabulary will be saved to \n"); 678 | printf("\t-read-vocab \n"); 679 | printf("\t\tThe vocabulary will be read from , not constructed from the training data\n"); 680 | printf("\t-cbow \n"); 681 | printf("\t\tUse the continuous bag of words model; default is 1 (use 0 for skip-gram model)\n"); 682 | printf("\t-sentence-vectors \n"); 683 | printf("\t\tAssume the first token at the beginning of each line is a sentence ID. This token will be trained\n"); 684 | printf("\t\twith full sentence context instead of just the window. Use 1 to turn on.\n"); 685 | printf("\nExamples:\n"); 686 | printf("./word2vec -train data.txt -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 1 -iter 3\n\n"); 687 | return 0; 688 | } 689 | output_file[0] = 0; 690 | save_vocab_file[0] = 0; 691 | read_vocab_file[0] = 0; 692 | if ((i = ArgPos((char *)"-size", argc, argv)) > 0) layer1_size = atoi(argv[i + 1]); 693 | if ((i = ArgPos((char *)"-train", argc, argv)) > 0) strcpy(train_file, argv[i + 1]); 694 | if ((i = ArgPos((char *)"-save-vocab", argc, argv)) > 0) strcpy(save_vocab_file, argv[i + 1]); 695 | if ((i = ArgPos((char *)"-read-vocab", argc, argv)) > 0) strcpy(read_vocab_file, argv[i + 1]); 696 | if ((i = ArgPos((char *)"-debug", argc, argv)) > 0) debug_mode = atoi(argv[i + 1]); 697 | if ((i = ArgPos((char *)"-binary", argc, argv)) > 0) binary = atoi(argv[i + 1]); 698 | if ((i = ArgPos((char *)"-cbow", argc, argv)) > 0) cbow = atoi(argv[i + 1]); 699 | if (cbow) alpha = 0.05; 700 | if ((i = ArgPos((char *)"-alpha", argc, argv)) > 0) alpha = atof(argv[i + 1]); 701 | if ((i = ArgPos((char *)"-output", argc, argv)) > 0) strcpy(output_file, argv[i + 1]); 702 | if ((i = ArgPos((char *)"-window", argc, argv)) > 0) window = atoi(argv[i + 1]); 703 | if ((i = ArgPos((char *)"-sample", argc, argv)) > 0) sample = atof(argv[i + 1]); 704 | if ((i = ArgPos((char *)"-hs", argc, argv)) > 0) hs = atoi(argv[i + 1]); 705 | if ((i = ArgPos((char *)"-negative", argc, argv)) > 0) negative = atoi(argv[i + 1]); 706 | if ((i = ArgPos((char *)"-threads", argc, argv)) > 0) num_threads = atoi(argv[i + 1]); 707 | if ((i = ArgPos((char *)"-iter", argc, argv)) > 0) iter = atoi(argv[i + 1]); 708 | if ((i = ArgPos((char *)"-min-count", argc, argv)) > 0) min_count = atoi(argv[i + 1]); 709 | if ((i = ArgPos((char *)"-classes", argc, argv)) > 0) classes = atoi(argv[i + 1]); 710 | if ((i = ArgPos((char *)"-sentence-vectors", argc, argv)) > 0) sentence_vectors = atoi(argv[i + 1]); 711 | vocab = (struct vocab_word *)calloc(vocab_max_size, sizeof(struct vocab_word)); 712 | vocab_hash = (int *)calloc(vocab_hash_size, sizeof(int)); 713 | expTable = (real *)malloc((EXP_TABLE_SIZE + 1) * sizeof(real)); 714 | for (i = 0; i < EXP_TABLE_SIZE; i++) { 715 | expTable[i] = exp((i / (real)EXP_TABLE_SIZE * 2 - 1) * MAX_EXP); // Precompute the exp() table 716 | expTable[i] = expTable[i] / (expTable[i] + 1); // Precompute f(x) = x / (x + 1) 717 | } 718 | TrainModel(); 719 | return 0; 720 | } 721 | -------------------------------------------------------------------------------- /corpus/Hownet/neg_opinion.txt: -------------------------------------------------------------------------------- 1 | 僄 2 | 啰啰唆唆 3 | 啰啰嗦嗦 4 | 啰里啰唆 5 | 啰里啰嗦 6 | 啰唆 7 | 啰嗦 8 | 噲 9 | 奓着头发 10 | 婞 11 | 婞直 12 | 崒 13 | 弇陋 14 | 惛 15 | 惼 16 | 梼昧 17 | 獪 18 | 瘆 19 | 瘆得慌 20 | 哀鸿遍野 21 | 矮 22 | 碍难 23 | 碍眼 24 | 爱搭不理 25 | 爱理不理 26 | 暗 27 | 暗暗 28 | 暗沉沉 29 | 暗淡 30 | 暗地 31 | 暗地里 32 | 暗黑 33 | 暗里 34 | 暗昧 35 | 暗弱 36 | 暗无天日 37 | 暗下 38 | 暗中 39 | 暗自 40 | 暗朦 41 | 岸然 42 | 肮里肮脏 43 | 肮脏 44 | 昂贵 45 | 凹凸 46 | 凹凸不平 47 | 傲 48 | 傲岸 49 | 傲慢 50 | 八面玲珑 51 | 跋扈 52 | 霸道 53 | 霸气 54 | 白 55 | 白白 56 | 白痴般 57 | 白搭 58 | 白忙 59 | 白忙活儿 60 | 白衣苍狗 61 | 白云苍狗 62 | 百孔千疮 63 | 败坏 64 | 稗 65 | 板 66 | 板板六十四 67 | 板滞 68 | 半半拉拉 69 | 半路出家 70 | 半新不旧 71 | 半真半假 72 | 薄 73 | 薄情 74 | 薄弱 75 | 薄幸 76 | 保残守缺 77 | 保守 78 | 抱残守缺 79 | 暴 80 | 暴烈 81 | 暴虐 82 | 暴躁 83 | 暴戾 84 | 暴戾恣睢 85 | 爆炸性 86 | 悲惨 87 | 悲观 88 | 悲观地 89 | 悲剧 90 | 悲凉 91 | 卑 92 | 卑鄙 93 | 卑鄙无耻 94 | 卑贱 95 | 卑劣 96 | 卑陋 97 | 卑怯 98 | 卑俗 99 | 卑琐 100 | 卑微 101 | 卑污 102 | 卑下 103 | 卑猥 104 | 背地 105 | 背地里 106 | 背光 107 | 背后 108 | 背悔 109 | 背静 110 | 背靠背 111 | 背理 112 | 背令 113 | 背人 114 | 背时 115 | 背阴 116 | 被动 117 | 被动式 118 | 被动性 119 | 本本主义 120 | 笨 121 | 笨手笨脚 122 | 笨头笨脑 123 | 笨重 124 | 笨拙 125 | 比肩接踵 126 | 鄙 127 | 鄙贱 128 | 鄙吝 129 | 鄙陋 130 | 鄙俗 131 | 鄙俚 132 | 蔽塞 133 | 闭塞 134 | 必修 135 | 变化不定 136 | 变化多端 137 | 变化万千 138 | 变化无常 139 | 变化无穷 140 | 变幻不定 141 | 变幻莫测 142 | 变幻无常 143 | 变态 144 | 变相 145 | 表里不一 146 | 憋拗 147 | 别别扭扭 148 | 别扭 149 | 别有用心 150 | 冰冷 151 | 冰炭不相容 152 | 秉性剌戾 153 | 病病歪歪 154 | 病弱 155 | 病态 156 | 病歪歪 157 | 病殃殃 158 | 病恹恹 159 | 波谲云诡 160 | 驳杂 161 | 捕风捉影 162 | 不爱交际 163 | 不便 164 | 不辨菽麦 165 | 不才 166 | 不成材 167 | 不成功 168 | 不成话 169 | 不成器 170 | 不成熟 171 | 不成体统 172 | 不成样子 173 | 不打紧 174 | 不大重要 175 | 不当 176 | 不到黄河心不死 177 | 不道德 178 | 不得当 179 | 不得劲 180 | 不得了 181 | 不得体 182 | 不登大雅之堂 183 | 不等 184 | 不端 185 | 不对 186 | 不对茬儿 187 | 不对称 188 | 不对付 189 | 不对劲 190 | 不对头 191 | 不发达 192 | 不法 193 | 不方便 194 | 不分青红皂白 195 | 不分皂白 196 | 不负责任 197 | 不干不净 198 | 不更事 199 | 不公 200 | 不公正 201 | 不共戴天 202 | 不够格 203 | 不够完善 204 | 不够意思 205 | 不关痛痒 206 | 不管三七二十一 207 | 不管用 208 | 不光彩 209 | 不光明 210 | 不规则 211 | 不轨 212 | 不好吃 213 | 不好看 214 | 不好客 215 | 不好卖 216 | 不好使 217 | 不好听 218 | 不好用 219 | 不和 220 | 不和蔼 221 | 不和谐 222 | 不合法 223 | 不合格 224 | 不合理 225 | 不合逻辑 226 | 不合情理 227 | 不合时令 228 | 不合时宜 229 | 不合适 230 | 不合算 231 | 不合宜 232 | 不合语法 233 | 不划算 234 | 不济 235 | 不济事 236 | 不俭省 237 | 不健康 238 | 不洁 239 | 不谨慎 240 | 不近情理 241 | 不近人情 242 | 不尽 243 | 不尽如人意 244 | 不精确 245 | 不敬 246 | 不绝如缕 247 | 不均匀 248 | 不开化 249 | 不堪入耳 250 | 不堪入目 251 | 不堪设想 252 | 不堪一击 253 | 不堪造就 254 | 不科学 255 | 不可爱 256 | 不可补救 257 | 不可读 258 | 不可告人 259 | 不可更新 260 | 不可恢复 261 | 不可降解 262 | 不可接受 263 | 不可救药 264 | 不可理喻 265 | 不可逆转 266 | 不可容忍 267 | 不可收拾 268 | 不可挽回 269 | 不可行 270 | 不可一世 271 | 不可逾越 272 | 不客气 273 | 不宽容 274 | 不郎不秀 275 | 不冷不热 276 | 不理智 277 | 不礼貌 278 | 不利 279 | 不利于健康 280 | 不力 281 | 不良 282 | 不灵敏 283 | 不灵巧 284 | 不流行 285 | 不伦不类 286 | 不美观 287 | 不妙 288 | 不民主 289 | 不明不白 290 | 不明显 291 | 不明智 292 | 不名一文 293 | 不名誉 294 | 不能解救 295 | 不能容忍 296 | 不宁 297 | 不努力 298 | 不平 299 | 不平等 300 | 不平衡 301 | 不起眼 302 | 不起眼儿 303 | 不巧 304 | 不切实际 305 | 不清不白 306 | 不清不楚 307 | 不清楚 308 | 不清洁 309 | 不确切 310 | 不仁 311 | 不仁不义 312 | 不人道 313 | 不三不四 314 | 不善 315 | 不善交际 316 | 不善交谈 317 | 不甚重要 318 | 不慎 319 | 不胜 320 | 不是味儿 321 | 不是滋味儿 322 | 不适当 323 | 不适宜 324 | 不适应 325 | 不适于居住 326 | 不受欢迎 327 | 不熟练 328 | 不疼不痒 329 | 不体面 330 | 不痛不痒 331 | 不透明 332 | 不透气 333 | 不妥 334 | 不为人知 335 | 不卫生 336 | 不文明 337 | 不文雅 338 | 不稳定 339 | 不问青红皂白 340 | 不问三七二十一 341 | 不问是非情由 342 | 不显眼 343 | 不现实 344 | 不相适应 345 | 不祥 346 | 不详 347 | 不详尽 348 | 不像话 349 | 不消化 350 | 不孝 351 | 不肖 352 | 不协调 353 | 不兴 354 | 不行 355 | 不幸 356 | 不修边幅 357 | 不学无术 358 | 不逊 359 | 不雅 360 | 不雅观 361 | 不雅致 362 | 不要紧 363 | 不一致 364 | 不宜 365 | 不宜居住 366 | 不宜说出口 367 | 不易 368 | 不友好 369 | 不友善 370 | 不择手段 371 | 不真诚 372 | 不真实 373 | 不贞洁 374 | 不正常 375 | 不正当 376 | 不正派 377 | 不正直 378 | 不值得羡慕 379 | 不值一文 380 | 不中用 381 | 不重要 382 | 不周 383 | 不周到 384 | 不注意 385 | 不着边际 386 | 不着调 387 | 不足道 388 | 不足挂齿 389 | 不足轻重 390 | 不足取 391 | 不足为外人道 392 | 不足为训 393 | 不羁 394 | 不稂不莠 395 | 不虔诚 396 | 才疏学浅 397 | 财迷心窍 398 | 残 399 | 残败 400 | 残暴 401 | 残毒 402 | 残酷 403 | 残酷无情 404 | 残虐 405 | 残破 406 | 残破不全 407 | 残缺 408 | 残缺不全 409 | 残忍 410 | 残损 411 | 惨 412 | 惨不忍睹 413 | 惨淡 414 | 惨毒 415 | 惨绝人寰 416 | 惨苦 417 | 惨厉 418 | 惨烈 419 | 惨痛 420 | 惨无人道 421 | 苍白 422 | 苍白无力 423 | 苍凉 424 | 苍茫 425 | 操切 426 | 糙 427 | 草 428 | 草草 429 | 草荒 430 | 草率 431 | 草木皆兵 432 | 策略 433 | 策略性 434 | 差 435 | 差点儿 436 | 差劲 437 | 差可 438 | 豺狼成性 439 | 豺狼当道 440 | 缠手 441 | 颤颤巍巍 442 | 颤颤悠悠 443 | 颤巍巍 444 | 猖 445 | 猖狂 446 | 长长短短 447 | 长篇大论 448 | 长篇累牍 449 | 长线 450 | 超标 451 | 超常 452 | 超然 453 | 超重 454 | 朝不保夕 455 | 朝不谋夕 456 | 朝秦暮楚 457 | 朝三暮四 458 | 潮 459 | 吵吵闹闹 460 | 吵人 461 | 沉闷 462 | 沉痛 463 | 沉滞 464 | 陈 465 | 陈腐 466 | 陈旧 467 | 成事不足,败事有余 468 | 逞性 469 | 逞性子 470 | 吃不开 471 | 吃劲 472 | 吃力 473 | 吃重 474 | 痴 475 | 痴痴 476 | 痴呆 477 | 痴呆呆 478 | 痴傻 479 | 痴愚 480 | 迟钝 481 | 侈 482 | 侈靡 483 | 侈糜 484 | 赤地千里 485 | 赤裸裸淫秽 486 | 赤贫 487 | 充满危机 488 | 冲昏头脑 489 | 丑 490 | 丑恶 491 | 丑陋 492 | 臭 493 | 臭不可闻 494 | 臭哄哄 495 | 臭烘烘 496 | 臭名远扬 497 | 臭名昭彰 498 | 臭名昭著 499 | 臭气冲天 500 | 臭气熏天 501 | 臭味 502 | 初出茅庐 503 | 出手阔气 504 | 触目惊心 505 | 穿不出去 506 | 穿不得 507 | 串秧儿 508 | 疮痍满目 509 | 蠢 510 | 蠢笨 511 | 蠢头蠢脑 512 | 刺鼻 513 | 刺耳 514 | 刺眼 515 | 次 516 | 次等 517 | 从动 518 | 从心所欲 519 | 从严 520 | 从重 521 | 粗 522 | 粗暴 523 | 粗笨 524 | 粗鄙 525 | 粗糙 526 | 粗放 527 | 粗拉 528 | 粗里粗气 529 | 粗劣 530 | 粗陋 531 | 粗鲁 532 | 粗率 533 | 粗蛮 534 | 粗莽 535 | 粗浅 536 | 粗涩 537 | 粗手笨脚 538 | 粗疏 539 | 粗俗 540 | 粗线条 541 | 粗心 542 | 粗心大意 543 | 粗野 544 | 粗枝大叶 545 | 粗制滥造 546 | 粗重 547 | 粗拙 548 | 粗犷 549 | 促狭 550 | 脆弱 551 | 村气 552 | 村野 553 | 寸草不生 554 | 错 555 | 错乱 556 | 错误 557 | 错误百出 558 | 错杂 559 | 错综 560 | 错综复杂 561 | 大 562 | 大错而特错 563 | 大错特错 564 | 大大咧咧 565 | 大而笨拙 566 | 大而化之 567 | 大而无当 568 | 大海捞针 569 | 大面额 570 | 大谬不然 571 | 大手大脚 572 | 大肆 573 | 大摇大摆 574 | 大意 575 | 大咧咧 576 | 呆 577 | 呆板 578 | 呆笨 579 | 呆痴 580 | 呆呆 581 | 呆钝 582 | 呆气 583 | 呆傻 584 | 呆头呆脑 585 | 歹 586 | 歹毒 587 | 带有敌意 588 | 殆 589 | 怠惰 590 | 单 591 | 单薄 592 | 单调 593 | 单调枯燥 594 | 单弱 595 | 胆怯 596 | 胆小 597 | 胆小怕事 598 | 胆小如鼠 599 | 淡 600 | 淡薄 601 | 淡淡 602 | 淡而无味 603 | 淡漠 604 | 淡然 605 | 诞 606 | 荡 607 | 刀光剑影 608 | 蹈常袭故 609 | 倒胃口 610 | 道德败坏 611 | 道貌岸然 612 | 德行 613 | 德性 614 | 得寸进尺 615 | 得陇望蜀 616 | 得鱼忘筌 617 | 灯红酒绿 618 | 灯火阑珊 619 | 等而下之 620 | 等外 621 | 等因奉此 622 | 低 623 | 低卑 624 | 低标准 625 | 低层 626 | 低档 627 | 低等 628 | 低端 629 | 低级 630 | 低贱 631 | 低劣 632 | 低迷 633 | 低能 634 | 低人一等 635 | 低三下四 636 | 低声下气 637 | 低俗 638 | 低下 639 | 低效 640 | 低效能 641 | 低值 642 | 低智 643 | 低质 644 | 滴里嘟噜 645 | 敌对 646 | 地位低下 647 | 地下 648 | 地狱般 649 | 颠倒 650 | 颠连 651 | 颠三倒四 652 | 凋敝 653 | 刁 654 | 刁恶 655 | 刁悍 656 | 刁滑 657 | 刁赖 658 | 刁蛮 659 | 刁钻 660 | 刁钻古怪 661 | 吊儿郎当 662 | 调皮 663 | 鼎沸 664 | 丢魂 665 | 丢脸 666 | 丢三落四 667 | 东倒西歪 668 | 冬烘 669 | 动荡 670 | 动荡不安 671 | 动魄惊心 672 | 动作迟顿 673 | 毒 674 | 毒辣 675 | 独裁 676 | 独断 677 | 度量小 678 | 短浅 679 | 短视 680 | 钝 681 | 多变 682 | 多病 683 | 多事 684 | 多义 685 | 多余 686 | 惰 687 | 惰性 688 | 讹 689 | 恶 690 | 恶毒 691 | 恶贯满盈 692 | 恶狠狠 693 | 恶劣 694 | 恶煞煞 695 | 恶心 696 | 恶浊 697 | 饿殍遍野 698 | 耳生 699 | 二把刀 700 | 二手 701 | 二五眼 702 | 发狂 703 | 发腻 704 | 发育不全 705 | 乏 706 | 乏味 707 | 翻手为云,覆手为雨 708 | 翻云覆雨 709 | 繁复 710 | 繁乱 711 | 繁难 712 | 繁冗 713 | 繁琐 714 | 繁芜 715 | 繁杂 716 | 繁重 717 | 繁缛 718 | 烦 719 | 烦难 720 | 烦冗 721 | 烦琐 722 | 烦嚣 723 | 反 724 | 反常 725 | 反对称 726 | 反反复复 727 | 反复无常 728 | 反面 729 | 反叛 730 | 反社会 731 | 犯有罪行 732 | 饭桶 733 | 泛 734 | 泛泛 735 | 放诞 736 | 放荡 737 | 放荡不羁 738 | 放浪 739 | 放肆 740 | 放纵 741 | 菲 742 | 菲薄 743 | 非 744 | 非法 745 | 非分 746 | 非婚生 747 | 非礼 748 | 非人 749 | 非生产性 750 | 非正常 751 | 非正统 752 | 非正义 753 | 废 754 | 废弛 755 | 废旧 756 | 废物 757 | 沸沸扬扬 758 | 费 759 | 费工夫 760 | 费功夫 761 | 费劲 762 | 费力 763 | 费时 764 | 费事 765 | 纷 766 | 纷繁 767 | 纷乱 768 | 纷扰 769 | 纷杂 770 | 封闭 771 | 封闭式 772 | 封闭型 773 | 封建 774 | 锋芒毕露 775 | 风吹日晒 776 | 风刀霜剑 777 | 风风火火 778 | 风流 779 | 风骚 780 | 风声鹤唳 781 | 风雨飘摇 782 | 疯疯癫癫 783 | 疯狂 784 | 疯狂般 785 | 疯癫癫 786 | 否 787 | 否定 788 | 肤泛 789 | 肤皮潦草 790 | 肤浅 791 | 浮 792 | 浮泛 793 | 浮光掠影 794 | 浮滑 795 | 浮皮蹭痒 796 | 浮漂 797 | 浮浅 798 | 浮躁 799 | 浮噪 800 | 腐败 801 | 腐败堕落 802 | 腐臭 803 | 腐恶 804 | 腐化 805 | 腐化堕落 806 | 腐旧 807 | 腐烂 808 | 腐朽 809 | 腐朽没落 810 | 覆雨翻云 811 | 复 812 | 复合 813 | 复合式 814 | 复合型 815 | 复杂 816 | 复杂多变 817 | 傅会 818 | 负 819 | 负面 820 | 富余 821 | 附会 822 | 嘎 823 | 该死 824 | 概念化 825 | 干 826 | 干巴 827 | 干巴巴 828 | 干瘪 829 | 干瘪瘪 830 | 干干巴巴 831 | 干燥 832 | 赶尽杀绝 833 | 刚愎 834 | 刚愎自用 835 | 高昂 836 | 高傲 837 | 高不成,低不就 838 | 高不成低不就 839 | 高成本 840 | 高价 841 | 高价位 842 | 高难 843 | 高难度 844 | 高压 845 | 疙疙瘩瘩 846 | 疙里疙瘩 847 | 隔靴搔痒 848 | 勾心斗角 849 | 苟且 850 | 狗眼看人 851 | 垢 852 | 够呛 853 | 够戗 854 | 孤 855 | 孤傲 856 | 孤傲不群 857 | 孤单 858 | 孤单单 859 | 孤独 860 | 孤孤单单 861 | 孤寡 862 | 孤寂 863 | 孤立 864 | 孤立无援 865 | 孤零零 866 | 孤陋寡闻 867 | 孤僻 868 | 古怪 869 | 古旧 870 | 古里古怪 871 | 固定不变 872 | 固执 873 | 寡 874 | 寡淡 875 | 寡断 876 | 寡了叭叽 877 | 寡情 878 | 寡味 879 | 寡言 880 | 寡言少语 881 | 挂漏 882 | 挂名 883 | 挂一漏万 884 | 乖谬 885 | 乖僻 886 | 乖张 887 | 乖剌 888 | 乖戾 889 | 怪里怪气 890 | 怪僻 891 | 官僚 892 | 官僚主义 893 | 光怪陆离 894 | 鬼 895 | 鬼鬼祟祟 896 | 鬼计多端 897 | 鬼头鬼脑 898 | 诡 899 | 诡计多端 900 | 诡秘 901 | 诡诈 902 | 诡谲 903 | 贵 904 | 过当 905 | 过分简单化 906 | 过分拥挤 907 | 过河拆桥 908 | 过了气 909 | 过气 910 | 过桥抽板 911 | 过时 912 | 哈喇 913 | 孩子气 914 | 海底捞月 915 | 海底捞针 916 | 害 917 | 骇人听闻 918 | 憨 919 | 含含糊糊 920 | 含含混混 921 | 含糊 922 | 含糊不清 923 | 含糊其辞 924 | 含糊其词 925 | 含混 926 | 含混不清 927 | 含蓄 928 | 涵蓄 929 | 寒 930 | 寒苦 931 | 寒素 932 | 寒酸 933 | 寒微 934 | 寒伧 935 | 寒碜 936 | 悍 937 | 悍然 938 | 豪 939 | 豪侈 940 | 豪横 941 | 豪强 942 | 豪奢 943 | 毫不客气 944 | 毫不留情 945 | 毫无价值 946 | 毫无目标 947 | 毫无意义 948 | 毫无用处 949 | 好不容易 950 | 好容易 951 | 好事多磨 952 | 黑 953 | 黑暗 954 | 黑沉沉 955 | 黑灯瞎火 956 | 黑洞洞 957 | 黑咕隆咚 958 | 黑乎乎 959 | 黑茫茫 960 | 黑蒙蒙 961 | 黑漆寥光 962 | 黑漆漆 963 | 黑森森 964 | 黑心 965 | 黑心肠 966 | 黑黝黝 967 | 黑黢黢 968 | 狠 969 | 狠毒 970 | 狠劲 971 | 狠心 972 | 横 973 | 横暴 974 | 横加 975 | 横蛮无理 976 | 横七竖八 977 | 哄然 978 | 猴 979 | 后患无穷 980 | 后进 981 | 呼幺喝六 982 | 胡 983 | 胡里胡涂 984 | 胡乱 985 | 胡子拉茬 986 | 胡子拉碴 987 | 糊糊涂涂 988 | 糊里糊涂 989 | 糊涂 990 | 虎踞龙蟠 991 | 虎头蛇尾 992 | 花 993 | 花插着 994 | 花搭着 995 | 花花搭搭 996 | 花里胡哨 997 | 花钱浪费 998 | 花拳绣腿 999 | 花天酒地 1000 | 花心 1001 | 哗然 1002 | 华 1003 | 华而不实 1004 | 猾 1005 | 滑 1006 | 滑头 1007 | 滑头滑脑 1008 | 怀着恶意 1009 | 坏 1010 | 坏脾气 1011 | 坏人当道 1012 | 幻 1013 | 幻异 1014 | 荒 1015 | 荒诞 1016 | 荒诞不经 1017 | 荒诞派 1018 | 荒诞无稽 1019 | 荒废 1020 | 荒寂 1021 | 荒凉 1022 | 荒乱 1023 | 荒落 1024 | 荒谬 1025 | 荒谬绝伦 1026 | 荒漠 1027 | 荒僻 1028 | 荒弃 1029 | 荒疏 1030 | 荒唐 1031 | 荒唐无稽 1032 | 荒无人烟 1033 | 荒芜 1034 | 荒淫 1035 | 荒淫无耻 1036 | 荒淫无度 1037 | 荒瘠 1038 | 黄色 1039 | 晃晃悠悠 1040 | 晃悠悠 1041 | 恍恍惚惚 1042 | 恍惚 1043 | 谎 1044 | 灰暗 1045 | 灰沉沉 1046 | 灰溜溜 1047 | 灰茫茫 1048 | 灰蒙蒙 1049 | 灰色 1050 | 灰头灰脸 1051 | 灰头土脸 1052 | 灰秃秃 1053 | 灰朦朦 1054 | 慧黠 1055 | 晦 1056 | 晦暗 1057 | 晦涩 1058 | 晦冥 1059 | 晦暝 1060 | 秽 1061 | 秽恶 1062 | 秽乱 1063 | 秽土 1064 | 秽亵 1065 | 会来事 1066 | 昏 1067 | 昏暗 1068 | 昏沉 1069 | 昏黑 1070 | 昏乱 1071 | 昏昧 1072 | 昏天黑地 1073 | 昏头昏脑 1074 | 昏庸 1075 | 昏愦 1076 | 昏聩 1077 | 婚外 1078 | 浑 1079 | 浑浑噩噩 1080 | 浑头浑脑 1081 | 浑浊 1082 | 浑噩 1083 | 混 1084 | 混合 1085 | 混混沌沌 1086 | 混交 1087 | 混乱 1088 | 混淆不清 1089 | 混血 1090 | 混账 1091 | 混浊 1092 | 混沌 1093 | 活动 1094 | 火暴 1095 | 火爆 1096 | 祸不单行 1097 | 祸从天降 1098 | 机变 1099 | 机械 1100 | 机械式 1101 | 机械性 1102 | 畸 1103 | 畸轻畸重 1104 | 畸形 1105 | 积满灰尘 1106 | 积重难返 1107 | 鸡零狗碎 1108 | 鸡毛蒜皮 1109 | 鸡犬不留 1110 | 鸡犬不宁 1111 | 棘手 1112 | 急不可待 1113 | 急功近利 1114 | 急切 1115 | 急性子 1116 | 急于 1117 | 急躁 1118 | 疾言厉色 1119 | 挤 1120 | 挤巴 1121 | 挤得水泄不通 1122 | 挤得要命 1123 | 挤挤插插 1124 | 挤满 1125 | 寂 1126 | 寂寥 1127 | 寂寞 1128 | 忌刻 1129 | 夹七夹八 1130 | 家长式 1131 | 家贫如洗 1132 | 家徒壁立 1133 | 家徒四壁 1134 | 假 1135 | 假冒 1136 | 假模假式 1137 | 假仁假义 1138 | 假想 1139 | 假惺惺 1140 | 假意 1141 | 假造 1142 | 假正经 1143 | 假装神圣 1144 | 价高 1145 | 价格不菲 1146 | 价格高昂 1147 | 架空 1148 | 尖刻 1149 | 尖酸 1150 | 尖酸刻薄 1151 | 尖嘴薄舌 1152 | 尖嘴猴腮 1153 | 间不容发 1154 | 间杂 1155 | 肩摩毂击 1156 | 艰 1157 | 艰巨 1158 | 艰苦 1159 | 艰苦卓绝 1160 | 艰难 1161 | 艰难曲折 1162 | 艰难险阻 1163 | 艰涩 1164 | 艰深 1165 | 艰危 1166 | 艰辛 1167 | 奸 1168 | 奸刁 1169 | 奸恶 1170 | 奸猾 1171 | 奸险 1172 | 奸邪 1173 | 奸诈 1174 | 奸佞 1175 | 简单 1176 | 简陋 1177 | 简慢 1178 | 贱 1179 | 见不得人 1180 | 见风使舵 1181 | 见风转舵 1182 | 见识短浅 1183 | 见异思迁 1184 | 剑拔弩张 1185 | 僵 1186 | 僵化 1187 | 僵硬 1188 | 胶柱鼓瑟 1189 | 浇薄 1190 | 浇漓 1191 | 骄 1192 | 骄傲 1193 | 骄傲自满 1194 | 骄横 1195 | 骄慢 1196 | 骄气 1197 | 骄人 1198 | 骄奢淫逸 1199 | 骄纵 1200 | 骄矜 1201 | 娇 1202 | 娇痴 1203 | 娇贵 1204 | 娇憨 1205 | 娇嫩 1206 | 娇气 1207 | 娇弱 1208 | 娇生惯养 1209 | 矫情 1210 | 矫情造作 1211 | 矫揉造作 1212 | 侥 1213 | 狡 1214 | 狡猾 1215 | 狡计多端 1216 | 狡兔三窟 1217 | 狡诈 1218 | 狡狯 1219 | 狡黠 1220 | 揭不开锅 1221 | 竭蹶 1222 | 洁身自好 1223 | 结结巴巴 1224 | 斤斤计较 1225 | 金刚努目 1226 | 紧 1227 | 紧巴 1228 | 紧巴巴 1229 | 近视 1230 | 荆棘载途 1231 | 惊爆 1232 | 惊人 1233 | 惊天动地 1234 | 惊险 1235 | 精力枯竭 1236 | 精神不振 1237 | 精神溜号 1238 | 经济拮据 1239 | 经验不足 1240 | 静僻 1241 | 净余 1242 | 窘 1243 | 窘促 1244 | 窘急 1245 | 窘困 1246 | 窘迫 1247 | 窘涩 1248 | 旧 1249 | 旧式 1250 | 拘 1251 | 拘礼 1252 | 拘执 1253 | 狙 1254 | 拒人于千里之外 1255 | 剧毒 1256 | 倔 1257 | 倔强 1258 | 倔头倔脑 1259 | 倔犟 1260 | 绝 1261 | 绝情 1262 | 峻 1263 | 开小差 1264 | 坎坷 1265 | 坎坷不平 1266 | 看风使舵 1267 | 糠 1268 | 亢 1269 | 靠不住 1270 | 苛 1271 | 苛刻 1272 | 磕磕绊绊 1273 | 磕头碰脑 1274 | 可悲 1275 | 可鄙 1276 | 可怖 1277 | 可耻 1278 | 可恶 1279 | 可骇 1280 | 可恨 1281 | 可惊 1282 | 可怜 1283 | 可怕 1284 | 可叹 1285 | 可有可无 1286 | 可憎 1287 | 刻板 1288 | 刻薄 1289 | 刻毒 1290 | 刻舟求剑 1291 | 坑坑洼洼 1292 | 坑洼 1293 | 坑洼不平 1294 | 空 1295 | 空洞 1296 | 空洞洞 1297 | 空洞无聊 1298 | 空洞无物 1299 | 空乏 1300 | 空泛 1301 | 空幻 1302 | 空空洞洞 1303 | 空落落 1304 | 空头 1305 | 空虚 1306 | 空中楼阁 1307 | 恐怖 1308 | 抠 1309 | 抠门儿 1310 | 抠搜 1311 | 抠唆 1312 | 口蜜腹剑 1313 | 口是心非 1314 | 口头上 1315 | 枯 1316 | 枯寂 1317 | 枯涩 1318 | 枯燥 1319 | 枯燥乏味 1320 | 枯燥无味 1321 | 枯槁 1322 | 苦 1323 | 苦不唧 1324 | 苦口 1325 | 苦苦 1326 | 苦涩 1327 | 酷 1328 | 酷烈 1329 | 酷虐 1330 | 夸诞 1331 | 狂 1332 | 狂傲 1333 | 狂暴 1334 | 狂荡 1335 | 狂妄 1336 | 狂妄自大 1337 | 狂躁 1338 | 狂悖 1339 | 狂恣 1340 | 困顿 1341 | 困窘 1342 | 困苦 1343 | 困难 1344 | 困难重重 1345 | 困人 1346 | 阔绰 1347 | 阔气 1348 | 拉忽 1349 | 拉拉杂杂 1350 | 拉杂 1351 | 辣 1352 | 辣手 1353 | 来路不明 1354 | 来之不易 1355 | 赖 1356 | 赖皮 1357 | 懒 1358 | 懒到极点 1359 | 懒惰 1360 | 懒散 1361 | 烂 1362 | 滥 1363 | 狼狈 1364 | 狼狈不堪 1365 | 狼籍 1366 | 狼藉 1367 | 狼心狗肺 1368 | 浪 1369 | 浪荡 1370 | 劳而无功 1371 | 老 1372 | 老大难 1373 | 老掉牙 1374 | 老赶 1375 | 老虎屁股摸不得 1376 | 老奸巨猾 1377 | 老奸巨滑 1378 | 老辣 1379 | 老派 1380 | 老气 1381 | 老气横秋 1382 | 老弱病残 1383 | 老实 1384 | 老式 1385 | 老朽 1386 | 累卵 1387 | 累赘 1388 | 累牍连篇 1389 | 肋脦 1390 | 冷 1391 | 冷冰冰 1392 | 冷淡 1393 | 冷峻 1394 | 冷酷 1395 | 冷酷无情 1396 | 冷冷 1397 | 冷冷清清 1398 | 冷厉 1399 | 冷落 1400 | 冷门 1401 | 冷漠 1402 | 冷峭 1403 | 冷清 1404 | 冷清清 1405 | 冷若冰霜 1406 | 冷销 1407 | 冷血 1408 | 冷噤 1409 | 离索 1410 | 离题 1411 | 离心离德 1412 | 理亏 1413 | 理屈 1414 | 理屈词穷 1415 | 理由不充分 1416 | 里出外进 1417 | 厉 1418 | 厉害 1419 | 厉声 1420 | 利令智昏 1421 | 利已 1422 | 利欲熏心 1423 | 哩哩啦啦 1424 | 哩哩罗罗 1425 | 哩溜歪斜 1426 | 连篇累牍 1427 | 良莠不齐 1428 | 两面光 1429 | 两面三刀 1430 | 寥 1431 | 寥寂 1432 | 潦草 1433 | 了不得 1434 | 了不起 1435 | 烈 1436 | 烈性子 1437 | 劣 1438 | 劣等 1439 | 劣质 1440 | 劣中之劣 1441 | 鳞状 1442 | 凛 1443 | 凛凛 1444 | 凛然 1445 | 吝 1446 | 吝啬 1447 | 零 1448 | 零丁 1449 | 零零散散 1450 | 零零碎碎 1451 | 零乱 1452 | 零落 1453 | 零七八碎 1454 | 零散 1455 | 零碎 1456 | 零星 1457 | 伶仃 1458 | 凌乱 1459 | 凌杂 1460 | 令人不安 1461 | 令人齿冷 1462 | 令人恶心 1463 | 令人发指 1464 | 令人费解 1465 | 令人寒心 1466 | 令人敬畏 1467 | 令人困倦 1468 | 令人毛骨悚然 1469 | 令人恼火 1470 | 令人疲倦 1471 | 令人生气 1472 | 令人生厌 1473 | 令人讨厌 1474 | 令人厌恶 1475 | 令人厌倦 1476 | 令人遗憾 1477 | 令人折断腰 1478 | 令人窒息 1479 | 令人作呕 1480 | 溜号 1481 | 流里流气 1482 | 流气 1483 | 六亲不认 1484 | 娄 1485 | 漏洞百出 1486 | 陋 1487 | 鲁 1488 | 鲁钝 1489 | 鲁莽 1490 | 碌 1491 | 碌碌 1492 | 碌碌无为 1493 | 驴唇不对马嘴 1494 | 率 1495 | 率尔 1496 | 率然 1497 | 乱 1498 | 乱成一团 1499 | 乱纷纷 1500 | 乱哄哄 1501 | 乱烘烘 1502 | 乱乎 1503 | 乱了营 1504 | 乱乱哄哄 1505 | 乱虐并生 1506 | 乱蓬蓬 1507 | 乱七八糟 1508 | 乱套 1509 | 乱腾 1510 | 乱腾腾 1511 | 乱杂 1512 | 乱糟糟 1513 | 乱真 1514 | 乱嘈嘈 1515 | 落后 1516 | 落落寡合 1517 | 落寞 1518 | 落市 1519 | 落俗套 1520 | 落套 1521 | 落拓 1522 | 落伍 1523 | 麻 1524 | 麻痹 1525 | 麻烦 1526 | 麻麻黑 1527 | 麻木 1528 | 麻木不仁 1529 | 马虎 1530 | 马马虎虎 1531 | 埋汰 1532 | 卖不掉 1533 | 卖不动 1534 | 蛮 1535 | 蛮不讲理 1536 | 蛮悍 1537 | 蛮横 1538 | 蛮横无理 1539 | 蛮荒 1540 | 满 1541 | 满脸横肉 1542 | 满目疮痍 1543 | 漫 1544 | 漫不经心 1545 | 漫不经意 1546 | 漫无边际 1547 | 漫无目标 1548 | 漫无目的 1549 | 漫漶 1550 | 谩 1551 | 茫 1552 | 茫茫 1553 | 茫茫然 1554 | 盲目 1555 | 盲人瞎马 1556 | 莽 1557 | 莽苍 1558 | 莽莽苍苍 1559 | 莽莽撞撞 1560 | 莽撞 1561 | 猫哭老鼠 1562 | 毛 1563 | 毛糙 1564 | 毛毛躁躁 1565 | 毛手毛脚 1566 | 毛头毛脑 1567 | 毛躁 1568 | 冒 1569 | 冒牌 1570 | 冒失 1571 | 冒险 1572 | 冒有风险 1573 | 貌似强大 1574 | 貌似真实的 1575 | 贸贸然 1576 | 贸然 1577 | 没边儿 1578 | 没出息 1579 | 没骨头 1580 | 没关系 1581 | 没好气 1582 | 没见过世面 1583 | 没教养 1584 | 没劲 1585 | 没理 1586 | 没礼貌 1587 | 没良心 1588 | 没两下子 1589 | 没轻没重 1590 | 没什么了不得 1591 | 没什么了不起 1592 | 没受过教育 1593 | 没头没脑 1594 | 没头脑 1595 | 没味 1596 | 没心没肺 1597 | 没心眼儿 1598 | 没意思 1599 | 没用 1600 | 没有教养 1601 | 没有礼貌 1602 | 没有头脑 1603 | 没有学问 1604 | 没有勇气 1605 | 媚俗 1606 | 闷 1607 | 闷气 1608 | 蒙昧 1609 | 蒙蒙 1610 | 蒙蒙胧胧 1611 | 蒙胧 1612 | 孟浪 1613 | 靡丽 1614 | 靡靡 1615 | 糜 1616 | 糜烂 1617 | 迷濛 1618 | 迷宫般 1619 | 迷糊 1620 | 迷离 1621 | 迷离扑朔 1622 | 迷离倘恍 1623 | 迷漫 1624 | 迷茫 1625 | 迷蒙 1626 | 迷蒙蒙 1627 | 迷迷糊糊 1628 | 迷迷茫茫 1629 | 迷迷蒙蒙 1630 | 迷迷怔怔 1631 | 弥天 1632 | 米珠薪桂 1633 | 秘 1634 | 秘密 1635 | 密 1636 | 密不透风 1637 | 绵里藏针 1638 | 绵软 1639 | 勉勉强强 1640 | 勉强 1641 | 面呈病色 1642 | 面黄肌瘦 1643 | 面目可憎 1644 | 面目狰狞 1645 | 面色蜡黄 1646 | 面生 1647 | 面无表情 1648 | 藐小 1649 | 渺 1650 | 渺茫 1651 | 渺渺 1652 | 渺然 1653 | 渺若烟云 1654 | 渺小 1655 | 灭绝人性 1656 | 明哲保身 1657 | 名不副实 1658 | 名过其实 1659 | 名义 1660 | 名义上 1661 | 名誉扫地 1662 | 命苦 1663 | 谬 1664 | 模糊 1665 | 模糊不清 1666 | 模棱两可 1667 | 摩肩接踵 1668 | 魔鬼般 1669 | 魔怔 1670 | 莫须有 1671 | 墨 1672 | 漠 1673 | 漠不关心 1674 | 漠漠 1675 | 漠然 1676 | 寞 1677 | 陌生 1678 | 暮气 1679 | 暮气沉沉 1680 | 暮色苍茫 1681 | 幕后 1682 | 木 1683 | 木雕泥塑 1684 | 木头木脑 1685 | 木讷 1686 | 目不识丁 1687 | 目光短浅 1688 | 目光如豆 1689 | 目光凶狠 1690 | 目空一切 1691 | 目无余子 1692 | 目中无人 1693 | 拿腔拿调 1694 | 拿腔作势 1695 | 奶声奶气 1696 | 男盗女娼 1697 | 难 1698 | 难吃 1699 | 难看 1700 | 难人 1701 | 难上加难 1702 | 难上难 1703 | 难说话 1704 | 难听 1705 | 难闻 1706 | 难相处 1707 | 难驯服 1708 | 难以 1709 | 难以沟通 1710 | 囊空如洗 1711 | 囊中羞涩 1712 | 闹 1713 | 闹得慌 1714 | 闹哄哄 1715 | 闹闹哄哄 1716 | 闹闹嚷嚷 1717 | 闹嚷嚷 1718 | 嫩 1719 | 泥沙俱下 1720 | 你死我活 1721 | 匿名 1722 | 腻 1723 | 腻人 1724 | 逆 1725 | 逆耳 1726 | 蔫不唧儿 1727 | 蔫儿坏 1728 | 蔫头耷脑 1729 | 拈轻怕重 1730 | 年久失修 1731 | 鸟尽弓藏 1732 | 狞 1733 | 狞恶 1734 | 凝滞 1735 | 泞 1736 | 牛 1737 | 牛气 1738 | 扭扭捏捏 1739 | 奴颜婢膝 1740 | 虐 1741 | 懦 1742 | 懦怯 1743 | 懦弱 1744 | 盘根错节 1745 | 盘陁 1746 | 庞杂 1747 | 旁若无人 1748 | 配不上 1749 | 蓬乱 1750 | 蓬散 1751 | 蓬首垢面 1752 | 蓬头垢面 1753 | 蓬头散发 1754 | 脾气暴 1755 | 脾气爆躁 1756 | 脾气坏 1757 | 脾气火暴 1758 | 脾气急躁 1759 | 皮 1760 | 皮毛 1761 | 皮相 1762 | 僻 1763 | 僻静 1764 | 偏 1765 | 偏激 1766 | 偏颇 1767 | 偏听偏信 1768 | 偏狭 1769 | 偏斜 1770 | 偏心 1771 | 偏心眼 1772 | 片断 1773 | 片面 1774 | 骗人 1775 | 漂浮 1776 | 贫 1777 | 贫寒 1778 | 贫苦 1779 | 贫困 1780 | 贫穷 1781 | 贫瘠 1782 | 平白 1783 | 平白无故 1784 | 平淡 1785 | 平淡无奇 1786 | 平淡无味 1787 | 平铺直叙 1788 | 平铺直序 1789 | 凭白无故 1790 | 凭空 1791 | 坡 1792 | 泼 1793 | 泼辣 1794 | 婆婆妈妈 1795 | 破 1796 | 破败 1797 | 破坏性 1798 | 破旧 1799 | 破烂不堪 1800 | 破陋 1801 | 扑朔迷离 1802 | 铺张 1803 | 铺张浪费 1804 | 欺诈性 1805 | 七零八落 1806 | 凄 1807 | 凄惨 1808 | 凄楚 1809 | 凄寒 1810 | 凄寂 1811 | 凄苦 1812 | 凄冷 1813 | 凄厉 1814 | 凄凉 1815 | 凄迷 1816 | 凄怆 1817 | 漆黑 1818 | 漆黑一团 1819 | 其貌不扬 1820 | 奇丑无比 1821 | 奇形怪状 1822 | 崎 1823 | 崎岖 1824 | 崎岖不平 1825 | 起绉 1826 | 起褶子 1827 | 岂有此理 1828 | 气粗 1829 | 气闷 1830 | 气盛 1831 | 气势汹汹 1832 | 气壮如牛 1833 | 千变万化 1834 | 千疮百孔 1835 | 千金一掷 1836 | 千钧一发 1837 | 千篇一律 1838 | 前呼后拥 1839 | 潜 1840 | 浅 1841 | 浅薄 1842 | 浅尝辄止 1843 | 浅陋 1844 | 欠妥 1845 | 欠完善 1846 | 欠周到 1847 | 强 1848 | 强暴 1849 | 强横 1850 | 强行 1851 | 强制 1852 | 强制性 1853 | 巧 1854 | 巧黠 1855 | 翘尾巴 1856 | 峭 1857 | 峭直 1858 | 怯 1859 | 怯懦 1860 | 怯然 1861 | 怯弱 1862 | 怯生生 1863 | 窃 1864 | 禽兽不如 1865 | 轻 1866 | 轻薄 1867 | 轻淡 1868 | 轻浮 1869 | 轻贱 1870 | 轻狂 1871 | 轻率 1872 | 轻描淡写 1873 | 轻易 1874 | 轻佻 1875 | 倾斜 1876 | 清淡 1877 | 清高 1878 | 清寒 1879 | 清苦 1880 | 清冷 1881 | 清贫 1882 | 穷 1883 | 穷乏 1884 | 穷极潦倒 1885 | 穷苦 1886 | 穷困 1887 | 穷困潦倒 1888 | 穷奢极侈 1889 | 穷奢极欲 1890 | 穷酸 1891 | 穷途潦倒 1892 | 穷途末路 1893 | 穷凶极恶 1894 | 穷匮 1895 | 囚首垢面 1896 | 区区 1897 | 曲曲折折 1898 | 曲折 1899 | 屈才 1900 | 屈理 1901 | 犬牙交错 1902 | 缺 1903 | 缺德 1904 | 缺乏才智 1905 | 缺乏教养 1906 | 缺乏绅士风度 1907 | 缺乏幽默 1908 | 缺心眼 1909 | 缺心眼儿 1910 | 群魔乱舞 1911 | 攘攘 1912 | 扰扰 1913 | 绕脖子 1914 | 人不为己,天诛地灭 1915 | 人不知,鬼不觉 1916 | 人声鼎沸 1917 | 人声嘈杂 1918 | 人头攒动 1919 | 人为财死,鸟为食亡 1920 | 任重道远 1921 | 任纵 1922 | 认死理 1923 | 认死理儿 1924 | 冗 1925 | 冗长 1926 | 冗余 1927 | 冗赘 1928 | 柔弱 1929 | 肉 1930 | 肉了叭叽 1931 | 肉麻 1932 | 如临大敌 1933 | 如临深渊 1934 | 如履薄冰 1935 | 乳臭未干 1936 | 软 1937 | 软绵绵 1938 | 软弱 1939 | 软弱无力 1940 | 若明若暗 1941 | 若隐若现 1942 | 弱 1943 | 弱不禁风 1944 | 弱不胜衣 1945 | 弱势 1946 | 弱小 1947 | 弱智 1948 | 三天打鱼两天晒网 1949 | 散 1950 | 散乱 1951 | 散漫 1952 | 嗓子不好 1953 | 丧尽天良 1954 | 丧心病狂 1955 | 骚 1956 | 骚乱性 1957 | 色厉内荏 1958 | 色迷迷 1959 | 色情 1960 | 涩 1961 | 涩苦 1962 | 涩滞 1963 | 森 1964 | 杀气腾腾 1965 | 杀人不见血 1966 | 杀人不眨眼 1967 | 杀人如麻 1968 | 傻 1969 | 傻呵呵 1970 | 傻乎乎 1971 | 傻里瓜唧 1972 | 傻里傻气 1973 | 傻头傻脑 1974 | 山南海北 1975 | 山穷水尽 1976 | 闪烁 1977 | 伤风败俗 1978 | 伤脑筋 1979 | 伤天害理 1980 | 伤心惨目 1981 | 少不更事 1982 | 奢 1983 | 奢侈 1984 | 奢华 1985 | 奢靡 1986 | 奢糜 1987 | 蛇蝎心肠 1988 | 涉世不深 1989 | 身无分文 1990 | 深重 1991 | 神不知,鬼不觉 1992 | 神不知鬼不觉 1993 | 神秘 1994 | 神气活现 1995 | 神气十足 1996 | 神神秘秘 1997 | 神志委靡 1998 | 声名狼藉 1999 | 声色俱厉 2000 | 生 2001 | 生拉硬拽 2002 | 生涩 2003 | 生疏 2004 | 生硬 2005 | 盛气凌人 2006 | 剩余 2007 | 失常 2008 | 失当 2009 | 失检 2010 | 失礼 2011 | 失落 2012 | 失去理性 2013 | 失神 2014 | 失慎 2015 | 失实 2016 | 失宜 2017 | 十恶不赦 2018 | 十室九空 2019 | 什 2020 | 什锦 2021 | 食而不化 2022 | 食而不知其味 2023 | 食古不化 2024 | 实属不易 2025 | 使不得 2026 | 使人疲劳 2027 | 世故 2028 | 世情冷暖 2029 | 世俗 2030 | 世态炎凉 2031 | 誓不两立 2032 | 势不两立 2033 | 势利 2034 | 势利眼 2035 | 嗜杀成性 2036 | 嗜血 2037 | 嗜血成性 2038 | 恃才傲物 2039 | 手脚不干净 2040 | 手紧 2041 | 手生 2042 | 手头紧 2043 | 手无缚鸡之力 2044 | 守旧 2045 | 守株待兔 2046 | 瘦 2047 | 瘦弱 2048 | 输理 2049 | 疏忽 2050 | 疏懒 2051 | 疏松 2052 | 书生气 2053 | 鼠胆 2054 | 鼠目寸光 2055 | 数不上 2056 | 数不着 2057 | 衰弱 2058 | 衰颓 2059 | 水火不相容 2060 | 水泄不通 2061 | 水性杨花 2062 | 水中捞月 2063 | 瞬息万变 2064 | 说不过去 2065 | 说来话长 2066 | 斯文扫地 2067 | 私 2068 | 私底下 2069 | 私密 2070 | 私下 2071 | 私下里 2072 | 私自 2073 | 死 2074 | 死板 2075 | 死板板 2076 | 死沉沉 2077 | 死脑筋 2078 | 死气沉沉 2079 | 死去活来 2080 | 死死 2081 | 死心塌地 2082 | 死心眼 2083 | 死心眼儿 2084 | 死性 2085 | 死一般 2086 | 死硬 2087 | 死有余辜 2088 | 肆 2089 | 肆无忌惮 2090 | 肆意 2091 | 四大皆空 2092 | 四面楚歌 2093 | 似 2094 | 似乎 2095 | 似是而非 2096 | 松垮 2097 | 松垮垮 2098 | 松散 2099 | 松散散 2100 | 松松垮垮 2101 | 耸人听闻 2102 | 酥 2103 | 酥软 2104 | 酥松 2105 | 俗 2106 | 俗气 2107 | 素不相识 2108 | 素昧平生 2109 | 肃 2110 | 肃杀 2111 | 酸 2112 | 酸不溜丢 2113 | 酸臭 2114 | 酸刻 2115 | 酸溜溜 2116 | 酸涩 2117 | 随便 2118 | 随风倒 2119 | 随风使舵 2120 | 随风转舵 2121 | 随随便便 2122 | 随心所欲 2123 | 碎 2124 | 祟 2125 | 损 2126 | 损人利己 2127 | 琐 2128 | 琐碎 2129 | 琐细 2130 | 琐屑 2131 | 索 2132 | 索然 2133 | 索然乏味 2134 | 索然寡味 2135 | 索然无味 2136 | 所谓 2137 | 太随便 2138 | 太虚 2139 | 贪 2140 | 贪得无厌 2141 | 贪婪 2142 | 贪心 2143 | 贪心不足 2144 | 瘫软 2145 | 谈何容易 2146 | 唐突 2147 | 烫手 2148 | 淘 2149 | 淘气 2150 | 淘神 2151 | 讨厌 2152 | 特困 2153 | 特贫 2154 | 体力不支 2155 | 体弱 2156 | 体衰 2157 | 天昏地暗 2158 | 天南地北 2159 | 天南海北 2160 | 天真 2161 | 恬淡 2162 | 腆 2163 | 挑逗性 2164 | 铁杆儿 2165 | 铁公鸡一毛不拔 2166 | 铁石心肠 2167 | 铁血 2168 | 听天由命 2169 | 偷 2170 | 偷工减料 2171 | 偷偷 2172 | 偷偷摸摸 2173 | 投机 2174 | 头脑空虚 2175 | 头痛 2176 | 秃 2177 | 徒 2178 | 徒劳 2179 | 徒劳无功 2180 | 徒劳无益 2181 | 徒然 2182 | 土 2183 | 土得掉渣 2184 | 土里土气 2185 | 土气 2186 | 土俗 2187 | 土头土脑 2188 | 兔死狗烹 2189 | 兔子不吃窝边草 2190 | 兔子尾巴长不了 2191 | 颓 2192 | 颓败 2193 | 颓废 2194 | 蜕化 2195 | 蜕化变质 2196 | 退化 2197 | 拖泥带水 2198 | 拖沓 2199 | 歪 2200 | 歪歪扭扭 2201 | 歪斜 2202 | 外面儿光 2203 | 外行 2204 | 顽 2205 | 顽钝 2206 | 顽梗 2207 | 顽固 2208 | 顽劣 2209 | 顽皮 2210 | 完全不重要 2211 | 万恶 2212 | 万花筒似 2213 | 万马齐喑 2214 | 万难 2215 | 枉 2216 | 枉费心机 2217 | 枉然 2218 | 望梅止渴 2219 | 忘恩负义 2220 | 忘情 2221 | 妄 2222 | 妄自尊大 2223 | 威 2224 | 威厉 2225 | 微不足道 2226 | 微贱 2227 | 微茫 2228 | 微末 2229 | 危殆 2230 | 危机四伏 2231 | 危机重重 2232 | 危急 2233 | 危如累卵 2234 | 危亡 2235 | 危险 2236 | 危在旦夕 2237 | 唯利是图 2238 | 唯我独尊 2239 | 惟利是图 2240 | 惟我独尊 2241 | 为人作嫁 2242 | 为人作嫁衣裳 2243 | 为所欲为 2244 | 萎靡不振 2245 | 委靡不振 2246 | 委琐 2247 | 伪 2248 | 伪善 2249 | 伪造 2250 | 未便 2251 | 未成熟 2252 | 未归类 2253 | 未揭露 2254 | 未老先衰 2255 | 未列计划 2256 | 未受过教育 2257 | 味道不好 2258 | 味同嚼蜡 2259 | 畏怯 2260 | 畏首畏尾 2261 | 文不对题 2262 | 文弱 2263 | 文恬武嬉 2264 | 紊 2265 | 紊乱 2266 | 问道于盲 2267 | 窝囊 2268 | 乌沉沉 2269 | 乌灯黑火 2270 | 乌洞洞 2271 | 乌七八糟 2272 | 乌漆墨黑 2273 | 乌涂 2274 | 乌托邦 2275 | 乌压压 2276 | 乌烟瘴气 2277 | 污 2278 | 污秽 2279 | 污七八糟 2280 | 污浊 2281 | 无伴 2282 | 无表情 2283 | 无补 2284 | 无补于事 2285 | 无常 2286 | 无诚意 2287 | 无道德观念 2288 | 无的放矢 2289 | 无动于衷 2290 | 无度 2291 | 无端 2292 | 无端端 2293 | 无法无天 2294 | 无根据 2295 | 无故 2296 | 无关大局 2297 | 无关宏旨 2298 | 无关紧要 2299 | 无关痛痒 2300 | 无光 2301 | 无光泽 2302 | 无规 2303 | 无涵养 2304 | 无稽 2305 | 无济于事 2306 | 无计划 2307 | 无记名 2308 | 无纪律 2309 | 无价值 2310 | 无教养 2311 | 无节制 2312 | 无可无不可 2313 | 无口才 2314 | 无赖 2315 | 无理 2316 | 无礼 2317 | 无力 2318 | 无聊 2319 | 无眉目 2320 | 无目的 2321 | 无能 2322 | 无能为力 2323 | 无凭无据 2324 | 无情 2325 | 无情无义 2326 | 无人过问 2327 | 无人问津 2328 | 无伤大雅 2329 | 无生气 2330 | 无实效 2331 | 无实质 2332 | 无所帮助 2333 | 无所不用其极 2334 | 无特色 2335 | 无望 2336 | 无味 2337 | 无谓 2338 | 无吸引力 2339 | 无限制 2340 | 无效 2341 | 无依无靠 2342 | 无意义 2343 | 无益 2344 | 无用 2345 | 无原则 2346 | 无缘无故 2347 | 无证据 2348 | 无知 2349 | 无中生有 2350 | 无助 2351 | 无足轻重 2352 | 芜 2353 | 芜杂 2354 | 武断 2355 | 雾里看花 2356 | 误 2357 | 误诊 2358 | 稀里糊涂 2359 | 稀松 2360 | 稀松平常 2361 | 喜新厌旧 2362 | 细 2363 | 细碎 2364 | 细小 2365 | 瞎 2366 | 下 2367 | 下乘 2368 | 下道儿 2369 | 下等 2370 | 下贱 2371 | 下流 2372 | 下品 2373 | 下三烂 2374 | 下三滥 2375 | 下作 2376 | 吓人 2377 | 纤弱 2378 | 险 2379 | 险毒 2380 | 险恶 2381 | 险峻 2382 | 险峭 2383 | 险象环生 2384 | 险要 2385 | 险诈 2386 | 险阻 2387 | 现行 2388 | 羡余 2389 | 香艳 2390 | 享乐 2391 | 向上倾斜 2392 | 向下倾斜 2393 | 象征性 2394 | 萧 2395 | 萧然 2396 | 萧瑟 2397 | 萧森 2398 | 萧疏 2399 | 萧索 2400 | 萧条 2401 | 萧飒 2402 | 嚣杂 2403 | 嚣张 2404 | 消极 2405 | 小 2406 | 小肚鸡肠 2407 | 小儿科 2408 | 小家子气 2409 | 小家子相 2410 | 小里小气 2411 | 小气 2412 | 小手小脚 2413 | 小小不言 2414 | 小心眼 2415 | 小心眼儿 2416 | 笑里藏刀 2417 | 效率很差 2418 | 携贰 2419 | 邪 2420 | 邪恶 2421 | 斜 2422 | 斜体 2423 | 斜歪 2424 | 卸磨杀驴 2425 | 懈怠 2426 | 辛 2427 | 辛苦 2428 | 辛酸 2429 | 辛辛苦苦 2430 | 心不在焉 2431 | 心粗 2432 | 心地狭窄 2433 | 心毒 2434 | 心浮 2435 | 心黑手辣 2436 | 心狠 2437 | 心狠手辣 2438 | 心口不一 2439 | 心切 2440 | 心如蛇蝎 2441 | 心如铁石 2442 | 心胸狭隘 2443 | 心胸狭窄 2444 | 心眼儿小 2445 | 心眼儿窄 2446 | 心眼小 2447 | 心猿意马 2448 | 星星点点 2449 | 腥 2450 | 腥臭 2451 | 腥臊 2452 | 腥膻 2453 | 形单影只 2454 | 形格势禁 2455 | 形同路人 2456 | 形同虚设 2457 | 形影相吊 2458 | 行不通 2459 | 行为不端 2460 | 行为不检 2461 | 性格内向 2462 | 性急 2463 | 性情急躁 2464 | 凶 2465 | 凶巴巴 2466 | 凶暴 2467 | 凶残 2468 | 凶毒 2469 | 凶恶 2470 | 凶悍 2471 | 凶狠 2472 | 凶横 2473 | 凶狂 2474 | 凶蛮 2475 | 凶猛 2476 | 凶煞 2477 | 凶顽 2478 | 凶险 2479 | 凶戾 2480 | 胸无城府 2481 | 胸无点墨 2482 | 熊 2483 | 虚 2484 | 虚诞 2485 | 虚浮 2486 | 虚幻 2487 | 虚假 2488 | 虚空 2489 | 虚夸 2490 | 虚拟 2491 | 虚荣 2492 | 虚弱 2493 | 虚设 2494 | 虚妄 2495 | 虚伪 2496 | 虚无 2497 | 虚无飘渺 2498 | 虚无缥缈 2499 | 虚虚实实 2500 | 虚有其表 2501 | 虚诈 2502 | 絮 2503 | 絮叨 2504 | 絮聒 2505 | 喧 2506 | 喧天 2507 | 喧嚣 2508 | 喧杂 2509 | 喧噪 2510 | 悬 2511 | 悬乎 2512 | 悬空 2513 | 玄 2514 | 学究气 2515 | 学识浅薄 2516 | 学识谫陋 2517 | 雪上加霜 2518 | 血淋淋 2519 | 血腥 2520 | 血雨腥风 2521 | 牙碜 2522 | 亚 2523 | 烟雾弥漫 2524 | 烟雾腾腾 2525 | 严 2526 | 严加 2527 | 严峻 2528 | 严苛 2529 | 严酷 2530 | 严冷 2531 | 严厉 2532 | 严肃 2533 | 严重 2534 | 言不由衷 2535 | 言之无物 2536 | 眼巴巴 2537 | 眼光短浅 2538 | 眼皮子高 2539 | 眼皮子浅 2540 | 眼生 2541 | 衍 2542 | 扬长 2543 | 羊质虎皮 2544 | 阳奉阴违 2545 | 妖 2546 | 妖里妖气 2547 | 摇摇晃晃 2548 | 摇摇欲坠 2549 | 要不得 2550 | 野 2551 | 野鸡 2552 | 野蛮 2553 | 叶公好龙 2554 | 夜郎自大 2555 | 一把死拿 2556 | 一暴十寒 2557 | 一波三折 2558 | 一不小心 2559 | 一场空 2560 | 一成不变 2561 | 一触即溃 2562 | 一发千钧 2563 | 一锅粥 2564 | 一脸横肉 2565 | 一脸稚气 2566 | 一毛不拔 2567 | 一偏 2568 | 一贫如洗 2569 | 一钱不值 2570 | 一仍旧贯 2571 | 一手遮天 2572 | 一塌糊涂 2573 | 一团乱麻 2574 | 一团漆黑 2575 | 一团糟 2576 | 一文不名 2577 | 一文不值 2578 | 一窝蜂 2579 | 一无可取 2580 | 一无是处 2581 | 一无所长 2582 | 一无所有 2583 | 一言堂 2584 | 一掷千金 2585 | 依违 2586 | 依稀 2587 | 衣衫不整 2588 | 颐指气使 2589 | 疑难 2590 | 倚老卖老 2591 | 以怨报德 2592 | 易变 2593 | 易怒 2594 | 臆 2595 | 意马心猿 2596 | 义正词严 2597 | 溢价 2598 | 异常 2599 | 异形 2600 | 荫 2601 | 因循守旧 2602 | 殷 2603 | 殷切 2604 | 阴 2605 | 阴暗 2606 | 阴沉 2607 | 阴沉沉 2608 | 阴毒 2609 | 阴恶 2610 | 阴晦 2611 | 阴冷 2612 | 阴凄 2613 | 阴森 2614 | 阴森森 2615 | 阴损 2616 | 阴险 2617 | 阴险毒辣 2618 | 阴性 2619 | 阴阳怪气 2620 | 淫 2621 | 淫荡 2622 | 淫秽 2623 | 淫贱 2624 | 淫乱 2625 | 淫靡 2626 | 淫邪 2627 | 淫逸 2628 | 淫亵 2629 | 淫猥 2630 | 引起反感 2631 | 隐 2632 | 隐晦 2633 | 隐秘 2634 | 隐然 2635 | 隐身 2636 | 隐形 2637 | 隐性 2638 | 隐隐 2639 | 隐隐绰绰 2640 | 隐隐约约 2641 | 隐约 2642 | 应名儿 2643 | 影影绰绰 2644 | 硬 2645 | 硬气 2646 | 硬生生 2647 | 硬性 2648 | 拥挤 2649 | 拥挤不堪 2650 | 庸 2651 | 庸碌 2652 | 庸俗 2653 | 庸庸碌碌 2654 | 用不着 2655 | 幽 2656 | 幽暗 2657 | 幽晦 2658 | 幽幽 2659 | 幽冥 2660 | 幽黯 2661 | 优柔 2662 | 优柔寡断 2663 | 悠谬 2664 | 犹犹豫豫 2665 | 犹豫不决 2666 | 犹豫不前 2667 | 油 2668 | 油乎乎 2669 | 油滑 2670 | 油腻 2671 | 油腻腻 2672 | 油头滑脑 2673 | 油汪汪 2674 | 油脂麻花 2675 | 油渍渍 2676 | 有碍观瞻 2677 | 有弊 2678 | 有点旧 2679 | 有毒 2680 | 有毒性 2681 | 有害 2682 | 有名无实 2683 | 有难度 2684 | 有伤风化 2685 | 有失检点 2686 | 有失偏颇 2687 | 有失身分 2688 | 有始无终 2689 | 有恃无恐 2690 | 有头无尾 2691 | 有一搭没一搭 2692 | 有义务 2693 | 有罪 2694 | 幼稚 2695 | 迂 2696 | 迂腐 2697 | 迂阔 2698 | 迂拙 2699 | 于事无补 2700 | 愚 2701 | 愚笨 2702 | 愚不可及 2703 | 愚痴 2704 | 愚蠢 2705 | 愚钝 2706 | 愚陋 2707 | 愚鲁 2708 | 愚昧 2709 | 愚昧无知 2710 | 愚蒙 2711 | 愚傻 2712 | 愚顽 2713 | 愚妄 2714 | 愚拙 2715 | 余剩 2716 | 逾分 2717 | 鱼龙混杂 2718 | 鱼游釜中 2719 | 与虎谋皮 2720 | 与世隔绝 2721 | 与世无争 2722 | 语无伦次 2723 | 语焉不详 2724 | 羽毛未丰 2725 | 欲壑难填 2726 | 原始 2727 | 圆 2728 | 圆滑 2729 | 约略 2730 | 越轨 2731 | 越礼 2732 | 云遮雾障 2733 | 云谲波诡 2734 | 蕴藉 2735 | 晕头转向 2736 | 杂 2737 | 杂草丛生 2738 | 杂乱 2739 | 杂乱无章 2740 | 杂牌 2741 | 杂七杂八 2742 | 杂遝 2743 | 杂沓 2744 | 灾难性 2745 | 在困难中 2746 | 脏 2747 | 脏乎乎 2748 | 脏乱 2749 | 脏乱差 2750 | 脏兮兮 2751 | 糟 2752 | 糟糕 2753 | 凿空 2754 | 凿死理儿 2755 | 躁 2756 | 躁急 2757 | 躁狂 2758 | 造次 2759 | 造作 2760 | 贼 2761 | 贼溜溜 2762 | 贼眉鼠眼 2763 | 贼去关门 2764 | 贼头贼脑 2765 | 扎手 2766 | 轧 2767 | 窄 2768 | 张冠李戴 2769 | 张狂 2770 | 招致不幸 2771 | 照本宣科 2772 | 狰狞 2773 | 正色 2774 | 正颜厉色 2775 | 枝蔓 2776 | 支离 2777 | 支离破碎 2778 | 直呆呆 2779 | 直瞪瞪 2780 | 直盯盯 2781 | 直勾勾 2782 | 直愣愣 2783 | 执迷不悟 2784 | 执拗 2785 | 趾高气扬 2786 | 只顾自身利益 2787 | 只听楼梯响,不见人下来 2788 | 纸上谈兵 2789 | 纸醉金迷 2790 | 志大才疏 2791 | 智障 2792 | 稚气 2793 | 质次价高 2794 | 质量差 2795 | 滞 2796 | 滞背 2797 | 滞钝 2798 | 滞涩 2799 | 滞销 2800 | 窒闷 2801 | 重 2802 | 重沓 2803 | 众叛亲离 2804 | 皱 2805 | 皱巴 2806 | 皱巴巴 2807 | 皱皱巴巴 2808 | 竹篮打水 2809 | 竹篮子打水 2810 | 竹篮子打水一场空 2811 | 煮豆燃萁 2812 | 主观 2813 | 主观上 2814 | 讆 2815 | 专横 2816 | 专横跋扈 2817 | 专制 2818 | 转移性 2819 | 装备不良 2820 | 装模作样 2821 | 装腔 2822 | 装腔作势 2823 | 装相 2824 | 装样子 2825 | 赘 2826 | 赘余 2827 | 捉襟见肘 2828 | 捉摸不定 2829 | 拙 2830 | 拙笨 2831 | 拙劣 2832 | 着三不着两 2833 | 浊 2834 | 子虚 2835 | 子虚乌有 2836 | 自傲 2837 | 自大 2838 | 自负 2839 | 自高自大 2840 | 自豪 2841 | 自命不凡 2842 | 自命清高 2843 | 自恃 2844 | 自私 2845 | 自私自利 2846 | 自相矛盾 2847 | 纵恣 2848 | 走神 2849 | 走油 2850 | 嘴尖 2851 | 醉翁之意不在酒 2852 | 最差 2853 | 最坏 2854 | 罪不容诛 2855 | 罪大恶极 2856 | 罪恶 2857 | 罪恶多端 2858 | 罪恶深重 2859 | 罪恶滔天 2860 | 罪恶昭彰 2861 | 罪恶昭著 2862 | 罪该万死 2863 | 罪孽深重 2864 | 左 2865 | 做作 2866 | 作势 2867 | 坐而论道 2868 | 坐井观天 2869 | 兀突 2870 | 孬 2871 | 噩 2872 | 卮 2873 | 孛 2874 | 啬 2875 | 啬刻 2876 | 厝火积薪 2877 | 赝 2878 | 剌 2879 | 剌戾 2880 | 剽悍 2881 | 罔 2882 | 伧 2883 | 伧俗 2884 | 佶屈聱牙 2885 | 侉 2886 | 佻 2887 | 俚 2888 | 俚俗 2889 | 倜然 2890 | 倥 2891 | 倥侗 2892 | 倥偬 2893 | 倨 2894 | 倨傲 2895 | 傥 2896 | 僭 2897 | 儇 2898 | 巽 2899 | 亵 2900 | 羸弱 2901 | 跅弛 2902 | 跅驰 2903 | 冥 2904 | 冥顽 2905 | 冥顽不化 2906 | 冥顽不灵 2907 | 冥冥 2908 | 讷 2909 | 诎 2910 | 诘屈聱牙 2911 | 谫 2912 | 谫陋 2913 | 谲 2914 | 谲诈 2915 | 阢 2916 | 阽 2917 | 刍 2918 | 堙 2919 | 艽 2920 | 芴 2921 | 苴 2922 | 茕 2923 | 茕茕 2924 | 茕茕孑立 2925 | 茕茕孑立,形影相吊 2926 | 荏 2927 | 荏弱 2928 | 萋迷 2929 | 迍邅 2930 | 瞢 2931 | 拗 2932 | 拮据 2933 | 吆三喝四 2934 | 吆五喝六 2935 | 咄咄逼人 2936 | 哙 2937 | 哝 2938 | 啷 2939 | 嗲 2940 | 嗲声嗲气 2941 | 嘈 2942 | 嘈杂 2943 | 嘀里嘟噜 2944 | 岌岌 2945 | 岌岌不可终日 2946 | 岌岌可危 2947 | 嶙峋 2948 | 嶙嶙 2949 | 犷 2950 | 犷悍 2951 | 狃 2952 | 狎 2953 | 狎昵 2954 | 狯 2955 | 狷 2956 | 狷急 2957 | 猥 2958 | 猥鄙 2959 | 猥贱 2960 | 猥劣 2961 | 猥陋 2962 | 猥琐 2963 | 猥亵 2964 | 獐头鼠目 2965 | 獠 2966 | 舛 2967 | 馀 2968 | 廪 2969 | 忉 2970 | 忮 2971 | 忸忸怩怩 2972 | 忸怩作态 2973 | 怊 2974 | 恹 2975 | 恹恹 2976 | 悖 2977 | 悖晦 2978 | 悖谬 2979 | 悖逆 2980 | 悖妄 2981 | 悭 2982 | 悭吝 2983 | 悱 2984 | 愦 2985 | 愣 2986 | 愣头愣脑 2987 | 愀然 2988 | 愎 2989 | 慵懒 2990 | 懵 2991 | 懵懂 2992 | 懵里懵懂 2993 | 懵懵懂懂 2994 | 阙陋 2995 | 阙略 2996 | 湎 2997 | 溲 2998 | 溷浊 2999 | 滂 3000 | 澹然 3001 | 蹇 3002 | 遴 3003 | 邋里邋遢 3004 | 邋遢 3005 | 邋邋遢遢 3006 | 孱 3007 | 孱弱 3008 | 羼 3009 | 娆 3010 | 媸 3011 | 孑 3012 | 孑然 3013 | 孑然一身 3014 | 孑身 3015 | 驽 3016 | 驽钝 3017 | 骈枝 3018 | 绌 3019 | 缈 3020 | 缛 3021 | 缥缈 3022 | 缭乱 3023 | 幺麽 3024 | 杌 3025 | 桀 3026 | 桀骜不驯 3027 | 棼 3028 | 槁 3029 | 轫 3030 | 辁 3031 | 暧 3032 | 暧昧 3033 | 暝 3034 | 犟 3035 | 毵毵 3036 | 虢 3037 | 朦 3038 | 朦胧 3039 | 朦朦胧胧 3040 | 臊 3041 | 膻 3042 | 膻气 3043 | 膻腥 3044 | 熹微 3045 | 戾 3046 | 恝 3047 | 恝然 3048 | 恣 3049 | 恣肆 3050 | 恣意 3051 | 憝 3052 | 戆 3053 | 戆头戆脑 3054 | 沓 3055 | 硗 3056 | 硗薄 3057 | 硗瘠 3058 | 碜 3059 | 睨 3060 | 瞀 3061 | 瞑 3062 | 瞽 3063 | 锱铢必较 3064 | 鸷 3065 | 鸷悍 3066 | 疣赘 3067 | 瘠 3068 | 瘠薄 3069 | 癃 3070 | 癫狂 3071 | 窈冥 3072 | 窭 3073 | 窳 3074 | 窳败 3075 | 窳惰 3076 | 窳劣 3077 | 褊 3078 | 褊急 3079 | 褊狭 3080 | 褶 3081 | 矜 3082 | 矜持 3083 | 矜夸 3084 | 聒 3085 | 聒噪 3086 | 颛 3087 | 颛蒙 3088 | 颟顸 3089 | 蚍蜉撼大树 3090 | 蚍蜉撼树 3091 | 蚩 3092 | 蜻蜓点水 3093 | 笃 3094 | 箪食瓢饮 3095 | 趄 3096 | 蹩脚 3097 | 霭霭 3098 | 龃 3099 | 龃龉 3100 | 龉 3101 | 龌 3102 | 龌龊 3103 | 鲰 3104 | 饕 3105 | 黝 3106 | 黝暗 3107 | 黝黯 3108 | 黠 3109 | 黠慧 3110 | 黢 3111 | 黢黑 3112 | 黩 3113 | 黪 3114 | 黯 3115 | 黯淡 3116 | 黯然 3117 | -------------------------------------------------------------------------------- /corpus/Hownet/url.txt: -------------------------------------------------------------------------------- 1 | 数据标题:Hownet情感词语集 2 | 数据网址:http://www.datatang.com/datares/go.aspx?dataid=603399 3 | -------------------------------------------------------------------------------- /corpus/Tsinghua/readme.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/corpus/Tsinghua/readme.txt -------------------------------------------------------------------------------- /corpus/Tsinghua/url.txt: -------------------------------------------------------------------------------- 1 | 数据标题:中文褒贬义词典v1.0(清华大学李军) 2 | 数据网址:http://www.datatang.com/datares/go.aspx?dataid=618405 3 | -------------------------------------------------------------------------------- /data/NLPCC_2016_Stance_Detection_Task_A_Unknown.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/data/NLPCC_2016_Stance_Detection_Task_A_Unknown.txt -------------------------------------------------------------------------------- /data/NLPCC_2016_Stance_Detection_Task_A_gold.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/data/NLPCC_2016_Stance_Detection_Task_A_gold.txt --------------------------------------------------------------------------------