├── NLPCC2016-Ensemble of Feature Sets and Classification Methods for Stance Detection.pdf
├── README.md
├── code
    ├── ChiSquare
    │   └── README.md
    ├── Ensemble_models_matlab
    │   ├── TaskA_Ensemble_models.m
    │   ├── main_TaskA_Ensemble_models.m
    │   ├── normalize.m
    │   ├── predict.mexw64
    │   ├── svmpredict.mexw64
    │   ├── svmtrain.mexw64
    │   ├── tf_idf.m
    │   └── train.mexw64
    ├── Feature_process_java
    │   ├── lib
    │   │   ├── split.jar
    │   │   ├── tree-split-word-1.1.jar
    │   │   └── typetrans.jar
    │   ├── library
    │   │   ├── stopDict.dic
    │   │   └── userLibrary
    │   │   │   └── userLibrary.dic
    │   └── src
    │   │   ├── DataPreprocess
    │   │       ├── Step01_01_TaskA_DataInfoStat.java
    │   │       ├── Step02_01_TaskA_GenDict.java
    │   │       ├── Step02_01_TaskA_GenVSM.java
    │   │       ├── Step04_01_TaskA_ChiSquare.java
    │   │       ├── Step04_02_TaskA_ChiSquare_Rank.java
    │   │       ├── Step04_03_TaskA_ChiSquare_VSM.java
    │   │       ├── Step05_01_TaskA_Opinion.java
    │   │       └── Step09_01_ResultSubmit.java
    │   │   └── Tools
    │   │       ├── CharacterAnalyzer.java
    │   │       ├── ConvertUnicode.java
    │   │       ├── PreProcessText.java
    │   │       ├── StringAnalyzer.java
    │   │       └── WordSegment_Ansj.java
    ├── Feature_ranking_matlab
    │   ├── TaskA_Feature_ranking.m
    │   ├── main_TaskA_Feature_ranking.m
    │   ├── normalize.m
    │   ├── predict.mexw64
    │   ├── svmpredict.mexw64
    │   ├── svmtrain.mexw64
    │   ├── tf_idf.m
    │   └── train.mexw64
    ├── LDA
    │   ├── README.md
    │   └── jgibblda.jar
    ├── LSA_and_LPI_and_LE_matlab
    │   ├── LE
    │   │   └── LapEig.m
    │   ├── LPI
    │   │   ├── LGE.m
    │   │   ├── lpp.m
    │   │   └── mySVD.m
    │   ├── LSA
    │   │   └── LSA.m
    │   ├── Step_03_main_nlpcc2016.m
    │   ├── Step_03_nlpcc2016.m
    │   ├── Step_04_main_nlpcc2016.m
    │   ├── Step_04_nlpcc2016.m
    │   └── tools
    │   │   ├── EuDist2.m
    │   │   ├── constructW.m
    │   │   ├── normalize.m
    │   │   └── tf_idf.m
    └── Para2vec
    │   ├── README.md
    │   ├── go_paravec.sh
    │   └── word2vec.c
├── corpus
    ├── Hownet
    │   ├── neg_opinion.txt
    │   ├── pos_opinion.txt
    │   └── url.txt
    └── Tsinghua
    │   ├── readme.txt
    │   ├── tsinghua.negative.gb.txt
    │   ├── tsinghua.positive.gb.txt
    │   └── url.txt
└── data
    ├── NLPCC2016_Stance_Detection_Task_A_Testdata.txt
    ├── NLPCC_2016_Stance_Detection_Task_A_Unknown.txt
    ├── NLPCC_2016_Stance_Detection_Task_A_gold.txt
    ├── evasampledata4-TaskAA.txt
    └── evasampledata4-TaskAR.txt


/NLPCC2016-Ensemble of Feature Sets and Classification Methods for Stance Detection.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/NLPCC2016-Ensemble of Feature Sets and Classification Methods for Stance Detection.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # README #
 2 | 
 3 | This is our solution for [NLPCC2016 shared task: Detecting Stance in Chinese Weibo (Task A)](http://tcci.ccf.org.cn/conference/2016/pages/page05_CFPTasks.html) 
 4 | 
 5 | @inproceedings{xu2016ensemble,    
 6 |   title={Ensemble of Feature Sets and Classification Methods for Stance Detection},    
 7 |   author={Xu, Jiaming and Zheng, Suncong and Shi, Jing and Yao, Yiqun and Xu, Bo},    
 8 |   booktitle={Natural Language Processing and Chinese Computing (NLPCC)},       
 9 |   year={2016},    
10 |   publisher={Springer}    
11 | }    
12 | 
13 | - This is a supervised task towards five targets. For each target, 600 labled Weibo texts, 600 unlabeled Weibo texts and 3,000 test Weibo texts are provided. The task is to detect the author's stance.      
14 | 
15 | We give an ensemble framework by integrating various feature sets, such as Paragraph Vector (Para2vec) [1], Latent Dirichlet Allocation (LDA) [2], Latent Semantic Analysis (LSA) [3], Laplacian Eigenmaps (LE) [4] and Locality Preserving Indexing (LPI) [5], and various classification methods, such as Random Forest (RF) [6], Linear Support Vector Machines (SVM-Linear) [7], SVM with RBF Kernel (SVM-RBF) [8] and AdaBoot [9].    
16 | 
17 | The official results show that our solution of the team "CBrain" achieves one 1st place and one 2nd place on these targets, and the overall ranking is 4th out of 16 teams with 0.6856 F1 score.       
18 | 
19 | **Team member**: [Jiaming Xu](http://jacoxu.com/?page_id=2), Suncong Zheng, Jing Shi, Yiqun Yao, Bo Xu.    
20 | 
21 | Please feel free to send me emails (*jacoxu@msn.com*) if you have any problems.  
22 | 
23 | [1]. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In:ICML. vol. 14, pp. 1188-1196 (2014)    
24 | [2]. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3(Jan), 993-1022 (2003)    
25 | [3]. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. JASIS 41(6), 391 (1990)    
26 | [4]. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373-1396 (2003)    
27 | [5]. He, X., Cai, D., Liu, H., Ma, W.Y.: Locality preserving indexing for document representation. In: SIGIR. pp. 96-103. ACM (2004)    
28 | [6]. Breiman, L.: Random forests. Machine Learning 45(1), 5-32 (2001)    
29 | [7]. Joachims, T.: Learning to classify text using support vector machines: Methods, theory and algorithms. Kluwer Academic Publishers (2002)    
30 | [8]. Scholkopf, B., Smola, A.J.: Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press (2002)    
31 | [9]. Freund, Y., Schapire, R.E., et al.: Experiments with a new boosting algorithm. In:ICML. vol. 96, pp. 148-156 (1996)    
32 | 


--------------------------------------------------------------------------------
/code/ChiSquare/README.md:
--------------------------------------------------------------------------------
1 | # Chi-Squared Test
2 | This project provide a text feature selection method with chi-squared test.  
3 | 
4 | Please find the project at the following url:    
5 | 
6 | https://github.com/kn45/Chi-Square
7 | 


--------------------------------------------------------------------------------
/code/Ensemble_models_matlab/TaskA_Ensemble_models.m:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Ensemble_models_matlab/TaskA_Ensemble_models.m


--------------------------------------------------------------------------------
/code/Ensemble_models_matlab/main_TaskA_Ensemble_models.m:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Ensemble_models_matlab/main_TaskA_Ensemble_models.m


--------------------------------------------------------------------------------
/code/Ensemble_models_matlab/normalize.m:
--------------------------------------------------------------------------------
 1 | function Xn = normalize(X)
 2 | % Normalize all feature vectors to unit length
 3 | 
 4 | n = size(X,1);  % the number of documents
 5 | Xt = X';
 6 | l = sqrt(sum(Xt.^2));  % the row vector length (L2 norm)
 7 | Ni = sparse(1:n,1:n,l);
 8 | Ni(Ni>0) = 1./Ni(Ni>0);
 9 | Xn = (Xt*Ni)';
10 | 
11 | end
12 | 


--------------------------------------------------------------------------------
/code/Ensemble_models_matlab/predict.mexw64:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Ensemble_models_matlab/predict.mexw64


--------------------------------------------------------------------------------
/code/Ensemble_models_matlab/svmpredict.mexw64:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Ensemble_models_matlab/svmpredict.mexw64


--------------------------------------------------------------------------------
/code/Ensemble_models_matlab/svmtrain.mexw64:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Ensemble_models_matlab/svmtrain.mexw64


--------------------------------------------------------------------------------
/code/Ensemble_models_matlab/tf_idf.m:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Ensemble_models_matlab/tf_idf.m


--------------------------------------------------------------------------------
/code/Ensemble_models_matlab/train.mexw64:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Ensemble_models_matlab/train.mexw64


--------------------------------------------------------------------------------
/code/Feature_process_java/lib/split.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_process_java/lib/split.jar


--------------------------------------------------------------------------------
/code/Feature_process_java/lib/tree-split-word-1.1.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_process_java/lib/tree-split-word-1.1.jar


--------------------------------------------------------------------------------
/code/Feature_process_java/lib/typetrans.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_process_java/lib/typetrans.jar


--------------------------------------------------------------------------------
/code/Feature_process_java/library/stopDict.dic:
--------------------------------------------------------------------------------
   1 | 
   2 | ,
   3 | ?
   4 | 、
   5 | 。
   6 | “
   7 | ”
   8 | 《
   9 | 》
  10 | ！
  11 | ，
  12 | ：
  13 | ；
  14 | ？
  15 | 】
  16 | 【
  17 | 人民
  18 | 啊
  19 | 阿
  20 | 哎
  21 | 哎呀
  22 | 哎哟
  23 | 唉
  24 | 俺
  25 | 俺们
  26 | 按
  27 | 按照
  28 | 吧
  29 | 吧哒
  30 | 把
  31 | 罢了
  32 | 被
  33 | 本
  34 | 本着
  35 | 比
  36 | 比方
  37 | 比如
  38 | 鄙人
  39 | 彼
  40 | 彼此
  41 | 边
  42 | 别
  43 | 别的
  44 | 别说
  45 | 并
  46 | 并且
  47 | 不比
  48 | 不成
  49 | 不单
  50 | 不但
  51 | 不独
  52 | 不管
  53 | 不光
  54 | 不过
  55 | 不仅
  56 | 不拘
  57 | 不论
  58 | 不怕
  59 | 不然
  60 | 不如
  61 | 不特
  62 | 不惟
  63 | 不问
  64 | 不只
  65 | 朝
  66 | 朝着
  67 | 趁
  68 | 趁着
  69 | 乘
  70 | 冲
  71 | 除
  72 | 除此之外
  73 | 除非
  74 | 除了
  75 | 此
  76 | 此间
  77 | 此外
  78 | 从
  79 | 从而
  80 | 打
  81 | 待
  82 | 但
  83 | 但是
  84 | 当
  85 | 当着
  86 | 到
  87 | 得
  88 | 的
  89 | 的话
  90 | 等
  91 | 等等
  92 | 地
  93 | 第
  94 | 叮咚
  95 | 对
  96 | 对于
  97 | 多
  98 | 多少
  99 | 而
 100 | 而况
 101 | 而且
 102 | 而是
 103 | 而外
 104 | 而言
 105 | 而已
 106 | 尔后
 107 | 反过来
 108 | 反过来说
 109 | 反之
 110 | 非但
 111 | 非徒
 112 | 否则
 113 | 嘎
 114 | 嘎登
 115 | 该
 116 | 赶
 117 | 个
 118 | 各
 119 | 各个
 120 | 各位
 121 | 各种
 122 | 各自
 123 | 给
 124 | 根据
 125 | 跟
 126 | 故
 127 | 故此
 128 | 固然
 129 | 关于
 130 | 管
 131 | 归
 132 | 果然
 133 | 果真
 134 | 过
 135 | 哈
 136 | 哈哈
 137 | 呵
 138 | 和
 139 | 何
 140 | 何处
 141 | 何况
 142 | 何时
 143 | 嘿
 144 | 哼
 145 | 哼唷
 146 | 呼哧
 147 | 乎
 148 | 哗
 149 | 还是
 150 | 还有
 151 | 换句话说
 152 | 换言之
 153 | 或
 154 | 或是
 155 | 或者
 156 | 极了
 157 | 及
 158 | 及其
 159 | 及至
 160 | 即
 161 | 即便
 162 | 即或
 163 | 即令
 164 | 即若
 165 | 即使
 166 | 几
 167 | 几时
 168 | 己
 169 | 既
 170 | 既然
 171 | 既是
 172 | 继而
 173 | 加之
 174 | 假如
 175 | 假若
 176 | 假使
 177 | 鉴于
 178 | 将
 179 | 较
 180 | 较之
 181 | 叫
 182 | 接着
 183 | 结果
 184 | 借
 185 | 紧接着
 186 | 进而
 187 | 尽
 188 | 尽管
 189 | 经
 190 | 经过
 191 | 就
 192 | 就是
 193 | 就是说
 194 | 据
 195 | 具体地说
 196 | 具体说来
 197 | 开始
 198 | 开外
 199 | 靠
 200 | 咳
 201 | 可
 202 | 可见
 203 | 可是
 204 | 可以
 205 | 况且
 206 | 啦
 207 | 来
 208 | 来着
 209 | 离
 210 | 例如
 211 | 哩
 212 | 连
 213 | 连同
 214 | 两者
 215 | 了
 216 | 临
 217 | 另
 218 | 另外
 219 | 另一方面
 220 | 论
 221 | 嘛
 222 | 吗
 223 | 慢说
 224 | 漫说
 225 | 冒
 226 | 么
 227 | 每
 228 | 每当
 229 | 们
 230 | 莫若
 231 | 某
 232 | 某个
 233 | 某些
 234 | 拿
 235 | 哪
 236 | 哪边
 237 | 哪儿
 238 | 哪个
 239 | 哪里
 240 | 哪年
 241 | 哪怕
 242 | 哪天
 243 | 哪些
 244 | 哪样
 245 | 那
 246 | 那边
 247 | 那儿
 248 | 那个
 249 | 那会儿
 250 | 那里
 251 | 那么
 252 | 那么些
 253 | 那么样
 254 | 那时
 255 | 那些
 256 | 那样
 257 | 乃
 258 | 乃至
 259 | 呢
 260 | 能
 261 | 你
 262 | 你们
 263 | 您
 264 | 宁
 265 | 宁可
 266 | 宁肯
 267 | 宁愿
 268 | 哦
 269 | 呕
 270 | 啪达
 271 | 旁人
 272 | 呸
 273 | 凭
 274 | 凭借
 275 | 其
 276 | 其次
 277 | 其二
 278 | 其他
 279 | 其它
 280 | 其一
 281 | 其余
 282 | 其中
 283 | 起
 284 | 起见
 285 | 岂但
 286 | 恰恰相反
 287 | 前后
 288 | 前者
 289 | 且
 290 | 然而
 291 | 然后
 292 | 然则
 293 | 让
 294 | 人家
 295 | 任
 296 | 任何
 297 | 任凭
 298 | 如
 299 | 如此
 300 | 如果
 301 | 如何
 302 | 如其
 303 | 如若
 304 | 如上所述
 305 | 若
 306 | 若非
 307 | 若是
 308 | 啥
 309 | 上下
 310 | 尚且
 311 | 设若
 312 | 设使
 313 | 甚而
 314 | 甚么
 315 | 甚至
 316 | 省得
 317 | 时候
 318 | 什么
 319 | 什么样
 320 | 使得
 321 | 是
 322 | 是的
 323 | 首先
 324 | 谁
 325 | 谁知
 326 | 顺
 327 | 顺着
 328 | 似的
 329 | 虽
 330 | 虽然
 331 | 虽说
 332 | 虽则
 333 | 随
 334 | 随着
 335 | 所
 336 | 所以
 337 | 他
 338 | 他们
 339 | 他人
 340 | 它
 341 | 它们
 342 | 她
 343 | 她们
 344 | 倘
 345 | 倘或
 346 | 倘然
 347 | 倘若
 348 | 倘使
 349 | 腾
 350 | 替
 351 | 通过
 352 | 同
 353 | 同时
 354 | 哇
 355 | 万一
 356 | 往
 357 | 望
 358 | 为
 359 | 为何
 360 | 为了
 361 | 为什么
 362 | 为着
 363 | 喂
 364 | 嗡嗡
 365 | 我
 366 | 我们
 367 | 呜
 368 | 呜呼
 369 | 乌乎
 370 | 无论
 371 | 无宁
 372 | 毋宁
 373 | 嘻
 374 | 吓
 375 | 相对而言
 376 | 像
 377 | 向
 378 | 向着
 379 | 嘘
 380 | 呀
 381 | 焉
 382 | 沿
 383 | 沿着
 384 | 要
 385 | 要不
 386 | 要不然
 387 | 要不是
 388 | 要么
 389 | 要是
 390 | 也
 391 | 也罢
 392 | 也好
 393 | 一
 394 | 一般
 395 | 一旦
 396 | 一方面
 397 | 一来
 398 | 一切
 399 | 一样
 400 | 一则
 401 | 一个
 402 | 依
 403 | 依照
 404 | 矣
 405 | 以
 406 | 以便
 407 | 以及
 408 | 以免
 409 | 以至
 410 | 以至于
 411 | 以致
 412 | 抑或
 413 | 因
 414 | 因此
 415 | 因而
 416 | 因为
 417 | 哟
 418 | 用
 419 | 由
 420 | 由此可见
 421 | 由于
 422 | 有
 423 | 有的
 424 | 有关
 425 | 有些
 426 | 又
 427 | 于
 428 | 于是
 429 | 于是乎
 430 | 与
 431 | 与此同时
 432 | 与否
 433 | 与其
 434 | 越是
 435 | 云云
 436 | 哉
 437 | 再说
 438 | 再者
 439 | 在
 440 | 在下
 441 | 咱
 442 | 咱们
 443 | 则
 444 | 怎
 445 | 怎么
 446 | 怎么办
 447 | 怎么样
 448 | 怎样
 449 | 咋
 450 | 照
 451 | 照着
 452 | 者
 453 | 这
 454 | 这边
 455 | 这儿
 456 | 这个
 457 | 这会儿
 458 | 这就是说
 459 | 这里
 460 | 这么
 461 | 这么点儿
 462 | 这么些
 463 | 这么样
 464 | 这时
 465 | 这些
 466 | 这样
 467 | 正如
 468 | 吱
 469 | 之
 470 | 之类
 471 | 之所以
 472 | 之一
 473 | 只是
 474 | 只限
 475 | 只要
 476 | 只有
 477 | 至
 478 | 至于
 479 | 诸位
 480 | 着
 481 | 着呢
 482 | 自
 483 | 自从
 484 | 自个儿
 485 | 自各儿
 486 | 自己
 487 | 自家
 488 | 自身
 489 | 综上所述
 490 | 总的来看
 491 | 总的来说
 492 | 总的说来
 493 | 总而言之
 494 | 总之
 495 | 纵
 496 | 纵令
 497 | 纵然
 498 | 纵使
 499 | 遵照
 500 | 作为
 501 | 兮
 502 | 呃
 503 | 呗
 504 | 咚
 505 | 咦
 506 | 喏
 507 | 啐
 508 | 喔唷
 509 | 嗬
 510 | 嗯
 511 | 嗳
 512 | ~
 513 | !
 514 | .
 515 | :
 516 | "
 517 | '
 518 | (
 519 | )
 520 | *
 521 | A
 522 | 白
 523 | 社会主义
 524 | --
 525 | ..
 526 | >>
 527 |  [
 528 |  ]
 529 | 
 530 | <
 531 | >
 532 | /
 533 | \
 534 | |
 535 | -
 536 | _
 537 | +
 538 | =
 539 | &
 540 | ^
 541 | %
 542 | #
 543 | @
 544 | `
 545 | ;
 546 | $
 547 | （
 548 | ）
 549 | ——
 550 | —
 551 | ￥
 552 | ·
 553 | ...
 554 | ‘
 555 | ’
 556 | 〉
 557 | 〈
 558 | …
 559 | 　
 560 | 0
 561 | 1
 562 | 2
 563 | 3
 564 | 4
 565 | 5
 566 | 6
 567 | 7
 568 | 8
 569 | 9
 570 | ０
 571 | １
 572 | ２
 573 | ３
 574 | ４
 575 | ５
 576 | ６
 577 | ７
 578 | ８
 579 | ９
 580 | 二
 581 | 三
 582 | 四
 583 | 五
 584 | 六
 585 | 七
 586 | 八
 587 | 九
 588 | 零
 589 | ＞
 590 | ＜
 591 | ＠
 592 | ＃
 593 | ＄
 594 | ％
 595 | ︿
 596 | ＆
 597 | ＊
 598 | ＋
 599 | ～
 600 | ｜
 601 | ［
 602 | ］
 603 | ｛
 604 | ｝
 605 | 啊哈
 606 | 啊呀
 607 | 啊哟
 608 | 挨次
 609 | 挨个
 610 | 挨家挨户
 611 | 挨门挨户
 612 | 挨门逐户
 613 | 挨着
 614 | 按理
 615 | 按期
 616 | 按时
 617 | 按说
 618 | 暗地里
 619 | 暗中
 620 | 暗自
 621 | 昂然
 622 | 八成
 623 | 白白
 624 | 半
 625 | 梆
 626 | 保管
 627 | 保险
 628 | 饱
 629 | 背地里
 630 | 背靠背
 631 | 倍感
 632 | 倍加
 633 | 本人
 634 | 本身
 635 | 甭
 636 | 比起
 637 | 比如说
 638 | 比照
 639 | 毕竟
 640 | 必
 641 | 必定
 642 | 必将
 643 | 必须
 644 | 便
 645 | 别人
 646 | 并非
 647 | 并肩
 648 | 并没
 649 | 并没有
 650 | 并排
 651 | 并无
 652 | 勃然
 653 | 不
 654 | 不必
 655 | 不常
 656 | 不大
 657 | 不但...而且
 658 | 不得
 659 | 不得不
 660 | 不得了
 661 | 不得已
 662 | 不迭
 663 | 不定
 664 | 不对
 665 | 不妨
 666 | 不管怎样
 667 | 不会
 668 | 不仅...而且
 669 | 不仅仅
 670 | 不仅仅是
 671 | 不经意
 672 | 不可开交
 673 | 不可抗拒
 674 | 不力
 675 | 不了
 676 | 不料
 677 | 不满
 678 | 不免
 679 | 不能不
 680 | 不起
 681 | 不巧
 682 | 不然的话
 683 | 不日
 684 | 不少
 685 | 不胜
 686 | 不时
 687 | 不是
 688 | 不同
 689 | 不能
 690 | 不要
 691 | 不外
 692 | 不外乎
 693 | 不下
 694 | 不限
 695 | 不消
 696 | 不已
 697 | 不亦乐乎
 698 | 不由得
 699 | 不再
 700 | 不择手段
 701 | 不怎么
 702 | 不曾
 703 | 不知不觉
 704 | 不止
 705 | 不止一次
 706 | 不至于
 707 | 才
 708 | 才能
 709 | 策略地
 710 | 差不多
 711 | 差一点
 712 | 常
 713 | 常常
 714 | 常言道
 715 | 常言说
 716 | 常言说得好
 717 | 长此下去
 718 | 长话短说
 719 | 长期以来
 720 | 长线
 721 | 敞开儿
 722 | 彻夜
 723 | 陈年
 724 | 趁便
 725 | 趁机
 726 | 趁热
 727 | 趁势
 728 | 趁早
 729 | 成年
 730 | 成年累月
 731 | 成心
 732 | 乘机
 733 | 乘胜
 734 | 乘势
 735 | 乘隙
 736 | 乘虚
 737 | 诚然
 738 | 迟早
 739 | 充分
 740 | 充其极
 741 | 充其量
 742 | 抽冷子
 743 | 臭
 744 | 初
 745 | 出
 746 | 出来
 747 | 出去
 748 | 除此
 749 | 除此而外
 750 | 除此以外
 751 | 除开
 752 | 除去
 753 | 除却
 754 | 除外
 755 | 处处
 756 | 川流不息
 757 | 传
 758 | 传说
 759 | 传闻
 760 | 串行
 761 | 纯
 762 | 纯粹
 763 | 此后
 764 | 此中
 765 | 次第
 766 | 匆匆
 767 | 从不
 768 | 从此
 769 | 从此以后
 770 | 从古到今
 771 | 从古至今
 772 | 从今以后
 773 | 从宽
 774 | 从来
 775 | 从轻
 776 | 从速
 777 | 从头
 778 | 从未
 779 | 从无到有
 780 | 从小
 781 | 从新
 782 | 从严
 783 | 从优
 784 | 从早到晚
 785 | 从中
 786 | 从重
 787 | 凑巧
 788 | 粗
 789 | 存心
 790 | 达旦
 791 | 打从
 792 | 打开天窗说亮话
 793 | 大
 794 | 大不了
 795 | 大大
 796 | 大抵
 797 | 大都
 798 | 大多
 799 | 大凡
 800 | 大概
 801 | 大家
 802 | 大举
 803 | 大略
 804 | 大面儿上
 805 | 大事
 806 | 大体
 807 | 大体上
 808 | 大约
 809 | 大张旗鼓
 810 | 大致
 811 | 呆呆地
 812 | 带
 813 | 殆
 814 | 待到
 815 | 单
 816 | 单纯
 817 | 单单
 818 | 但愿
 819 | 弹指之间
 820 | 当场
 821 | 当儿
 822 | 当即
 823 | 当口儿
 824 | 当然
 825 | 当庭
 826 | 当头
 827 | 当下
 828 | 当真
 829 | 当中
 830 | 倒不如
 831 | 倒不如说
 832 | 倒是
 833 | 到处
 834 | 到底
 835 | 到了儿
 836 | 到目前为止
 837 | 到头
 838 | 到头来
 839 | 得起
 840 | 得天独厚
 841 | 的确
 842 | 等到
 843 | 叮当
 844 | 顶多
 845 | 定
 846 | 动不动
 847 | 动辄
 848 | 陡然
 849 | 都
 850 | 独
 851 | 独自
 852 | 断然
 853 | 顿时
 854 | 多次
 855 | 多多
 856 | 多多少少
 857 | 多多益善
 858 | 多亏
 859 | 多年来
 860 | 多年前
 861 | 而后
 862 | 而论
 863 | 而又
 864 | 尔等
 865 | 二话不说
 866 | 二话没说
 867 | 反倒
 868 | 反倒是
 869 | 反而
 870 | 反手
 871 | 反之亦然
 872 | 反之则
 873 | 方
 874 | 方才
 875 | 方能
 876 | 放量
 877 | 非常
 878 | 非得
 879 | 分期
 880 | 分期分批
 881 | 分头
 882 | 奋勇
 883 | 愤然
 884 | 风雨无阻
 885 | 逢
 886 | 弗
 887 | 甫
 888 | 嘎嘎
 889 | 该当
 890 | 概
 891 | 赶快
 892 | 赶早不赶晚
 893 | 敢
 894 | 敢情
 895 | 敢于
 896 | 刚
 897 | 刚才
 898 | 刚好
 899 | 刚巧
 900 | 高低
 901 | 格外
 902 | 隔日
 903 | 隔夜
 904 | 个人
 905 | 各式
 906 | 更
 907 | 更加
 908 | 更进一步
 909 | 更为
 910 | 公然
 911 | 共
 912 | 共总
 913 | 够瞧的
 914 | 姑且
 915 | 古来
 916 | 故而
 917 | 故意
 918 | 固
 919 | 怪
 920 | 怪不得
 921 | 惯常
 922 | 光
 923 | 光是
 924 | 归根到底
 925 | 归根结底
 926 | 过于
 927 | 毫不
 928 | 毫无
 929 | 毫无保留地
 930 | 毫无例外
 931 | 好在
 932 | 何必
 933 | 何尝
 934 | 何妨
 935 | 何苦
 936 | 何乐而不为
 937 | 何须
 938 | 何止
 939 | 很
 940 | 很多
 941 | 很少
 942 | 轰然
 943 | 后来
 944 | 呼啦
 945 | 忽地
 946 | 忽然
 947 | 互
 948 | 互相
 949 | 哗啦
 950 | 话说
 951 | 还
 952 | 恍然
 953 | 会
 954 | 豁然
 955 | 活
 956 | 伙同
 957 | 或多或少
 958 | 或许
 959 | 基本
 960 | 基本上
 961 | 基于
 962 | 极
 963 | 极大
 964 | 极度
 965 | 极端
 966 | 极力
 967 | 极其
 968 | 极为
 969 | 急匆匆
 970 | 即将
 971 | 即刻
 972 | 即是说
 973 | 几度
 974 | 几番
 975 | 几乎
 976 | 几经
 977 | 既...又
 978 | 继之
 979 | 加上
 980 | 加以
 981 | 间或
 982 | 简而言之
 983 | 简言之
 984 | 简直
 985 | 见
 986 | 将才
 987 | 将近
 988 | 将要
 989 | 交口
 990 | 较比
 991 | 较为
 992 | 接连不断
 993 | 接下来
 994 | 皆可
 995 | 截然
 996 | 截至
 997 | 藉以
 998 | 借此
 999 | 借以
1000 | 届时
1001 | 仅
1002 | 仅仅
1003 | 谨
1004 | 进来
1005 | 进去
1006 | 近
1007 | 近几年来
1008 | 近来
1009 | 近年来
1010 | 尽管如此
1011 | 尽可能
1012 | 尽快
1013 | 尽量
1014 | 尽然
1015 | 尽如人意
1016 | 尽心竭力
1017 | 尽心尽力
1018 | 尽早
1019 | 精光
1020 | 经常
1021 | 竟
1022 | 竟然
1023 | 究竟
1024 | 就此
1025 | 就地
1026 | 就算
1027 | 居然
1028 | 局外
1029 | 举凡
1030 | 据称
1031 | 据此
1032 | 据实
1033 | 据说
1034 | 据我所知
1035 | 据悉
1036 | 具体来说
1037 | 决不
1038 | 决非
1039 | 绝
1040 | 绝不
1041 | 绝顶
1042 | 绝对
1043 | 绝非
1044 | 均
1045 | 喀
1046 | 看
1047 | 看来
1048 | 看起来
1049 | 看上去
1050 | 看样子
1051 | 可好
1052 | 可能
1053 | 恐怕
1054 | 快
1055 | 快要
1056 | 来不及
1057 | 来得及
1058 | 来讲
1059 | 来看
1060 | 拦腰
1061 | 牢牢
1062 | 老
1063 | 老大
1064 | 老老实实
1065 | 老是
1066 | 累次
1067 | 累年
1068 | 理当
1069 | 理该
1070 | 理应
1071 | 历
1072 | 立
1073 | 立地
1074 | 立刻
1075 | 立马
1076 | 立时
1077 | 联袂
1078 | 连连
1079 | 连日
1080 | 连日来
1081 | 连声
1082 | 连袂
1083 | 临到
1084 | 另方面
1085 | 另行
1086 | 另一个
1087 | 路经
1088 | 屡
1089 | 屡次
1090 | 屡次三番
1091 | 屡屡
1092 | 缕缕
1093 | 率尔
1094 | 率然
1095 | 略
1096 | 略加
1097 | 略微
1098 | 略为
1099 | 论说
1100 | 马上
1101 | 蛮
1102 | 满
1103 | 没
1104 | 没有
1105 | 每逢
1106 | 每每
1107 | 每时每刻
1108 | 猛然
1109 | 猛然间
1110 | 莫
1111 | 莫不
1112 | 莫非
1113 | 莫如
1114 | 默默地
1115 | 默然
1116 | 呐
1117 | 那末
1118 | 奈
1119 | 难道
1120 | 难得
1121 | 难怪
1122 | 难说
1123 | 内
1124 | 年复一年
1125 | 凝神
1126 | 偶而
1127 | 偶尔
1128 | 怕
1129 | 砰
1130 | 碰巧
1131 | 譬如
1132 | 偏偏
1133 | 乒
1134 | 平素
1135 | 颇
1136 | 迫于
1137 | 扑通
1138 | 其后
1139 | 其实
1140 | 奇
1141 | 齐
1142 | 起初
1143 | 起来
1144 | 起首
1145 | 起头
1146 | 起先
1147 | 岂
1148 | 岂非
1149 | 岂止
1150 | 迄
1151 | 恰逢
1152 | 恰好
1153 | 恰恰
1154 | 恰巧
1155 | 恰如
1156 | 恰似
1157 | 千
1158 | 千万
1159 | 千万千万
1160 | 切
1161 | 切不可
1162 | 切莫
1163 | 切切
1164 | 切勿
1165 | 窃
1166 | 亲口
1167 | 亲身
1168 | 亲手
1169 | 亲眼
1170 | 亲自
1171 | 顷
1172 | 顷刻
1173 | 顷刻间
1174 | 顷刻之间
1175 | 请勿
1176 | 穷年累月
1177 | 取道
1178 | 去
1179 | 权时
1180 | 全都
1181 | 全力
1182 | 全年
1183 | 全然
1184 | 全身心
1185 | 然
1186 | 人人
1187 | 仍
1188 | 仍旧
1189 | 仍然
1190 | 日复一日
1191 | 日见
1192 | 日渐
1193 | 日益
1194 | 日臻
1195 | 如常
1196 | 如此等等
1197 | 如次
1198 | 如今
1199 | 如期
1200 | 如前所述
1201 | 如上
1202 | 如下
1203 | 汝
1204 | 三番两次
1205 | 三番五次
1206 | 三天两头
1207 | 瑟瑟
1208 | 沙沙
1209 | 上
1210 | 上来
1211 | 上去
1212 | 标签
1213 | 转发
1214 | 关注
1215 | 分享
1216 | 回复
1217 | 地址
1218 | 


--------------------------------------------------------------------------------
/code/Feature_process_java/library/userLibrary/userLibrary.dic:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_process_java/library/userLibrary/userLibrary.dic


--------------------------------------------------------------------------------
/code/Feature_process_java/src/DataPreprocess/Step01_01_TaskA_DataInfoStat.java:
--------------------------------------------------------------------------------
  1 | package DataPreprocess;
  2 | 
  3 | import java.io.BufferedReader;
  4 | import java.io.BufferedWriter;
  5 | import java.io.File;
  6 | import java.io.FileInputStream;
  7 | import java.io.FileOutputStream;
  8 | import java.io.InputStreamReader;
  9 | import java.io.OutputStreamWriter;
 10 | import java.util.HashMap;
 11 | import Tools.PreProcessText;
 12 | 
 13 | public class Step01_01_TaskA_DataInfoStat {
 14 | 	/**
 15 | 	 * @param args
 16 | 	 * @author jacoxu.com-2016/07/02
 17 | 	 */
 18 | 	//rawTrainData, dataRefine, tagInfo
 19 | 	private static void dataAnalysis(String rawTrain_labeled, String rawTrain_unlabeled, String rawTest_unlabeled
 20 | 			, String train_labeled_tag_pre, String train_labeled_text_pre, String train_unlabeled_text_pre
 21 | 			, String test_unlabeled_text_pre, HashMap<String, String> topic2filesuffix 
 22 | 			, boolean hasTestData) {
 23 | 		//
 24 | 		HashMap<String, Integer> topicCountInfoMap = new HashMap<String, Integer>();		
 25 | 		HashMap<String, Integer> tagCountInfoMap = new HashMap<String, Integer>();
 26 | 		
 27 | 		try {
 28 | 			String encoding = "UTF-8";
 29 | 			File rawTrainLabelledFile = new File(rawTrain_labeled);
 30 | 			File rawTrainUnlabeledFile = new File(rawTrain_unlabeled);			
 31 | 			File rawTestUnlabeledFile = new File(rawTest_unlabeled);
 32 | 			if (rawTrainLabelledFile.exists()) {
 33 | 				BufferedReader rawTrainLabelledReader = new BufferedReader(new InputStreamReader(
 34 | 						new FileInputStream(rawTrainLabelledFile), encoding));
 35 | 				BufferedReader rawTrainUnlabeledReader = new BufferedReader(new InputStreamReader(
 36 | 						new FileInputStream(rawTrainUnlabeledFile), encoding));				
 37 | 				
 38 | 				//开始读取训练数据
 39 | 				System.out.println("Start to process train data info...");
 40 | 				String tmpId;
 41 | 				String tmpTopic;
 42 | 				String tmpText;
 43 | 				String tmpTag;
 44 | 				
 45 | 				int lineNum=0;
 46 | 				int dropNum = 0;
 47 | 				String tmpLineStr = null;
 48 | 				while ((tmpLineStr = rawTrainLabelledReader.readLine()) != null) {
 49 | 					lineNum++;
 50 | 					//第一行是Title，所以跳过去
 51 | 					if (lineNum<=1) continue;
 52 | 					String[] trainSentInfo = tmpLineStr.split("\t");
 53 | 					if (trainSentInfo.length==3) {
 54 | 						System.err.println("Error: rawTrainLabelledReader.length is "+ trainSentInfo.length 
 55 | 								+ " in lineNum" + lineNum + ", we drop it.");
 56 | 						dropNum++;
 57 | 						continue;
 58 | 					}
 59 | 					if (trainSentInfo.length!=4) {
 60 | 						System.err.println("Error: rawTrainLabelledReader.length is "+ trainSentInfo.length 
 61 | 								+ " in lineNum" + lineNum);
 62 | 						System.exit(0);
 63 | 					}
 64 | 					tmpId = trainSentInfo[0].trim();
 65 | 					tmpTopic = trainSentInfo[1].trim();
 66 | 					tmpText = trainSentInfo[2].trim();
 67 | 					tmpTag = trainSentInfo[3].trim();
 68 | 					
 69 | 					//如果是初次遇到此标签
 70 | 					int tmpTopicCount = 0;
 71 | 					int tmpTagCount = 0;
 72 | 					
 73 | 					if (topicCountInfoMap.containsKey(tmpTopic)) {
 74 | 						tmpTopicCount = topicCountInfoMap.get(tmpTopic);
 75 | 					}else {
 76 | 						tmpTopicCount = 0;
 77 | 					}
 78 | 					tmpTopicCount++;
 79 | 					topicCountInfoMap.put(tmpTopic, tmpTopicCount);
 80 | 					
 81 | 					if (tagCountInfoMap.containsKey(tmpTopic+"_"+tmpTag)) {
 82 | 						tmpTagCount = tagCountInfoMap.get(tmpTopic+"_"+tmpTag);
 83 | 					}else {
 84 | 						tmpTagCount = 0;
 85 | 					}
 86 | 					tmpTagCount++;
 87 | 					tagCountInfoMap.put(tmpTopic+"_"+tmpTag, tmpTagCount);
 88 | 					
 89 | 					//对文本进行预处理
 90 | 					tmpText = PreProcessText.preProcess4NLPCC2016(tmpText, tmpTopic);
 91 | 					int tag_flag = -1;
 92 | 					if (tmpTag.equals("FAVOR"))
 93 | 						tag_flag = 1;
 94 | 					else if (tmpTag.equals("AGAINST"))
 95 | 						tag_flag = 2;
 96 | 					else if (tmpTag.equals("NONE"))
 97 | 						tag_flag = 0;
 98 | 					else {
 99 | 						System.out.println("Error tag:" + tmpTag);
100 | 					}
101 | 					Result2Txt(train_labeled_tag_pre+"_"+topic2filesuffix.get(tmpTopic)
102 | 							, String.valueOf(tag_flag));
103 | 					Result2Txt(train_labeled_text_pre+"_"+topic2filesuffix.get(tmpTopic)
104 | 							, tmpText);
105 | 					if (lineNum%1000 ==0) {
106 | 						System.out.println("hasProcessed train data numbers:" + lineNum);
107 | 					}
108 | 				}
109 | 				System.out.println("topicCountInfoMap:"+topicCountInfoMap.toString());
110 | 				System.out.println("tagCountInfoMap:"+tagCountInfoMap.toString());
111 | 				System.out.println("Totally processed train data numbers:" + lineNum);
112 | 				System.out.println("rawTrainLabelledFile dropNum is:" + dropNum);
113 | 				rawTrainLabelledReader.close();
114 | 
115 | 				dropNum = 0;
116 | 				lineNum=0;
117 | 				topicCountInfoMap.clear();
118 | 				tagCountInfoMap.clear();
119 | 				while ((tmpLineStr = rawTrainUnlabeledReader.readLine()) != null) {
120 | 					lineNum++;
121 | 					//第一行是Title，所以跳过去
122 | 					if (lineNum<=1) continue;
123 | 					String[] trainSentInfo = tmpLineStr.split("\t");
124 | 					if (trainSentInfo.length == 2){
125 | 						System.err.println("Error: rawTrainUnlabeledFile.length is "+ trainSentInfo.length 
126 | 								+ " in lineNum" + lineNum + ", we drop it.");
127 | 						dropNum++;
128 | 						continue;						
129 | 					} 
130 | 					
131 | 					if (trainSentInfo.length!=3) {
132 | 						System.err.println("Error: rawTrainUnlabeledFile.length is "+ trainSentInfo.length 
133 | 								+ " in lineNum" + lineNum);
134 | 						System.exit(0);
135 | 					}
136 | 					tmpId = trainSentInfo[0].trim();
137 | 					tmpTopic = trainSentInfo[1].trim();
138 | 					tmpText = trainSentInfo[2].trim();
139 | 					
140 | 					//如果是初次遇到此主题
141 | 					int tmpTopicCount = 0;
142 | 					
143 | 					if (topicCountInfoMap.containsKey(tmpTopic)) {
144 | 						tmpTopicCount = topicCountInfoMap.get(tmpTopic);
145 | 					}else {
146 | 						tmpTopicCount = 0;
147 | 					}
148 | 					tmpTopicCount++;
149 | 					topicCountInfoMap.put(tmpTopic, tmpTopicCount);
150 | 					
151 | 					//对文本进行预处理
152 | 					tmpText = PreProcessText.preProcess4NLPCC2016(tmpText, tmpTopic);
153 | 					Result2Txt(train_unlabeled_text_pre+"_"+topic2filesuffix.get(tmpTopic)
154 | 							, tmpText);
155 | 					if (lineNum%1000 ==0) {
156 | 						System.out.println("hasProcessed unlabeled train data numbers:" + lineNum);
157 | 					}
158 | 				}
159 | 				System.out.println("topicCountInfoMap:"+topicCountInfoMap.toString());
160 | 				System.out.println("Totally processed unlabeled train data numbers:" + lineNum);
161 | 				System.out.println("rawTrainUnlabeledFile dropNum is:" + dropNum);				
162 | 				rawTrainUnlabeledReader.close();
163 | 				
164 | 				
165 | 				if (hasTestData) {
166 | 					BufferedReader rawTestUnlabeledReader = new BufferedReader(new InputStreamReader(
167 | 							new FileInputStream(rawTestUnlabeledFile), encoding));
168 | 					//开始读取测试数据
169 | 					lineNum=0;
170 | 					topicCountInfoMap.clear();
171 | 					tagCountInfoMap.clear();
172 | 					while ((tmpLineStr = rawTestUnlabeledReader.readLine()) != null) {
173 | 						lineNum++;
174 | 						//第一行是Title，所以跳过去
175 | 						if (lineNum<=1) continue;
176 | 						String[] testSentInfo = tmpLineStr.split("\t");
177 | 						if (testSentInfo.length!=4) {
178 | //							System.err.println("【Error】: rawTestUnlabeledFile.length is "+ testSentInfo.length 
179 | //									+ " in lineNum" + lineNum);
180 | 							System.err.println(testSentInfo[0].trim() + "\t" + testSentInfo[4].trim());							
181 | //							System.exit(0);
182 | 						}
183 | 						tmpId = testSentInfo[0].trim();
184 | 						tmpTopic = testSentInfo[1].trim();
185 | 						tmpText = testSentInfo[2].trim();
186 | 						
187 | 						//如果是初次遇到此主题
188 | 						int tmpTopicCount = 0;
189 | 						
190 | 						if (topicCountInfoMap.containsKey(tmpTopic)) {
191 | 							tmpTopicCount = topicCountInfoMap.get(tmpTopic);
192 | 						}else {
193 | 							tmpTopicCount = 0;
194 | 						}
195 | 						tmpTopicCount++;
196 | 						topicCountInfoMap.put(tmpTopic, tmpTopicCount);
197 | 						
198 | 						//对文本进行预处理
199 | 						tmpText = PreProcessText.preProcess4NLPCC2016(tmpText, tmpTopic);
200 | 						Result2Txt(test_unlabeled_text_pre+"_"+topic2filesuffix.get(tmpTopic)
201 | 								, tmpText);
202 | 						if (lineNum%1000 ==0) {
203 | 							System.out.println("hasProcessed unlabeled test data numbers:" + lineNum);
204 | 						}
205 | 					}
206 | 					System.out.println("topicCountInfoMap:"+topicCountInfoMap.toString());
207 | 					System.out.println("Totally processed unlabeled test data numbers:" + lineNum);
208 | 					rawTestUnlabeledReader.close();
209 | 				}
210 | 			} else {
211 | 				System.out.println("can't find the file");
212 | 			}
213 | 		} catch (Exception e) {
214 | 			System.out.println("something error when reading the content of the file");
215 | 			e.printStackTrace();
216 | 		}
217 | 		return;
218 | 		
219 | 	}
220 | 
221 | 	public static void Result2Txt(String file, String txt) {
222 | 		  try {
223 | 			  BufferedWriter os = new BufferedWriter(new OutputStreamWriter(   
224 | 		               new FileOutputStream(new File(file),true), "UTF-8")); 
225 | 			  os.write(txt + "\n");
226 | 			  os.close();
227 | 		  } catch (Exception e) {
228 | 			  e.printStackTrace();
229 | 		  }
230 | 	 }
231 | 	
232 | 	public static void main(String[] args) {
233 | 		//***********测试区域************
234 | 		System.out.println("test");
235 | 		//***********测试区域************
236 | 		
237 | 		String dataPathStr="./../../data/";
238 | 		String resultsPathStr="./../../data/RefineData/Step01_topics/";
239 | 		File resultsPathFile = new File(resultsPathStr);
240 | 		if (!resultsPathFile.exists()) resultsPathFile.mkdir();
241 | 
242 | 		//分析训练/测试数据集
243 | 	    String rawTrain_labeled = dataPathStr+"evasampledata4-TaskAA.txt";
244 | 	    String rawTrain_unlabeled = dataPathStr+"evasampledata4-TaskAR.txt";
245 | 
246 | 	    String rawTest_unlabeled = dataPathStr+"NLPCC2016_Stance_Detection_Task_A_Testdata.txt";
247 | 	    //按5个主题 拆分 标签和文本 到文件中
248 | 	    String train_labeled_tag_pre = resultsPathStr+"TaskA_train_labeled_tag";
249 | 	    String train_labeled_text_pre = resultsPathStr+"TaskA_train_labeled_text";
250 | 	    String train_unlabeled_text_pre = resultsPathStr+"TaskA_train_unlabeled_text";
251 | 	    String test_unlabeled_text_pre = resultsPathStr+"TaskA_test_unlabeled_text";
252 | 	    //主题到文件后缀的映射
253 | 	    HashMap<String, String> topic2filesuffix = new HashMap<String, String>();
254 | 	    topic2filesuffix.put("IphoneSE", "iphonese");
255 | 	    topic2filesuffix.put("iPhone SE", "iphonese");	    
256 | 	    topic2filesuffix.put("春节放鞭炮", "bianpao");
257 | 	    topic2filesuffix.put("俄罗斯在叙利亚的反恐行动", "fankong");
258 | 	    topic2filesuffix.put("俄罗斯叙利亚反恐行动", "fankong");
259 | 	    topic2filesuffix.put("开放二胎", "ertai");
260 | 	    topic2filesuffix.put("深圳禁摩限电", "jinmo");
261 |     
262 | 	    boolean hasTestData = true;
263 | 		
264 | 		long readstart=System.currentTimeMillis();
265 | 		dataAnalysis(rawTrain_labeled, rawTrain_unlabeled, rawTest_unlabeled
266 | 				, train_labeled_tag_pre, train_labeled_text_pre, train_unlabeled_text_pre
267 | 				, test_unlabeled_text_pre, topic2filesuffix, hasTestData); 
268 | 		long readend=System.currentTimeMillis();
269 |         System.out.println((readend-readstart)/1000.0+"s had been consumed to process the raw train/test data");
270 | 	}
271 | }
272 | 


--------------------------------------------------------------------------------
/code/Feature_process_java/src/DataPreprocess/Step02_01_TaskA_GenDict.java:
--------------------------------------------------------------------------------
  1 | package DataPreprocess;
  2 | 
  3 | import java.io.BufferedReader;
  4 | import java.io.BufferedWriter;
  5 | import java.io.File;
  6 | import java.io.FileOutputStream;
  7 | import java.io.FileReader;
  8 | import java.io.OutputStreamWriter;
  9 | import java.util.ArrayList;
 10 | import java.util.HashMap;
 11 | import java.util.Iterator;
 12 | import java.util.Map;
 13 | 
 14 | public class Step02_01_TaskA_GenDict {
 15 | 	public static void main(String[] args) throws Exception {
 16 | 
 17 | 		//读取训练数据文件
 18 | 		String dataPathStr="./../../data/RefineData/Step01_topics/";
 19 | 		String resultsPathStr="./../../data/RefineData/Step02_vsm/";
 20 | 		File resultsPathFile = new File(resultsPathStr);
 21 | 		if (!resultsPathFile.exists()) resultsPathFile.mkdir();
 22 | 		
 23 | 	    String train_labeled_text_pre = dataPathStr+"TaskA_train_labeled_text";
 24 | 	    String train_unlabeled_text_pre = dataPathStr+"TaskA_train_unlabeled_text";
 25 | 	    String test_unlabeled_text_pre = dataPathStr+"TaskA_test_unlabeled_text";
 26 | 
 27 | 	    //主题到文件后缀的映射
 28 | 	    ArrayList<String> topic2filesuffix = new ArrayList<String>();
 29 | 	    topic2filesuffix.add("iphonese");
 30 | 	    topic2filesuffix.add("bianpao");
 31 | 	    topic2filesuffix.add("fankong");
 32 | 	    topic2filesuffix.add("ertai");
 33 | 	    topic2filesuffix.add("jinmo");
 34 | 	    
 35 | 	    boolean hasTestData = true;
 36 | 	    
 37 | 	    for (int topic_idx = 0; topic_idx < topic2filesuffix.size(); topic_idx++) {
 38 | 			String usrWordDict = resultsPathStr + "usrWordDict_" + topic2filesuffix.get(topic_idx);
 39 | 			BufferedWriter wordDictFileW = new BufferedWriter(new OutputStreamWriter(
 40 | 					new FileOutputStream(new File(usrWordDict)), "UTF-8"));			
 41 | 	    	
 42 | 	    	String train_labeled_text = 
 43 | 					train_labeled_text_pre+"_"+topic2filesuffix.get(topic_idx);
 44 | 			FileReader train_labeled_text_fr = new FileReader(train_labeled_text);
 45 | 			BufferedReader train_labeled_text_br = new BufferedReader(train_labeled_text_fr);
 46 | 
 47 | 			String train_unlabeled_text = 
 48 | 					train_unlabeled_text_pre+"_"+topic2filesuffix.get(topic_idx);
 49 | 			FileReader train_unlabeled_text_fr = new FileReader(train_unlabeled_text);
 50 | 			BufferedReader train_unlabeled_text_br = new BufferedReader(train_unlabeled_text_fr);
 51 | 
 52 | 			//先读入处理好的训练语料
 53 | 			int wordReadNum = 0;
 54 | 			int wordWriteNum = 0;
 55 | 
 56 | 			HashMap<String, Long> wordMap = new HashMap<String, Long>();
 57 | 			System.out.println("Start to read wordSet ...");
 58 | 			//train_labeled_text
 59 | 			int lineNum = 0;
 60 | 			String tempLine;
 61 | 			while ((tempLine = train_labeled_text_br.readLine()) != null) {
 62 | 				lineNum++;
 63 | 				String[] wordArraysStr = tempLine.trim().split("\\s+");
 64 | 				for (int i = 0; i < wordArraysStr.length; i++) {
 65 | 					String tmpWord = wordArraysStr[i].trim();
 66 | 					if (tmpWord.length()<1) continue;
 67 | 					if (wordMap.containsKey(tmpWord)) {
 68 | 						wordMap.put(tmpWord, wordMap.get(tmpWord)+1);
 69 | 					}else {
 70 | 						wordMap.put(tmpWord,(long) 1);
 71 | 						wordReadNum++;
 72 | 					}
 73 | 				}
 74 | 				if (lineNum%1000 ==0) {
 75 | 					System.out.println("hasProcessed train data numbers:" + lineNum);
 76 | 				}
 77 | 			}
 78 | 			System.out.println("train_labeled_text total lineNum:"+lineNum
 79 | 					+", and wordSet.size():"+wordMap.size());
 80 | 			train_labeled_text_br.close();
 81 | 
 82 | 			//train_unlabeled_text
 83 | 			lineNum = 0;
 84 | 			while ((tempLine = train_unlabeled_text_br.readLine()) != null) {
 85 | 				lineNum++;
 86 | 				String[] wordArraysStr = tempLine.trim().split("\\s+");
 87 | 				for (int i = 0; i < wordArraysStr.length; i++) {
 88 | 					String tmpWord = wordArraysStr[i].trim();
 89 | 					if (tmpWord.length()<1) continue;
 90 | 					if (wordMap.containsKey(tmpWord)) {
 91 | 						wordMap.put(tmpWord, wordMap.get(tmpWord)+1);
 92 | 					}else {
 93 | 						wordMap.put(tmpWord,(long) 1);
 94 | 						wordReadNum++;
 95 | 					}
 96 | 				}
 97 | 				if (lineNum%1000 ==0) {
 98 | 					System.out.println("hasProcessed train_unlabeled_text data numbers:" + lineNum);
 99 | 				}
100 | 			}
101 | 			System.out.println("train_unlabeled_text total lineNum:"+lineNum
102 | 					+", and wordSet.size():"+wordMap.size());
103 | 			train_unlabeled_text_br.close();
104 | 			
105 | 			if (hasTestData){
106 | 				String test_unlabeled_text = 
107 | 						test_unlabeled_text_pre+"_"+topic2filesuffix.get(topic_idx);
108 | 				FileReader test_unlabeled_text_fr = new FileReader(test_unlabeled_text);
109 | 				BufferedReader test_unlabeled_text_br = new BufferedReader(test_unlabeled_text_fr);
110 | 
111 | 				//test_unlabeled_text
112 | 				lineNum = 0;
113 | 				while ((tempLine = test_unlabeled_text_br.readLine()) != null) {
114 | 					lineNum++;
115 | 					String[] wordArraysStr = tempLine.trim().split("\\s+");
116 | 					for (int i = 0; i < wordArraysStr.length; i++) {
117 | 						String tmpWord = wordArraysStr[i].trim();
118 | 						if (tmpWord.length()<1) continue;
119 | 						if (wordMap.containsKey(tmpWord)) {
120 | 							wordMap.put(tmpWord, wordMap.get(tmpWord)+1);
121 | 						}else {
122 | 							wordMap.put(tmpWord,(long) 1);
123 | 							wordReadNum++;
124 | 						}
125 | 					}
126 | 					if (lineNum%1000 ==0) {
127 | 						System.out.println("hasProcessed test_unlabeled_text data numbers:" + lineNum);
128 | 					}
129 | 				}
130 | 				System.out.println("test_unlabeled_text total lineNum:"+lineNum
131 | 						+", and wordSet.size():"+wordMap.size());
132 | 				test_unlabeled_text_br.close();
133 | 			}
134 | 			
135 | 			Iterator iter=wordMap.entrySet().iterator();
136 | 			while(iter.hasNext()){
137 | 				Map.Entry<String, Long> entry = (Map.Entry) iter.next();
138 | 				String tmpWord = entry.getKey();
139 | 
140 | 				wordWriteNum++;
141 | 				wordDictFileW.write(tmpWord+"\t"+wordWriteNum+"\n");
142 | 			}
143 | 			System.out.println("wordReadNum:" +wordReadNum);
144 | 			System.out.println("wordWriteNum:" +wordWriteNum);
145 | 			if (wordMap.size()!=wordReadNum) {
146 | 				System.out.println("Error! wordReadNum is diffent with wordMap.size()");
147 | 			}	
148 | 			wordDictFileW.close();	
149 | 		}
150 | 	}
151 | }


--------------------------------------------------------------------------------
/code/Feature_process_java/src/DataPreprocess/Step02_01_TaskA_GenVSM.java:
--------------------------------------------------------------------------------
  1 | package DataPreprocess;
  2 | 
  3 | import java.io.BufferedReader;
  4 | import java.io.BufferedWriter;
  5 | import java.io.File;
  6 | import java.io.FileInputStream;
  7 | import java.io.FileOutputStream;
  8 | import java.io.IOException;
  9 | import java.io.InputStreamReader;
 10 | import java.io.OutputStreamWriter;
 11 | import java.util.ArrayList;
 12 | import java.util.HashMap;
 13 | 
 14 | public class Step02_01_TaskA_GenVSM {
 15 | 	public static void main(String[] args) throws Exception {
 16 | 		//Test
 17 | //		String tempLine ="宸濄伄娴併倢銇倛銇?28521"; 
 18 | //		String[] tokensStr  = tempLine.split("\t");
 19 | 		//利用纯文本和wordmap构建基于词频的向量空间模型，用于STH预处理
 20 | 		// all = [test_data;train_data]!
 21 | 		String dataPathStr = "./../../data/RefineData/Step01_topics/";
 22 | 		String resultsPathStr = "./../../data/RefineData/Step02_vsm/";
 23 | 		
 24 | 	    String train_labeled_text_pre = dataPathStr+"TaskA_train_labeled_text";
 25 | 	    String train_unlabeled_text_pre = dataPathStr+"TaskA_train_unlabeled_text";
 26 | 	    String test_unlabeled_text_pre = dataPathStr+"TaskA_test_unlabeled_text";
 27 | 	    
 28 | 	    //主题到文件后缀的映射
 29 | 	    ArrayList<String> topic2filesuffix = new ArrayList<String>();
 30 | 	    topic2filesuffix.add("iphonese");
 31 | 	    topic2filesuffix.add("bianpao");
 32 | 	    topic2filesuffix.add("fankong");
 33 | 	    topic2filesuffix.add("ertai");
 34 | 	    topic2filesuffix.add("jinmo");
 35 | 	    
 36 | 	    boolean hasTestData = true;
 37 | 	    
 38 | 	    for (int topic_idx = 0; topic_idx < topic2filesuffix.size(); topic_idx++) {
 39 | 			String wordMapStr = resultsPathStr + "usrWordDict_" + topic2filesuffix.get(topic_idx);
 40 | 			
 41 | 			String vsmBW_train_labeledStr = resultsPathStr + "/vsm_train_labeled_"
 42 | 											+ topic2filesuffix.get(topic_idx);
 43 | 			String vsmBW_train_unlabeledStr = resultsPathStr+"/vsm_train_unlabeled_"
 44 | 											  + topic2filesuffix.get(topic_idx);
 45 | 			String vsmBW_test_unlabeledStr = resultsPathStr+"/vsm_test_unlabeled_"
 46 | 					  						 + topic2filesuffix.get(topic_idx);
 47 | 			BufferedReader train_labeled_File = new BufferedReader(new InputStreamReader(
 48 | 					new FileInputStream(new File(train_labeled_text_pre
 49 | 							+"_"+topic2filesuffix.get(topic_idx))), "UTF-8"));
 50 | 			BufferedReader train_unlabeled_File = new BufferedReader(new InputStreamReader(
 51 | 					new FileInputStream(new File(train_unlabeled_text_pre
 52 | 							+"_"+topic2filesuffix.get(topic_idx))), "UTF-8"));
 53 | 			
 54 | 			//构造训练VSM词频向量空间模型
 55 | 			BufferedReader wordMapRD = new BufferedReader(
 56 | 					new InputStreamReader(new FileInputStream(new File(wordMapStr)), "UTF-8"));
 57 | 			creatVSMText(train_labeled_File, wordMapRD, vsmBW_train_labeledStr);
 58 | 			wordMapRD.close();
 59 | 			train_labeled_File.close();
 60 | 			
 61 | 			wordMapRD = new BufferedReader(
 62 | 					new InputStreamReader(new FileInputStream(new File(wordMapStr)), "UTF-8"));
 63 | 			creatVSMText(train_unlabeled_File, wordMapRD, vsmBW_train_unlabeledStr);
 64 | 			wordMapRD.close();
 65 | 			train_unlabeled_File.close();			
 66 | 			
 67 | 			if(hasTestData){
 68 | 				BufferedReader test_unlabeled_File = new BufferedReader(new InputStreamReader(
 69 | 						new FileInputStream(new File(test_unlabeled_text_pre
 70 | 								+"_"+topic2filesuffix.get(topic_idx))), "UTF-8"));
 71 | 				
 72 | 				wordMapRD = new BufferedReader(
 73 | 						new InputStreamReader(new FileInputStream(new File(wordMapStr)), "UTF-8"));
 74 | 				creatVSMText(test_unlabeled_File, wordMapRD, vsmBW_test_unlabeledStr);
 75 | 				wordMapRD.close();
 76 | 				test_unlabeled_File.close();
 77 | 			}
 78 | 	    }	    
 79 | 
 80 | 		System.out.println("It is done, ok!");
 81 | 	}
 82 | 	
 83 | 	public static void creatVSMText(BufferedReader sourceTextRD,
 84 | 			BufferedReader wordMapRD, String vsmBW_Str) throws IOException, Exception {
 85 | 		System.out.println("Start to create VSM ...!");
 86 | 		String tempLine;
 87 | 		//先读入词典
 88 | 		int wordIdxNum = 1;
 89 | 		HashMap<String, Integer> wordMap = new HashMap<String,Integer>();
 90 | 
 91 | 		while ((tempLine = wordMapRD.readLine()) != null) {
 92 | 			//词典中放着词和索引号，索引号
 93 | 			if (wordMap.containsKey(tempLine.trim())) {
 94 | 				System.out.println("Test, the word is replicate:"+tempLine.trim()
 95 | 						+", in wordIdxNum:"+wordIdxNum);
 96 | 			}
 97 | 			if (tempLine.trim().length()==0) continue;
 98 | 			//wordMap.put(tempLine.trim(), wordIdxNum);
 99 | 			wordMap.put(tempLine.split("\\s+")[0].trim(), Integer.valueOf(tempLine.split("\\s+")[1]));	
100 | 			wordIdxNum =wordIdxNum+1;
101 | 		}
102 | 		//定义了这个数据集的特征维数
103 | 		int dimVector = wordIdxNum-1;
104 | 		System.out.println("Has read the dictionary, the size is:"+wordMap.size());
105 | 		ArrayList<Integer> wordFreqList = new ArrayList<Integer>();
106 | 		int lineNum = 1;
107 | 		boolean hasWordFeature = false;
108 | 		StringBuffer tmpVSMBuffer = new StringBuffer();
109 | 		while ((tempLine = sourceTextRD.readLine()) != null) {
110 | 			hasWordFeature = false;
111 | 			//读入一行，即一个文档；
112 | 			wordFreqList.clear();
113 | 			for (int i = 0; i < dimVector; i++) {
114 | 				wordFreqList.add(0);
115 | 			}
116 | 
117 | 			String[] tokensStr;
118 | 			boolean isvalid = true;
119 | 			
120 | 			tokensStr = tempLine.trim().split("\\s+");
121 | 
122 | 			if (!(tokensStr.length<1)) {
123 | 				for (int j = 0; j < tokensStr.length; j++) {
124 | 					String tempToken = tokensStr[j];
125 | 					if (wordMap.containsKey(tempToken.trim())) {
126 | 						hasWordFeature = true;
127 | 						int index = wordMap.get(tempToken.trim());
128 | 						if (index>dimVector) {
129 | 							System.out.print("Error, and the word is: "+tempToken.trim());
130 | 						}
131 | 						wordFreqList.set(index-1, wordFreqList.get(index-1)+1);
132 | 					}else {
133 | 						System.out.println("error: the map has not contain the word:"
134 | 								+tempToken+" in Line:"+lineNum);
135 | 					}
136 | 				}
137 | 			}else {
138 | 				isvalid = false;
139 | 			}
140 | 			
141 | 			if (!isvalid) {
142 | 				System.out.println("warning: the string has lacked contents:"
143 | 						+tempLine.trim()+" in Line:"+lineNum);
144 | 			}
145 | 			for (int tempFreq:wordFreqList) {
146 | 				tmpVSMBuffer.append(String.valueOf(tempFreq)+" ");
147 | 			}
148 | 			//把处理好的文本写入到新的文本文件中
149 | 			Result2Txt(vsmBW_Str,tmpVSMBuffer.toString().trim());
150 | 			tmpVSMBuffer.delete(0, tmpVSMBuffer.length());
151 | 			
152 | 			if (!hasWordFeature) {
153 | 				System.out.println("++++++++++"+"has no word in Line:"+lineNum+"++++++++++");
154 | 			}
155 | 			lineNum++;
156 | 			if (lineNum%1000 ==0) {
157 | 				System.out.println("hasProcessed text numbers:" + lineNum);
158 | 			}
159 | 		}
160 | 	}
161 | 	public static void Result2Txt(String file, String txt) {
162 | 		  try {
163 | 			   BufferedWriter os = new BufferedWriter(new OutputStreamWriter(   
164 | 		                new FileOutputStream(new File(file),true), "UTF-8")); 
165 | 			   os.write(txt + "\n");
166 | 			   os.close();
167 | 		  } catch (Exception e) {
168 | 			  e.printStackTrace();
169 | 		  }
170 | 	 }
171 | }
172 | 


--------------------------------------------------------------------------------
/code/Feature_process_java/src/DataPreprocess/Step04_01_TaskA_ChiSquare.java:
--------------------------------------------------------------------------------
  1 | package DataPreprocess;
  2 | 
  3 | import java.io.BufferedReader;
  4 | import java.io.BufferedWriter;
  5 | import java.io.File;
  6 | import java.io.FileOutputStream;
  7 | import java.io.FileReader;
  8 | import java.io.OutputStreamWriter;
  9 | import java.util.ArrayList;
 10 | import java.util.HashMap;
 11 | import java.util.Iterator;
 12 | import java.util.Map;
 13 | 
 14 | public class Step04_01_TaskA_ChiSquare {
 15 | 	public static void main(String[] args) throws Exception {
 16 | 
 17 | 		//读取训练数据文件
 18 | 		String dataPathStr="./../../data/RefineData/Step01_topics/";
 19 | 		String resultsPathStr="./../../data/RefineData/Step04_chisquare/";
 20 | 		File resultsPathFile = new File(resultsPathStr);
 21 | 		if (!resultsPathFile.exists()) resultsPathFile.mkdir();
 22 | 		
 23 | 	    String train_labeled_text_pre = dataPathStr+"TaskA_train_labeled_text";
 24 | 	    String train_labeled_tag_pre = dataPathStr+"TaskA_train_labeled_tag";
 25 | 
 26 | 	    //主题到文件后缀的映射
 27 | 	    ArrayList<String> topic2filesuffix = new ArrayList<String>();
 28 | 	    topic2filesuffix.add("iphonese");
 29 | 	    topic2filesuffix.add("bianpao");
 30 | 	    topic2filesuffix.add("fankong");
 31 | 	    topic2filesuffix.add("ertai");
 32 | 	    topic2filesuffix.add("jinmo");
 33 | 	    int low_feq = 1;
 34 | 	    boolean hasTestData = false;
 35 | 	    
 36 | 	    for (int topic_idx = 0; topic_idx < topic2filesuffix.size(); topic_idx++) {
 37 | 			String usrWordDict = resultsPathStr + "usrWordDict_low_feq_" + String.valueOf(low_feq) 
 38 | 					+ "_"+ topic2filesuffix.get(topic_idx);
 39 | 			BufferedWriter wordDictFileW = new BufferedWriter(new OutputStreamWriter(
 40 | 					new FileOutputStream(new File(usrWordDict)), "UTF-8"));			
 41 | 
 42 | 			String taskA_ChiSquare = resultsPathStr + "TaskA_chiSquare_" 
 43 | 					+ topic2filesuffix.get(topic_idx);
 44 | 			BufferedWriter taskA_ChiSquareFileW = new BufferedWriter(new OutputStreamWriter(
 45 | 					new FileOutputStream(new File(taskA_ChiSquare)), "UTF-8"));	
 46 | 			
 47 | 	    	String train_labeled_text = 
 48 | 					train_labeled_text_pre+"_"+topic2filesuffix.get(topic_idx);
 49 | 			FileReader train_labeled_text_fr = new FileReader(train_labeled_text);
 50 | 			BufferedReader train_labeled_text_br = new BufferedReader(train_labeled_text_fr);
 51 | 
 52 | 			String train_labeled_tag = 
 53 | 					train_labeled_tag_pre+"_"+topic2filesuffix.get(topic_idx);
 54 | 			FileReader train_labeled_tag_fr = new FileReader(train_labeled_tag);
 55 | 			BufferedReader train_labeled_tag_br = new BufferedReader(train_labeled_tag_fr);
 56 | 
 57 | 			//先读入处理好的训练语料
 58 | 			int wordReadNum = 0;
 59 | 			int wordWriteNum = 0;
 60 | 
 61 | 			HashMap<String, Long> wordMap = new HashMap<String, Long>();
 62 | 			System.out.println("Start to read wordSet ...");
 63 | 			//train_labeled_text
 64 | 			int lineNum = 0;
 65 | 			String tempLine;
 66 | 			while ((tempLine = train_labeled_text_br.readLine()) != null) {
 67 | 				lineNum++;
 68 | 				String[] wordArraysStr = tempLine.trim().split("\\s+");
 69 | 				for (int i = 0; i < wordArraysStr.length; i++) {
 70 | 					String tmpWord = wordArraysStr[i].trim();
 71 | 					if (tmpWord.length()<1) continue;
 72 | 					if (wordMap.containsKey(tmpWord)) {
 73 | 						wordMap.put(tmpWord, wordMap.get(tmpWord)+1);
 74 | 					}else {
 75 | 						wordMap.put(tmpWord,(long) 1);
 76 | 						wordReadNum++;
 77 | 					}
 78 | 				}
 79 | 				if (lineNum%1000 ==0) {
 80 | 					System.out.println("hasProcessed train data numbers:" + lineNum);
 81 | 				}
 82 | 			}
 83 | 			System.out.println("train_labeled_text total lineNum:"+lineNum
 84 | 					+", and wordSet.size():"+wordMap.size());
 85 | 			train_labeled_text_br.close();
 86 | 
 87 | 			//开始筛选low_feq>3的并输出
 88 | 			System.out.println("raw wordMap size:" + wordMap.size());			
 89 | 			Iterator iter = wordMap.entrySet().iterator();
 90 | 			while(iter.hasNext()){
 91 | 				Map.Entry<String, Long> entry = (Map.Entry) iter.next();
 92 | 				String tmpWord = entry.getKey();
 93 | 				long tmp_low_feq = entry.getValue();
 94 | 				if (tmp_low_feq < low_feq) {
 95 | 					wordMap.remove(tmpWord);
 96 | 					continue;
 97 | 				}
 98 | 				wordWriteNum++;
 99 | 				wordDictFileW.write(tmpWord+"\t"+wordWriteNum+"\n");
100 | 			}
101 | 			System.out.println("wordReadNum:" + wordReadNum);
102 | 			System.out.println("wordWriteNum:" + wordWriteNum);
103 | 			System.out.println("refined wordMap size:" + wordMap.size());
104 | 			wordDictFileW.close();			
105 | 			
106 | 			//开始输出卡方格式
107 | 			//tag \t text
108 | 			lineNum = 0;
109 | 			train_labeled_text_fr = new FileReader(train_labeled_text);
110 | 			train_labeled_text_br = new BufferedReader(train_labeled_text_fr);			
111 | 			String tempTag;
112 | 			String tempText;
113 | 			String[] tokensStr;
114 | 			StringBuffer tmpLineBuffer = new StringBuffer();
115 | 			while ((tempTag = train_labeled_tag_br.readLine()) != null) {
116 | 				lineNum++;
117 | 				tempText = train_labeled_text_br.readLine();
118 | 				
119 | 				tokensStr = tempText.trim().split("\\s+");
120 | 
121 | 				if (!(tokensStr.length<1)) {
122 | 					for (int j = 0; j < tokensStr.length; j++) {
123 | 						String tempToken = tokensStr[j];
124 | 						if (wordMap.containsKey(tempToken.trim())) {
125 | 							tmpLineBuffer.append(tempToken + " ");
126 | 						}else {
127 | 							System.out.println("error: the map has not contain the word:"
128 | 									+ tempToken+" in Line:" + lineNum);
129 | 						}
130 | 					}
131 | 					taskA_ChiSquareFileW.write(tempTag+"\t"+tmpLineBuffer.toString().trim()+"\n");	
132 | 				}				
133 | 				
134 | 				tmpLineBuffer.delete(0, tmpLineBuffer.length());
135 | 				if (lineNum%1000 ==0) {
136 | 					System.out.println("hasProcessed train_unlabeled_text data numbers:" + lineNum);
137 | 				}
138 | 			}
139 | 			taskA_ChiSquareFileW.close();
140 | 			System.out.println("taskA_ChiSquareFileW total lineNum:"+lineNum
141 | 					+", and wordSet.size():"+wordMap.size());
142 | 			train_labeled_tag_br.close();
143 | 			train_labeled_text_br.close();
144 | 
145 | 		}
146 | 	}
147 | }


--------------------------------------------------------------------------------
/code/Feature_process_java/src/DataPreprocess/Step04_02_TaskA_ChiSquare_Rank.java:
--------------------------------------------------------------------------------
  1 | package DataPreprocess;
  2 | 
  3 | import java.io.BufferedReader;
  4 | import java.io.BufferedWriter;
  5 | import java.io.File;
  6 | import java.io.FileOutputStream;
  7 | import java.io.FileReader;
  8 | import java.io.OutputStreamWriter;
  9 | import java.util.ArrayList;
 10 | import java.util.Collections;
 11 | import java.util.Comparator;
 12 | import java.util.HashMap;
 13 | import java.util.Iterator;
 14 | import java.util.List;
 15 | import java.util.Map;
 16 | 
 17 | public class Step04_02_TaskA_ChiSquare_Rank {
 18 | 	public static void main(String[] args) throws Exception {
 19 | 
 20 | 		//读取训练数据文件
 21 | 		String dataPathStr="./../../data/RefineData/Step04_chisquare/";
 22 | 		String resultsPathStr="./../../data/RefineData/Step04_chisquare/";
 23 | 		File resultsPathFile = new File(resultsPathStr);
 24 | 		if (!resultsPathFile.exists()) resultsPathFile.mkdir();
 25 | 		
 26 | 	    String train_chi2_score_pre = dataPathStr+"output_chi2_";
 27 | 
 28 | 	    //主题到文件后缀的映射
 29 | 	    ArrayList<String> topic2filesuffix = new ArrayList<String>();
 30 | 	    topic2filesuffix.add("iphonese");
 31 | 	    topic2filesuffix.add("bianpao");
 32 | 	    topic2filesuffix.add("fankong");
 33 | 	    topic2filesuffix.add("ertai");
 34 | 	    topic2filesuffix.add("jinmo");
 35 | 	    int top_words = 500;
 36 | 	    
 37 | 	    for (int topic_idx = 0; topic_idx < topic2filesuffix.size(); topic_idx++) {
 38 | 		    HashMap<String, Float> chi2_rank_favor_map = new HashMap<String, Float>();
 39 | 		    HashMap<String, Float> chi2_rank_none_map = new HashMap<String, Float>();
 40 | 		    HashMap<String, Float> chi2_rank_against_map = new HashMap<String, Float>();
 41 | 		    
 42 | 			String usrChi2ScoreDict = resultsPathStr + "usrWordDict_chi2_score_" 
 43 | 								 + topic2filesuffix.get(topic_idx);
 44 | 			BufferedWriter wordChi2ScoreFileW = new BufferedWriter(new OutputStreamWriter(
 45 | 					new FileOutputStream(new File(usrChi2ScoreDict)), "UTF-8"));			
 46 | 			
 47 | 			String usrWordDict = resultsPathStr + "usrWordDict_chi2_refined_" 
 48 | 					 + topic2filesuffix.get(topic_idx);
 49 | 			BufferedWriter wordDictFileW = new BufferedWriter(new OutputStreamWriter(
 50 | 					new FileOutputStream(new File(usrWordDict)), "UTF-8"));	
 51 | 
 52 | //			String taskA_ChiSquare = resultsPathStr + "TaskA_chi2_refined_" 
 53 | //					+ topic2filesuffix.get(topic_idx);
 54 | //			BufferedWriter taskA_ChiSquareFileW = new BufferedWriter(new OutputStreamWriter(
 55 | //					new FileOutputStream(new File(taskA_ChiSquare)), "UTF-8"));	
 56 | 			
 57 | 	    	String train_chi2_score_text = 
 58 | 					train_chi2_score_pre+topic2filesuffix.get(topic_idx)+".tst";
 59 | 			FileReader train_chi2_score_fr = new FileReader(train_chi2_score_text);
 60 | 			BufferedReader train_chi2_score_br = new BufferedReader(train_chi2_score_fr);
 61 | 
 62 | 			//先读入处理好的训练语料
 63 | 			int wordReadNum = 0;
 64 | 			int wordWriteNum = 0;
 65 | 
 66 | 			//train_labeled_text
 67 | 			int lineNum = 0;
 68 | 			String tempLine;
 69 | 			while ((tempLine = train_chi2_score_br.readLine()) != null) {
 70 | 				lineNum++;
 71 | 				String[] tokensStr = tempLine.trim().split("\t");
 72 | 				String tmpTag = tokensStr[0].trim();
 73 | 				String tmpWord = tokensStr[1].trim();
 74 | 				Float tmpScore = Float.valueOf(tokensStr[2].trim());
 75 | 				
 76 | 				if(tmpTag.equals("FAVOR")){
 77 | 					chi2_rank_favor_map.put(tmpWord, tmpScore);
 78 | 				}else if (tmpTag.equals("AGAINST")) {
 79 | 					chi2_rank_against_map.put(tmpWord, tmpScore);
 80 | 				}else if (tmpTag.equals("NONE")) {
 81 | 					chi2_rank_none_map.put(tmpWord, tmpScore);
 82 | 				}else {
 83 | 					System.out.println("Error: wrong tag is "+tmpTag);
 84 | 				}
 85 | 			}
 86 | 			System.out.println("train_labeled_text total lineNum:"+lineNum
 87 | 					+", and wordSet.size():"+chi2_rank_favor_map.size());
 88 | 			train_chi2_score_br.close();
 89 | 			
 90 | 			//开始对chi2进行排序，同时读入词典
 91 | 			HashMap<String, Float> wordMap = new HashMap<String, Float>();
 92 | 			System.out.println("Start to read wordSet ...");			
 93 | 			//Favor
 94 | 			List<Map.Entry<String, Float>> chi2_rank_favor_List =   
 95 | 			        new ArrayList<Map.Entry<String, Float>>(chi2_rank_favor_map.entrySet());   
 96 | 			Collections.sort(chi2_rank_favor_List, new Comparator<Map.Entry<String, Float>>() {
 97 | 			@Override
 98 | 			public int compare(Map.Entry<String, Float> o1, Map.Entry<String, Float> o2) {         
 99 | 			        return (o2.getValue().compareTo(o1.getValue()));
100 | 			        //return (o1.getKey()).toString().compareTo(o2.getKey());   
101 | 			    }
102 | 			});
103 | 			if (top_words>chi2_rank_favor_List.size())
104 | 				top_words = chi2_rank_favor_List.size();
105 | 			for (int i = 0; i < chi2_rank_favor_List.size(); i++) {   
106 | 			    String tmpWord = chi2_rank_favor_List.get(i).getKey();   
107 | 			    float tmpChi2 = chi2_rank_favor_List.get(i).getValue();   
108 | 			    wordChi2ScoreFileW.write(tmpWord+"\t"+tmpChi2+"\n");
109 | 			    if (i < top_words){
110 | 			    	wordReadNum++;
111 | 			    	wordMap.put(tmpWord, tmpChi2);
112 | 			    }
113 | 			}
114 | 			//Against
115 | 			List<Map.Entry<String, Float>> chi2_rank_against_List =   
116 | 			        new ArrayList<Map.Entry<String, Float>>(chi2_rank_against_map.entrySet());   
117 | 			Collections.sort(chi2_rank_against_List, new Comparator<Map.Entry<String, Float>>() {
118 | 			@Override
119 | 			public int compare(Map.Entry<String, Float> o1, Map.Entry<String, Float> o2) {         
120 | 			        return (o2.getValue().compareTo(o1.getValue()));
121 | 			        //return (o1.getKey()).toString().compareTo(o2.getKey());   
122 | 			    }
123 | 			});
124 | 			for (int i = 0; i < chi2_rank_against_List.size(); i++) {   
125 | 			    String tmpWord = chi2_rank_against_List.get(i).getKey();   
126 | 			    float tmpChi2 = chi2_rank_against_List.get(i).getValue();   
127 | 			    wordChi2ScoreFileW.write(tmpWord+"\t"+tmpChi2+"\n");     
128 | 			    if (i < top_words){
129 | 			    	wordReadNum++;
130 | 			    	wordMap.put(tmpWord, tmpChi2);
131 | 			    }
132 | 			}
133 | 			//None
134 | 			List<Map.Entry<String, Float>> chi2_rank_none_List =   
135 | 			        new ArrayList<Map.Entry<String, Float>>(chi2_rank_none_map.entrySet());   
136 | 			Collections.sort(chi2_rank_none_List, new Comparator<Map.Entry<String, Float>>() {
137 | 			@Override
138 | 			public int compare(Map.Entry<String, Float> o1, Map.Entry<String, Float> o2) {         
139 | 			        return (o2.getValue().compareTo(o1.getValue()));
140 | 			        //return (o1.getKey()).toString().compareTo(o2.getKey());   
141 | 			    }
142 | 			});
143 | 			for (int i = 0; i < chi2_rank_none_List.size(); i++) {   
144 | 			    String tmpWord = chi2_rank_none_List.get(i).getKey();   
145 | 			    float tmpChi2 = chi2_rank_none_List.get(i).getValue();   
146 | 			    wordChi2ScoreFileW.write(tmpWord+"\t"+tmpChi2+"\n"); 
147 | 			    if (i < top_words){
148 | 			    	wordReadNum++;
149 | 			    	wordMap.put(tmpWord, tmpChi2);
150 | 			    }
151 | 			}
152 | 			
153 | 			//开始输出基于Chi2选择后的词典
154 | 			System.out.println("raw wordMap size:" + wordMap.size());			
155 | 			Iterator iter = wordMap.entrySet().iterator();
156 | 			while(iter.hasNext()){
157 | 				Map.Entry<String, Float> entry = (Map.Entry) iter.next();
158 | 				String tmpWord = entry.getKey();
159 | 
160 | 				wordWriteNum++;
161 | 				wordDictFileW.write(tmpWord+"\t"+wordWriteNum+"\n");
162 | 			}
163 | 			System.out.println("wordReadNum:" + wordReadNum);
164 | 			System.out.println("wordWriteNum:" + wordWriteNum);
165 | 			System.out.println("refined wordMap size:" + wordMap.size());
166 | 			wordChi2ScoreFileW.close();
167 | 			wordDictFileW.close();
168 | 		}
169 | 	}
170 | }


--------------------------------------------------------------------------------
/code/Feature_process_java/src/DataPreprocess/Step04_03_TaskA_ChiSquare_VSM.java:
--------------------------------------------------------------------------------
  1 | package DataPreprocess;
  2 | 
  3 | import java.io.BufferedReader;
  4 | import java.io.BufferedWriter;
  5 | import java.io.File;
  6 | import java.io.FileInputStream;
  7 | import java.io.FileOutputStream;
  8 | import java.io.IOException;
  9 | import java.io.InputStreamReader;
 10 | import java.io.OutputStreamWriter;
 11 | import java.util.ArrayList;
 12 | import java.util.HashMap;
 13 | 
 14 | public class Step04_03_TaskA_ChiSquare_VSM {
 15 | 	public static void main(String[] args) throws Exception {
 16 | 		//Test
 17 | //		String tempLine ="宸濄伄娴併倢銇倛銇?28521"; 
 18 | //		String[] tokensStr  = tempLine.split("\t");
 19 | 		//利用纯文本和wordmap构建基于词频的向量空间模型，用于STH预处理
 20 | 		// all = [test_data;train_data]!
 21 | 		String dataPathStr = "./../../data/RefineData/Step01_topics/";
 22 | 		String resultsPathStr = "./../../data/RefineData/Step04_chisquare/";
 23 | 		
 24 | 	    String train_labeled_text_pre = dataPathStr+"TaskA_train_labeled_text";
 25 | 	    String train_unlabeled_text_pre = dataPathStr+"TaskA_train_unlabeled_text";
 26 | 	    String test_unlabeled_text_pre = dataPathStr+"TaskA_test_unlabeled_text";
 27 | 	    
 28 | 	    //主题到文件后缀的映射
 29 | 	    ArrayList<String> topic2filesuffix = new ArrayList<String>();
 30 | 	    topic2filesuffix.add("iphonese");
 31 | 	    topic2filesuffix.add("bianpao");
 32 | 	    topic2filesuffix.add("fankong");
 33 | 	    topic2filesuffix.add("ertai");
 34 | 	    topic2filesuffix.add("jinmo");
 35 | 	    
 36 | 	    boolean hasTestData = true;
 37 | 	    
 38 | 	    for (int topic_idx = 0; topic_idx < topic2filesuffix.size(); topic_idx++) {
 39 | 			String wordMapStr = resultsPathStr + "usrWordDict_chi2_refined_" + topic2filesuffix.get(topic_idx);
 40 | 			
 41 | 			String vsmBW_train_labeledStr = resultsPathStr + "/vsm_train_labeled_chi2_"
 42 | 											+ topic2filesuffix.get(topic_idx);
 43 | 			String vsmBW_train_unlabeledStr = resultsPathStr+"/vsm_train_unlabeled_chi2_"
 44 | 											  + topic2filesuffix.get(topic_idx);
 45 | 			String vsmBW_test_unlabeledStr = resultsPathStr+"/vsm_test_unlabeled_chi2_"
 46 | 					  						 + topic2filesuffix.get(topic_idx);
 47 | 			BufferedReader train_labeled_File = new BufferedReader(new InputStreamReader(
 48 | 					new FileInputStream(new File(train_labeled_text_pre
 49 | 							+"_"+topic2filesuffix.get(topic_idx))), "UTF-8"));
 50 | 			BufferedReader train_unlabeled_File = new BufferedReader(new InputStreamReader(
 51 | 					new FileInputStream(new File(train_unlabeled_text_pre
 52 | 							+"_"+topic2filesuffix.get(topic_idx))), "UTF-8"));
 53 | 			
 54 | 			//构造训练VSM词频向量空间模型
 55 | 			BufferedReader wordMapRD = new BufferedReader(
 56 | 					new InputStreamReader(new FileInputStream(new File(wordMapStr)), "UTF-8"));
 57 | 			creatVSMText(train_labeled_File, wordMapRD, vsmBW_train_labeledStr);
 58 | 			wordMapRD.close();
 59 | 			train_labeled_File.close();
 60 | 			
 61 | 			wordMapRD = new BufferedReader(
 62 | 					new InputStreamReader(new FileInputStream(new File(wordMapStr)), "UTF-8"));
 63 | 			creatVSMText(train_unlabeled_File, wordMapRD, vsmBW_train_unlabeledStr);
 64 | 			wordMapRD.close();
 65 | 			train_unlabeled_File.close();			
 66 | 			
 67 | 			if(hasTestData){
 68 | 				BufferedReader test_unlabeled_File = new BufferedReader(new InputStreamReader(
 69 | 						new FileInputStream(new File(test_unlabeled_text_pre
 70 | 								+"_"+topic2filesuffix.get(topic_idx))), "UTF-8"));
 71 | 				
 72 | 				wordMapRD = new BufferedReader(
 73 | 						new InputStreamReader(new FileInputStream(new File(wordMapStr)), "UTF-8"));
 74 | 				creatVSMText(test_unlabeled_File, wordMapRD, vsmBW_test_unlabeledStr);
 75 | 				wordMapRD.close();
 76 | 				test_unlabeled_File.close();
 77 | 			}
 78 | 	    }	    
 79 | 
 80 | 		System.out.println("It is done, ok!");
 81 | 	}
 82 | 	
 83 | 	public static void creatVSMText(BufferedReader sourceTextRD,
 84 | 			BufferedReader wordMapRD, String vsmBW_Str) throws IOException, Exception {
 85 | 		System.out.println("Start to create VSM ...!");
 86 | 		String tempLine;
 87 | 		//先读入词典
 88 | 		int wordIdxNum = 1;
 89 | 		HashMap<String, Integer> wordMap = new HashMap<String,Integer>();
 90 | 
 91 | 		while ((tempLine = wordMapRD.readLine()) != null) {
 92 | 			//词典中放着词和索引号，索引号
 93 | 			if (wordMap.containsKey(tempLine.trim())) {
 94 | 				System.out.println("Test, the word is replicate:"+tempLine.trim()
 95 | 						+", in wordIdxNum:"+wordIdxNum);
 96 | 			}
 97 | 			if (tempLine.trim().length()==0) continue;
 98 | 			//wordMap.put(tempLine.trim(), wordIdxNum);
 99 | 			wordMap.put(tempLine.split("\\s+")[0].trim(), Integer.valueOf(tempLine.split("\\s+")[1]));	
100 | 			wordIdxNum =wordIdxNum+1;
101 | 		}
102 | 		//定义了这个数据集的特征维数
103 | 		int dimVector = wordIdxNum-1;
104 | 		System.out.println("Has read the dictionary, the size is:"+wordMap.size());
105 | 		ArrayList<Integer> wordFreqList = new ArrayList<Integer>();
106 | 		int lineNum = 0;
107 | 		boolean hasWordFeature = false;
108 | 		int hasNoWordFeature = 0;
109 | 		StringBuffer tmpVSMBuffer = new StringBuffer();
110 | 		while ((tempLine = sourceTextRD.readLine()) != null) {
111 | 			hasWordFeature = false;
112 | 			//读入一行，即一个文档；
113 | 			wordFreqList.clear();
114 | 			for (int i = 0; i < dimVector; i++) {
115 | 				wordFreqList.add(0);
116 | 			}
117 | 
118 | 			String[] tokensStr;
119 | 			boolean isvalid = true;
120 | 			
121 | 			tokensStr = tempLine.trim().split("\\s+");
122 | 
123 | 			if (!(tokensStr.length<1)) {
124 | 				for (int j = 0; j < tokensStr.length; j++) {
125 | 					String tempToken = tokensStr[j];
126 | 					if (wordMap.containsKey(tempToken.trim())) {
127 | 						hasWordFeature = true;
128 | 						int index = wordMap.get(tempToken.trim());
129 | 						if (index>dimVector) {
130 | 							System.out.print("Error, and the word is: "+tempToken.trim());
131 | 						}
132 | 						wordFreqList.set(index-1, wordFreqList.get(index-1)+1);
133 | 					}
134 | //					else {
135 | //						System.out.println("error: the map has not contain the word:"
136 | //								+tempToken+" in Line:"+lineNum);
137 | //					}
138 | 				}
139 | 			}else {
140 | 				isvalid = false;
141 | 			}
142 | 			
143 | 			if (!isvalid) {
144 | 				System.out.println("warning: the string has lacked contents:"
145 | 						+tempLine.trim()+" in Line:"+lineNum);
146 | 			}
147 | 			for (int tempFreq:wordFreqList) {
148 | 				tmpVSMBuffer.append(String.valueOf(tempFreq)+" ");
149 | 			}
150 | 			//把处理好的文本写入到新的文本文件中
151 | 			Result2Txt(vsmBW_Str,tmpVSMBuffer.toString().trim());
152 | 			tmpVSMBuffer.delete(0, tmpVSMBuffer.length());
153 | 			
154 | 			if (!hasWordFeature) {
155 | 				hasNoWordFeature++;
156 | 			}
157 | 			lineNum++;
158 | 			if (lineNum%1000 ==0) {
159 | 				System.out.println("hasProcessed text numbers:" + lineNum);
160 | 			}
161 | 		}
162 | 		System.out.println("Total number is:"+lineNum+" and no word lines is:"+ hasNoWordFeature);
163 | 	}
164 | 	public static void Result2Txt(String file, String txt) {
165 | 		  try {
166 | 			   BufferedWriter os = new BufferedWriter(new OutputStreamWriter(   
167 | 		                new FileOutputStream(new File(file),true), "UTF-8")); 
168 | 			   os.write(txt + "\n");
169 | 			   os.close();
170 | 		  } catch (Exception e) {
171 | 			  e.printStackTrace();
172 | 		  }
173 | 	 }
174 | }
175 | 


--------------------------------------------------------------------------------
/code/Feature_process_java/src/DataPreprocess/Step05_01_TaskA_Opinion.java:
--------------------------------------------------------------------------------
  1 | package DataPreprocess;
  2 | 
  3 | import java.io.BufferedReader;
  4 | import java.io.BufferedWriter;
  5 | import java.io.File;
  6 | import java.io.FileInputStream;
  7 | import java.io.FileNotFoundException;
  8 | import java.io.FileOutputStream;
  9 | import java.io.IOException;
 10 | import java.io.InputStreamReader;
 11 | import java.io.OutputStreamWriter;
 12 | import java.util.ArrayList;
 13 | import java.util.HashMap;
 14 | import java.util.HashSet;
 15 | 
 16 | public class Step05_01_TaskA_Opinion {
 17 | 	public static void main(String[] args) throws Exception {
 18 | 		//Test
 19 | //		String tempLine ="宸濄伄娴併倢銇倛銇?28521"; 
 20 | //		String[] tokensStr  = tempLine.split("\t");
 21 | 		//利用纯文本和wordmap构建基于词频的向量空间模型，用于STH预处理
 22 | 		// all = [test_data;train_data]!
 23 | 		String dataPathStr = "./../../data/RefineData/Step01_topics/";
 24 | 		String corpusPathStr = "./../../corpus/";
 25 | 		String resultsPathStr = "./../../data/RefineData/Step05_opinion/";
 26 | 		
 27 | 	    String train_labeled_text_pre = dataPathStr+"TaskA_train_labeled_text";
 28 | 	    String test_unlabeled_text_pre = dataPathStr+"TaskA_test_unlabeled_text";
 29 | 	    
 30 | 	    //主题到文件后缀的映射
 31 | 	    ArrayList<String> topic2filesuffix = new ArrayList<String>();
 32 | 	    topic2filesuffix.add("iphonese");
 33 | 	    topic2filesuffix.add("bianpao");
 34 | 	    topic2filesuffix.add("fankong");
 35 | 	    topic2filesuffix.add("ertai");
 36 | 	    topic2filesuffix.add("jinmo");
 37 | 	    
 38 | 	    boolean hasTestData = true;
 39 | 	    HashSet<String> opinion_pos_set = new HashSet<String>();
 40 | 	    HashSet<String> opinion_neg_set = new HashSet<String>();
 41 | 	    String opinion_pos_str = corpusPathStr + "Hownet/pos_opinion.txt";
 42 | 	    String opinion_neg_str = corpusPathStr + "Hownet/neg_opinion.txt";
 43 | 	    loadWordSet(opinion_pos_str, opinion_pos_set);
 44 | 	    loadWordSet(opinion_neg_str, opinion_neg_set);
 45 | 
 46 | 	    HashSet<String> sentiment_pos_set = new HashSet<String>();
 47 | 	    HashSet<String> sentiment_neg_set = new HashSet<String>();
 48 | 	    String sentiment_pos_str = corpusPathStr + "Tsinghua/tsinghua.positive.gb.txt";
 49 | 	    String sentiment_neg_str = corpusPathStr + "Tsinghua/tsinghua.negative.gb.txt";
 50 | 	    loadWordSet(sentiment_pos_str, sentiment_pos_set);
 51 | 	    loadWordSet(sentiment_neg_str, sentiment_neg_set);
 52 | 	    
 53 | 	    for (int topic_idx = 0; topic_idx < topic2filesuffix.size(); topic_idx++) {
 54 | 			
 55 | 			String opinion_train_labeledStr = resultsPathStr + "/opinion_train_labeled_"
 56 | 											+ topic2filesuffix.get(topic_idx);
 57 | 			String opinion_test_unlabeledStr = resultsPathStr+"/opinion_test_unlabeled_"
 58 | 					  						 + topic2filesuffix.get(topic_idx);
 59 | 			BufferedReader train_labeled_File = new BufferedReader(new InputStreamReader(
 60 | 					new FileInputStream(new File(train_labeled_text_pre
 61 | 							+"_"+topic2filesuffix.get(topic_idx))), "UTF-8"));
 62 | 			
 63 | 			//构造训练VSM词频向量空间模型
 64 | 			creatOpinoinFea(train_labeled_File, opinion_train_labeledStr,
 65 | 					opinion_pos_set, opinion_neg_set, sentiment_pos_set, sentiment_neg_set);
 66 | 			train_labeled_File.close();
 67 | 					
 68 | 			
 69 | 			if(hasTestData){
 70 | 				BufferedReader test_unlabeled_File = new BufferedReader(new InputStreamReader(
 71 | 						new FileInputStream(new File(test_unlabeled_text_pre
 72 | 								+"_"+topic2filesuffix.get(topic_idx))), "UTF-8"));
 73 | 				
 74 | 				creatOpinoinFea(test_unlabeled_File, opinion_test_unlabeledStr,
 75 | 						opinion_pos_set, opinion_neg_set, sentiment_pos_set, sentiment_neg_set);
 76 | 				test_unlabeled_File.close();
 77 | 			}
 78 | 	    }	    
 79 | 
 80 | 		System.out.println("It is done, ok!");
 81 | 	}
 82 | 	
 83 | 	private static void loadWordSet(String wordFileStr, HashSet<String> wordSet) throws IOException {
 84 | 		File wordFile = new File(wordFileStr);
 85 | 		FileInputStream fis = null;
 86 | 		try {
 87 | 			fis = new FileInputStream(wordFile);
 88 | 		} catch (FileNotFoundException fnfe) {
 89 | 			System.out.println("not found wordFileStr:"+wordFileStr);
 90 | 		}
 91 |         try {
 92 |         	BufferedReader reader = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
 93 |             String tempString = null;
 94 |             while ((tempString = reader.readLine()) != null) {
 95 |             	wordSet.add(tempString.trim());
 96 |             }
 97 |             System.out.println("load word dictionary... done");
 98 |             reader.close();
 99 |         }  finally {
100 |             fis.close();
101 |         }
102 | 	}
103 | 	
104 | 	public static void creatOpinoinFea(BufferedReader file_br, String opinion_str
105 | 			, HashSet<String> opinion_pos_set, HashSet<String> opinion_neg_set
106 | 			, HashSet<String> sentiment_pos_set, HashSet<String> sentiment_neg_set
107 | 			) throws IOException, Exception {
108 | 		System.out.println("Start to create opinion feature ...!");
109 | 		String tempLine;
110 | 		//定义了这个数据集的特征维数
111 | 		double opinion_pos_num = 0.0;
112 | 		double opinion_neg_num = 0.0;
113 | 		double sentiment_pos_num = 0.0;
114 | 		double sentiment_neg_num = 0.0;
115 | 		
116 | 		double opinion_pos_score = 0.0;
117 | 		double opinion_neg_score = 0.0;
118 | 		double sentiment_pos_score = 0.0;
119 | 		double sentiment_neg_score = 0.0;
120 | 		int lineNum = 0;
121 | 		while ((tempLine = file_br.readLine()) != null) {
122 | 			opinion_pos_num = 0;
123 | 			opinion_neg_num = 0;
124 | 			sentiment_pos_num = 0;
125 | 			sentiment_neg_num = 0;
126 | 			String[] tokensStr;			
127 | 			tokensStr = tempLine.trim().split("\\s+");
128 | 
129 | 			for (int j = 0; j < tokensStr.length; j++) {
130 | 				String tempToken = tokensStr[j];
131 | 				if (opinion_pos_set.contains(tempToken.trim())) {
132 | 					opinion_pos_num+=1.0;
133 | 				}else if (opinion_neg_set.contains(tempToken.trim())) {
134 | 					opinion_neg_num+=1.0;
135 | 				}else if (sentiment_pos_set.contains(tempToken.trim())) {
136 | 					sentiment_pos_num+=1.0;
137 | 				}else if (sentiment_neg_set.contains(tempToken.trim())) {
138 | 					sentiment_neg_num+=1.0;
139 | 				}
140 | 			}
141 | 			opinion_neg_num = Math.pow(opinion_neg_num, 1.3);
142 | 			sentiment_neg_num = Math.pow(sentiment_neg_num, 1.3);
143 | 			opinion_pos_score = 
144 | 					(opinion_pos_num+1.0)/(opinion_pos_num+opinion_neg_num+2.0);
145 | 			opinion_neg_score = 
146 | 					(opinion_neg_num+1.0)/(opinion_pos_num+opinion_neg_num+2.0);
147 | 			sentiment_pos_score = 
148 | 					(sentiment_pos_num+1.0)/(sentiment_pos_num+sentiment_neg_num+2.0);
149 | 			sentiment_neg_score = 
150 | 					(sentiment_neg_num+1.0)/(sentiment_pos_num+sentiment_neg_num+2.0);
151 | 			
152 | 			//把处理好的文本写入到新的文本文件中
153 | 			Result2Txt(opinion_str,String.valueOf(opinion_pos_score)+" "+String.valueOf(opinion_neg_score)+" "
154 | 					+ String.valueOf(sentiment_pos_score)+" "+String.valueOf(sentiment_neg_score));
155 | 			lineNum++;
156 | 			if (lineNum%1000 ==0) {
157 | 				System.out.println("hasProcessed text numbers:" + lineNum);
158 | 			}
159 | 		}
160 | 	}
161 | 	public static void Result2Txt(String file, String txt) {
162 | 		  try {
163 | 			   BufferedWriter os = new BufferedWriter(new OutputStreamWriter(   
164 | 		                new FileOutputStream(new File(file),true), "UTF-8")); 
165 | 			   os.write(txt + "\n");
166 | 			   os.close();
167 | 		  } catch (Exception e) {
168 | 			  e.printStackTrace();
169 | 		  }
170 | 	 }
171 | }
172 | 


--------------------------------------------------------------------------------
/code/Feature_process_java/src/DataPreprocess/Step09_01_ResultSubmit.java:
--------------------------------------------------------------------------------
  1 | package DataPreprocess;
  2 | 
  3 | import java.io.BufferedReader;
  4 | import java.io.BufferedWriter;
  5 | import java.io.File;
  6 | import java.io.FileInputStream;
  7 | import java.io.FileOutputStream;
  8 | import java.io.InputStreamReader;
  9 | import java.io.OutputStreamWriter;
 10 | import java.util.ArrayList;
 11 | import java.util.HashMap;
 12 | import java.util.Iterator;
 13 | import java.util.Map;
 14 | 
 15 | public class Step09_01_ResultSubmit {
 16 | 	/**
 17 | 	 * @param args
 18 | 	 * @author jacoxu.com-2016/06/23
 19 | 	 */
 20 | 	//predictResult_pre, testFile_str, submitFile_str, topic2filesuffix
 21 | 	private static void dataAnalysis(String predictResult_pre, String testFile_str
 22 | 			, String submitFile_str, ArrayList<String> topic2filesuffix) {
 23 | 		
 24 | 		try {
 25 | 			String encoding = "UTF-8";
 26 | 			File testFile = new File(testFile_str);
 27 | 			BufferedReader testFileReader = new BufferedReader(new InputStreamReader(
 28 | 					new FileInputStream(testFile), encoding));
 29 | 			String tmpPredictLineStr;
 30 | 			int readPredictLineNum = 0;
 31 | 			String tmpTestLineStr;
 32 | 			int readTestLineNum = 0;
 33 | 			//先把测试文件的头部写到提交文件中
 34 | 			tmpTestLineStr = testFileReader.readLine();
 35 | 			if (tmpTestLineStr.split("\t").length != 4) {
 36 | 				System.err.println("Error, in the first line of text file:" + tmpTestLineStr);	
 37 | 				System.exit(0);
 38 | 			}
 39 | 			Result2Txt(submitFile_str, tmpTestLineStr);
 40 | 			//开始写结果，按照Topic依次进行写入
 41 | 			String tmpPredictLabel = "UNKNOWN";
 42 | 			String tmpNewTestStr;
 43 | 			for (int topic_idx = 0; topic_idx < topic2filesuffix.size(); topic_idx++) {
 44 | 				System.out.println("Current topic is:"+topic2filesuffix.get(topic_idx));
 45 | 				String predictResult_str = predictResult_pre + topic2filesuffix.get(topic_idx);
 46 | 				File predictResultFile = new File(predictResult_str);
 47 | 				BufferedReader predictResultReader = new BufferedReader(new InputStreamReader(
 48 | 						new FileInputStream(predictResultFile), encoding));
 49 | 				while ((tmpPredictLineStr = predictResultReader.readLine()) != null) {
 50 | 					readPredictLineNum++;
 51 | 					float predictScore = Float.valueOf(tmpPredictLineStr);
 52 | 					if (predictScore < -0.2)
 53 | 						System.err.println("Error, wrong predict score:"
 54 | 											+ String.valueOf(predictScore));
 55 | 					else if (predictScore < 0.2)
 56 | 						tmpPredictLabel = "NONE";
 57 | 					else if (predictScore < 1.2)
 58 | 						tmpPredictLabel = "FAVOR";
 59 | 					else if (predictScore < 2.2)
 60 | 						tmpPredictLabel = "AGAINST";
 61 | 					else {
 62 | 						System.err.println("Error, wrong predict score:"
 63 | 											+ String.valueOf(predictScore));
 64 | 					}
 65 | 					tmpTestLineStr = testFileReader.readLine();
 66 | 					String[] tmpTestTokens = tmpTestLineStr.split("\t");
 67 | 					if (tmpTestTokens.length != 4) {
 68 | //						System.err.println("Error, in the first line of text file:" + tmpTestLineStr);	
 69 | //						System.exit(0);
 70 | 						System.out.println(tmpTestTokens[0]+"\t"+tmpTestTokens[4]);
 71 | 					}
 72 | 					readTestLineNum++;
 73 | 					tmpNewTestStr = tmpTestTokens[0]+"\t"+tmpTestTokens[1]+"\t"+tmpTestTokens[2]+"\t"+tmpPredictLabel;
 74 | 					Result2Txt(submitFile_str, tmpNewTestStr);					
 75 | 				}
 76 | 				predictResultReader.close();
 77 | 			}			
 78 | 			testFileReader.close();
 79 | 			if (readPredictLineNum != readTestLineNum) {
 80 | 				System.err.println("Error, readPredictLineNum:" + readPredictLineNum 
 81 | 						+ ", readTestLineNum"+ readTestLineNum);	
 82 | 				System.exit(0);
 83 | 			}
 84 | 			
 85 | 		} catch (Exception e) {
 86 | 			System.out.println("something error when reading the content of the file");
 87 | 			e.printStackTrace();
 88 | 		}
 89 | 		return;
 90 | 		
 91 | 	}
 92 | 
 93 | 	public static void Result2Txt(String file, String txt) {
 94 | 		  try {
 95 | 			  BufferedWriter os = new BufferedWriter(new OutputStreamWriter(   
 96 | 		               new FileOutputStream(new File(file),true), "UTF-8")); 
 97 | 			  os.write(txt + "\n");
 98 | 			  os.close();
 99 | 		  } catch (Exception e) {
100 | 			  e.printStackTrace();
101 | 		  }
102 | 	 }
103 | 	
104 | 	public static void main(String[] args) {
105 | 		//***********测试区域************
106 | 		System.out.println("test");
107 | 
108 | 		//***********测试区域************
109 | 		
110 | 		String dataPathStr="./../../data/";
111 | 		String resultsPathStr="./../../data/RefineData/Results/";
112 | 		File resultsPathFile = new File(resultsPathStr);
113 | 		if (!resultsPathFile.exists()) resultsPathFile.mkdir();
114 | 
115 | 		//进行测试集结果展示
116 | 	    String predictResult_pre=resultsPathStr+"predict_";//resultsPathStr+topic;
117 | 	    //TODO 这里为测试集的文件名
118 | 	    String testFile_str = dataPathStr+"NLPCC2016_Stance_Detection_Task_A_Testdata.txt";
119 | 	    String submitFile_str = dataPathStr +"TaskA_prediction_submission.txt";
120 | 	    ArrayList<String> topic2filesuffix = new ArrayList<String>();
121 | 	    topic2filesuffix.add("bianpao");
122 | 	    topic2filesuffix.add("iphonese");
123 | 	    topic2filesuffix.add("fankong");
124 | 	    topic2filesuffix.add("ertai");
125 | 	    topic2filesuffix.add("jinmo");
126 | 	    
127 | 		long readstart=System.currentTimeMillis();
128 | 		dataAnalysis(predictResult_pre, testFile_str, submitFile_str, topic2filesuffix);
129 |         long readend=System.currentTimeMillis();
130 |         System.out.println((readend-readstart)/1000.0+"s had been consumed to process submission.");
131 | 	}
132 | 
133 | }
134 | 


--------------------------------------------------------------------------------
/code/Feature_process_java/src/Tools/CharacterAnalyzer.java:
--------------------------------------------------------------------------------
  1 | package Tools;
  2 | 
  3 | /**
  4 |  * @author cBrain
  5 |  *
  6 |  */ 
  7 | public class CharacterAnalyzer {
  8 | 	// java unicode字符集
  9 | 	// http://doc.java.sun.com/DocWeb/api/java.lang.Character.UnicodeBlock
 10 | 	// http://hi.baidu.com/qing419925094/item/bb32a7b915273871244b09c9
 11 | 	public static final boolean isChinese(char c) {
 12 | 		return (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS)?true:false; 
 13 | 	}
 14 | 	
 15 | 	public static final boolean isNumber(char c)
 16 | 	{
 17 | 		return ((c<='9')&&(c>='0'))?true:false;
 18 | 	}
 19 | 
 20 | 	public static final boolean isEnglish(char c)
 21 | 	{
 22 | 		return (((c<='z')&&(c>='a'))||((c<='Z')&&(c>='A')))?true:false;
 23 | 	}
 24 | 	
 25 | 	public static final boolean isWhiteSpace(char c)
 26 | 	{
 27 | 		return c == ' ';
 28 | 	}
 29 | 	
 30 | 	public static final boolean isRegularSymbol(char c)
 31 | 	{
 32 | 		return (c=='*')||(c=='?');
 33 | 	}
 34 | 	
 35 | 	//判断是否是 中文/英文/数字/常见标点 过滤掉其他字符
 36 | 	public static final boolean isGoodCharacter(char c)
 37 | 	{
 38 | 		if(filterUnicode(c))
 39 | 			return false;
 40 | 		if(isChinese(c))
 41 | 			return true;
 42 | 		if(isWhiteSpace(c))
 43 | 			return true;
 44 | //		if (isSpecialSymbol(c)) //filter the special symbol
 45 | //			return false;
 46 | 		if(isNumber(c))
 47 | 			return true;
 48 | 		if(isEnglish(c))
 49 | 			return true;
 50 | //		if(isSolrKeywords(c)) //filter the solr keywords
 51 | //			return false;
 52 | //		if(isBiaoDian(c))
 53 | //			return true;
 54 | //		if(isRegularSymbol(c))
 55 | //			return true;
 56 | //		if(isOtherBiaoDian(c))
 57 | //			return true;
 58 | 		return false;
 59 | 	}
 60 | 	
 61 | 	private static boolean filterUnicode(char c) {
 62 | 		//方法1：直接利用UnicodeBlock
 63 | 	    Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
 64 | 	    if (ub == Character.UnicodeBlock.GENERAL_PUNCTUATION
 65 | 	      || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION) {
 66 | 	     return true;
 67 | 	    }
 68 | 	    //方法2：直接通过Unicode编码集剔除
 69 | 		//ASCII 标点符号
 70 | 		if((c>=0x0020)&&(c<=0x002F)||(c>=0x003A)&&(c<=0x0040)||(c>=0x005B)&&(c<=0x0060)||(c>=0x007B)&&(c<=0X007E))
 71 | 			return true;
 72 | 		//拉丁文第一增补集标点符号
 73 | 		if((c>=0x00A0)&&(c<=0x00BF))
 74 | 			return true;
 75 | 		//通用标点符号
 76 | 		if((c>=0x2000)&&(c<=0x206F))
 77 | 			return true;
 78 | 		//增补标点符号
 79 | 		if((c>=0x2E00)&&(c<=0x2E7F))
 80 | 			return true;
 81 | 		//CJK标点符号
 82 | 		if((c>=0x3000)&&(c<=0x303F))
 83 | 			return true;
 84 | 		//全角ASCII标点符号
 85 | 		if((c>=0xFF01)&&(c<=0xFF0F)||(c>=0xFF1A)&&(c<=0xFF20)||(c>=0xFF3B)&&(c<=0xFF40)||(c>=0xFF5B)&&(c<=0xFF5E))
 86 | 			return true;
 87 | 		//竖排标点符号
 88 | 		if((c>=0xFE10)&&(c<=0xFE1F))
 89 | 			return true;
 90 | 		
 91 | 		//方法3：通过词典过滤与前台保证一致性
 92 | //		if (SmsBase.getSymbolSet().contains((int)c)) return true;
 93 | 		
 94 | 		return false;
 95 | 	}
 96 | 	
 97 | 	private static boolean isSpecialSymbol(char c) {
 98 | 		//"`~!@#$%^&*()-_=+[]{}|;:',<.>/?\
 99 | //		if(c=='"'||c=='`'||c=='~'||c=='!'||c=='@'||c=='#'||c=='$'||
100 | //		   c=='%'||c=='^'||c=='&'||c=='*'||c=='('||c==')'||c=='-'||
101 | //		   c=='_'||c=='='||c=='+'||c=='['||c==']'||c=='{'||c=='}'||
102 | //		   c=='|'||c==';'||c==':'||c=='\''||c==','||c=='<'||c=='.'||
103 | //		   c=='>'||c=='/'||c=='?'||c=='\\')
104 | //			return true;
105 | 		
106 | 		//ASCII 标点符号
107 | 		if((c>=0x0020)&&(c<=0x002F)||(c>=0x003A)&&(c<=0x0040)||(c>=0x005B)&&(c<=0x0060)||(c>=0x007B)&&(c<=0X007E))
108 | 			return true;
109 | 		//全角ASCII标点符号
110 | 		if((c>=0xFF01)&&(c<=0xFF0F)||(c>=0xFF1A)&&(c<=0xFF20)||(c>=0xFF3B)&&(c<=0xFF40)||(c>=0xFF5B)&&(c<=0xFF5E))
111 | 			return true;
112 | 		return false;
113 | 	}
114 | 
115 | 	private static boolean isSolrKeywords(char c) {
116 | 		if(c==':'||c=='?'||c=='*'||c=='"'||c=='('||c==')')
117 | 			return true;
118 | 		return false;
119 | 	}
120 | 
121 | 	public static final boolean isOtherBiaoDian(char c)
122 | 	{
123 | 		return (c=='\"')||(c=='$')||(c==':')||(c=='|')||(c==',');
124 | 	}
125 | 	
126 | 	public static final boolean isBiaoDian(char c)
127 | 	{
128 | 	    Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
129 | 	    if (ub == Character.UnicodeBlock.GENERAL_PUNCTUATION
130 | 	      || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION) {
131 | 	     return true;
132 | 	    }
133 | 	    return false;
134 | 	}
135 | 	/**
136 | 	 * @param c
137 | 	 * @return
138 | 	 */
139 | 	public static final boolean isSymbol(char c)
140 | 	{
141 | //	    Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
142 | //	    if (ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS
143 | //	      || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS
144 | //	      || ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A
145 | //	      || ub == Character.UnicodeBlock.GENERAL_PUNCTUATION
146 | //	      || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION
147 | //	      || ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) {
148 | //	     return true;
149 | //	    }
150 | //	    return false;
151 | 		if(c=='-'||c=='/'||c=='_'||c==':')
152 | 			return true;
153 | 		return false;
154 | 	}
155 | 	public static void main (String[] args){
156 | //		String ss = "、!@#$%^&*北京()！@￥%……&*（）明天！＠＃￥％……＆×";
157 | 		String ss ="℃＄¤￠￡‰§№☆★○●◎◇◆□■△▲※→←↑↓〓ⅰⅱ我ⅲⅳⅴⅵⅶ北 京ⅷⅸⅹ⒈⒉，你呢？⒊⒋⒌⒍⒎⒏";
158 | 		StringBuffer str = new StringBuffer();
159 | 		char test1 = 0x25;
160 | 		System.out.println(test1);    
161 | 		char[] ch = ss.toCharArray();
162 | 		for (int i = 0; i < ch.length; i++) {
163 | 			int test = ch[i];
164 | 			System.out.println(ch[i]+" - 0x"+Integer.toHexString(test));
165 | 			Character.UnicodeBlock ub = Character.UnicodeBlock.of(ch[i]);
166 | 			if (!isGoodCharacter(ch[i]))
167 | 			continue;
168 | 			str.append(ch[i]);
169 | 		}
170 | 		System.out.println(ss);
171 | 		System.out.println(str.toString());
172 | 	}
173 | 	
174 | 	private static boolean filterSympoTest(char c) {
175 | 		//方法1：直接利用UnicodeBlock
176 | //	    Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
177 | //	    if (ub == Character.UnicodeBlock.GENERAL_PUNCTUATION
178 | //	      || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION) {
179 | //	     return true;
180 | //	    }
181 | 	    //方法2：直接通过Unicode编码集剔除
182 | 		if (isWhiteSpace(c)) return false;
183 | 		//ASCII 标点符号
184 | 		if((c>=0x0020)&&(c<=0x002F)||(c>=0x003A)&&(c<=0x0040)||(c>=0x005B)&&(c<=0x0060)||(c>=0x007B)&&(c<=0X007E))
185 | 			return true;
186 | 		//拉丁文第一增补集标点符号
187 | 		if((c>=0x00A0)&&(c<=0x00BF))
188 | 			return true;
189 | 		//通用标点符号
190 | 		if((c>=0x2000)&&(c<=0x206F))
191 | 			return true;
192 | 		//增补标点符号
193 | 		if((c>=0x2E00)&&(c<=0x2E7F))
194 | 			return true;
195 | 		//CJK标点符号
196 | 		if((c>=0x3000)&&(c<=0x303F))
197 | 			return true;
198 | 		//全角ASCII标点符号
199 | 		if((c>=0xFF01)&&(c<=0xFF0F)||(c>=0xFF1A)&&(c<=0xFF20)||(c>=0xFF3B)&&(c<=0xFF40)||(c>=0xFF5B)&&(c<=0xFF5E))
200 | 			return true;
201 | 		//竖排标点符号
202 | 		if((c>=0xFE10)&&(c<=0xFE1F))
203 | 			return true;
204 | 
205 | 		return false;
206 | 	}
207 | }


--------------------------------------------------------------------------------
/code/Feature_process_java/src/Tools/ConvertUnicode.java:
--------------------------------------------------------------------------------
 1 | package Tools;
 2 | 
 3 | public class ConvertUnicode {
 4 | 	public static String convertUnicode(String ori){
 5 |         char aChar;
 6 |         int len = ori.length();
 7 |         StringBuffer outBuffer = new StringBuffer(len);
 8 |         for (int x = 0; x < len;) {
 9 |             aChar = ori.charAt(x++);
10 |             if (aChar == '\\') {
11 |                 aChar = ori.charAt(x++);
12 |                 if (aChar == 'u') {
13 |                     // Read the xxxx
14 |                     int value = 0;
15 |                     for (int i = 0; i < 4; i++) {
16 |                         aChar = ori.charAt(x++);
17 |                         switch (aChar) {
18 |                         case '0':
19 |                         case '1':
20 |                         case '2':
21 |                         case '3':
22 |                         case '4':
23 |                         case '5':
24 |                         case '6':
25 |                         case '7':
26 |                         case '8':
27 |                         case '9':
28 |                             value = (value << 4) + aChar - '0';
29 |                             break;
30 |                         case 'a':
31 |                         case 'b':
32 |                         case 'c':
33 |                         case 'd':
34 |                         case 'e':
35 |                         case 'f':
36 |                             value = (value << 4) + 10 + aChar - 'a';
37 |                             break;
38 |                         case 'A':
39 |                         case 'B':
40 |                         case 'C':
41 |                         case 'D':
42 |                         case 'E':
43 |                         case 'F':
44 |                             value = (value << 4) + 10 + aChar - 'A';
45 |                             break;
46 |                         default:
47 |                             throw new IllegalArgumentException(
48 |                                     "Malformed   \\uxxxx   encoding.");
49 |                         }
50 |                     }
51 |                     outBuffer.append((char) value);
52 |                 } else {
53 |                     if (aChar == 't')
54 |                         aChar = '\t';
55 |                     else if (aChar == 'r')
56 |                         aChar = '\r';
57 |                     else if (aChar == 'n')
58 |                         aChar = '\n';
59 |                     else if (aChar == 'f')
60 |                         aChar = '\f';
61 |                     outBuffer.append(aChar);
62 |                 }
63 |             } else
64 |                 outBuffer.append(aChar);
65 |  
66 |         }
67 |         return outBuffer.toString();
68 | 	}
69 | }
70 | 


--------------------------------------------------------------------------------
/code/Feature_process_java/src/Tools/PreProcessText.java:
--------------------------------------------------------------------------------
 1 | package Tools;
 2 | import java.io.IOException;
 3 | 
 4 | import Tools.StringAnalyzer;
 5 | import TypeTrans.Full2Half;
 6 | 
 7 | public class PreProcessText {
 8 | 	static public String preProcess4Task1(String inputStr, String tmpRelationP, String tmpEntityS, String tmpEntityO) throws IOException{
 9 | 		if (inputStr.length()<1) return inputStr;
10 | 		//强制限制了实体词分开
11 | 		if (tmpRelationP!=null) {
12 | 			if (tmpRelationP.length()==4) { //传闻不和、同为校花、昔日情敌、绯闻女友 等关系 重心词在后面
13 | 				if (!inputStr.contains(tmpRelationP)) {
14 | 					tmpRelationP = tmpRelationP.substring(2);
15 | 				}
16 | 			} 
17 | 			inputStr = inputStr.replaceAll(tmpRelationP, " "+tmpRelationP+" ");
18 | 		}
19 | 		inputStr = inputStr.replaceAll(tmpEntityS, " "+tmpEntityS+" ");
20 | 		inputStr = inputStr.replaceAll(tmpEntityO, " "+tmpEntityO+" ");
21 | 		inputStr=Full2Half.ToDBC(inputStr);//全角转半角						
22 | 		inputStr=inputStr.toLowerCase();//字母全部小写
23 | 		inputStr=inputStr.replaceAll("\\s+", " ");//多个空格缩成单个空格
24 | 		inputStr = StringAnalyzer.extractGoodCharacter(inputStr); //去除所有特殊字符
25 | 		//                           无词性                                                                       带词性
26 | 		inputStr = WordSegment_Ansj.splitWord(inputStr)+"\t"+WordSegment_Ansj.splitWordwithTag(inputStr);//进行分词
27 | 		
28 | 		return inputStr;
29 | 	} 
30 | 	static public String preProcess4Task2(String inputStr) throws IOException{
31 | 		if (inputStr.length()<1) return inputStr;
32 | 		inputStr=Full2Half.ToDBC(inputStr);//全角转半角						
33 | 		inputStr=inputStr.toLowerCase();//字母全部小写
34 | 		inputStr=inputStr.replaceAll("\\s+", " ");//多个空格缩成单个空格
35 | 		inputStr = StringAnalyzer.extractGoodCharacter(inputStr); //去除所有特殊字符
36 | 		//                           无词性                                                                       带词性
37 | 		inputStr = WordSegment_Ansj.splitWordwithOutTag4Task2(inputStr);//进行分词
38 | 		
39 | 		return inputStr.trim();
40 | 	}
41 | 	
42 | 	static public String preProcess4NLPCC2016(String inputStr, String topic) throws IOException{
43 | 		if (inputStr.length()<1) return inputStr;
44 | 		inputStr=inputStr.replaceAll("#"+topic+"#", " ");//（1）过滤掉HashTag的标识
45 | 		inputStr=inputStr.replaceAll("http://t.cn/(.{7})", " ");//（2）过滤掉http://t.cn/[7个字符]
46 | 		//（3）过滤掉一些特殊的分享标识，如：
47 | 		//（分享[10个字符以内]）(分享[10个字符以内]) 【分享[10个字符以内]】
48 | 		//（来自[10个字符以内]）(来自[10个字符以内]) 【来自[10个字符以内]】
49 | 		inputStr=inputStr.replaceAll("（分享(.{0,9})）", " ");
50 | 		inputStr=inputStr.replaceAll("\\(分享(.{0,9})\\)", " ");
51 | 		inputStr=inputStr.replaceAll("【分享(.{0,9})】", " ");
52 | 		
53 | 		inputStr=inputStr.replaceAll("（来自(.{0,9})）", " ");
54 | 		inputStr=inputStr.replaceAll("\\(来自(.{0,9})\\)", " ");
55 | 		inputStr=inputStr.replaceAll("【来自(.{0,9})】", " ");
56 | 		//（4）过滤掉所有的@标识，如@腾讯新闻客户端 @[10个字符以内]后接另一个@、空格或换行符
57 | 		String[] inputStr_sub = inputStr.split("\\s+");
58 | 		StringBuffer inputStr_bf = new StringBuffer();
59 | 		for (String tmpinputStr_sub:inputStr_sub) {
60 | 			tmpinputStr_sub = tmpinputStr_sub+"<eos>";
61 | 			tmpinputStr_sub = tmpinputStr_sub.replaceAll("@(.{0,9})@", "@");	
62 | 			tmpinputStr_sub = tmpinputStr_sub.replaceAll("@(.{0,9}) ", " ");
63 | 			tmpinputStr_sub = tmpinputStr_sub.replaceAll("@(.{0,9})<eos>", " ");
64 | 			tmpinputStr_sub = tmpinputStr_sub.replaceAll("<eos>", "");
65 | 			inputStr_bf.append(tmpinputStr_sub);
66 | 			inputStr_bf.append(" ");
67 | 		}
68 | 		inputStr = inputStr_bf.toString().trim();
69 | 		inputStr_bf = null;	
70 | 		
71 | 		inputStr=Full2Half.ToDBC(inputStr);//（5）全角转半角					
72 | 		inputStr=inputStr.toLowerCase();//（6）字母全部小写
73 | 		inputStr=inputStr.replaceAll("\\s+", " ");//（7）多个空格缩成单个空格
74 | 		inputStr = StringAnalyzer.extractGoodCharacter(inputStr); //（8）去除所有特殊字符
75 | 		inputStr = WordSegment_Ansj.splitWord(inputStr);//（9）进行分词
76 | 		
77 | 		return inputStr.trim();
78 | 	} 	
79 | 
80 | 	private static boolean isITSuffixSpamInfo(String tmpQuerySnippet, String tmpEntityS, String tmpEntityO) {
81 | 		if ((tmpQuerySnippet.contains(tmpEntityS)||tmpQuerySnippet.contains(tmpEntityO))
82 | 				&&tmpQuerySnippet.length()>4) {
83 | 			return false;
84 | 		}else {
85 | 			return true;
86 | 		}
87 | 	}
88 | }
89 | 


--------------------------------------------------------------------------------
/code/Feature_process_java/src/Tools/StringAnalyzer.java:
--------------------------------------------------------------------------------
  1 | package Tools;
  2 | 
  3 | public class StringAnalyzer {
  4 | 
  5 |     
  6 | 	private static String extractSendTime(String ss) {
  7 | 		String[] tokens = ss.split("{$}");
  8 | 		return tokens[0];
  9 | 	}
 10 | 	private static String removeShortTerm(String ss){
 11 | 		StringBuffer sb = new StringBuffer();
 12 | 		String[] tokens = ss.split(" ");
 13 | 		for(int i = 0;i<tokens.length;i++)
 14 | 		{
 15 | 			if(tokens[i].length()>1)
 16 | 			{
 17 | 				sb.append(tokens[i]);
 18 | 				sb.append(" ");
 19 | 			}
 20 | 		}
 21 | 		return sb.toString();
 22 | 	}
 23 | 	public static String extractChineseCharacterWithoutSpace(String ss) {
 24 | 		Boolean lastCharTag = true;
 25 | 		StringBuffer str = new StringBuffer();
 26 | 		char[] ch = ss.toCharArray();
 27 | 		for (int i = 0; i < ch.length; i++) {
 28 | 			if(CharacterAnalyzer.isChinese(ch[i]))
 29 | 			{
 30 | 				str.append(ch[i]);
 31 | 			}
 32 | 		}
 33 | 		return str.toString();
 34 | 	}
 35 | 	
 36 | 	public static String extractGoodCharacter(String ss){  
 37 | 		if(ss == null)
 38 | 			return null;
 39 | 		Boolean lastCharTag = true;
 40 | 		StringBuffer str = new StringBuffer();
 41 | 		char[] ch = ss.toCharArray();
 42 | 		for (int i = 0; i < ch.length; i++) {
 43 | 			if(CharacterAnalyzer.isGoodCharacter(ch[i])){
 44 | 				str.append(ch[i]);
 45 | 			}else {
 46 | 				str.append(' ');
 47 | 			}
 48 | 		}
 49 | 		return str.toString().replaceAll("\\s+", " ").trim();
 50 | 	}
 51 | 	
 52 | 	public static String extractChineseCharacter(String ss) {
 53 | 		Boolean lastCharTag = true;
 54 | 		StringBuffer str = new StringBuffer();
 55 | 		char[] ch = ss.toCharArray();
 56 | 		for (int i = 0; i < ch.length; i++) {
 57 | 			if(CharacterAnalyzer.isChinese(ch[i]))
 58 | 			{
 59 | 				if(lastCharTag)
 60 | 				{
 61 | 					str.append(ch[i]);
 62 | 				}
 63 | 				else
 64 | 				{
 65 | 					str.append(" ");
 66 | 					str.append(ch[i]);
 67 | 					lastCharTag = true;
 68 | 				}
 69 | 			}
 70 | 			else
 71 | 			{
 72 | 				lastCharTag = false;
 73 | 			}
 74 | 		}
 75 | 		//return removeShortTerm(str.toString());
 76 | 		if(str.toString().length() == 0)
 77 | 		{
 78 | 			return "";
 79 | 		}
 80 | 		
 81 | 		return str.toString().toLowerCase();
 82 | 	}
 83 | 	
 84 | 	public static String extractEnglishCharacter(String ss){
 85 | 		Boolean lastCharTag = true;
 86 | 		StringBuffer str = new StringBuffer();
 87 | 		char[] ch = ss.toCharArray();
 88 | 		for (int i = 0; i < ch.length; i++) {
 89 | 			if(CharacterAnalyzer.isEnglish(ch[i])){
 90 | 				if(lastCharTag){
 91 | 					str.append(ch[i]);
 92 | 				}
 93 | 				else{
 94 | 					str.append(" ");
 95 | 					str.append(ch[i]);
 96 | 					lastCharTag = true;
 97 | 				}
 98 | 			}
 99 | 			else{
100 | 				lastCharTag = false;
101 | 			}
102 | 		}
103 | 		if(str.toString().length() == 0){
104 | 			return null;
105 | 		}
106 | 		
107 | 		return str.toString().toLowerCase();
108 | 	}
109 | 	
110 | 	public static Boolean isNumberString(String ss){
111 | 		char[] ch = ss.toCharArray();
112 | 		for (int i = 0; i < ch.length; i++) {
113 | 			if(!CharacterAnalyzer.isNumber(ch[i]))
114 | 				return false;
115 | 		}
116 | //		int telephoneNumberLength = 14;
117 | //		if(ss.length()> telephoneNumberLength || ss.length() < 10)//TODO
118 | //			return false;
119 | //		
120 | 		return true;
121 | 	}
122 | 
123 | 	public static Boolean isNumberString2(String ss){
124 | 		char[] ch = ss.toCharArray();
125 | 		for (int i = 1; i < ch.length-1; i++) {
126 | 			if(!CharacterAnalyzer.isNumber(ch[i]))
127 | 				return false;
128 | 		}
129 | 		int telephoneNumberLength = 14;
130 | 		if(ss.length()> telephoneNumberLength || ss.length() < 10)
131 | 			return false;
132 | 		
133 | 		return true;
134 | 	}
135 | 	
136 | 	
137 | 	
138 | 	public static String extractNumberCharacter(String ss){
139 | 		Boolean lastCharTag = true;
140 | 		StringBuffer str = new StringBuffer();
141 | 		char[] ch = ss.toCharArray();
142 | 		for (int i = 0; i < ch.length; i++) {
143 | 			//if(characterAnalyzer.isNumber(ch[i])||characterAnalyzer.isSymbol(ch[i]))
144 | 			if(CharacterAnalyzer.isNumber(ch[i])){
145 | 				if(lastCharTag){
146 | 					str.append(ch[i]);
147 | 				}
148 | 				else{
149 | 					str.append(" ");
150 | 					str.append(ch[i]);
151 | 					lastCharTag = true;
152 | 				}
153 | 			}
154 | 			else{
155 | 				lastCharTag = false;
156 | 			}
157 | 		}
158 | 		if(str.toString().length() == 0){
159 | 			return null;
160 | 		}
161 | 		return str.toString();
162 | 	}
163 | 	
164 | 	public static void main(String args[]){
165 | 		String testString = "丨~~@昨天喝12了[酒]，今天|丨血压高。 大事没办了1 6，小-事耽误了。 横批是：他阿了吊嬳!!!";
166 | 		System.out.println(extractGoodCharacter(testString));
167 | 		System.out.println(extractEnglishCharacter(testString));
168 | 		System.out.println(extractNumberCharacter(testString));
169 | 	}
170 | }
171 | 


--------------------------------------------------------------------------------
/code/Feature_process_java/src/Tools/WordSegment_Ansj.java:
--------------------------------------------------------------------------------
 1 | ﻿package Tools;
 2 | 
 3 | import java.io.IOException;
 4 | import java.io.StringReader;
 5 | import java.util.ArrayList;
 6 | import java.util.List;
 7 | 
 8 | import org.ansj.domain.Term;
 9 | import org.ansj.splitWord.Analysis;
10 | import org.ansj.splitWord.analysis.ToAnalysis;
11 | import org.ansj.util.recognition.NatureRecognition;
12 | 
13 | //参考：https://github.com/NLPchina/ansj_seg/wiki/%E5%88%86%E8%AF%8D%E4%BD%BF%E7%94%A8demo
14 | public class WordSegment_Ansj {
15 | 	
16 | 	public static String splitWord(String shortDoc) throws IOException {
17 | 		ArrayList<String> termsOnDoc = new ArrayList<String>(); 
18 | 		List<Term> all = new ArrayList<Term>();
19 | 		Analysis udf = new ToAnalysis(new StringReader(shortDoc));
20 | 		Term term = null;
21 | 		while ((term = udf.next()) != null) {
22 | 			String tempTerm = term.toString().trim();
23 | 			if (tempTerm.length()>0){
24 | 				termsOnDoc.add(tempTerm);
25 | 			}
26 | 		}
27 | 		//添加句子的分词结果
28 | 		StringBuffer tmpContentBuffer = new StringBuffer();
29 | 		for (int i = 0; i < termsOnDoc.size(); i++) {
30 | 			if (i!=0) {
31 | 				tmpContentBuffer.append(" ");
32 | 			}
33 | 			tmpContentBuffer.append(termsOnDoc.get(i));						
34 | 		}
35 | 		return tmpContentBuffer.toString();
36 | 	}
37 | 	public static String splitWordwithTag(String shortDoc) throws IOException {
38 | 		ArrayList<String> termsOnDoc = new ArrayList<String>(); 
39 | 		List<Term> allTerms = new ArrayList<Term>();
40 | 		Analysis udf = new ToAnalysis(new StringReader(shortDoc));
41 | 		Term term = null;
42 | 		while ((term = udf.next()) != null) {
43 | 			String tempTerm = term.toString().trim();
44 | 			if (tempTerm.length()>0){
45 | 				allTerms.add(term);
46 | //				termsOnDoc.add(tempTerm);
47 | 			}
48 | 		}
49 | 		new NatureRecognition(allTerms).recogntion();
50 | 		for (int i = 0; i < allTerms.size(); i++) {
51 | 			termsOnDoc.add(allTerms.get(i).toString());
52 | 		}
53 | 		//添加句子的分词结果
54 | 		StringBuffer tmpContentBuffer = new StringBuffer();
55 | 		for (int i = 0; i < termsOnDoc.size(); i++) {
56 | 			if (i!=0) {
57 | 				tmpContentBuffer.append(" ");
58 | 			}
59 | 			tmpContentBuffer.append(termsOnDoc.get(i));						
60 | 		}
61 | 		return tmpContentBuffer.toString();
62 | 	}
63 | 	public static String splitWordwithOutTag4Task2(String shortDoc) throws IOException {
64 | 		ArrayList<String> termsOnDoc = new ArrayList<String>(); 
65 | 		List<Term> allTerms = new ArrayList<Term>();
66 | 		Analysis udf = new ToAnalysis(new StringReader(shortDoc));
67 | 		Term term = null;
68 | 		while ((term = udf.next()) != null) {
69 | 			String tempTerm = term.toString().trim();
70 | 			if (tempTerm.length()>0){
71 | 				allTerms.add(term);
72 | //				termsOnDoc.add(tempTerm);
73 | 			}
74 | 		}
75 | 		new NatureRecognition(allTerms).recogntion();
76 | 		for (int i = 0; i < allTerms.size(); i++) {
77 | 			termsOnDoc.add(allTerms.get(i).toString());
78 | 		}
79 | 		//添加句子的分词结果
80 | 		StringBuffer tmpContentBuffer = new StringBuffer();
81 | 		for (int i = 0; i < termsOnDoc.size(); i++) {
82 | 			String[] tmpTermArrays = termsOnDoc.get(i).split("/");
83 | 			if (tmpTermArrays.length==2){
84 | 				String tmpWordStr = tmpTermArrays[0];
85 | 				termsOnDoc.set(i, tmpWordStr);
86 | 			}
87 | 			if (i!=0) {
88 | 				tmpContentBuffer.append(" ");
89 | 			}
90 | 			tmpContentBuffer.append(termsOnDoc.get(i));						
91 | 		}
92 | 		return tmpContentBuffer.toString().trim();
93 | 	}
94 | }
95 | 
96 | 
97 | 


--------------------------------------------------------------------------------
/code/Feature_ranking_matlab/TaskA_Feature_ranking.m:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_ranking_matlab/TaskA_Feature_ranking.m


--------------------------------------------------------------------------------
/code/Feature_ranking_matlab/main_TaskA_Feature_ranking.m:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_ranking_matlab/main_TaskA_Feature_ranking.m


--------------------------------------------------------------------------------
/code/Feature_ranking_matlab/normalize.m:
--------------------------------------------------------------------------------
 1 | function Xn = normalize(X)
 2 | % Normalize all feature vectors to unit length
 3 | 
 4 | n = size(X,1);  % the number of documents
 5 | Xt = X';
 6 | l = sqrt(sum(Xt.^2));  % the row vector length (L2 norm)
 7 | Ni = sparse(1:n,1:n,l);
 8 | Ni(Ni>0) = 1./Ni(Ni>0);
 9 | Xn = (Xt*Ni)';
10 | 
11 | end
12 | 


--------------------------------------------------------------------------------
/code/Feature_ranking_matlab/predict.mexw64:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_ranking_matlab/predict.mexw64


--------------------------------------------------------------------------------
/code/Feature_ranking_matlab/svmpredict.mexw64:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_ranking_matlab/svmpredict.mexw64


--------------------------------------------------------------------------------
/code/Feature_ranking_matlab/svmtrain.mexw64:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_ranking_matlab/svmtrain.mexw64


--------------------------------------------------------------------------------
/code/Feature_ranking_matlab/tf_idf.m:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_ranking_matlab/tf_idf.m


--------------------------------------------------------------------------------
/code/Feature_ranking_matlab/train.mexw64:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/Feature_ranking_matlab/train.mexw64


--------------------------------------------------------------------------------
/code/LDA/README.md:
--------------------------------------------------------------------------------
1 | # Latent Dirichlet Allocation (LDA)
2 | 
3 | This project is modified from the following open-source implementation GibbsLDA
4 | 
5 | http://jgibblda.sourceforge.net/
6 | 
7 | The usage can be found here: http://jacoxu.com/?p=648
8 | 


--------------------------------------------------------------------------------
/code/LDA/jgibblda.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/LDA/jgibblda.jar


--------------------------------------------------------------------------------
/code/LSA_and_LPI_and_LE_matlab/LE/LapEig.m:
--------------------------------------------------------------------------------
 1 | function Y = LapEig(X, options, d)
 2 | % Laplacian Eigenmap
 3 | W = constructW(X,options);
 4 | 
 5 | D = diag(sum(W));
 6 | ii = find(diag(D)==0);
 7 | if size(ii)~=0
 8 |     for i=1:size(ii)
 9 |         D(ii(i),ii(i)) = 0.01;
10 |     end
11 | end
12 | D = sparse(D);
13 | %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
14 | L = D - W;
15 | 
16 | options = [];
17 | options.disp = 0;
18 | [eigVecs,eigVals] = eigs(L,D,1+d,'sa',options);
19 | Y = eigVecs(:,2:end);
20 | 
21 | end
22 | 


--------------------------------------------------------------------------------
/code/LSA_and_LPI_and_LE_matlab/LPI/LGE.m:
--------------------------------------------------------------------------------
  1 | function [eigvector, eigvalue] = LGE(W, D, options, data)
  2 | % LGE: Linear Graph Embedding
  3 | %
  4 | %       [eigvector, eigvalue] = LGE(W, D, options, data)
  5 | % 
  6 | %             Input:
  7 | %               data       - data matrix. Each row vector of data is a
  8 | %                         sample vector. 
  9 | %               W       - Affinity graph matrix. 
 10 | %               D       - Constraint graph matrix. 
 11 | %                         LGE solves the optimization problem of 
 12 | %                         a* = argmax (a'data'WXa)/(a'data'DXa) 
 13 | %                         Default: D = I 
 14 | %
 15 | %               options - Struct value in Matlab. The fields in options
 16 | %                         that can be set:
 17 | %
 18 | %                     ReducedDim   -  The dimensionality of the reduced
 19 | %                                     subspace. If 0, all the dimensions
 20 | %                                     will be kept. Default is 30. 
 21 | %
 22 | %                            Regu  -  1: regularized solution, 
 23 | %                                        a* = argmax (a'X'WXa)/(a'X'DXa+ReguAlpha*I) 
 24 | %                                     0: solve the sinularity problem by SVD (PCA) 
 25 | %                                     Default: 0 
 26 | %
 27 | %                        ReguAlpha -  The regularization parameter. Valid
 28 | %                                     when Regu==1. Default value is 0.1. 
 29 | %
 30 | %                            ReguType  -  'Ridge': Tikhonov regularization
 31 | %                                         'Custom': User provided
 32 | %                                                   regularization matrix
 33 | %                                          Default: 'Ridge' 
 34 | %                        regularizerR  -   (nFea x nFea) regularization
 35 | %                                          matrix which should be provided
 36 | %                                          if ReguType is 'Custom'. nFea is
 37 | %                                          the feature number of data
 38 | %                                          matrix
 39 | %
 40 | %                            PCARatio     -  The percentage of principal
 41 | %                                            component kept in the PCA
 42 | %                                            step. The percentage is
 43 | %                                            calculated based on the
 44 | %                                            eigenvalue. Default is 1
 45 | %                                            (100%, all the non-zero
 46 | %                                            eigenvalues will be kept.
 47 | %                                            If PCARatio > 1, the PCA step
 48 | %                                            will keep exactly PCARatio principle
 49 | %                                            components (does not exceed the
 50 | %                                            exact number of non-zero components).  
 51 | %
 52 | %
 53 | %             Output:
 54 | %               eigvector - Each column is an embedding function, for a new
 55 | %                           sample vector (row vector) x,  y = x*eigvector
 56 | %                           will be the embedding result of x.
 57 | %               eigvalue  - The sorted eigvalue of the eigen-problem.
 58 | %               elapse    - Time spent on different steps 
 59 | %                           
 60 | % 
 61 | %
 62 | %    Examples:
 63 | %
 64 | % See also LPP, NPE, IsoProjection, LSDA.
 65 | %
 66 | %
 67 | %Reference:
 68 | %
 69 | %   1. Deng Cai, Xiaofei He, Jiawei Han, "Spectral Regression for Efficient
 70 | %   Regularized Subspace Learning", IEEE International Conference on
 71 | %   Computer Vision (ICCV), Rio de Janeiro, Brazil, Oct. 2007. 
 72 | %
 73 | %   2. Deng Cai, Xiaofei He, Yuxiao Hu, Jiawei Han, and Thomas Huang, 
 74 | %   "Learning a Spatially Smooth Subspace for Face Recognition", CVPR'2007
 75 | % 
 76 | %   3. Deng Cai, Xiaofei He, Jiawei Han, "Spectral Regression: A Unified
 77 | %   Subspace Learning Framework for Content-Based Image Retrieval", ACM
 78 | %   Multimedia 2007, Augsburg, Germany, Sep. 2007.
 79 | %
 80 | %   4. Deng Cai, "Spectral Regression: A Regression Framework for
 81 | %   Efficient Regularized Subspace Learning", PhD Thesis, Department of
 82 | %   Computer Science, UIUC, 2009.   
 83 | %
 84 | %   version 3.0 --Dec/2011 
 85 | %   version 2.1 --June/2007 
 86 | %   version 2.0 --May/2007 
 87 | %   version 1.0 --Sep/2006 
 88 | %
 89 | %   Written by Deng Cai (dengcai AT gmail.com)
 90 | %
 91 | 
 92 | MAX_MATRIX_SIZE = 1600; % You can change this number according your machine computational power
 93 | EIGVECTOR_RATIO = 0.1; % You can change this number according your machine computational power
 94 | 
 95 | if (~exist('options','var'))
 96 |    options = [];
 97 | end
 98 | 
 99 | ReducedDim = 30;
100 | if isfield(options,'ReducedDim')
101 |     ReducedDim = options.ReducedDim;
102 | end
103 | 
104 | if ~isfield(options,'Regu') || ~options.Regu
105 |     bPCA = 1;
106 |     if ~isfield(options,'PCARatio')
107 |         options.PCARatio = 1;
108 |     end
109 | else
110 |     bPCA = 0;
111 |     if ~isfield(options,'ReguType')
112 |         options.ReguType = 'Ridge';
113 |     end
114 |     if ~isfield(options,'ReguAlpha')
115 |         options.ReguAlpha = 0.1;
116 |     end
117 | end
118 | 
119 | bD = 1;
120 | if ~exist('D','var') || isempty(D)
121 |     bD = 0;
122 | end
123 | 
124 | 
125 | [nSmp,nFea] = size(data);
126 | if size(W,1) ~= nSmp
127 |     error('W and data mismatch!');
128 | end
129 | if bD && (size(D,1) ~= nSmp)
130 |     error('D and data mismatch!');
131 | end
132 | 
133 | bChol = 0;
134 | if bPCA && (nSmp > nFea) && (options.PCARatio >= 1)
135 |     if bD
136 |         DPrime = data'*D*data;
137 |     else
138 |         DPrime = data'*data;
139 |     end
140 |     DPrime = full(DPrime);
141 |     DPrime = max(DPrime,DPrime');
142 |     [R,p] = chol(DPrime);
143 |     if p == 0
144 |         bPCA = 0;
145 |         bChol = 1;
146 |     end
147 | end
148 | 
149 | %======================================
150 | % SVD
151 | %======================================
152 | 
153 | if bPCA    
154 |     [U, S, V] = mySVD(data);
155 |     [U, S, V]=CutonRatio(U,S,V,options);
156 |     eigvalue_PCA = full(diag(S));
157 |     if bD
158 |         data = U*S;
159 |         eigvector_PCA = V;
160 | 
161 |         DPrime = data'*D*data;
162 |         DPrime = max(DPrime,DPrime');
163 |     else
164 |         data = U;
165 |         eigvector_PCA = V*spdiags(eigvalue_PCA.^-1,0,length(eigvalue_PCA),length(eigvalue_PCA));
166 |     end
167 | else
168 |     if ~bChol
169 |         if bD
170 |             DPrime = data'*D*data;
171 |         else
172 |             DPrime = data'*data;
173 |         end
174 | 
175 |         switch lower(options.ReguType)
176 |             case {lower('Ridge')}
177 |                 if options.ReguAlpha > 0
178 |                     for i=1:size(DPrime,1)
179 |                         DPrime(i,i) = DPrime(i,i) + options.ReguAlpha;
180 |                     end
181 |                 end
182 |             case {lower('Tensor')}
183 |                 if options.ReguAlpha > 0
184 |                     DPrime = DPrime + options.ReguAlpha*options.regularizerR;
185 |                 end
186 |             case {lower('Custom')}
187 |                 if options.ReguAlpha > 0
188 |                     DPrime = DPrime + options.ReguAlpha*options.regularizerR;
189 |                 end
190 |             otherwise
191 |                 error('ReguType does not exist!');
192 |         end
193 | 
194 |         DPrime = max(DPrime,DPrime');
195 |     end
196 | end
197 | 
198 | WPrime = data'*W*data;
199 | WPrime = max(WPrime,WPrime');
200 | 
201 | 
202 | 
203 | %======================================
204 | % Generalized Eigen
205 | %======================================
206 | 
207 | dimMatrix = size(WPrime,2);
208 | 
209 | if ReducedDim > dimMatrix
210 |     ReducedDim = dimMatrix; 
211 | end
212 | 
213 | 
214 | if isfield(options,'bEigs')
215 |     bEigs = options.bEigs;
216 | else
217 |     if (dimMatrix > MAX_MATRIX_SIZE) && (ReducedDim < dimMatrix*EIGVECTOR_RATIO)
218 |         bEigs = 1;
219 |     else
220 |         bEigs = 0;
221 |     end
222 | end
223 | 
224 | 
225 | if bEigs
226 |     %disp('use eigs to speed up!');
227 |     option = struct('disp',0);
228 |     if bPCA && ~bD
229 |         [eigvector, eigvalue] = eigs(WPrime,ReducedDim,'la',option);
230 |     else
231 |         if bChol
232 |             option.cholB = 1;
233 |             [eigvector, eigvalue] = eigs(WPrime,R,ReducedDim,'la',option);
234 |         else
235 |             [eigvector, eigvalue] = eigs(WPrime,DPrime,ReducedDim,'la',option);
236 |         end
237 |     end
238 |     eigvalue = diag(eigvalue);
239 | else
240 |     if bPCA && ~bD 
241 |         [eigvector, eigvalue] = eig(WPrime);
242 |     else
243 |         [eigvector, eigvalue] = eig(WPrime,DPrime);
244 |     end
245 |     eigvalue = diag(eigvalue);
246 |     
247 |     [junk, index] = sort(-eigvalue);
248 |     eigvalue = eigvalue(index);
249 |     eigvector = eigvector(:,index);
250 | 
251 |     if ReducedDim < size(eigvector,2)
252 |         eigvector = eigvector(:, 1:ReducedDim);
253 |         eigvalue = eigvalue(1:ReducedDim);
254 |     end
255 | end
256 | 
257 | 
258 | if bPCA
259 |     eigvector = eigvector_PCA*eigvector;
260 | end
261 | 
262 | for i = 1:size(eigvector,2)
263 |     eigvector(:,i) = eigvector(:,i)./norm(eigvector(:,i));
264 | end
265 | 
266 |     
267 | 
268 | 
269 | 
270 | function [U, S, V]=CutonRatio(U,S,V,options)
271 |     if  ~isfield(options, 'PCARatio')
272 |         options.PCARatio = 1;
273 |     end
274 | 
275 |     eigvalue_PCA = full(diag(S));
276 |     if options.PCARatio > 1
277 |         idx = options.PCARatio;
278 |         if idx < length(eigvalue_PCA)
279 |             U = U(:,1:idx);
280 |             V = V(:,1:idx);
281 |             S = S(1:idx,1:idx);
282 |         end
283 |     elseif options.PCARatio < 1
284 |         sumEig = sum(eigvalue_PCA);
285 |         sumEig = sumEig*options.PCARatio;
286 |         sumNow = 0;
287 |         for idx = 1:length(eigvalue_PCA)
288 |             sumNow = sumNow + eigvalue_PCA(idx);
289 |             if sumNow >= sumEig
290 |                 break;
291 |             end
292 |         end
293 |         U = U(:,1:idx);
294 |         V = V(:,1:idx);
295 |         S = S(1:idx,1:idx);
296 |     end
297 | 


--------------------------------------------------------------------------------
/code/LSA_and_LPI_and_LE_matlab/LPI/lpp.m:
--------------------------------------------------------------------------------
  1 | function [eigvector, eigvalue] = lpp(W, options, data)
  2 | % LPP: Locality Preserving Projections
  3 | %
  4 | %       [eigvector, eigvalue] = LPP(W, options, data)
  5 | % 
  6 | %             Input:
  7 | %               data       - Data matrix. Each row vector of fea is a data point.
  8 | %               W       - Affinity matrix. You can either call "constructW"
  9 | %                         to construct the W, or construct it by yourself.
 10 | %               options - Struct value in Matlab. The fields in options
 11 | %                         that can be set:
 12 | %                           
 13 | %                         Please see LGE.m for other options.
 14 | %
 15 | %             Output:
 16 | %               eigvector - Each column is an embedding function, for a new
 17 | %                           data point (row vector) x,  y = x*eigvector
 18 | %                           will be the embedding result of x.
 19 | %               eigvalue  - The sorted eigvalue of LPP eigen-problem. 
 20 | % 
 21 | %
 22 | %    Examples:
 23 | %
 24 | %       fea = rand(50,70);
 25 | %       options = [];
 26 | %       options.Metric = 'Euclidean';
 27 | %       options.NeighborMode = 'KNN';
 28 | %       options.k = 5;
 29 | %       options.WeightMode = 'HeatKernel';
 30 | %       options.t = 5;
 31 | %       W = constructW(fea,options);
 32 | %       options.PCARatio = 0.99
 33 | %       [eigvector, eigvalue] = LPP(W, options, fea);
 34 | %       Y = fea*eigvector;
 35 | %       
 36 | %       
 37 | %       fea = rand(50,70);
 38 | %       gnd = [ones(10,1);ones(15,1)*2;ones(10,1)*3;ones(15,1)*4];
 39 | %       options = [];
 40 | %       options.Metric = 'Euclidean';
 41 | %       options.NeighborMode = 'Supervised';
 42 | %       options.gnd = gnd;
 43 | %       options.bLDA = 1;
 44 | %       W = constructW(fea,options);      
 45 | %       options.PCARatio = 1;
 46 | %       [eigvector, eigvalue] = LPP(W, options, fea);
 47 | %       Y = fea*eigvector;
 48 | % 
 49 | % 
 50 | % Note: After applying some simple algebra, the smallest eigenvalue problem:
 51 | %				data^T*L*data = \lemda data^T*D*data
 52 | %      is equivalent to the largest eigenvalue problem:
 53 | %				data^T*W*data = \beta data^T*D*data
 54 | %		where L=D-W;  \lemda= 1 - \beta.
 55 | % Thus, the smallest eigenvalue problem can be transformed to a largest 
 56 | % eigenvalue problem. Such tricks are adopted in this code for the 
 57 | % consideration of calculation precision of Matlab.
 58 | % 
 59 | %
 60 | % See also constructW, LGE
 61 | %
 62 | %Reference:
 63 | %	Xiaofei He, and Partha Niyogi, "Locality Preserving Projections"
 64 | %	Advances in Neural Information Processing Systems 16 (NIPS 2003),
 65 | %	Vancouver, Canada, 2003.
 66 | %
 67 | %   Xiaofei He, Shuicheng Yan, Yuxiao Hu, Partha Niyogi, and Hong-Jiang
 68 | %   Zhang, "Face Recognition Using Laplacianfaces", IEEE PAMI, Vol. 27, No.
 69 | %   3, Mar. 2005. 
 70 | %
 71 | %   Deng Cai, Xiaofei He and Jiawei Han, "Document Clustering Using
 72 | %   Locality Preserving Indexing" IEEE TKDE, Dec. 2005.
 73 | %
 74 | %   Deng Cai, Xiaofei He and Jiawei Han, "Using Graph Model for Face Analysis",
 75 | %   Technical Report, UIUCDCS-R-2005-2636, UIUC, Sept. 2005
 76 | % 
 77 | %	Xiaofei He, "Locality Preserving Projections"
 78 | %	PhD's thesis, Computer Science Department, The University of Chicago,
 79 | %	2005.
 80 | %
 81 | %   version 2.1 --June/2007 
 82 | %   version 2.0 --May/2007 
 83 | %   version 1.1 --Feb/2006 
 84 | %   version 1.0 --April/2004 
 85 | %
 86 | %   Written by Deng Cai (dengcai2 AT cs.uiuc.edu)
 87 | %
 88 | 
 89 | 
 90 | if (~exist('options','var'))
 91 |    options = [];
 92 | end
 93 | 
 94 | [nSmp,nFea] = size(data);
 95 | if size(W,1) ~= nSmp
 96 |     error('W and data mismatch!');
 97 | end
 98 | 
 99 | 
100 | %==========================
101 | % If data is too large, the following centering codes can be commented 
102 | % options.keepMean = 1;
103 | %==========================
104 | if isfield(options,'keepMean') && options.keepMean
105 |     ;
106 | else
107 |     if issparse(data)
108 |         data = full(data);
109 |     end
110 |     sampleMean = mean(data);
111 |     data = (data - repmat(sampleMean,nSmp,1));
112 | end
113 | %==========================
114 | 
115 | 
116 | 
117 | 
118 | D = full(sum(W,2));
119 | 
120 | 
121 | if ~isfield(options,'Regu') || ~options.Regu
122 |     DToPowerHalf = D.^.5;
123 |     D_mhalf = DToPowerHalf.^-1;
124 | 
125 |     if nSmp < 5000
126 |         tmpD_mhalf = repmat(D_mhalf,1,nSmp);
127 |         W = (tmpD_mhalf.*W).*tmpD_mhalf';
128 |         clear tmpD_mhalf;
129 |     else
130 |         [i_idx,j_idx,v_idx] = find(W);
131 |         v1_idx = zeros(size(v_idx));
132 |         for i=1:length(v_idx)
133 |             v1_idx(i) = v_idx(i)*D_mhalf(i_idx(i))*D_mhalf(j_idx(i));
134 |         end
135 |         W = sparse(i_idx,j_idx,v1_idx);
136 |         clear i_idx j_idx v_idx v1_idx
137 |     end
138 |     W = max(W,W');
139 |     
140 |     data = repmat(DToPowerHalf,1,nFea).*data;
141 |     [eigvector, eigvalue] = LGE(W, [], options, data);
142 | else
143 |     options.ReguAlpha = options.ReguAlpha*sum(D)/length(D);
144 | 
145 |     D = sparse(1:nSmp,1:nSmp,D,nSmp,nSmp);
146 |     [eigvector, eigvalue] = LGE(W, D, options, data);
147 | end
148 | 
149 | 
150 | eigIdx = find(eigvalue < 1e-3);
151 | eigvalue (eigIdx) = [];
152 | eigvector(:,eigIdx) = [];


--------------------------------------------------------------------------------
/code/LSA_and_LPI_and_LE_matlab/LPI/mySVD.m:
--------------------------------------------------------------------------------
  1 | function [U, S, V] = mySVD(X,ReducedDim)
  2 | %mySVD    Accelerated singular value decomposition.
  3 | %   [U,S,V] = mySVD(X) produces a diagonal matrix S, of the  
  4 | %   dimension as the rank of X and with nonnegative diagonal elements in
  5 | %   decreasing order, and unitary matrices U and V so that
  6 | %   X = U*S*V'.
  7 | %
  8 | %   [U,S,V] = mySVD(X,ReducedDim) produces a diagonal matrix S, of the  
  9 | %   dimension as ReducedDim and with nonnegative diagonal elements in
 10 | %   decreasing order, and unitary matrices U and V so that
 11 | %   Xhat = U*S*V' is the best approximation (with respect to F norm) of X
 12 | %   among all the matrices with rank no larger than ReducedDim.
 13 | %
 14 | %   Based on the size of X, mySVD computes the eigvectors of X*X^T or X^T*X
 15 | %   first, and then convert them to the eigenvectors of the other.  
 16 | %
 17 | %   See also SVD.
 18 | %
 19 | %   version 2.0 --Feb/2009 
 20 | %   version 1.0 --April/2004 
 21 | %
 22 | %   Written by Deng Cai (dengcai AT gmail.com)
 23 | %                                                   
 24 | 
 25 | MAX_MATRIX_SIZE = 1600; % You can change this number according your machine computational power
 26 | EIGVECTOR_RATIO = 0.1; % You can change this number according your machine computational power
 27 | 
 28 | 
 29 | if ~exist('ReducedDim','var')
 30 |     ReducedDim = 0;
 31 | end
 32 | 
 33 | [nSmp, mFea] = size(X);
 34 | if mFea/nSmp > 1.0713
 35 |     ddata = X*X';
 36 |     ddata = max(ddata,ddata');
 37 |     
 38 |     dimMatrix = size(ddata,1);
 39 |     if (ReducedDim > 0) && (dimMatrix > MAX_MATRIX_SIZE) && (ReducedDim < dimMatrix*EIGVECTOR_RATIO)
 40 |         option = struct('disp',0);
 41 |         [U, eigvalue] = eigs(ddata,ReducedDim,'la',option);
 42 |         eigvalue = diag(eigvalue);
 43 |     else
 44 |         if issparse(ddata)
 45 |             ddata = full(ddata);
 46 |         end
 47 |         
 48 |         [U, eigvalue] = eig(ddata);
 49 |         eigvalue = diag(eigvalue);
 50 |         [dump, index] = sort(-eigvalue);
 51 |         eigvalue = eigvalue(index);
 52 |         U = U(:, index);
 53 |     end
 54 |     clear ddata;
 55 |     
 56 |     maxEigValue = max(abs(eigvalue));
 57 |     eigIdx = find(abs(eigvalue)/maxEigValue < 1e-10);
 58 |     eigvalue(eigIdx) = [];
 59 |     U(:,eigIdx) = [];
 60 |     
 61 |     if (ReducedDim > 0) && (ReducedDim < length(eigvalue))
 62 |         eigvalue = eigvalue(1:ReducedDim);
 63 |         U = U(:,1:ReducedDim);
 64 |     end
 65 |     
 66 |     eigvalue_Half = eigvalue.^.5;
 67 |     S =  spdiags(eigvalue_Half,0,length(eigvalue_Half),length(eigvalue_Half));
 68 | 
 69 |     if nargout >= 3
 70 |         eigvalue_MinusHalf = eigvalue_Half.^-1;
 71 |         V = X'*(U.*repmat(eigvalue_MinusHalf',size(U,1),1));
 72 |     end
 73 | else
 74 |     ddata = X'*X;
 75 |     ddata = max(ddata,ddata');
 76 |     
 77 |     dimMatrix = size(ddata,1);
 78 |     if (ReducedDim > 0) && (dimMatrix > MAX_MATRIX_SIZE) && (ReducedDim < dimMatrix*EIGVECTOR_RATIO)
 79 |         option = struct('disp',0);
 80 |         [V, eigvalue] = eigs(ddata,ReducedDim,'la',option);
 81 |         eigvalue = diag(eigvalue);
 82 |     else
 83 |         if issparse(ddata)
 84 |             ddata = full(ddata);
 85 |         end
 86 |         
 87 |         [V, eigvalue] = eig(ddata);
 88 |         eigvalue = diag(eigvalue);
 89 |         
 90 |         [dump, index] = sort(-eigvalue);
 91 |         eigvalue = eigvalue(index);
 92 |         V = V(:, index);
 93 |     end
 94 |     clear ddata;
 95 |     
 96 |     maxEigValue = max(abs(eigvalue));
 97 |     eigIdx = find(abs(eigvalue)/maxEigValue < 1e-10);
 98 |     eigvalue(eigIdx) = [];
 99 |     V(:,eigIdx) = [];
100 |     
101 |     if (ReducedDim > 0) && (ReducedDim < length(eigvalue))
102 |         eigvalue = eigvalue(1:ReducedDim);
103 |         V = V(:,1:ReducedDim);
104 |     end
105 |     
106 |     eigvalue_Half = eigvalue.^.5;
107 |     S =  spdiags(eigvalue_Half,0,length(eigvalue_Half),length(eigvalue_Half));
108 |     
109 |     eigvalue_MinusHalf = eigvalue_Half.^-1;
110 |     U = X*(V.*repmat(eigvalue_MinusHalf',size(V,1),1));
111 | end
112 | 
113 | 
114 | 
115 | 
116 | 
117 | 
118 | 
119 | 


--------------------------------------------------------------------------------
/code/LSA_and_LPI_and_LE_matlab/LSA/LSA.m:
--------------------------------------------------------------------------------
1 | function Y = LSA(X, nLowVec)
2 | 
3 | k = nLowVec;
4 | 
5 | [Y,~,~] = svds(X,k);
6 | 
7 | end
8 | 


--------------------------------------------------------------------------------
/code/LSA_and_LPI_and_LE_matlab/Step_03_main_nlpcc2016.m:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/LSA_and_LPI_and_LE_matlab/Step_03_main_nlpcc2016.m


--------------------------------------------------------------------------------
/code/LSA_and_LPI_and_LE_matlab/Step_03_nlpcc2016.m:
--------------------------------------------------------------------------------
 1 | function Step_03_nlpcc2016(method, topic, parameters, hasTestData)
 2 | 
 3 | dataStr = ['./../../data/RefineData/Step02_vsm/vsm_train_labeled_',topic];
 4 | load(dataStr);
 5 | evalStr = ['fea = vsm_train_labeled_',topic,';'];
 6 | eval(evalStr)
 7 | dataStr = ['./../../data/RefineData/Step02_vsm/vsm_train_unlabeled_',topic];
 8 | load(dataStr);
 9 | evalStr = ['fea = [fea;vsm_train_unlabeled_',topic,'];'];
10 | eval(evalStr)
11 | if hasTestData
12 |     dataStr = ['./../../data/RefineData/Step02_vsm/vsm_test_unlabeled_',topic];
13 |     load(dataStr);
14 |     evalStr = ['fea = [fea;vsm_test_unlabeled_',topic,'];'];
15 |     eval(evalStr)
16 | end
17 | 
18 | nLowVec = parameters.nLowVec;
19 | %% init parameters
20 | options = [];
21 | options.NeighborMode = 'KNN';
22 | options.Metric = 'Euclidean';
23 | options.WeightMode = 'HeatKernel';
24 | options.t = 1;
25 | options.bSelfConnected = 0;
26 | options.k = 15;
27 | 
28 | %%
29 | if (strcmp(method,'LSA'))
30 |     disp('Here is baseline method: LSA!');
31 |     fea=tf_idf(fea);
32 |     fea = normalize(fea);
33 |     disp(['nLowVec is:',num2str(nLowVec)]);
34 |     Y = LSA(fea,nLowVec);
35 |     resultStr = ['./../../data/RefineData/Step03_lsa_lpi_le/fea_lsa_',topic];
36 |     save(resultStr, 'Y', '-ascii');
37 |     disp('LSA feature is ok!');    
38 |     
39 | elseif (strcmp(method,'Spectral_LE'))
40 |     disp('Here is baseline method: Spectral_LE!');
41 |     fea=tf_idf(fea);
42 |     fea = normalize(fea);
43 |     disp(['nLowVec is:',num2str(nLowVec)]);    
44 |     Y = LapEig(fea,options,nLowVec);
45 |     resultStr = ['./../../data/RefineData/Step03_lsa_lpi_le/fea_le_',topic];
46 |     save(resultStr, 'Y', '-ascii');    
47 |     disp('LE feature is ok!');  
48 |     
49 | elseif (strcmp(method,'LPI'))
50 |     disp('Here is baseline method: LPI!');
51 |     fea=tf_idf(fea);
52 |     fea = normalize(fea);
53 |     W = constructW(fea,options);
54 |     options.PCARatio = 1;
55 |     options.ReducedDim=nLowVec;
56 |     [eigvector, ~] = lpp(W, options, fea);
57 |     disp(['nLowVec is:',num2str(nLowVec)]);
58 |     Y = fea*eigvector;
59 |     resultStr = ['./../../data/RefineData/Step03_lsa_lpi_le/fea_lpi_',topic];
60 |     save(resultStr, 'Y', '-ascii');
61 |     disp('LPI feature is ok!');      
62 | 
63 | else
64 |     error(['You input a invalid method:',method,'!'])
65 | end


--------------------------------------------------------------------------------
/code/LSA_and_LPI_and_LE_matlab/Step_04_main_nlpcc2016.m:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/code/LSA_and_LPI_and_LE_matlab/Step_04_main_nlpcc2016.m


--------------------------------------------------------------------------------
/code/LSA_and_LPI_and_LE_matlab/Step_04_nlpcc2016.m:
--------------------------------------------------------------------------------
 1 | function Step_04_nlpcc2016(method, topic, parameters, hasTestData)
 2 | 
 3 | dataStr = ['./../../data/RefineData/Step04_chisquare/vsm_train_labeled_chi2_',topic];
 4 | load(dataStr);
 5 | evalStr = ['fea = vsm_train_labeled_chi2_',topic,';'];
 6 | eval(evalStr)
 7 | dataStr = ['./../../data/RefineData/Step04_chisquare/vsm_train_unlabeled_chi2_',topic];
 8 | load(dataStr);
 9 | evalStr = ['fea = [fea;vsm_train_unlabeled_chi2_',topic,'];'];
10 | eval(evalStr)
11 | if hasTestData
12 |     dataStr = ['./../../data/RefineData/Step04_chisquare/vsm_test_unlabeled_chi2_',topic];
13 |     load(dataStr);
14 |     evalStr = ['fea = [fea;vsm_test_unlabeled_chi2_',topic,'];'];
15 |     eval(evalStr)
16 | end
17 | 
18 | nLowVec = parameters.nLowVec;
19 | %% init parameters
20 | options = [];
21 | options.NeighborMode = 'KNN';
22 | options.Metric = 'Euclidean';
23 | options.WeightMode = 'HeatKernel';
24 | options.t = 1;
25 | options.bSelfConnected = 0;
26 | options.k = 15;
27 | 
28 | %%
29 | if (strcmp(method,'LSA'))
30 |     disp('Here is baseline method: LSA!');
31 |     fea=tf_idf(fea);
32 |     fea = normalize(fea);
33 |     disp(['nLowVec is:',num2str(nLowVec)]);
34 |     Y = LSA(fea,nLowVec);
35 |     resultStr = ['./../../data/RefineData/Step04_chisquare/fea_chi2_lsa_',topic];
36 |     save(resultStr, 'Y', '-ascii');
37 |     disp('LSA feature is ok!');    
38 |     
39 | elseif (strcmp(method,'Spectral_LE'))
40 |     disp('Here is baseline method: Spectral_LE!');
41 |     fea=tf_idf(fea);
42 |     fea = normalize(fea);
43 |     disp(['nLowVec is:',num2str(nLowVec)]);    
44 |     Y = LapEig(fea,options,nLowVec);
45 |     resultStr = ['./../../data/RefineData/Step04_chisquare/fea_chi2_le_',topic];
46 |     save(resultStr, 'Y', '-ascii');    
47 |     disp('LE feature is ok!');  
48 |     
49 | elseif (strcmp(method,'LPI'))
50 |     disp('Here is baseline method: LPI!');
51 |     fea=tf_idf(fea);
52 |     fea = normalize(fea);
53 |     W = constructW(fea,options);
54 |     options.PCARatio = 1;
55 |     options.ReducedDim=nLowVec;
56 |     [eigvector, ~] = lpp(W, options, fea);
57 |     disp(['nLowVec is:',num2str(nLowVec)]);
58 |     Y = fea*eigvector;
59 |     resultStr = ['./../../data/RefineData/Step04_chisquare/fea_chi2_lpi_',topic];
60 |     save(resultStr, 'Y', '-ascii');
61 |     disp('LPI feature is ok!');      
62 | 
63 | else
64 |     error(['You input a invalid method:',method,'!'])
65 | end


--------------------------------------------------------------------------------
/code/LSA_and_LPI_and_LE_matlab/tools/EuDist2.m:
--------------------------------------------------------------------------------
 1 | function D = EuDist2(fea_a,fea_b,bSqrt)
 2 | % Euclidean Distance matrix
 3 | %   D = EuDist(fea_a,fea_b)
 4 | %   fea_a:    nSample_a * nFeature
 5 | %   fea_b:    nSample_b * nFeature
 6 | %   D:      nSample_a * nSample_a
 7 | %       or  nSample_a * nSample_b
 8 | 
 9 | 
10 | if ~exist('bSqrt','var')
11 |     bSqrt = 1;
12 | end
13 | 
14 | 
15 | if (~exist('fea_b','var')) | isempty(fea_b)
16 |     [nSmp, nFea] = size(fea_a);
17 | 
18 |     aa = sum(fea_a.*fea_a,2);
19 |     ab = fea_a*fea_a';
20 |     
21 |     aa = full(aa);
22 |     ab = full(ab);
23 | 
24 |     if bSqrt
25 |         D = sqrt(repmat(aa, 1, nSmp) + repmat(aa', nSmp, 1) - 2*ab);
26 |         D = real(D);
27 |     else
28 |         D = repmat(aa, 1, nSmp) + repmat(aa', nSmp, 1) - 2*ab;
29 |     end
30 |     
31 |     D = max(D,D');
32 |     D = D - diag(diag(D));
33 |     D = abs(D);
34 | else
35 |     [nSmp_a, nFea] = size(fea_a);
36 |     [nSmp_b, nFea] = size(fea_b);
37 |     
38 |     aa = sum(fea_a.*fea_a,2);
39 |     bb = sum(fea_b.*fea_b,2);
40 |     ab = fea_a*fea_b';
41 | 
42 |     aa = full(aa);
43 |     bb = full(bb);
44 |     ab = full(ab);
45 | 
46 |     if bSqrt
47 |         D = sqrt(repmat(aa, 1, nSmp_b) + repmat(bb', nSmp_a, 1) - 2*ab);
48 |         D = real(D);
49 |     else
50 |         D = repmat(aa, 1, nSmp_b) + repmat(bb', nSmp_a, 1) - 2*ab;
51 |     end
52 |     
53 |     D = abs(D);
54 | end
55 | 
56 | 


--------------------------------------------------------------------------------
/code/LSA_and_LPI_and_LE_matlab/tools/constructW.m:
--------------------------------------------------------------------------------
  1 | function [W, elapse] = constructW(fea,options)
  2 | %	Usage:
  3 | %	W = constructW(fea,options)
  4 | %
  5 | %	fea: Rows of vectors of data points. Each row is x_i
  6 | %   options: Struct value in Matlab. The fields in options that can be set:
  7 | %           Metric -  Choices are:
  8 | %               'Euclidean' - Will use the Euclidean distance of two data 
  9 | %                             points to evaluate the "closeness" between 
 10 | %                             them. [Default One]
 11 | %               'Cosine'    - Will use the cosine value of two vectors
 12 | %                             to evaluate the "closeness" between them.
 13 | %                             A popular similarity measure used in
 14 | %                             Information Retrieval.
 15 | %                  
 16 | %           NeighborMode -  Indicates how to construct the graph. Choices
 17 | %                           are: [Default 'KNN']
 18 | %                'KNN'            -  k = 0
 19 | %                                       Complete graph
 20 | %                                    k > 0
 21 | %                                      Put an edge between two nodes if and
 22 | %                                      only if they are among the k nearst
 23 | %                                      neighbors of each other. You are
 24 | %                                      required to provide the parameter k in
 25 | %                                      the options. Default k=5.
 26 | %               'Supervised'      -  k = 0
 27 | %                                       Put an edge between two nodes if and
 28 | %                                       only if they belong to same class. 
 29 | %                                    k > 0
 30 | %                                       Put an edge between two nodes if
 31 | %                                       they belong to same class and they
 32 | %                                       are among the k nearst neighbors of
 33 | %                                       each other. 
 34 | %                                    Default: k=0
 35 | %                                   You are required to provide the label
 36 | %                                   information gnd in the options.
 37 | %                                              
 38 | %           WeightMode   -  Indicates how to assign weights for each edge
 39 | %                           in the graph. Choices are:
 40 | %               'Binary'       - 0-1 weighting. Every edge receiveds weight
 41 | %                                of 1. [Default One]
 42 | %               'HeatKernel'   - If nodes i and j are connected, put weight
 43 | %                                W_ij = exp(-norm(x_i-x_j)^2/(2t^2)). This
 44 | %                                weight mode can only be used under
 45 | %                                'Euclidean' metric and you are required to
 46 | %                                provide the parameter t.
 47 | %               'Cosine'       - If nodes i and j are connected, put weight
 48 | %                                cosine(x_i,x_j). Can only be used under
 49 | %                                'Cosine' metric.
 50 | %               
 51 | %            k         -   The parameter needed under 'KNN' NeighborMode.
 52 | %                          Default will be 5.
 53 | %            gnd       -   The parameter needed under 'Supervised'
 54 | %                          NeighborMode.  Colunm vector of the label
 55 | %                          information for each data point.
 56 | %            bLDA      -   0 or 1. Only effective under 'Supervised'
 57 | %                          NeighborMode. If 1, the graph will be constructed
 58 | %                          to make LPP exactly same as LDA. Default will be
 59 | %                          0. 
 60 | %            t         -   The parameter needed under 'HeatKernel'
 61 | %                          WeightMode. Default will be 1
 62 | %         bNormalized  -   0 or 1. Only effective under 'Cosine' metric.
 63 | %                          Indicates whether the fea are already be
 64 | %                          normalized to 1. Default will be 0
 65 | %      bSelfConnected  -   0 or 1. Indicates whether W(i,i) == 1. Default 1
 66 | %                          if 'Supervised' NeighborMode & bLDA == 1,
 67 | %                          bSelfConnected will always be 1. Default 1.
 68 | %
 69 | %
 70 | %    Examples:
 71 | %
 72 | %       fea = rand(50,15);
 73 | %       options = [];
 74 | %       options.Metric = 'Euclidean';
 75 | %       options.NeighborMode = 'KNN';
 76 | %       options.k = 5;
 77 | %       options.WeightMode = 'HeatKernel';
 78 | %       options.t = 1;
 79 | %       W = constructW(fea,options);
 80 | %       
 81 | %       
 82 | %       fea = rand(50,15);
 83 | %       gnd = [ones(10,1);ones(15,1)*2;ones(10,1)*3;ones(15,1)*4];
 84 | %       options = [];
 85 | %       options.Metric = 'Euclidean';
 86 | %       options.NeighborMode = 'Supervised';
 87 | %       options.gnd = gnd;
 88 | %       options.WeightMode = 'HeatKernel';
 89 | %       options.t = 1;
 90 | %       W = constructW(fea,options);
 91 | %       
 92 | %       
 93 | %       fea = rand(50,15);
 94 | %       gnd = [ones(10,1);ones(15,1)*2;ones(10,1)*3;ones(15,1)*4];
 95 | %       options = [];
 96 | %       options.Metric = 'Euclidean';
 97 | %       options.NeighborMode = 'Supervised';
 98 | %       options.gnd = gnd;
 99 | %       options.bLDA = 1;
100 | %       W = constructW(fea,options);      
101 | %       
102 | %
103 | %    For more details about the different ways to construct the W, please
104 | %    refer:
105 | %       Deng Cai, Xiaofei He and Jiawei Han, "Document Clustering Using
106 | %       Locality Preserving Indexing" IEEE TKDE, Dec. 2005.
107 | %    
108 | %
109 | %    Written by Deng Cai (dengcai2 AT cs.uiuc.edu), April/2004, Feb/2006,
110 | %                                             May/2007
111 | % 
112 | 
113 | if (~exist('options','var'))
114 |    options = [];
115 | else
116 |    if ~isstruct(options) 
117 |        error('parameter error!');
118 |    end
119 | end
120 | 
121 | %=================================================
122 | if ~isfield(options,'Metric')
123 |     options.Metric = 'Cosine';
124 | end
125 | 
126 | switch lower(options.Metric)
127 |     case {lower('Euclidean')}
128 |     case {lower('Cosine')}
129 |         if ~isfield(options,'bNormalized')
130 |             options.bNormalized = 0;
131 |         end
132 |     otherwise
133 |         error('Metric does not exist!');
134 | end
135 | 
136 | %=================================================
137 | if ~isfield(options,'NeighborMode')
138 |     options.NeighborMode = 'KNN';
139 | end
140 | 
141 | switch lower(options.NeighborMode)
142 |     case {lower('KNN')}  %For simplicity, we include the data point itself in the kNN
143 |         if ~isfield(options,'k')
144 |             options.k = 5;
145 |         end
146 |     case {lower('Supervised')}
147 |         if ~isfield(options,'bLDA')
148 |             options.bLDA = 0;
149 |         end
150 |         if options.bLDA
151 |             options.bSelfConnected = 1;
152 |         end
153 |         if ~isfield(options,'k')
154 |             options.k = 0;
155 |         end
156 |         if ~isfield(options,'gnd')
157 |             error('Label(gnd) should be provided under ''Supervised'' NeighborMode!');
158 |         end
159 |         if ~isempty(fea) && length(options.gnd) ~= size(fea,1)
160 |             error('gnd doesn''t match with fea!');
161 |         end
162 |     otherwise
163 |         error('NeighborMode does not exist!');
164 | end
165 | 
166 | %=================================================
167 | 
168 | if ~isfield(options,'WeightMode')
169 |     options.WeightMode = 'Binary';
170 | end
171 | 
172 | bBinary = 0;
173 | switch lower(options.WeightMode)
174 |     case {lower('Binary')}
175 |         bBinary = 1; 
176 |     case {lower('HeatKernel')}
177 |         if ~strcmpi(options.Metric,'Euclidean')
178 |             warning('''HeatKernel'' WeightMode should be used under ''Euclidean'' Metric!');
179 |             options.Metric = 'Euclidean';
180 |         end
181 |         if ~isfield(options,'t')
182 |             options.t = 1;
183 |         end
184 |     case {lower('Cosine')}
185 |         if ~strcmpi(options.Metric,'Cosine')
186 |             warning('''Cosine'' WeightMode should be used under ''Cosine'' Metric!');
187 |             options.Metric = 'Cosine';
188 |         end
189 |         if ~isfield(options,'bNormalized')
190 |             options.bNormalized = 0;
191 |         end
192 |     otherwise
193 |         error('WeightMode does not exist!');
194 | end
195 | 
196 | %=================================================
197 | 
198 | if ~isfield(options,'bSelfConnected')
199 |     options.bSelfConnected = 1;
200 | end
201 | 
202 | %=================================================
203 | tmp_T = cputime;
204 | 
205 | if isfield(options,'gnd') 
206 |     nSmp = length(options.gnd);
207 | else
208 |     nSmp = size(fea,1);
209 | end
210 | 
211 | maxM = 62500000; %500M
212 | BlockSize = floor(maxM/(nSmp*3));
213 | 
214 | 
215 | if strcmpi(options.NeighborMode,'Supervised')
216 |     Label = unique(options.gnd);
217 |     nLabel = length(Label);
218 |     if options.bLDA
219 |         G = zeros(nSmp,nSmp);
220 |         for idx=1:nLabel
221 |             classIdx = options.gnd==Label(idx);
222 |             G(classIdx,classIdx) = 1/sum(classIdx);
223 |         end
224 |         W = sparse(G);
225 |         elapse = cputime - tmp_T;
226 |         return;
227 |     end
228 |     
229 |     switch lower(options.WeightMode)
230 |         case {lower('Binary')}
231 |             if options.k > 0
232 |                 G = zeros(nSmp*(options.k+1),3);
233 |                 idNow = 0;
234 |                 for i=1:nLabel
235 |                     classIdx = find(options.gnd==Label(i));
236 |                     D = EuDist2(fea(classIdx,:),[],0);
237 |                     [dump idx] = sort(D,2); % sort each row
238 |                     clear D dump;
239 |                     idx = idx(:,1:options.k+1);
240 |                     
241 |                     nSmpClass = length(classIdx)*(options.k+1);
242 |                     G(idNow+1:nSmpClass+idNow,1) = repmat(classIdx,[options.k+1,1]);
243 |                     G(idNow+1:nSmpClass+idNow,2) = classIdx(idx(:));
244 |                     G(idNow+1:nSmpClass+idNow,3) = 1;
245 |                     idNow = idNow+nSmpClass;
246 |                     clear idx
247 |                 end
248 |                 G = sparse(G(:,1),G(:,2),G(:,3),nSmp,nSmp);
249 |                 G = max(G,G');
250 |             else
251 |                 G = zeros(nSmp,nSmp);
252 |                 for i=1:nLabel
253 |                     classIdx = find(options.gnd==Label(i));
254 |                     G(classIdx,classIdx) = 1;
255 |                 end
256 |             end
257 |             
258 |             if ~options.bSelfConnected
259 |                 for i=1:size(G,1)
260 |                     G(i,i) = 0;
261 |                 end
262 |             end
263 |             
264 |             W = sparse(G);
265 |         case {lower('HeatKernel')}
266 |             if options.k > 0
267 |                 G = zeros(nSmp*(options.k+1),3);
268 |                 idNow = 0;
269 |                 for i=1:nLabel
270 |                     classIdx = find(options.gnd==Label(i));
271 |                     D = EuDist2(fea(classIdx,:),[],0);
272 |                     [dump idx] = sort(D,2); % sort each row
273 |                     clear D;
274 |                     idx = idx(:,1:options.k+1);
275 |                     dump = dump(:,1:options.k+1);
276 |                     dump = exp(-dump/(2*options.t^2));
277 |                     
278 |                     nSmpClass = length(classIdx)*(options.k+1);
279 |                     G(idNow+1:nSmpClass+idNow,1) = repmat(classIdx,[options.k+1,1]);
280 |                     G(idNow+1:nSmpClass+idNow,2) = classIdx(idx(:));
281 |                     G(idNow+1:nSmpClass+idNow,3) = dump(:);
282 |                     idNow = idNow+nSmpClass;
283 |                     clear dump idx
284 |                 end
285 |                 G = sparse(G(:,1),G(:,2),G(:,3),nSmp,nSmp);
286 |             else
287 |                 G = zeros(nSmp,nSmp);
288 |                 for i=1:nLabel
289 |                     classIdx = find(options.gnd==Label(i));
290 |                     D = EuDist2(fea(classIdx,:),[],0);
291 |                     D = exp(-D/(2*options.t^2));
292 |                     G(classIdx,classIdx) = D;
293 |                 end
294 |             end
295 |             
296 |             if ~options.bSelfConnected
297 |                 for i=1:size(G,1)
298 |                     G(i,i) = 0;
299 |                 end
300 |             end
301 | 
302 |             W = sparse(max(G,G'));
303 |         case {lower('Cosine')}
304 |             if ~options.bNormalized
305 |                 [nSmp, nFea] = size(fea);
306 |                 if issparse(fea)
307 |                     fea2 = fea';
308 |                     feaNorm = sum(fea2.^2,1).^.5;
309 |                     for i = 1:nSmp
310 |                         fea2(:,i) = fea2(:,i) ./ max(1e-10,feaNorm(i));
311 |                     end
312 |                     fea = fea2';
313 |                     clear fea2;
314 |                 else
315 |                     feaNorm = sum(fea.^2,2).^.5;
316 |                     for i = 1:nSmp
317 |                         fea(i,:) = fea(i,:) ./ max(1e-12,feaNorm(i));
318 |                     end
319 |                 end
320 | 
321 |             end
322 | 
323 |             if options.k > 0
324 |                 G = zeros(nSmp*(options.k+1),3);
325 |                 idNow = 0;
326 |                 for i=1:nLabel
327 |                     classIdx = find(options.gnd==Label(i));
328 |                     D = fea(classIdx,:)*fea(classIdx,:)';
329 |                     [dump idx] = sort(-D,2); % sort each row
330 |                     clear D;
331 |                     idx = idx(:,1:options.k+1);
332 |                     dump = -dump(:,1:options.k+1);
333 |                     
334 |                     nSmpClass = length(classIdx)*(options.k+1);
335 |                     G(idNow+1:nSmpClass+idNow,1) = repmat(classIdx,[options.k+1,1]);
336 |                     G(idNow+1:nSmpClass+idNow,2) = classIdx(idx(:));
337 |                     G(idNow+1:nSmpClass+idNow,3) = dump(:);
338 |                     idNow = idNow+nSmpClass;
339 |                     clear dump idx
340 |                 end
341 |                 G = sparse(G(:,1),G(:,2),G(:,3),nSmp,nSmp);
342 |             else
343 |                 G = zeros(nSmp,nSmp);
344 |                 for i=1:nLabel
345 |                     classIdx = find(options.gnd==Label(i));
346 |                     G(classIdx,classIdx) = fea(classIdx,:)*fea(classIdx,:)';
347 |                 end
348 |             end
349 | 
350 |             if ~options.bSelfConnected
351 |                 for i=1:size(G,1)
352 |                     G(i,i) = 0;
353 |                 end
354 |             end
355 | 
356 |             W = sparse(max(G,G'));
357 |         otherwise
358 |             error('WeightMode does not exist!');
359 |     end
360 |     elapse = cputime - tmp_T;
361 |     return;
362 | end
363 | 
364 | 
365 | if strcmpi(options.NeighborMode,'KNN') && (options.k > 0)
366 |     if strcmpi(options.Metric,'Euclidean')
367 |         G = zeros(nSmp*(options.k+1),3);
368 |         for i = 1:ceil(nSmp/BlockSize)
369 |             if i == ceil(nSmp/BlockSize)
370 |                 smpIdx = (i-1)*BlockSize+1:nSmp;
371 |                 dist = EuDist2(fea(smpIdx,:),fea,0);
372 |                 dist = full(dist);
373 |                 [dump idx] = sort(dist,2); % sort each row
374 |                 idx = idx(:,1:options.k+1);
375 |                 dump = dump(:,1:options.k+1);
376 |                 if ~bBinary
377 |                     dump = exp(-dump/(2*options.t^2));
378 |                 end
379 |                 
380 |                 G((i-1)*BlockSize*(options.k+1)+1:nSmp*(options.k+1),1) = repmat(smpIdx',[options.k+1,1]);
381 |                 G((i-1)*BlockSize*(options.k+1)+1:nSmp*(options.k+1),2) = idx(:);
382 |                 if ~bBinary
383 |                     G((i-1)*BlockSize*(options.k+1)+1:nSmp*(options.k+1),3) = dump(:);
384 |                 else
385 |                     G((i-1)*BlockSize*(options.k+1)+1:nSmp*(options.k+1),3) = 1;
386 |                 end
387 |             else
388 |                 smpIdx = (i-1)*BlockSize+1:i*BlockSize;
389 |                 dist = EuDist2(fea(smpIdx,:),fea,0);
390 |                 dist = full(dist);
391 |                 [dump idx] = sort(dist,2); % sort each row
392 |                 idx = idx(:,1:options.k+1);
393 |                 dump = dump(:,1:options.k+1);
394 |                 if ~bBinary
395 |                     dump = exp(-dump/(2*options.t^2));
396 |                 end
397 |                 
398 |                 G((i-1)*BlockSize*(options.k+1)+1:i*BlockSize*(options.k+1),1) = repmat(smpIdx',[options.k+1,1]);
399 |                 G((i-1)*BlockSize*(options.k+1)+1:i*BlockSize*(options.k+1),2) = idx(:);
400 |                 if ~bBinary
401 |                     G((i-1)*BlockSize*(options.k+1)+1:i*BlockSize*(options.k+1),3) = dump(:);
402 |                 else
403 |                     G((i-1)*BlockSize*(options.k+1)+1:i*BlockSize*(options.k+1),3) = 1;
404 |                 end
405 |             end
406 |         end
407 | 
408 |         W = sparse(G(:,1),G(:,2),G(:,3),nSmp,nSmp);
409 |     else
410 |         if ~options.bNormalized
411 |             [nSmp, nFea] = size(fea);
412 |             if issparse(fea)
413 |                 fea2 = fea';
414 |                 clear fea;
415 |                 for i = 1:nSmp
416 |                     fea2(:,i) = fea2(:,i) ./ max(1e-10,sum(fea2(:,i).^2,1).^.5);
417 |                 end
418 |                 fea = fea2';
419 |                 clear fea2;
420 |             else
421 |                 feaNorm = sum(fea.^2,2).^.5;
422 |                 for i = 1:nSmp
423 |                     fea(i,:) = fea(i,:) ./ max(1e-12,feaNorm(i));
424 |                 end
425 |             end
426 |         end
427 |         
428 |         G = zeros(nSmp*(options.k+1),3);
429 |         for i = 1:ceil(nSmp/BlockSize)
430 |             if i == ceil(nSmp/BlockSize)
431 |                 smpIdx = (i-1)*BlockSize+1:nSmp;
432 |                 dist = fea(smpIdx,:)*fea';
433 |                 dist = full(dist);
434 |                 [dump idx] = sort(-dist,2); % sort each row
435 |                 idx = idx(:,1:options.k+1);
436 |                 dump = -dump(:,1:options.k+1);
437 | 
438 |                 G((i-1)*BlockSize*(options.k+1)+1:nSmp*(options.k+1),1) = repmat(smpIdx',[options.k+1,1]);
439 |                 G((i-1)*BlockSize*(options.k+1)+1:nSmp*(options.k+1),2) = idx(:);
440 |                 G((i-1)*BlockSize*(options.k+1)+1:nSmp*(options.k+1),3) = dump(:);
441 |             else
442 |                 smpIdx = (i-1)*BlockSize+1:i*BlockSize;
443 |                 dist = fea(smpIdx,:)*fea';
444 |                 dist = full(dist);
445 |                 [dump idx] = sort(-dist,2); % sort each row
446 |                 idx = idx(:,1:options.k+1);
447 |                 dump = -dump(:,1:options.k+1);
448 | 
449 |                 G((i-1)*BlockSize*(options.k+1)+1:i*BlockSize*(options.k+1),1) = repmat(smpIdx',[options.k+1,1]);
450 |                 G((i-1)*BlockSize*(options.k+1)+1:i*BlockSize*(options.k+1),2) = idx(:);
451 |                 G((i-1)*BlockSize*(options.k+1)+1:i*BlockSize*(options.k+1),3) = dump(:);
452 |             end
453 |         end
454 | 
455 |         W = sparse(G(:,1),G(:,2),G(:,3),nSmp,nSmp);
456 |     end
457 |     
458 |     if strcmpi(options.WeightMode,'Binary')
459 |         W(find(W)) = 1;
460 |     end
461 |     
462 |     if isfield(options,'bSemiSupervised') && options.bSemiSupervised
463 |         tmpgnd = options.gnd(options.semiSplit);
464 |         
465 |         Label = unique(tmpgnd);
466 |         nLabel = length(Label);
467 |         G = zeros(sum(options.semiSplit),sum(options.semiSplit));
468 |         for idx=1:nLabel
469 |             classIdx = tmpgnd==Label(idx);
470 |             G(classIdx,classIdx) = 1;
471 |         end
472 |         Wsup = sparse(G);
473 |         if ~isfield(options,'SameCategoryWeight')
474 |             options.SameCategoryWeight = 1;
475 |         end
476 |         W(options.semiSplit,options.semiSplit) = (Wsup>0)*options.SameCategoryWeight;
477 |     end
478 |     
479 |     if ~options.bSelfConnected
480 |         for i=1:size(W,1)
481 |             W(i,i) = 0;
482 |         end
483 |     end
484 | 
485 |     W = max(W,W');
486 |     
487 |     elapse = cputime - tmp_T;
488 |     return;
489 | end
490 | 
491 | 
492 | % strcmpi(options.NeighborMode,'KNN') & (options.k == 0)
493 | % Complete Graph
494 | 
495 | if strcmpi(options.Metric,'Euclidean')
496 |     W = EuDist2(fea,[],0);
497 |     W = exp(-W/(2*options.t^2));
498 | else
499 |     if ~options.bNormalized
500 | %         feaNorm = sum(fea.^2,2).^.5;
501 | %         fea = fea ./ repmat(max(1e-10,feaNorm),1,size(fea,2));
502 |         [nSmp, nFea] = size(fea);
503 |         if issparse(fea)
504 |             fea2 = fea';
505 |             feaNorm = sum(fea2.^2,1).^.5;
506 |             for i = 1:nSmp
507 |                 fea2(:,i) = fea2(:,i) ./ max(1e-10,feaNorm(i));
508 |             end
509 |             fea = fea2';
510 |             clear fea2;
511 |         else
512 |             feaNorm = sum(fea.^2,2).^.5;
513 |             for i = 1:nSmp
514 |                 fea(i,:) = fea(i,:) ./ max(1e-12,feaNorm(i));
515 |             end
516 |         end
517 |     end
518 |     
519 |     W = full(fea*fea');
520 | end
521 | 
522 | if ~options.bSelfConnected
523 |     for i=1:size(W,1)
524 |         W(i,i) = 0;
525 |     end
526 | end
527 | 
528 | W = max(W,W');
529 | 
530 | 
531 | 
532 | elapse = cputime - tmp_T;
533 | 
534 | 
535 | 


--------------------------------------------------------------------------------
/code/LSA_and_LPI_and_LE_matlab/tools/normalize.m:
--------------------------------------------------------------------------------
 1 | function Xn = normalize(X)
 2 | % Normalize all feature vectors to unit length
 3 | 
 4 | n = size(X,1);  % the number of documents
 5 | Xt = X';
 6 | l = sqrt(sum(Xt.^2));  % the row vector length (L2 norm)
 7 | Ni = sparse(1:n,1:n,l);
 8 | Ni(Ni>0) = 1./Ni(Ni>0);
 9 | Xn = (Xt*Ni)';
10 | 
11 | end
12 | 


--------------------------------------------------------------------------------
/code/LSA_and_LPI_and_LE_matlab/tools/tf_idf.m:
--------------------------------------------------------------------------------
 1 | function [trainX] = tf_idf(trainTF)
 2 | % TF-IDF weighting
 3 | % ([1+log(tf)]*log[N/df])
 4 | [n,m] = size(trainTF);  % the number of (training) documents and terms
 5 | df = sum(trainTF>0);  % (training) document frequency
 6 | idf = log(n./df);
 7 | IDF = sparse(1:m,1:m,idf);
 8 | [trainI,trainJ,trainV] = find(trainTF);
 9 | trainLogTF = sparse(trainI,trainJ,1+log(trainV),size(trainTF,1),size(trainTF,2));
10 | trainX = trainLogTF*IDF;
11 | end
12 | 


--------------------------------------------------------------------------------
/code/Para2vec/README.md:
--------------------------------------------------------------------------------
1 | # Paragraph Vector (Para2vec)
2 | 
3 | This project is modified from the following open-source implementation Para2vec
4 | 
5 | https://github.com/mesnilgr/iclr15
6 | 
7 | 


--------------------------------------------------------------------------------
/code/Para2vec/go_paravec.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #auther:Mikolov, modifier: Jacob Xu. -- 20160614
 3 | # iphonese, bianpao, fankong, ertai, jinmo
 4 | topic="iphonese"
 5 | rm alldata.txt
 6 | rm alldata-id.txt
 7 | 
 8 | echo "Current topic is "${topic}" ..."
 9 | cat "TaskA_train_labeled_text_"${topic} "TaskA_train_unlabeled_text_"${topic} "TaskA_test_unlabeled_text_"${topic} > alldata.txt
10 | awk 'BEGIN{a=0;}{print "_*" a " " $0; a++;}' < alldata.txt > alldata-id.txt
11 | 
12 | gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -funroll-loops
13 | time ./word2vec -train ./alldata-id.txt -output ${topic}"_vectors.txt" -cbow 0 -size 50 -window 5 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1
14 | grep '_\*' ${topic}"_vectors.txt" > ${topic}"_para2vecs.txt"
15 | sed -i "s+_\*+0+g" ${topic}"_para2vecs.txt"
16 | echo "It is done, OK!"


--------------------------------------------------------------------------------
/code/Para2vec/word2vec.c:
--------------------------------------------------------------------------------
  1 | //  Copyright 2013 Google Inc. All Rights Reserved.
  2 | //
  3 | //  Licensed under the Apache License, Version 2.0 (the "License");
  4 | //  you may not use this file except in compliance with the License.
  5 | //  You may obtain a copy of the License at
  6 | //
  7 | //      http://www.apache.org/licenses/LICENSE-2.0
  8 | //
  9 | //  Unless required by applicable law or agreed to in writing, software
 10 | //  distributed under the License is distributed on an "AS IS" BASIS,
 11 | //  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | //  See the License for the specific language governing permissions and
 13 | //  limitations under the License.
 14 | 
 15 | #include <stdio.h>
 16 | #include <stdlib.h>
 17 | #include <string.h>
 18 | #include <math.h>
 19 | #include <pthread.h>
 20 | 
 21 | #define MAX_STRING 100
 22 | #define EXP_TABLE_SIZE 1000
 23 | #define MAX_EXP 6
 24 | #define MAX_SENTENCE_LENGTH 1000
 25 | #define MAX_CODE_LENGTH 40
 26 | 
 27 | const int vocab_hash_size = 30000000;  // Maximum 30 * 0.7 = 21M words in the vocabulary
 28 | 
 29 | typedef float real;                    // Precision of float numbers
 30 | 
 31 | struct vocab_word {
 32 |   long long cn;
 33 |   int *point;
 34 |   char *word, *code, codelen;
 35 | };
 36 | 
 37 | char train_file[MAX_STRING], output_file[MAX_STRING];
 38 | char save_vocab_file[MAX_STRING], read_vocab_file[MAX_STRING];
 39 | struct vocab_word *vocab;
 40 | int binary = 0, cbow = 1, debug_mode = 2, window = 5, min_count = 1, num_threads = 12, min_reduce = 1;
 41 | int *vocab_hash;
 42 | long long vocab_max_size = 1000, vocab_size = 0, layer1_size = 100, sentence_vectors = 0;
 43 | long long train_words = 0, word_count_actual = 0, iter = 5, file_size = 0, classes = 0;
 44 | real alpha = 0.025, starting_alpha, sample = 1e-3;
 45 | real *syn0, *syn1, *syn1neg, *expTable;
 46 | clock_t start;
 47 | 
 48 | int hs = 0, negative = 5;
 49 | const int table_size = 1e8;
 50 | int *table;
 51 | 
 52 | void InitUnigramTable() {
 53 |   int a, i;
 54 |   long long train_words_pow = 0;
 55 |   real d1, power = 0.75;
 56 |   table = (int *)malloc(table_size * sizeof(int));
 57 |   for (a = 0; a < vocab_size; a++) train_words_pow += pow(vocab[a].cn, power);
 58 |   i = 0;
 59 |   d1 = pow(vocab[i].cn, power) / (real)train_words_pow;
 60 |   for (a = 0; a < table_size; a++) {
 61 |     table[a] = i;
 62 |     if (a / (real)table_size > d1) {
 63 |       i++;
 64 |       d1 += pow(vocab[i].cn, power) / (real)train_words_pow;
 65 |     }
 66 |     if (i >= vocab_size) i = vocab_size - 1;
 67 |   }
 68 | }
 69 | 
 70 | // Reads a single word from a file, assuming space + tab + EOL to be word boundaries
 71 | void ReadWord(char *word, FILE *fin) {
 72 |   int a = 0, ch;
 73 |   while (!feof(fin)) {
 74 |     ch = fgetc(fin);
 75 |     if (ch == 13) continue;
 76 |     if ((ch == ' ') || (ch == '\t') || (ch == '\n')) {
 77 |       if (a > 0) {
 78 |         if (ch == '\n') ungetc(ch, fin);
 79 |         break;
 80 |       }
 81 |       if (ch == '\n') {
 82 |         strcpy(word, (char *)"</s>");
 83 |         return;
 84 |       } else continue;
 85 |     }
 86 |     word[a] = ch;
 87 |     a++;
 88 |     if (a >= MAX_STRING - 1) a--;   // Truncate too long words
 89 |   }
 90 |   word[a] = 0;
 91 | }
 92 | 
 93 | // Returns hash value of a word
 94 | int GetWordHash(char *word) {
 95 |   unsigned long long a, hash = 0;
 96 |   for (a = 0; a < strlen(word); a++) hash = hash * 257 + word[a];
 97 |   hash = hash % vocab_hash_size;
 98 |   return hash;
 99 | }
100 | 
101 | // Returns position of a word in the vocabulary; if the word is not found, returns -1
102 | int SearchVocab(char *word) {
103 |   unsigned int hash = GetWordHash(word);
104 |   while (1) {
105 |     if (vocab_hash[hash] == -1) return -1;
106 |     if (!strcmp(word, vocab[vocab_hash[hash]].word)) return vocab_hash[hash];
107 |     hash = (hash + 1) % vocab_hash_size;
108 |   }
109 |   return -1;
110 | }
111 | 
112 | // Reads a word and returns its index in the vocabulary
113 | int ReadWordIndex(FILE *fin) {
114 |   char word[MAX_STRING];
115 |   ReadWord(word, fin);
116 |   if (feof(fin)) return -1;
117 |   return SearchVocab(word);
118 | }
119 | 
120 | // Adds a word to the vocabulary
121 | int AddWordToVocab(char *word) {
122 |   unsigned int hash, length = strlen(word) + 1;
123 |   if (length > MAX_STRING) length = MAX_STRING;
124 |   vocab[vocab_size].word = (char *)calloc(length, sizeof(char));
125 |   strcpy(vocab[vocab_size].word, word);
126 |   vocab[vocab_size].cn = 0;
127 |   vocab_size++;
128 |   // Reallocate memory if needed
129 |   if (vocab_size + 2 >= vocab_max_size) {
130 |     vocab_max_size += 1000;
131 |     vocab = (struct vocab_word *)realloc(vocab, vocab_max_size * sizeof(struct vocab_word));
132 |   }
133 |   hash = GetWordHash(word);
134 |   while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size;
135 |   vocab_hash[hash] = vocab_size - 1;
136 |   return vocab_size - 1;
137 | }
138 | 
139 | // Used later for sorting by word counts
140 | int VocabCompare(const void *a, const void *b) {
141 |     return ((struct vocab_word *)b)->cn - ((struct vocab_word *)a)->cn;
142 | }
143 | 
144 | // Sorts the vocabulary by frequency using word counts
145 | void SortVocab() {
146 |   int a, size;
147 |   unsigned int hash;
148 |   // Sort the vocabulary and keep </s> at the first position
149 |   qsort(&vocab[1], vocab_size - 1, sizeof(struct vocab_word), VocabCompare);
150 |   for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1;
151 |   size = vocab_size;
152 |   train_words = 0;
153 |   for (a = 0; a < size; a++) {
154 |     // Words occuring less than min_count times will be discarded from the vocab
155 |     if (vocab[a].cn < min_count) {
156 |       vocab_size--;
157 |       free(vocab[vocab_size].word);
158 |     } else {
159 |       // Hash will be re-computed, as after the sorting it is not actual
160 |       hash=GetWordHash(vocab[a].word);
161 |       while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size;
162 |       vocab_hash[hash] = a;
163 |       train_words += vocab[a].cn;
164 |     }
165 |   }
166 |   vocab = (struct vocab_word *)realloc(vocab, (vocab_size + 1) * sizeof(struct vocab_word));
167 |   // Allocate memory for the binary tree construction
168 |   for (a = 0; a < vocab_size; a++) {
169 |     vocab[a].code = (char *)calloc(MAX_CODE_LENGTH, sizeof(char));
170 |     vocab[a].point = (int *)calloc(MAX_CODE_LENGTH, sizeof(int));
171 |   }
172 | }
173 | 
174 | // Reduces the vocabulary by removing infrequent tokens
175 | void ReduceVocab() {
176 |   int a, b = 0;
177 |   unsigned int hash;
178 |   for (a = 0; a < vocab_size; a++) if (vocab[a].cn > min_reduce) {
179 |     vocab[b].cn = vocab[a].cn;
180 |     vocab[b].word = vocab[a].word;
181 |     b++;
182 |   } else free(vocab[a].word);
183 |   vocab_size = b;
184 |   for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1;
185 |   for (a = 0; a < vocab_size; a++) {
186 |     // Hash will be re-computed, as it is not actual
187 |     hash = GetWordHash(vocab[a].word);
188 |     while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size;
189 |     vocab_hash[hash] = a;
190 |   }
191 |   fflush(stdout);
192 |   min_reduce++;
193 | }
194 | 
195 | // Create binary Huffman tree using the word counts
196 | // Frequent words will have short uniqe binary codes
197 | void CreateBinaryTree() {
198 |   long long a, b, i, min1i, min2i, pos1, pos2, point[MAX_CODE_LENGTH];
199 |   char code[MAX_CODE_LENGTH];
200 |   long long *count = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long));
201 |   long long *binary = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long));
202 |   long long *parent_node = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long));
203 |   for (a = 0; a < vocab_size; a++) count[a] = vocab[a].cn;
204 |   for (a = vocab_size; a < vocab_size * 2; a++) count[a] = 1e15;
205 |   pos1 = vocab_size - 1;
206 |   pos2 = vocab_size;
207 |   // Following algorithm constructs the Huffman tree by adding one node at a time
208 |   for (a = 0; a < vocab_size - 1; a++) {
209 |     // First, find two smallest nodes 'min1, min2'
210 |     if (pos1 >= 0) {
211 |       if (count[pos1] < count[pos2]) {
212 |         min1i = pos1;
213 |         pos1--;
214 |       } else {
215 |         min1i = pos2;
216 |         pos2++;
217 |       }
218 |     } else {
219 |       min1i = pos2;
220 |       pos2++;
221 |     }
222 |     if (pos1 >= 0) {
223 |       if (count[pos1] < count[pos2]) {
224 |         min2i = pos1;
225 |         pos1--;
226 |       } else {
227 |         min2i = pos2;
228 |         pos2++;
229 |       }
230 |     } else {
231 |       min2i = pos2;
232 |       pos2++;
233 |     }
234 |     count[vocab_size + a] = count[min1i] + count[min2i];
235 |     parent_node[min1i] = vocab_size + a;
236 |     parent_node[min2i] = vocab_size + a;
237 |     binary[min2i] = 1;
238 |   }
239 |   // Now assign binary code to each vocabulary word
240 |   for (a = 0; a < vocab_size; a++) {
241 |     b = a;
242 |     i = 0;
243 |     while (1) {
244 |       code[i] = binary[b];
245 |       point[i] = b;
246 |       i++;
247 |       b = parent_node[b];
248 |       if (b == vocab_size * 2 - 2) break;
249 |     }
250 |     vocab[a].codelen = i;
251 |     vocab[a].point[0] = vocab_size - 2;
252 |     for (b = 0; b < i; b++) {
253 |       vocab[a].code[i - b - 1] = code[b];
254 |       vocab[a].point[i - b] = point[b] - vocab_size;
255 |     }
256 |   }
257 |   free(count);
258 |   free(binary);
259 |   free(parent_node);
260 | }
261 | 
262 | void LearnVocabFromTrainFile() {
263 |   char word[MAX_STRING];
264 |   FILE *fin;
265 |   long long a, i;
266 |   for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1;
267 |   fin = fopen(train_file, "rb");
268 |   if (fin == NULL) {
269 |     printf("ERROR: training data file not found!\n");
270 |     exit(1);
271 |   }
272 |   vocab_size = 0;
273 |   AddWordToVocab((char *)"</s>");
274 |   while (1) {
275 |     ReadWord(word, fin);
276 |     if (feof(fin)) break;
277 |     train_words++;
278 |     if ((debug_mode > 1) && (train_words % 100000 == 0)) {
279 |       printf("%lldK%c", train_words / 1000, 13);
280 |       fflush(stdout);
281 |     }
282 |     i = SearchVocab(word);
283 |     if (i == -1) {
284 |       a = AddWordToVocab(word);
285 |       vocab[a].cn = 1;
286 |     } else vocab[i].cn++;
287 |     if (vocab_size > vocab_hash_size * 0.7) ReduceVocab();
288 |   }
289 |   SortVocab();
290 |   if (debug_mode > 0) {
291 |     printf("Vocab size: %lld\n", vocab_size);
292 |     printf("Words in train file: %lld\n", train_words);
293 |   }
294 |   file_size = ftell(fin);
295 |   fclose(fin);
296 | }
297 | 
298 | void SaveVocab() {
299 |   long long i;
300 |   FILE *fo = fopen(save_vocab_file, "wb");
301 |   for (i = 0; i < vocab_size; i++) fprintf(fo, "%s %lld\n", vocab[i].word, vocab[i].cn);
302 |   fclose(fo);
303 | }
304 | 
305 | void ReadVocab() {
306 |   long long a, i = 0;
307 |   char c;
308 |   char word[MAX_STRING];
309 |   FILE *fin = fopen(read_vocab_file, "rb");
310 |   if (fin == NULL) {
311 |     printf("Vocabulary file not found\n");
312 |     exit(1);
313 |   }
314 |   for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1;
315 |   vocab_size = 0;
316 |   while (1) {
317 |     ReadWord(word, fin);
318 |     if (feof(fin)) break;
319 |     a = AddWordToVocab(word);
320 |     fscanf(fin, "%lld%c", &vocab[a].cn, &c);
321 |     i++;
322 |   }
323 |   SortVocab();
324 |   if (debug_mode > 0) {
325 |     printf("Vocab size: %lld\n", vocab_size);
326 |     printf("Words in train file: %lld\n", train_words);
327 |   }
328 |   fin = fopen(train_file, "rb");
329 |   if (fin == NULL) {
330 |     printf("ERROR: training data file not found!\n");
331 |     exit(1);
332 |   }
333 |   fseek(fin, 0, SEEK_END);
334 |   file_size = ftell(fin);
335 |   fclose(fin);
336 | }
337 | 
338 | void InitNet() {
339 |   long long a, b;
340 |   unsigned long long next_random = 1;
341 |   a = posix_memalign((void **)&syn0, 128, (long long)vocab_size * layer1_size * sizeof(real));
342 |   if (syn0 == NULL) {printf("Memory allocation failed\n"); exit(1);}
343 |   if (hs) {
344 |     a = posix_memalign((void **)&syn1, 128, (long long)vocab_size * layer1_size * sizeof(real));
345 |     if (syn1 == NULL) {printf("Memory allocation failed\n"); exit(1);}
346 |     for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++)
347 |      syn1[a * layer1_size + b] = 0;
348 |   }
349 |   if (negative>0) {
350 |     a = posix_memalign((void **)&syn1neg, 128, (long long)vocab_size * layer1_size * sizeof(real));
351 |     if (syn1neg == NULL) {printf("Memory allocation failed\n"); exit(1);}
352 |     for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++)
353 |      syn1neg[a * layer1_size + b] = 0;
354 |   }
355 |   for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) {
356 |     next_random = next_random * (unsigned long long)25214903917 + 11;
357 |     syn0[a * layer1_size + b] = (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size;
358 |   }
359 |   CreateBinaryTree();
360 | }
361 | 
362 | void *TrainModelThread(void *id) {
363 |   long long a, b, d, cw, word, last_word, sentence_length = 0, sentence_position = 0;
364 |   long long word_count = 0, last_word_count = 0, sen[MAX_SENTENCE_LENGTH + 1];
365 |   long long l1, l2, c, target, label, local_iter = iter;
366 |   unsigned long long next_random = (long long)id;
367 |   real f, g;
368 |   clock_t now;
369 |   real *neu1 = (real *)calloc(layer1_size, sizeof(real));
370 |   real *neu1e = (real *)calloc(layer1_size, sizeof(real));
371 |   FILE *fi = fopen(train_file, "rb");
372 |   fseek(fi, file_size / (long long)num_threads * (long long)id, SEEK_SET);
373 |   while (1) {
374 |     if (word_count - last_word_count > 10000) {
375 |       word_count_actual += word_count - last_word_count;
376 |       last_word_count = word_count;
377 |       if ((debug_mode > 1)) {
378 |         now=clock();
379 |         printf("%cAlpha: %f  Progress: %.2f%%  Words/thread/sec: %.2fk  ", 13, alpha,
380 |          word_count_actual / (real)(iter * train_words + 1) * 100,
381 |          word_count_actual / ((real)(now - start + 1) / (real)CLOCKS_PER_SEC * 1000));
382 |         fflush(stdout);
383 |       }
384 |       alpha = starting_alpha * (1 - word_count_actual / (real)(iter * train_words + 1));
385 |       if (alpha < starting_alpha * 0.0001) alpha = starting_alpha * 0.0001;
386 |     }
387 |     if (sentence_length == 0) {
388 |       while (1) {
389 |         word = ReadWordIndex(fi);
390 |         if (feof(fi)) break;
391 |         if (word == -1) continue;
392 |         word_count++;
393 |         if (word == 0) break;
394 |         // The subsampling randomly discards frequent words while keeping the ranking same
395 |         if (sample > 0) {
396 |           real ran = (sqrt(vocab[word].cn / (sample * train_words)) + 1) * (sample * train_words) / vocab[word].cn;
397 |           next_random = next_random * (unsigned long long)25214903917 + 11;
398 |           if (ran < (next_random & 0xFFFF) / (real)65536) continue;
399 |         }
400 |         sen[sentence_length] = word;
401 |         sentence_length++;
402 |         if (sentence_length >= MAX_SENTENCE_LENGTH) break;
403 |       }
404 |       sentence_position = 0;
405 |     }
406 |     if (feof(fi) || (word_count > train_words / num_threads)) {
407 |       word_count_actual += word_count - last_word_count;
408 |       local_iter--;
409 |       if (local_iter == 0) break;
410 |       word_count = 0;
411 |       last_word_count = 0;
412 |       sentence_length = 0;
413 |       fseek(fi, file_size / (long long)num_threads * (long long)id, SEEK_SET);
414 |       continue;
415 |     }
416 |     word = sen[sentence_position];
417 |     if (word == -1) continue;
418 |     for (c = 0; c < layer1_size; c++) neu1[c] = 0;
419 |     for (c = 0; c < layer1_size; c++) neu1e[c] = 0;
420 |     next_random = next_random * (unsigned long long)25214903917 + 11;
421 |     b = next_random % window;
422 |     if (cbow) {  //train the cbow architecture
423 |       // in -> hidden
424 |       cw = 0;
425 |       for (a = b; a < window * 1 + 1 - b; a++) if (a != window) {
426 |         c = sentence_position - window + a;
427 |         if (c < 0) continue;
428 |         if (c >= sentence_length) continue;
429 |         if (sentence_vectors && (c == 0)) continue;
430 |         last_word = sen[c];
431 |         if (last_word == -1) continue;
432 |         for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size];
433 |         cw++;
434 |       }
435 |       if (sentence_vectors) {
436 |         last_word = sen[0];
437 |         if (last_word == -1) continue;
438 |         for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size];
439 |         cw++;
440 |       }
441 |       if (cw) {
442 |         for (c = 0; c < layer1_size; c++) neu1[c] /= cw;
443 |         if (hs) for (d = 0; d < vocab[word].codelen; d++) {
444 |           f = 0;
445 |           l2 = vocab[word].point[d] * layer1_size;
446 |           // Propagate hidden -> output
447 |           for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1[c + l2];
448 |           if (f <= -MAX_EXP) continue;
449 |           else if (f >= MAX_EXP) continue;
450 |           else f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))];
451 |           // 'g' is the gradient multiplied by the learning rate
452 |           g = (1 - vocab[word].code[d] - f) * alpha;
453 |           // Propagate errors output -> hidden
454 |           for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];
455 |           // Learn weights hidden -> output
456 |           for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c];
457 |         }
458 |         // NEGATIVE SAMPLING
459 |         if (negative > 0) for (d = 0; d < negative + 1; d++) {
460 |           if (d == 0) {
461 |             target = word;
462 |             label = 1;
463 |           } else {
464 |             next_random = next_random * (unsigned long long)25214903917 + 11;
465 |             target = table[(next_random >> 16) % table_size];
466 |             if (target == 0) target = next_random % (vocab_size - 1) + 1;
467 |             if (target == word) continue;
468 |             label = 0;
469 |           }
470 |           l2 = target * layer1_size;
471 |           f = 0;
472 |           for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1neg[c + l2];
473 |           if (f > MAX_EXP) g = (label - 1) * alpha;
474 |           else if (f < -MAX_EXP) g = (label - 0) * alpha;
475 |           else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;
476 |           for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
477 |           for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * neu1[c];
478 |         }
479 |         // hidden -> in
480 |         for (a = b; a < window * 1 + 1 - b; a++) if (a != window) {
481 |           c = sentence_position - window + a;
482 |           if (c < 0) continue;
483 |           if (c >= sentence_length) continue;
484 |           if (sentence_vectors && (c == 0)) continue;
485 |           last_word = sen[c];
486 |           if (last_word == -1) continue;
487 |           for (c = 0; c < layer1_size; c++) syn0[c + last_word * layer1_size] += neu1e[c];
488 |         }
489 |         if (sentence_vectors) {
490 |           last_word = sen[0];
491 |           if (last_word == -1) continue;
492 |           for (c = 0; c < layer1_size; c++) syn0[c + last_word * layer1_size] += neu1e[c];
493 |         }
494 |       }
495 |     } else {  //train skip-gram
496 |       for (a = b; a < window * 2 + 1 + sentence_vectors - b; a++) if (a != window) {
497 |         c = sentence_position - window + a;
498 |         if (sentence_vectors) if (a >= window * 2 + sentence_vectors - b) c = 0;
499 |         if (c < 0) continue;
500 |         if (c >= sentence_length) continue;
501 |         last_word = sen[c];
502 |         if (last_word == -1) continue;
503 |         l1 = last_word * layer1_size;
504 |         for (c = 0; c < layer1_size; c++) neu1e[c] = 0;
505 |         // HIERARCHICAL SOFTMAX
506 |         if (hs) for (d = 0; d < vocab[word].codelen; d++) {
507 |           f = 0;
508 |           l2 = vocab[word].point[d] * layer1_size;
509 |           // Propagate hidden -> output
510 |           for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1[c + l2];
511 |           if (f <= -MAX_EXP) continue;
512 |           else if (f >= MAX_EXP) continue;
513 |           else f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))];
514 |           // 'g' is the gradient multiplied by the learning rate
515 |           g = (1 - vocab[word].code[d] - f) * alpha;
516 |           // Propagate errors output -> hidden
517 |           for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];
518 |           // Learn weights hidden -> output
519 |           for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * syn0[c + l1];
520 |         }
521 |         // NEGATIVE SAMPLING
522 |         if (negative > 0) for (d = 0; d < negative + 1; d++) {
523 |           if (d == 0) {
524 |             target = word;
525 |             label = 1;
526 |           } else {
527 |             next_random = next_random * (unsigned long long)25214903917 + 11;
528 |             target = table[(next_random >> 16) % table_size];
529 |             if (target == 0) target = next_random % (vocab_size - 1) + 1;
530 |             if (target == word) continue;
531 |             label = 0;
532 |           }
533 |           l2 = target * layer1_size;
534 |           f = 0;
535 |           for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
536 |           if (f > MAX_EXP) g = (label - 1) * alpha;
537 |           else if (f < -MAX_EXP) g = (label - 0) * alpha;
538 |           else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;
539 |           for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
540 |           for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
541 |         }
542 |         // Learn weights input -> hidden
543 |         for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];
544 |       }
545 |     }
546 |     sentence_position++;
547 |     if (sentence_position >= sentence_length) {
548 |       sentence_length = 0;
549 |       continue;
550 |     }
551 |   }
552 |   fclose(fi);
553 |   free(neu1);
554 |   free(neu1e);
555 |   pthread_exit(NULL);
556 | }
557 | 
558 | void TrainModel() {
559 |   long a, b, c, d;
560 |   FILE *fo;
561 |   pthread_t *pt = (pthread_t *)malloc(num_threads * sizeof(pthread_t));
562 |   printf("Starting training using file %s\n", train_file);
563 |   starting_alpha = alpha;
564 |   if (read_vocab_file[0] != 0) ReadVocab(); else LearnVocabFromTrainFile();
565 |   if (save_vocab_file[0] != 0) SaveVocab();
566 |   if (output_file[0] == 0) return;
567 |   InitNet();
568 |   if (negative > 0) InitUnigramTable();
569 |   start = clock();
570 |   for (a = 0; a < num_threads; a++) pthread_create(&pt[a], NULL, TrainModelThread, (void *)a);
571 |   for (a = 0; a < num_threads; a++) pthread_join(pt[a], NULL);
572 |   fo = fopen(output_file, "wb");
573 |   if (classes == 0) {
574 |     // Save the word vectors
575 |     fprintf(fo, "%lld %lld\n", vocab_size, layer1_size);
576 |     for (a = 0; a < vocab_size; a++) {
577 |       fprintf(fo, "%s ", vocab[a].word);
578 |       if (binary) for (b = 0; b < layer1_size; b++) fwrite(&syn0[a * layer1_size + b], sizeof(real), 1, fo);
579 |       else for (b = 0; b < layer1_size; b++) fprintf(fo, "%lf ", syn0[a * layer1_size + b]);
580 |       fprintf(fo, "\n");
581 |     }
582 |   } else {
583 |     // Run K-means on the word vectors
584 |     int clcn = classes, iter = 10, closeid;
585 |     int *centcn = (int *)malloc(classes * sizeof(int));
586 |     int *cl = (int *)calloc(vocab_size, sizeof(int));
587 |     real closev, x;
588 |     real *cent = (real *)calloc(classes * layer1_size, sizeof(real));
589 |     for (a = 0; a < vocab_size; a++) cl[a] = a % clcn;
590 |     for (a = 0; a < iter; a++) {
591 |       for (b = 0; b < clcn * layer1_size; b++) cent[b] = 0;
592 |       for (b = 0; b < clcn; b++) centcn[b] = 1;
593 |       for (c = 0; c < vocab_size; c++) {
594 |         for (d = 0; d < layer1_size; d++) cent[layer1_size * cl[c] + d] += syn0[c * layer1_size + d];
595 |         centcn[cl[c]]++;
596 |       }
597 |       for (b = 0; b < clcn; b++) {
598 |         closev = 0;
599 |         for (c = 0; c < layer1_size; c++) {
600 |           cent[layer1_size * b + c] /= centcn[b];
601 |           closev += cent[layer1_size * b + c] * cent[layer1_size * b + c];
602 |         }
603 |         closev = sqrt(closev);
604 |         for (c = 0; c < layer1_size; c++) cent[layer1_size * b + c] /= closev;
605 |       }
606 |       for (c = 0; c < vocab_size; c++) {
607 |         closev = -10;
608 |         closeid = 0;
609 |         for (d = 0; d < clcn; d++) {
610 |           x = 0;
611 |           for (b = 0; b < layer1_size; b++) x += cent[layer1_size * d + b] * syn0[c * layer1_size + b];
612 |           if (x > closev) {
613 |             closev = x;
614 |             closeid = d;
615 |           }
616 |         }
617 |         cl[c] = closeid;
618 |       }
619 |     }
620 |     // Save the K-means classes
621 |     for (a = 0; a < vocab_size; a++) fprintf(fo, "%s %d\n", vocab[a].word, cl[a]);
622 |     free(centcn);
623 |     free(cent);
624 |     free(cl);
625 |   }
626 |   fclose(fo);
627 | }
628 | 
629 | int ArgPos(char *str, int argc, char **argv) {
630 |   int a;
631 |   for (a = 1; a < argc; a++) if (!strcmp(str, argv[a])) {
632 |     if (a == argc - 1) {
633 |       printf("Argument missing for %s\n", str);
634 |       exit(1);
635 |     }
636 |     return a;
637 |   }
638 |   return -1;
639 | }
640 | 
641 | int main(int argc, char **argv) {
642 |   int i;
643 |   if (argc == 1) {
644 |     printf("WORD VECTOR estimation toolkit v 0.1c\n\n");
645 |     printf("Options:\n");
646 |     printf("Parameters for training:\n");
647 |     printf("\t-train <file>\n");
648 |     printf("\t\tUse text data from <file> to train the model\n");
649 |     printf("\t-output <file>\n");
650 |     printf("\t\tUse <file> to save the resulting word vectors / word clusters\n");
651 |     printf("\t-size <int>\n");
652 |     printf("\t\tSet size of word vectors; default is 100\n");
653 |     printf("\t-window <int>\n");
654 |     printf("\t\tSet max skip length between words; default is 5\n");
655 |     printf("\t-sample <float>\n");
656 |     printf("\t\tSet threshold for occurrence of words. Those that appear with higher frequency in the training data\n");
657 |     printf("\t\twill be randomly down-sampled; default is 1e-3, useful range is (0, 1e-5)\n");
658 |     printf("\t-hs <int>\n");
659 |     printf("\t\tUse Hierarchical Softmax; default is 0 (not used)\n");
660 |     printf("\t-negative <int>\n");
661 |     printf("\t\tNumber of negative examples; default is 5, common values are 3 - 10 (0 = not used)\n");
662 |     printf("\t-threads <int>\n");
663 |     printf("\t\tUse <int> threads (default 12)\n");
664 |     printf("\t-iter <int>\n");
665 |     printf("\t\tRun more training iterations (default 5)\n");
666 |     printf("\t-min-count <int>\n");
667 |     printf("\t\tThis will discard words that appear less than <int> times; default is 5\n");
668 |     printf("\t-alpha <float>\n");
669 |     printf("\t\tSet the starting learning rate; default is 0.025 for skip-gram and 0.05 for CBOW\n");
670 |     printf("\t-classes <int>\n");
671 |     printf("\t\tOutput word classes rather than word vectors; default number of classes is 0 (vectors are written)\n");
672 |     printf("\t-debug <int>\n");
673 |     printf("\t\tSet the debug mode (default = 2 = more info during training)\n");
674 |     printf("\t-binary <int>\n");
675 |     printf("\t\tSave the resulting vectors in binary moded; default is 0 (off)\n");
676 |     printf("\t-save-vocab <file>\n");
677 |     printf("\t\tThe vocabulary will be saved to <file>\n");
678 |     printf("\t-read-vocab <file>\n");
679 |     printf("\t\tThe vocabulary will be read from <file>, not constructed from the training data\n");
680 |     printf("\t-cbow <int>\n");
681 |     printf("\t\tUse the continuous bag of words model; default is 1 (use 0 for skip-gram model)\n");
682 |     printf("\t-sentence-vectors <int>\n");
683 |     printf("\t\tAssume the first token at the beginning of each line is a sentence ID. This token will be trained\n");
684 |     printf("\t\twith full sentence context instead of just the window. Use 1 to turn on.\n");
685 |     printf("\nExamples:\n");
686 |     printf("./word2vec -train data.txt -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 1 -iter 3\n\n");
687 |     return 0;
688 |   }
689 |   output_file[0] = 0;
690 |   save_vocab_file[0] = 0;
691 |   read_vocab_file[0] = 0;
692 |   if ((i = ArgPos((char *)"-size", argc, argv)) > 0) layer1_size = atoi(argv[i + 1]);
693 |   if ((i = ArgPos((char *)"-train", argc, argv)) > 0) strcpy(train_file, argv[i + 1]);
694 |   if ((i = ArgPos((char *)"-save-vocab", argc, argv)) > 0) strcpy(save_vocab_file, argv[i + 1]);
695 |   if ((i = ArgPos((char *)"-read-vocab", argc, argv)) > 0) strcpy(read_vocab_file, argv[i + 1]);
696 |   if ((i = ArgPos((char *)"-debug", argc, argv)) > 0) debug_mode = atoi(argv[i + 1]);
697 |   if ((i = ArgPos((char *)"-binary", argc, argv)) > 0) binary = atoi(argv[i + 1]);
698 |   if ((i = ArgPos((char *)"-cbow", argc, argv)) > 0) cbow = atoi(argv[i + 1]);
699 |   if (cbow) alpha = 0.05;
700 |   if ((i = ArgPos((char *)"-alpha", argc, argv)) > 0) alpha = atof(argv[i + 1]);
701 |   if ((i = ArgPos((char *)"-output", argc, argv)) > 0) strcpy(output_file, argv[i + 1]);
702 |   if ((i = ArgPos((char *)"-window", argc, argv)) > 0) window = atoi(argv[i + 1]);
703 |   if ((i = ArgPos((char *)"-sample", argc, argv)) > 0) sample = atof(argv[i + 1]);
704 |   if ((i = ArgPos((char *)"-hs", argc, argv)) > 0) hs = atoi(argv[i + 1]);
705 |   if ((i = ArgPos((char *)"-negative", argc, argv)) > 0) negative = atoi(argv[i + 1]);
706 |   if ((i = ArgPos((char *)"-threads", argc, argv)) > 0) num_threads = atoi(argv[i + 1]);
707 |   if ((i = ArgPos((char *)"-iter", argc, argv)) > 0) iter = atoi(argv[i + 1]);
708 |   if ((i = ArgPos((char *)"-min-count", argc, argv)) > 0) min_count = atoi(argv[i + 1]);
709 |   if ((i = ArgPos((char *)"-classes", argc, argv)) > 0) classes = atoi(argv[i + 1]);
710 |   if ((i = ArgPos((char *)"-sentence-vectors", argc, argv)) > 0) sentence_vectors = atoi(argv[i + 1]);
711 |   vocab = (struct vocab_word *)calloc(vocab_max_size, sizeof(struct vocab_word));
712 |   vocab_hash = (int *)calloc(vocab_hash_size, sizeof(int));
713 |   expTable = (real *)malloc((EXP_TABLE_SIZE + 1) * sizeof(real));
714 |   for (i = 0; i < EXP_TABLE_SIZE; i++) {
715 |     expTable[i] = exp((i / (real)EXP_TABLE_SIZE * 2 - 1) * MAX_EXP); // Precompute the exp() table
716 |     expTable[i] = expTable[i] / (expTable[i] + 1);                   // Precompute f(x) = x / (x + 1)
717 |   }
718 |   TrainModel();
719 |   return 0;
720 | }
721 | 


--------------------------------------------------------------------------------
/corpus/Hownet/neg_opinion.txt:
--------------------------------------------------------------------------------
   1 | 僄 
   2 | 啰啰唆唆 
   3 | 啰啰嗦嗦 
   4 | 啰里啰唆 
   5 | 啰里啰嗦 
   6 | 啰唆 
   7 | 啰嗦 
   8 | 噲 
   9 | 奓着头发 
  10 | 婞 
  11 | 婞直 
  12 | 崒 
  13 | 弇陋 
  14 | 惛 
  15 | 惼 
  16 | 梼昧 
  17 | 獪 
  18 | 瘆 
  19 | 瘆得慌 
  20 | 哀鸿遍野 
  21 | 矮 
  22 | 碍难 
  23 | 碍眼 
  24 | 爱搭不理 
  25 | 爱理不理 
  26 | 暗 
  27 | 暗暗 
  28 | 暗沉沉 
  29 | 暗淡 
  30 | 暗地 
  31 | 暗地里 
  32 | 暗黑 
  33 | 暗里 
  34 | 暗昧 
  35 | 暗弱 
  36 | 暗无天日 
  37 | 暗下 
  38 | 暗中 
  39 | 暗自 
  40 | 暗朦 
  41 | 岸然 
  42 | 肮里肮脏 
  43 | 肮脏 
  44 | 昂贵 
  45 | 凹凸 
  46 | 凹凸不平 
  47 | 傲 
  48 | 傲岸 
  49 | 傲慢 
  50 | 八面玲珑 
  51 | 跋扈 
  52 | 霸道 
  53 | 霸气 
  54 | 白 
  55 | 白白 
  56 | 白痴般 
  57 | 白搭 
  58 | 白忙 
  59 | 白忙活儿 
  60 | 白衣苍狗 
  61 | 白云苍狗 
  62 | 百孔千疮 
  63 | 败坏 
  64 | 稗 
  65 | 板 
  66 | 板板六十四 
  67 | 板滞 
  68 | 半半拉拉 
  69 | 半路出家 
  70 | 半新不旧 
  71 | 半真半假 
  72 | 薄 
  73 | 薄情 
  74 | 薄弱 
  75 | 薄幸 
  76 | 保残守缺 
  77 | 保守 
  78 | 抱残守缺 
  79 | 暴 
  80 | 暴烈 
  81 | 暴虐 
  82 | 暴躁 
  83 | 暴戾 
  84 | 暴戾恣睢 
  85 | 爆炸性 
  86 | 悲惨 
  87 | 悲观 
  88 | 悲观地 
  89 | 悲剧 
  90 | 悲凉 
  91 | 卑 
  92 | 卑鄙 
  93 | 卑鄙无耻 
  94 | 卑贱 
  95 | 卑劣 
  96 | 卑陋 
  97 | 卑怯 
  98 | 卑俗 
  99 | 卑琐 
 100 | 卑微 
 101 | 卑污 
 102 | 卑下 
 103 | 卑猥 
 104 | 背地 
 105 | 背地里 
 106 | 背光 
 107 | 背后 
 108 | 背悔 
 109 | 背静 
 110 | 背靠背 
 111 | 背理 
 112 | 背令 
 113 | 背人 
 114 | 背时 
 115 | 背阴 
 116 | 被动 
 117 | 被动式 
 118 | 被动性 
 119 | 本本主义 
 120 | 笨 
 121 | 笨手笨脚 
 122 | 笨头笨脑 
 123 | 笨重 
 124 | 笨拙 
 125 | 比肩接踵 
 126 | 鄙 
 127 | 鄙贱 
 128 | 鄙吝 
 129 | 鄙陋 
 130 | 鄙俗 
 131 | 鄙俚 
 132 | 蔽塞 
 133 | 闭塞 
 134 | 必修 
 135 | 变化不定 
 136 | 变化多端 
 137 | 变化万千 
 138 | 变化无常 
 139 | 变化无穷 
 140 | 变幻不定 
 141 | 变幻莫测 
 142 | 变幻无常 
 143 | 变态 
 144 | 变相 
 145 | 表里不一 
 146 | 憋拗 
 147 | 别别扭扭 
 148 | 别扭 
 149 | 别有用心 
 150 | 冰冷 
 151 | 冰炭不相容 
 152 | 秉性剌戾 
 153 | 病病歪歪 
 154 | 病弱 
 155 | 病态 
 156 | 病歪歪 
 157 | 病殃殃 
 158 | 病恹恹 
 159 | 波谲云诡 
 160 | 驳杂 
 161 | 捕风捉影 
 162 | 不爱交际 
 163 | 不便 
 164 | 不辨菽麦 
 165 | 不才 
 166 | 不成材 
 167 | 不成功 
 168 | 不成话 
 169 | 不成器 
 170 | 不成熟 
 171 | 不成体统 
 172 | 不成样子 
 173 | 不打紧 
 174 | 不大重要 
 175 | 不当 
 176 | 不到黄河心不死 
 177 | 不道德 
 178 | 不得当 
 179 | 不得劲 
 180 | 不得了 
 181 | 不得体 
 182 | 不登大雅之堂 
 183 | 不等 
 184 | 不端 
 185 | 不对 
 186 | 不对茬儿 
 187 | 不对称 
 188 | 不对付 
 189 | 不对劲 
 190 | 不对头 
 191 | 不发达 
 192 | 不法 
 193 | 不方便 
 194 | 不分青红皂白 
 195 | 不分皂白 
 196 | 不负责任 
 197 | 不干不净 
 198 | 不更事 
 199 | 不公 
 200 | 不公正 
 201 | 不共戴天 
 202 | 不够格 
 203 | 不够完善 
 204 | 不够意思 
 205 | 不关痛痒 
 206 | 不管三七二十一 
 207 | 不管用 
 208 | 不光彩 
 209 | 不光明 
 210 | 不规则 
 211 | 不轨 
 212 | 不好吃 
 213 | 不好看 
 214 | 不好客 
 215 | 不好卖 
 216 | 不好使 
 217 | 不好听 
 218 | 不好用 
 219 | 不和 
 220 | 不和蔼 
 221 | 不和谐 
 222 | 不合法 
 223 | 不合格 
 224 | 不合理 
 225 | 不合逻辑 
 226 | 不合情理 
 227 | 不合时令 
 228 | 不合时宜 
 229 | 不合适 
 230 | 不合算 
 231 | 不合宜 
 232 | 不合语法 
 233 | 不划算 
 234 | 不济 
 235 | 不济事 
 236 | 不俭省 
 237 | 不健康 
 238 | 不洁 
 239 | 不谨慎 
 240 | 不近情理 
 241 | 不近人情 
 242 | 不尽 
 243 | 不尽如人意 
 244 | 不精确 
 245 | 不敬 
 246 | 不绝如缕 
 247 | 不均匀 
 248 | 不开化 
 249 | 不堪入耳 
 250 | 不堪入目 
 251 | 不堪设想 
 252 | 不堪一击 
 253 | 不堪造就 
 254 | 不科学 
 255 | 不可爱 
 256 | 不可补救 
 257 | 不可读 
 258 | 不可告人 
 259 | 不可更新 
 260 | 不可恢复 
 261 | 不可降解 
 262 | 不可接受 
 263 | 不可救药 
 264 | 不可理喻 
 265 | 不可逆转 
 266 | 不可容忍 
 267 | 不可收拾 
 268 | 不可挽回 
 269 | 不可行 
 270 | 不可一世 
 271 | 不可逾越 
 272 | 不客气 
 273 | 不宽容 
 274 | 不郎不秀 
 275 | 不冷不热 
 276 | 不理智 
 277 | 不礼貌 
 278 | 不利 
 279 | 不利于健康 
 280 | 不力 
 281 | 不良 
 282 | 不灵敏 
 283 | 不灵巧 
 284 | 不流行 
 285 | 不伦不类 
 286 | 不美观 
 287 | 不妙 
 288 | 不民主 
 289 | 不明不白 
 290 | 不明显 
 291 | 不明智 
 292 | 不名一文 
 293 | 不名誉 
 294 | 不能解救 
 295 | 不能容忍 
 296 | 不宁 
 297 | 不努力 
 298 | 不平 
 299 | 不平等 
 300 | 不平衡 
 301 | 不起眼 
 302 | 不起眼儿 
 303 | 不巧 
 304 | 不切实际 
 305 | 不清不白 
 306 | 不清不楚 
 307 | 不清楚 
 308 | 不清洁 
 309 | 不确切 
 310 | 不仁 
 311 | 不仁不义 
 312 | 不人道 
 313 | 不三不四 
 314 | 不善 
 315 | 不善交际 
 316 | 不善交谈 
 317 | 不甚重要 
 318 | 不慎 
 319 | 不胜 
 320 | 不是味儿 
 321 | 不是滋味儿 
 322 | 不适当 
 323 | 不适宜 
 324 | 不适应 
 325 | 不适于居住 
 326 | 不受欢迎 
 327 | 不熟练 
 328 | 不疼不痒 
 329 | 不体面 
 330 | 不痛不痒 
 331 | 不透明 
 332 | 不透气 
 333 | 不妥 
 334 | 不为人知 
 335 | 不卫生 
 336 | 不文明 
 337 | 不文雅 
 338 | 不稳定 
 339 | 不问青红皂白 
 340 | 不问三七二十一 
 341 | 不问是非情由 
 342 | 不显眼 
 343 | 不现实 
 344 | 不相适应 
 345 | 不祥 
 346 | 不详 
 347 | 不详尽 
 348 | 不像话 
 349 | 不消化 
 350 | 不孝 
 351 | 不肖 
 352 | 不协调 
 353 | 不兴 
 354 | 不行 
 355 | 不幸 
 356 | 不修边幅 
 357 | 不学无术 
 358 | 不逊 
 359 | 不雅 
 360 | 不雅观 
 361 | 不雅致 
 362 | 不要紧 
 363 | 不一致 
 364 | 不宜 
 365 | 不宜居住 
 366 | 不宜说出口 
 367 | 不易 
 368 | 不友好 
 369 | 不友善 
 370 | 不择手段 
 371 | 不真诚 
 372 | 不真实 
 373 | 不贞洁 
 374 | 不正常 
 375 | 不正当 
 376 | 不正派 
 377 | 不正直 
 378 | 不值得羡慕 
 379 | 不值一文 
 380 | 不中用 
 381 | 不重要 
 382 | 不周 
 383 | 不周到 
 384 | 不注意 
 385 | 不着边际 
 386 | 不着调 
 387 | 不足道 
 388 | 不足挂齿 
 389 | 不足轻重 
 390 | 不足取 
 391 | 不足为外人道 
 392 | 不足为训 
 393 | 不羁 
 394 | 不稂不莠 
 395 | 不虔诚 
 396 | 才疏学浅 
 397 | 财迷心窍 
 398 | 残 
 399 | 残败 
 400 | 残暴 
 401 | 残毒 
 402 | 残酷 
 403 | 残酷无情 
 404 | 残虐 
 405 | 残破 
 406 | 残破不全 
 407 | 残缺 
 408 | 残缺不全 
 409 | 残忍 
 410 | 残损 
 411 | 惨 
 412 | 惨不忍睹 
 413 | 惨淡 
 414 | 惨毒 
 415 | 惨绝人寰 
 416 | 惨苦 
 417 | 惨厉 
 418 | 惨烈 
 419 | 惨痛 
 420 | 惨无人道 
 421 | 苍白 
 422 | 苍白无力 
 423 | 苍凉 
 424 | 苍茫 
 425 | 操切 
 426 | 糙 
 427 | 草 
 428 | 草草 
 429 | 草荒 
 430 | 草率 
 431 | 草木皆兵 
 432 | 策略 
 433 | 策略性 
 434 | 差 
 435 | 差点儿 
 436 | 差劲 
 437 | 差可 
 438 | 豺狼成性 
 439 | 豺狼当道 
 440 | 缠手 
 441 | 颤颤巍巍 
 442 | 颤颤悠悠 
 443 | 颤巍巍 
 444 | 猖 
 445 | 猖狂 
 446 | 长长短短 
 447 | 长篇大论 
 448 | 长篇累牍 
 449 | 长线 
 450 | 超标 
 451 | 超常 
 452 | 超然 
 453 | 超重 
 454 | 朝不保夕 
 455 | 朝不谋夕 
 456 | 朝秦暮楚 
 457 | 朝三暮四 
 458 | 潮 
 459 | 吵吵闹闹 
 460 | 吵人 
 461 | 沉闷 
 462 | 沉痛 
 463 | 沉滞 
 464 | 陈 
 465 | 陈腐 
 466 | 陈旧 
 467 | 成事不足，败事有余 
 468 | 逞性 
 469 | 逞性子 
 470 | 吃不开 
 471 | 吃劲 
 472 | 吃力 
 473 | 吃重 
 474 | 痴 
 475 | 痴痴 
 476 | 痴呆 
 477 | 痴呆呆 
 478 | 痴傻 
 479 | 痴愚 
 480 | 迟钝 
 481 | 侈 
 482 | 侈靡 
 483 | 侈糜 
 484 | 赤地千里 
 485 | 赤裸裸淫秽 
 486 | 赤贫 
 487 | 充满危机 
 488 | 冲昏头脑 
 489 | 丑 
 490 | 丑恶 
 491 | 丑陋 
 492 | 臭 
 493 | 臭不可闻 
 494 | 臭哄哄 
 495 | 臭烘烘 
 496 | 臭名远扬 
 497 | 臭名昭彰 
 498 | 臭名昭著 
 499 | 臭气冲天 
 500 | 臭气熏天 
 501 | 臭味 
 502 | 初出茅庐 
 503 | 出手阔气 
 504 | 触目惊心 
 505 | 穿不出去 
 506 | 穿不得 
 507 | 串秧儿 
 508 | 疮痍满目 
 509 | 蠢 
 510 | 蠢笨 
 511 | 蠢头蠢脑 
 512 | 刺鼻 
 513 | 刺耳 
 514 | 刺眼 
 515 | 次 
 516 | 次等 
 517 | 从动 
 518 | 从心所欲 
 519 | 从严 
 520 | 从重 
 521 | 粗 
 522 | 粗暴 
 523 | 粗笨 
 524 | 粗鄙 
 525 | 粗糙 
 526 | 粗放 
 527 | 粗拉 
 528 | 粗里粗气 
 529 | 粗劣 
 530 | 粗陋 
 531 | 粗鲁 
 532 | 粗率 
 533 | 粗蛮 
 534 | 粗莽 
 535 | 粗浅 
 536 | 粗涩 
 537 | 粗手笨脚 
 538 | 粗疏 
 539 | 粗俗 
 540 | 粗线条 
 541 | 粗心 
 542 | 粗心大意 
 543 | 粗野 
 544 | 粗枝大叶 
 545 | 粗制滥造 
 546 | 粗重 
 547 | 粗拙 
 548 | 粗犷 
 549 | 促狭 
 550 | 脆弱 
 551 | 村气 
 552 | 村野 
 553 | 寸草不生 
 554 | 错 
 555 | 错乱 
 556 | 错误 
 557 | 错误百出 
 558 | 错杂 
 559 | 错综 
 560 | 错综复杂 
 561 | 大 
 562 | 大错而特错 
 563 | 大错特错 
 564 | 大大咧咧 
 565 | 大而笨拙 
 566 | 大而化之 
 567 | 大而无当 
 568 | 大海捞针 
 569 | 大面额 
 570 | 大谬不然 
 571 | 大手大脚 
 572 | 大肆 
 573 | 大摇大摆 
 574 | 大意 
 575 | 大咧咧 
 576 | 呆 
 577 | 呆板 
 578 | 呆笨 
 579 | 呆痴 
 580 | 呆呆 
 581 | 呆钝 
 582 | 呆气 
 583 | 呆傻 
 584 | 呆头呆脑 
 585 | 歹 
 586 | 歹毒 
 587 | 带有敌意 
 588 | 殆 
 589 | 怠惰 
 590 | 单 
 591 | 单薄 
 592 | 单调 
 593 | 单调枯燥 
 594 | 单弱 
 595 | 胆怯 
 596 | 胆小 
 597 | 胆小怕事 
 598 | 胆小如鼠 
 599 | 淡 
 600 | 淡薄 
 601 | 淡淡 
 602 | 淡而无味 
 603 | 淡漠 
 604 | 淡然 
 605 | 诞 
 606 | 荡 
 607 | 刀光剑影 
 608 | 蹈常袭故 
 609 | 倒胃口 
 610 | 道德败坏 
 611 | 道貌岸然 
 612 | 德行 
 613 | 德性 
 614 | 得寸进尺 
 615 | 得陇望蜀 
 616 | 得鱼忘筌 
 617 | 灯红酒绿 
 618 | 灯火阑珊 
 619 | 等而下之 
 620 | 等外 
 621 | 等因奉此 
 622 | 低 
 623 | 低卑 
 624 | 低标准 
 625 | 低层 
 626 | 低档 
 627 | 低等 
 628 | 低端 
 629 | 低级 
 630 | 低贱 
 631 | 低劣 
 632 | 低迷 
 633 | 低能 
 634 | 低人一等 
 635 | 低三下四 
 636 | 低声下气 
 637 | 低俗 
 638 | 低下 
 639 | 低效 
 640 | 低效能 
 641 | 低值 
 642 | 低智 
 643 | 低质 
 644 | 滴里嘟噜 
 645 | 敌对 
 646 | 地位低下 
 647 | 地下 
 648 | 地狱般 
 649 | 颠倒 
 650 | 颠连 
 651 | 颠三倒四 
 652 | 凋敝 
 653 | 刁 
 654 | 刁恶 
 655 | 刁悍 
 656 | 刁滑 
 657 | 刁赖 
 658 | 刁蛮 
 659 | 刁钻 
 660 | 刁钻古怪 
 661 | 吊儿郎当 
 662 | 调皮 
 663 | 鼎沸 
 664 | 丢魂 
 665 | 丢脸 
 666 | 丢三落四 
 667 | 东倒西歪 
 668 | 冬烘 
 669 | 动荡 
 670 | 动荡不安 
 671 | 动魄惊心 
 672 | 动作迟顿 
 673 | 毒 
 674 | 毒辣 
 675 | 独裁 
 676 | 独断 
 677 | 度量小 
 678 | 短浅 
 679 | 短视 
 680 | 钝 
 681 | 多变 
 682 | 多病 
 683 | 多事 
 684 | 多义 
 685 | 多余 
 686 | 惰 
 687 | 惰性 
 688 | 讹 
 689 | 恶 
 690 | 恶毒 
 691 | 恶贯满盈 
 692 | 恶狠狠 
 693 | 恶劣 
 694 | 恶煞煞 
 695 | 恶心 
 696 | 恶浊 
 697 | 饿殍遍野 
 698 | 耳生 
 699 | 二把刀 
 700 | 二手 
 701 | 二五眼 
 702 | 发狂 
 703 | 发腻 
 704 | 发育不全 
 705 | 乏 
 706 | 乏味 
 707 | 翻手为云，覆手为雨 
 708 | 翻云覆雨 
 709 | 繁复 
 710 | 繁乱 
 711 | 繁难 
 712 | 繁冗 
 713 | 繁琐 
 714 | 繁芜 
 715 | 繁杂 
 716 | 繁重 
 717 | 繁缛 
 718 | 烦 
 719 | 烦难 
 720 | 烦冗 
 721 | 烦琐 
 722 | 烦嚣 
 723 | 反 
 724 | 反常 
 725 | 反对称 
 726 | 反反复复 
 727 | 反复无常 
 728 | 反面 
 729 | 反叛 
 730 | 反社会 
 731 | 犯有罪行 
 732 | 饭桶 
 733 | 泛 
 734 | 泛泛 
 735 | 放诞 
 736 | 放荡 
 737 | 放荡不羁 
 738 | 放浪 
 739 | 放肆 
 740 | 放纵 
 741 | 菲 
 742 | 菲薄 
 743 | 非 
 744 | 非法 
 745 | 非分 
 746 | 非婚生 
 747 | 非礼 
 748 | 非人 
 749 | 非生产性 
 750 | 非正常 
 751 | 非正统 
 752 | 非正义 
 753 | 废 
 754 | 废弛 
 755 | 废旧 
 756 | 废物 
 757 | 沸沸扬扬 
 758 | 费 
 759 | 费工夫 
 760 | 费功夫 
 761 | 费劲 
 762 | 费力 
 763 | 费时 
 764 | 费事 
 765 | 纷 
 766 | 纷繁 
 767 | 纷乱 
 768 | 纷扰 
 769 | 纷杂 
 770 | 封闭 
 771 | 封闭式 
 772 | 封闭型 
 773 | 封建 
 774 | 锋芒毕露 
 775 | 风吹日晒 
 776 | 风刀霜剑 
 777 | 风风火火 
 778 | 风流 
 779 | 风骚 
 780 | 风声鹤唳 
 781 | 风雨飘摇 
 782 | 疯疯癫癫 
 783 | 疯狂 
 784 | 疯狂般 
 785 | 疯癫癫 
 786 | 否 
 787 | 否定 
 788 | 肤泛 
 789 | 肤皮潦草 
 790 | 肤浅 
 791 | 浮 
 792 | 浮泛 
 793 | 浮光掠影 
 794 | 浮滑 
 795 | 浮皮蹭痒 
 796 | 浮漂 
 797 | 浮浅 
 798 | 浮躁 
 799 | 浮噪 
 800 | 腐败 
 801 | 腐败堕落 
 802 | 腐臭 
 803 | 腐恶 
 804 | 腐化 
 805 | 腐化堕落 
 806 | 腐旧 
 807 | 腐烂 
 808 | 腐朽 
 809 | 腐朽没落 
 810 | 覆雨翻云 
 811 | 复 
 812 | 复合 
 813 | 复合式 
 814 | 复合型 
 815 | 复杂 
 816 | 复杂多变 
 817 | 傅会 
 818 | 负 
 819 | 负面 
 820 | 富余 
 821 | 附会 
 822 | 嘎 
 823 | 该死 
 824 | 概念化 
 825 | 干 
 826 | 干巴 
 827 | 干巴巴 
 828 | 干瘪 
 829 | 干瘪瘪 
 830 | 干干巴巴 
 831 | 干燥 
 832 | 赶尽杀绝 
 833 | 刚愎 
 834 | 刚愎自用 
 835 | 高昂 
 836 | 高傲 
 837 | 高不成，低不就 
 838 | 高不成低不就 
 839 | 高成本 
 840 | 高价 
 841 | 高价位 
 842 | 高难 
 843 | 高难度 
 844 | 高压 
 845 | 疙疙瘩瘩 
 846 | 疙里疙瘩 
 847 | 隔靴搔痒 
 848 | 勾心斗角 
 849 | 苟且 
 850 | 狗眼看人 
 851 | 垢 
 852 | 够呛 
 853 | 够戗 
 854 | 孤 
 855 | 孤傲 
 856 | 孤傲不群 
 857 | 孤单 
 858 | 孤单单 
 859 | 孤独 
 860 | 孤孤单单 
 861 | 孤寡 
 862 | 孤寂 
 863 | 孤立 
 864 | 孤立无援 
 865 | 孤零零 
 866 | 孤陋寡闻 
 867 | 孤僻 
 868 | 古怪 
 869 | 古旧 
 870 | 古里古怪 
 871 | 固定不变 
 872 | 固执 
 873 | 寡 
 874 | 寡淡 
 875 | 寡断 
 876 | 寡了叭叽 
 877 | 寡情 
 878 | 寡味 
 879 | 寡言 
 880 | 寡言少语 
 881 | 挂漏 
 882 | 挂名 
 883 | 挂一漏万 
 884 | 乖谬 
 885 | 乖僻 
 886 | 乖张 
 887 | 乖剌 
 888 | 乖戾 
 889 | 怪里怪气 
 890 | 怪僻 
 891 | 官僚 
 892 | 官僚主义 
 893 | 光怪陆离 
 894 | 鬼 
 895 | 鬼鬼祟祟 
 896 | 鬼计多端 
 897 | 鬼头鬼脑 
 898 | 诡 
 899 | 诡计多端 
 900 | 诡秘 
 901 | 诡诈 
 902 | 诡谲 
 903 | 贵 
 904 | 过当 
 905 | 过分简单化 
 906 | 过分拥挤 
 907 | 过河拆桥 
 908 | 过了气 
 909 | 过气 
 910 | 过桥抽板 
 911 | 过时 
 912 | 哈喇 
 913 | 孩子气 
 914 | 海底捞月 
 915 | 海底捞针 
 916 | 害 
 917 | 骇人听闻 
 918 | 憨 
 919 | 含含糊糊 
 920 | 含含混混 
 921 | 含糊 
 922 | 含糊不清 
 923 | 含糊其辞 
 924 | 含糊其词 
 925 | 含混 
 926 | 含混不清 
 927 | 含蓄 
 928 | 涵蓄 
 929 | 寒 
 930 | 寒苦 
 931 | 寒素 
 932 | 寒酸 
 933 | 寒微 
 934 | 寒伧 
 935 | 寒碜 
 936 | 悍 
 937 | 悍然 
 938 | 豪 
 939 | 豪侈 
 940 | 豪横 
 941 | 豪强 
 942 | 豪奢 
 943 | 毫不客气 
 944 | 毫不留情 
 945 | 毫无价值 
 946 | 毫无目标 
 947 | 毫无意义 
 948 | 毫无用处 
 949 | 好不容易 
 950 | 好容易 
 951 | 好事多磨 
 952 | 黑 
 953 | 黑暗 
 954 | 黑沉沉 
 955 | 黑灯瞎火 
 956 | 黑洞洞 
 957 | 黑咕隆咚 
 958 | 黑乎乎 
 959 | 黑茫茫 
 960 | 黑蒙蒙 
 961 | 黑漆寥光 
 962 | 黑漆漆 
 963 | 黑森森 
 964 | 黑心 
 965 | 黑心肠 
 966 | 黑黝黝 
 967 | 黑黢黢 
 968 | 狠 
 969 | 狠毒 
 970 | 狠劲 
 971 | 狠心 
 972 | 横 
 973 | 横暴 
 974 | 横加 
 975 | 横蛮无理 
 976 | 横七竖八 
 977 | 哄然 
 978 | 猴 
 979 | 后患无穷 
 980 | 后进 
 981 | 呼幺喝六 
 982 | 胡 
 983 | 胡里胡涂 
 984 | 胡乱 
 985 | 胡子拉茬 
 986 | 胡子拉碴 
 987 | 糊糊涂涂 
 988 | 糊里糊涂 
 989 | 糊涂 
 990 | 虎踞龙蟠 
 991 | 虎头蛇尾 
 992 | 花 
 993 | 花插着 
 994 | 花搭着 
 995 | 花花搭搭 
 996 | 花里胡哨 
 997 | 花钱浪费 
 998 | 花拳绣腿 
 999 | 花天酒地 
1000 | 花心 
1001 | 哗然 
1002 | 华 
1003 | 华而不实 
1004 | 猾 
1005 | 滑 
1006 | 滑头 
1007 | 滑头滑脑 
1008 | 怀着恶意 
1009 | 坏 
1010 | 坏脾气 
1011 | 坏人当道 
1012 | 幻 
1013 | 幻异 
1014 | 荒 
1015 | 荒诞 
1016 | 荒诞不经 
1017 | 荒诞派 
1018 | 荒诞无稽 
1019 | 荒废 
1020 | 荒寂 
1021 | 荒凉 
1022 | 荒乱 
1023 | 荒落 
1024 | 荒谬 
1025 | 荒谬绝伦 
1026 | 荒漠 
1027 | 荒僻 
1028 | 荒弃 
1029 | 荒疏 
1030 | 荒唐 
1031 | 荒唐无稽 
1032 | 荒无人烟 
1033 | 荒芜 
1034 | 荒淫 
1035 | 荒淫无耻 
1036 | 荒淫无度 
1037 | 荒瘠 
1038 | 黄色 
1039 | 晃晃悠悠 
1040 | 晃悠悠 
1041 | 恍恍惚惚 
1042 | 恍惚 
1043 | 谎 
1044 | 灰暗 
1045 | 灰沉沉 
1046 | 灰溜溜 
1047 | 灰茫茫 
1048 | 灰蒙蒙 
1049 | 灰色 
1050 | 灰头灰脸 
1051 | 灰头土脸 
1052 | 灰秃秃 
1053 | 灰朦朦 
1054 | 慧黠 
1055 | 晦 
1056 | 晦暗 
1057 | 晦涩 
1058 | 晦冥 
1059 | 晦暝 
1060 | 秽 
1061 | 秽恶 
1062 | 秽乱 
1063 | 秽土 
1064 | 秽亵 
1065 | 会来事 
1066 | 昏 
1067 | 昏暗 
1068 | 昏沉 
1069 | 昏黑 
1070 | 昏乱 
1071 | 昏昧 
1072 | 昏天黑地 
1073 | 昏头昏脑 
1074 | 昏庸 
1075 | 昏愦 
1076 | 昏聩 
1077 | 婚外 
1078 | 浑 
1079 | 浑浑噩噩 
1080 | 浑头浑脑 
1081 | 浑浊 
1082 | 浑噩 
1083 | 混 
1084 | 混合 
1085 | 混混沌沌 
1086 | 混交 
1087 | 混乱 
1088 | 混淆不清 
1089 | 混血 
1090 | 混账 
1091 | 混浊 
1092 | 混沌 
1093 | 活动 
1094 | 火暴 
1095 | 火爆 
1096 | 祸不单行 
1097 | 祸从天降 
1098 | 机变 
1099 | 机械 
1100 | 机械式 
1101 | 机械性 
1102 | 畸 
1103 | 畸轻畸重 
1104 | 畸形 
1105 | 积满灰尘 
1106 | 积重难返 
1107 | 鸡零狗碎 
1108 | 鸡毛蒜皮 
1109 | 鸡犬不留 
1110 | 鸡犬不宁 
1111 | 棘手 
1112 | 急不可待 
1113 | 急功近利 
1114 | 急切 
1115 | 急性子 
1116 | 急于 
1117 | 急躁 
1118 | 疾言厉色 
1119 | 挤 
1120 | 挤巴 
1121 | 挤得水泄不通 
1122 | 挤得要命 
1123 | 挤挤插插 
1124 | 挤满 
1125 | 寂 
1126 | 寂寥 
1127 | 寂寞 
1128 | 忌刻 
1129 | 夹七夹八 
1130 | 家长式 
1131 | 家贫如洗 
1132 | 家徒壁立 
1133 | 家徒四壁 
1134 | 假 
1135 | 假冒 
1136 | 假模假式 
1137 | 假仁假义 
1138 | 假想 
1139 | 假惺惺 
1140 | 假意 
1141 | 假造 
1142 | 假正经 
1143 | 假装神圣 
1144 | 价高 
1145 | 价格不菲 
1146 | 价格高昂 
1147 | 架空 
1148 | 尖刻 
1149 | 尖酸 
1150 | 尖酸刻薄 
1151 | 尖嘴薄舌 
1152 | 尖嘴猴腮 
1153 | 间不容发 
1154 | 间杂 
1155 | 肩摩毂击 
1156 | 艰 
1157 | 艰巨 
1158 | 艰苦 
1159 | 艰苦卓绝 
1160 | 艰难 
1161 | 艰难曲折 
1162 | 艰难险阻 
1163 | 艰涩 
1164 | 艰深 
1165 | 艰危 
1166 | 艰辛 
1167 | 奸 
1168 | 奸刁 
1169 | 奸恶 
1170 | 奸猾 
1171 | 奸险 
1172 | 奸邪 
1173 | 奸诈 
1174 | 奸佞 
1175 | 简单 
1176 | 简陋 
1177 | 简慢 
1178 | 贱 
1179 | 见不得人 
1180 | 见风使舵 
1181 | 见风转舵 
1182 | 见识短浅 
1183 | 见异思迁 
1184 | 剑拔弩张 
1185 | 僵 
1186 | 僵化 
1187 | 僵硬 
1188 | 胶柱鼓瑟 
1189 | 浇薄 
1190 | 浇漓 
1191 | 骄 
1192 | 骄傲 
1193 | 骄傲自满 
1194 | 骄横 
1195 | 骄慢 
1196 | 骄气 
1197 | 骄人 
1198 | 骄奢淫逸 
1199 | 骄纵 
1200 | 骄矜 
1201 | 娇 
1202 | 娇痴 
1203 | 娇贵 
1204 | 娇憨 
1205 | 娇嫩 
1206 | 娇气 
1207 | 娇弱 
1208 | 娇生惯养 
1209 | 矫情 
1210 | 矫情造作 
1211 | 矫揉造作 
1212 | 侥 
1213 | 狡 
1214 | 狡猾 
1215 | 狡计多端 
1216 | 狡兔三窟 
1217 | 狡诈 
1218 | 狡狯 
1219 | 狡黠 
1220 | 揭不开锅 
1221 | 竭蹶 
1222 | 洁身自好 
1223 | 结结巴巴 
1224 | 斤斤计较 
1225 | 金刚努目 
1226 | 紧 
1227 | 紧巴 
1228 | 紧巴巴 
1229 | 近视 
1230 | 荆棘载途 
1231 | 惊爆 
1232 | 惊人 
1233 | 惊天动地 
1234 | 惊险 
1235 | 精力枯竭 
1236 | 精神不振 
1237 | 精神溜号 
1238 | 经济拮据 
1239 | 经验不足 
1240 | 静僻 
1241 | 净余 
1242 | 窘 
1243 | 窘促 
1244 | 窘急 
1245 | 窘困 
1246 | 窘迫 
1247 | 窘涩 
1248 | 旧 
1249 | 旧式 
1250 | 拘 
1251 | 拘礼 
1252 | 拘执 
1253 | 狙 
1254 | 拒人于千里之外 
1255 | 剧毒 
1256 | 倔 
1257 | 倔强 
1258 | 倔头倔脑 
1259 | 倔犟 
1260 | 绝 
1261 | 绝情 
1262 | 峻 
1263 | 开小差 
1264 | 坎坷 
1265 | 坎坷不平 
1266 | 看风使舵 
1267 | 糠 
1268 | 亢 
1269 | 靠不住 
1270 | 苛 
1271 | 苛刻 
1272 | 磕磕绊绊 
1273 | 磕头碰脑 
1274 | 可悲 
1275 | 可鄙 
1276 | 可怖 
1277 | 可耻 
1278 | 可恶 
1279 | 可骇 
1280 | 可恨 
1281 | 可惊 
1282 | 可怜 
1283 | 可怕 
1284 | 可叹 
1285 | 可有可无 
1286 | 可憎 
1287 | 刻板 
1288 | 刻薄 
1289 | 刻毒 
1290 | 刻舟求剑 
1291 | 坑坑洼洼 
1292 | 坑洼 
1293 | 坑洼不平 
1294 | 空 
1295 | 空洞 
1296 | 空洞洞 
1297 | 空洞无聊 
1298 | 空洞无物 
1299 | 空乏 
1300 | 空泛 
1301 | 空幻 
1302 | 空空洞洞 
1303 | 空落落 
1304 | 空头 
1305 | 空虚 
1306 | 空中楼阁 
1307 | 恐怖 
1308 | 抠 
1309 | 抠门儿 
1310 | 抠搜 
1311 | 抠唆 
1312 | 口蜜腹剑 
1313 | 口是心非 
1314 | 口头上 
1315 | 枯 
1316 | 枯寂 
1317 | 枯涩 
1318 | 枯燥 
1319 | 枯燥乏味 
1320 | 枯燥无味 
1321 | 枯槁 
1322 | 苦 
1323 | 苦不唧 
1324 | 苦口 
1325 | 苦苦 
1326 | 苦涩 
1327 | 酷 
1328 | 酷烈 
1329 | 酷虐 
1330 | 夸诞 
1331 | 狂 
1332 | 狂傲 
1333 | 狂暴 
1334 | 狂荡 
1335 | 狂妄 
1336 | 狂妄自大 
1337 | 狂躁 
1338 | 狂悖 
1339 | 狂恣 
1340 | 困顿 
1341 | 困窘 
1342 | 困苦 
1343 | 困难 
1344 | 困难重重 
1345 | 困人 
1346 | 阔绰 
1347 | 阔气 
1348 | 拉忽 
1349 | 拉拉杂杂 
1350 | 拉杂 
1351 | 辣 
1352 | 辣手 
1353 | 来路不明 
1354 | 来之不易 
1355 | 赖 
1356 | 赖皮 
1357 | 懒 
1358 | 懒到极点 
1359 | 懒惰 
1360 | 懒散 
1361 | 烂 
1362 | 滥 
1363 | 狼狈 
1364 | 狼狈不堪 
1365 | 狼籍 
1366 | 狼藉 
1367 | 狼心狗肺 
1368 | 浪 
1369 | 浪荡 
1370 | 劳而无功 
1371 | 老 
1372 | 老大难 
1373 | 老掉牙 
1374 | 老赶 
1375 | 老虎屁股摸不得 
1376 | 老奸巨猾 
1377 | 老奸巨滑 
1378 | 老辣 
1379 | 老派 
1380 | 老气 
1381 | 老气横秋 
1382 | 老弱病残 
1383 | 老实 
1384 | 老式 
1385 | 老朽 
1386 | 累卵 
1387 | 累赘 
1388 | 累牍连篇 
1389 | 肋脦 
1390 | 冷 
1391 | 冷冰冰 
1392 | 冷淡 
1393 | 冷峻 
1394 | 冷酷 
1395 | 冷酷无情 
1396 | 冷冷 
1397 | 冷冷清清 
1398 | 冷厉 
1399 | 冷落 
1400 | 冷门 
1401 | 冷漠 
1402 | 冷峭 
1403 | 冷清 
1404 | 冷清清 
1405 | 冷若冰霜 
1406 | 冷销 
1407 | 冷血 
1408 | 冷噤 
1409 | 离索 
1410 | 离题 
1411 | 离心离德 
1412 | 理亏 
1413 | 理屈 
1414 | 理屈词穷 
1415 | 理由不充分 
1416 | 里出外进 
1417 | 厉 
1418 | 厉害 
1419 | 厉声 
1420 | 利令智昏 
1421 | 利已 
1422 | 利欲熏心 
1423 | 哩哩啦啦 
1424 | 哩哩罗罗 
1425 | 哩溜歪斜 
1426 | 连篇累牍 
1427 | 良莠不齐 
1428 | 两面光 
1429 | 两面三刀 
1430 | 寥 
1431 | 寥寂 
1432 | 潦草 
1433 | 了不得 
1434 | 了不起 
1435 | 烈 
1436 | 烈性子 
1437 | 劣 
1438 | 劣等 
1439 | 劣质 
1440 | 劣中之劣 
1441 | 鳞状 
1442 | 凛 
1443 | 凛凛 
1444 | 凛然 
1445 | 吝 
1446 | 吝啬 
1447 | 零 
1448 | 零丁 
1449 | 零零散散 
1450 | 零零碎碎 
1451 | 零乱 
1452 | 零落 
1453 | 零七八碎 
1454 | 零散 
1455 | 零碎 
1456 | 零星 
1457 | 伶仃 
1458 | 凌乱 
1459 | 凌杂 
1460 | 令人不安 
1461 | 令人齿冷 
1462 | 令人恶心 
1463 | 令人发指 
1464 | 令人费解 
1465 | 令人寒心 
1466 | 令人敬畏 
1467 | 令人困倦 
1468 | 令人毛骨悚然 
1469 | 令人恼火 
1470 | 令人疲倦 
1471 | 令人生气 
1472 | 令人生厌 
1473 | 令人讨厌 
1474 | 令人厌恶 
1475 | 令人厌倦 
1476 | 令人遗憾 
1477 | 令人折断腰 
1478 | 令人窒息 
1479 | 令人作呕 
1480 | 溜号 
1481 | 流里流气 
1482 | 流气 
1483 | 六亲不认 
1484 | 娄 
1485 | 漏洞百出 
1486 | 陋 
1487 | 鲁 
1488 | 鲁钝 
1489 | 鲁莽 
1490 | 碌 
1491 | 碌碌 
1492 | 碌碌无为 
1493 | 驴唇不对马嘴 
1494 | 率 
1495 | 率尔 
1496 | 率然 
1497 | 乱 
1498 | 乱成一团 
1499 | 乱纷纷 
1500 | 乱哄哄 
1501 | 乱烘烘 
1502 | 乱乎 
1503 | 乱了营 
1504 | 乱乱哄哄 
1505 | 乱虐并生 
1506 | 乱蓬蓬 
1507 | 乱七八糟 
1508 | 乱套 
1509 | 乱腾 
1510 | 乱腾腾 
1511 | 乱杂 
1512 | 乱糟糟 
1513 | 乱真 
1514 | 乱嘈嘈 
1515 | 落后 
1516 | 落落寡合 
1517 | 落寞 
1518 | 落市 
1519 | 落俗套 
1520 | 落套 
1521 | 落拓 
1522 | 落伍 
1523 | 麻 
1524 | 麻痹 
1525 | 麻烦 
1526 | 麻麻黑 
1527 | 麻木 
1528 | 麻木不仁 
1529 | 马虎 
1530 | 马马虎虎 
1531 | 埋汰 
1532 | 卖不掉 
1533 | 卖不动 
1534 | 蛮 
1535 | 蛮不讲理 
1536 | 蛮悍 
1537 | 蛮横 
1538 | 蛮横无理 
1539 | 蛮荒 
1540 | 满 
1541 | 满脸横肉 
1542 | 满目疮痍 
1543 | 漫 
1544 | 漫不经心 
1545 | 漫不经意 
1546 | 漫无边际 
1547 | 漫无目标 
1548 | 漫无目的 
1549 | 漫漶 
1550 | 谩 
1551 | 茫 
1552 | 茫茫 
1553 | 茫茫然 
1554 | 盲目 
1555 | 盲人瞎马 
1556 | 莽 
1557 | 莽苍 
1558 | 莽莽苍苍 
1559 | 莽莽撞撞 
1560 | 莽撞 
1561 | 猫哭老鼠 
1562 | 毛 
1563 | 毛糙 
1564 | 毛毛躁躁 
1565 | 毛手毛脚 
1566 | 毛头毛脑 
1567 | 毛躁 
1568 | 冒 
1569 | 冒牌 
1570 | 冒失 
1571 | 冒险 
1572 | 冒有风险 
1573 | 貌似强大 
1574 | 貌似真实的 
1575 | 贸贸然 
1576 | 贸然 
1577 | 没边儿 
1578 | 没出息 
1579 | 没骨头 
1580 | 没关系 
1581 | 没好气 
1582 | 没见过世面 
1583 | 没教养 
1584 | 没劲 
1585 | 没理 
1586 | 没礼貌 
1587 | 没良心 
1588 | 没两下子 
1589 | 没轻没重 
1590 | 没什么了不得 
1591 | 没什么了不起 
1592 | 没受过教育 
1593 | 没头没脑 
1594 | 没头脑 
1595 | 没味 
1596 | 没心没肺 
1597 | 没心眼儿 
1598 | 没意思 
1599 | 没用 
1600 | 没有教养 
1601 | 没有礼貌 
1602 | 没有头脑 
1603 | 没有学问 
1604 | 没有勇气 
1605 | 媚俗 
1606 | 闷 
1607 | 闷气 
1608 | 蒙昧 
1609 | 蒙蒙 
1610 | 蒙蒙胧胧 
1611 | 蒙胧 
1612 | 孟浪 
1613 | 靡丽 
1614 | 靡靡 
1615 | 糜 
1616 | 糜烂 
1617 | 迷濛 
1618 | 迷宫般 
1619 | 迷糊 
1620 | 迷离 
1621 | 迷离扑朔 
1622 | 迷离倘恍 
1623 | 迷漫 
1624 | 迷茫 
1625 | 迷蒙 
1626 | 迷蒙蒙 
1627 | 迷迷糊糊 
1628 | 迷迷茫茫 
1629 | 迷迷蒙蒙 
1630 | 迷迷怔怔 
1631 | 弥天 
1632 | 米珠薪桂 
1633 | 秘 
1634 | 秘密 
1635 | 密 
1636 | 密不透风 
1637 | 绵里藏针 
1638 | 绵软 
1639 | 勉勉强强 
1640 | 勉强 
1641 | 面呈病色 
1642 | 面黄肌瘦 
1643 | 面目可憎 
1644 | 面目狰狞 
1645 | 面色蜡黄 
1646 | 面生 
1647 | 面无表情 
1648 | 藐小 
1649 | 渺 
1650 | 渺茫 
1651 | 渺渺 
1652 | 渺然 
1653 | 渺若烟云 
1654 | 渺小 
1655 | 灭绝人性 
1656 | 明哲保身 
1657 | 名不副实 
1658 | 名过其实 
1659 | 名义 
1660 | 名义上 
1661 | 名誉扫地 
1662 | 命苦 
1663 | 谬 
1664 | 模糊 
1665 | 模糊不清 
1666 | 模棱两可 
1667 | 摩肩接踵 
1668 | 魔鬼般 
1669 | 魔怔 
1670 | 莫须有 
1671 | 墨 
1672 | 漠 
1673 | 漠不关心 
1674 | 漠漠 
1675 | 漠然 
1676 | 寞 
1677 | 陌生 
1678 | 暮气 
1679 | 暮气沉沉 
1680 | 暮色苍茫 
1681 | 幕后 
1682 | 木 
1683 | 木雕泥塑 
1684 | 木头木脑 
1685 | 木讷 
1686 | 目不识丁 
1687 | 目光短浅 
1688 | 目光如豆 
1689 | 目光凶狠 
1690 | 目空一切 
1691 | 目无余子 
1692 | 目中无人 
1693 | 拿腔拿调 
1694 | 拿腔作势 
1695 | 奶声奶气 
1696 | 男盗女娼 
1697 | 难 
1698 | 难吃 
1699 | 难看 
1700 | 难人 
1701 | 难上加难 
1702 | 难上难 
1703 | 难说话 
1704 | 难听 
1705 | 难闻 
1706 | 难相处 
1707 | 难驯服 
1708 | 难以 
1709 | 难以沟通 
1710 | 囊空如洗 
1711 | 囊中羞涩 
1712 | 闹 
1713 | 闹得慌 
1714 | 闹哄哄 
1715 | 闹闹哄哄 
1716 | 闹闹嚷嚷 
1717 | 闹嚷嚷 
1718 | 嫩 
1719 | 泥沙俱下 
1720 | 你死我活 
1721 | 匿名 
1722 | 腻 
1723 | 腻人 
1724 | 逆 
1725 | 逆耳 
1726 | 蔫不唧儿 
1727 | 蔫儿坏 
1728 | 蔫头耷脑 
1729 | 拈轻怕重 
1730 | 年久失修 
1731 | 鸟尽弓藏 
1732 | 狞 
1733 | 狞恶 
1734 | 凝滞 
1735 | 泞 
1736 | 牛 
1737 | 牛气 
1738 | 扭扭捏捏 
1739 | 奴颜婢膝 
1740 | 虐 
1741 | 懦 
1742 | 懦怯 
1743 | 懦弱 
1744 | 盘根错节 
1745 | 盘陁 
1746 | 庞杂 
1747 | 旁若无人 
1748 | 配不上 
1749 | 蓬乱 
1750 | 蓬散 
1751 | 蓬首垢面 
1752 | 蓬头垢面 
1753 | 蓬头散发 
1754 | 脾气暴 
1755 | 脾气爆躁 
1756 | 脾气坏 
1757 | 脾气火暴 
1758 | 脾气急躁 
1759 | 皮 
1760 | 皮毛 
1761 | 皮相 
1762 | 僻 
1763 | 僻静 
1764 | 偏 
1765 | 偏激 
1766 | 偏颇 
1767 | 偏听偏信 
1768 | 偏狭 
1769 | 偏斜 
1770 | 偏心 
1771 | 偏心眼 
1772 | 片断 
1773 | 片面 
1774 | 骗人 
1775 | 漂浮 
1776 | 贫 
1777 | 贫寒 
1778 | 贫苦 
1779 | 贫困 
1780 | 贫穷 
1781 | 贫瘠 
1782 | 平白 
1783 | 平白无故 
1784 | 平淡 
1785 | 平淡无奇 
1786 | 平淡无味 
1787 | 平铺直叙 
1788 | 平铺直序 
1789 | 凭白无故 
1790 | 凭空 
1791 | 坡 
1792 | 泼 
1793 | 泼辣 
1794 | 婆婆妈妈 
1795 | 破 
1796 | 破败 
1797 | 破坏性 
1798 | 破旧 
1799 | 破烂不堪 
1800 | 破陋 
1801 | 扑朔迷离 
1802 | 铺张 
1803 | 铺张浪费 
1804 | 欺诈性 
1805 | 七零八落 
1806 | 凄 
1807 | 凄惨 
1808 | 凄楚 
1809 | 凄寒 
1810 | 凄寂 
1811 | 凄苦 
1812 | 凄冷 
1813 | 凄厉 
1814 | 凄凉 
1815 | 凄迷 
1816 | 凄怆 
1817 | 漆黑 
1818 | 漆黑一团 
1819 | 其貌不扬 
1820 | 奇丑无比 
1821 | 奇形怪状 
1822 | 崎 
1823 | 崎岖 
1824 | 崎岖不平 
1825 | 起绉 
1826 | 起褶子 
1827 | 岂有此理 
1828 | 气粗 
1829 | 气闷 
1830 | 气盛 
1831 | 气势汹汹 
1832 | 气壮如牛 
1833 | 千变万化 
1834 | 千疮百孔 
1835 | 千金一掷 
1836 | 千钧一发 
1837 | 千篇一律 
1838 | 前呼后拥 
1839 | 潜 
1840 | 浅 
1841 | 浅薄 
1842 | 浅尝辄止 
1843 | 浅陋 
1844 | 欠妥 
1845 | 欠完善 
1846 | 欠周到 
1847 | 强 
1848 | 强暴 
1849 | 强横 
1850 | 强行 
1851 | 强制 
1852 | 强制性 
1853 | 巧 
1854 | 巧黠 
1855 | 翘尾巴 
1856 | 峭 
1857 | 峭直 
1858 | 怯 
1859 | 怯懦 
1860 | 怯然 
1861 | 怯弱 
1862 | 怯生生 
1863 | 窃 
1864 | 禽兽不如 
1865 | 轻 
1866 | 轻薄 
1867 | 轻淡 
1868 | 轻浮 
1869 | 轻贱 
1870 | 轻狂 
1871 | 轻率 
1872 | 轻描淡写 
1873 | 轻易 
1874 | 轻佻 
1875 | 倾斜 
1876 | 清淡 
1877 | 清高 
1878 | 清寒 
1879 | 清苦 
1880 | 清冷 
1881 | 清贫 
1882 | 穷 
1883 | 穷乏 
1884 | 穷极潦倒 
1885 | 穷苦 
1886 | 穷困 
1887 | 穷困潦倒 
1888 | 穷奢极侈 
1889 | 穷奢极欲 
1890 | 穷酸 
1891 | 穷途潦倒 
1892 | 穷途末路 
1893 | 穷凶极恶 
1894 | 穷匮 
1895 | 囚首垢面 
1896 | 区区 
1897 | 曲曲折折 
1898 | 曲折 
1899 | 屈才 
1900 | 屈理 
1901 | 犬牙交错 
1902 | 缺 
1903 | 缺德 
1904 | 缺乏才智 
1905 | 缺乏教养 
1906 | 缺乏绅士风度 
1907 | 缺乏幽默 
1908 | 缺心眼 
1909 | 缺心眼儿 
1910 | 群魔乱舞 
1911 | 攘攘 
1912 | 扰扰 
1913 | 绕脖子 
1914 | 人不为己，天诛地灭 
1915 | 人不知，鬼不觉 
1916 | 人声鼎沸 
1917 | 人声嘈杂 
1918 | 人头攒动 
1919 | 人为财死，鸟为食亡 
1920 | 任重道远 
1921 | 任纵 
1922 | 认死理 
1923 | 认死理儿 
1924 | 冗 
1925 | 冗长 
1926 | 冗余 
1927 | 冗赘 
1928 | 柔弱 
1929 | 肉 
1930 | 肉了叭叽 
1931 | 肉麻 
1932 | 如临大敌 
1933 | 如临深渊 
1934 | 如履薄冰 
1935 | 乳臭未干 
1936 | 软 
1937 | 软绵绵 
1938 | 软弱 
1939 | 软弱无力 
1940 | 若明若暗 
1941 | 若隐若现 
1942 | 弱 
1943 | 弱不禁风 
1944 | 弱不胜衣 
1945 | 弱势 
1946 | 弱小 
1947 | 弱智 
1948 | 三天打鱼两天晒网 
1949 | 散 
1950 | 散乱 
1951 | 散漫 
1952 | 嗓子不好 
1953 | 丧尽天良 
1954 | 丧心病狂 
1955 | 骚 
1956 | 骚乱性 
1957 | 色厉内荏 
1958 | 色迷迷 
1959 | 色情 
1960 | 涩 
1961 | 涩苦 
1962 | 涩滞 
1963 | 森 
1964 | 杀气腾腾 
1965 | 杀人不见血 
1966 | 杀人不眨眼 
1967 | 杀人如麻 
1968 | 傻 
1969 | 傻呵呵 
1970 | 傻乎乎 
1971 | 傻里瓜唧 
1972 | 傻里傻气 
1973 | 傻头傻脑 
1974 | 山南海北 
1975 | 山穷水尽 
1976 | 闪烁 
1977 | 伤风败俗 
1978 | 伤脑筋 
1979 | 伤天害理 
1980 | 伤心惨目 
1981 | 少不更事 
1982 | 奢 
1983 | 奢侈 
1984 | 奢华 
1985 | 奢靡 
1986 | 奢糜 
1987 | 蛇蝎心肠 
1988 | 涉世不深 
1989 | 身无分文 
1990 | 深重 
1991 | 神不知，鬼不觉 
1992 | 神不知鬼不觉 
1993 | 神秘 
1994 | 神气活现 
1995 | 神气十足 
1996 | 神神秘秘 
1997 | 神志委靡 
1998 | 声名狼藉 
1999 | 声色俱厉 
2000 | 生 
2001 | 生拉硬拽 
2002 | 生涩 
2003 | 生疏 
2004 | 生硬 
2005 | 盛气凌人 
2006 | 剩余 
2007 | 失常 
2008 | 失当 
2009 | 失检 
2010 | 失礼 
2011 | 失落 
2012 | 失去理性 
2013 | 失神 
2014 | 失慎 
2015 | 失实 
2016 | 失宜 
2017 | 十恶不赦 
2018 | 十室九空 
2019 | 什 
2020 | 什锦 
2021 | 食而不化 
2022 | 食而不知其味 
2023 | 食古不化 
2024 | 实属不易 
2025 | 使不得 
2026 | 使人疲劳 
2027 | 世故 
2028 | 世情冷暖 
2029 | 世俗 
2030 | 世态炎凉 
2031 | 誓不两立 
2032 | 势不两立 
2033 | 势利 
2034 | 势利眼 
2035 | 嗜杀成性 
2036 | 嗜血 
2037 | 嗜血成性 
2038 | 恃才傲物 
2039 | 手脚不干净 
2040 | 手紧 
2041 | 手生 
2042 | 手头紧 
2043 | 手无缚鸡之力 
2044 | 守旧 
2045 | 守株待兔 
2046 | 瘦 
2047 | 瘦弱 
2048 | 输理 
2049 | 疏忽 
2050 | 疏懒 
2051 | 疏松 
2052 | 书生气 
2053 | 鼠胆 
2054 | 鼠目寸光 
2055 | 数不上 
2056 | 数不着 
2057 | 衰弱 
2058 | 衰颓 
2059 | 水火不相容 
2060 | 水泄不通 
2061 | 水性杨花 
2062 | 水中捞月 
2063 | 瞬息万变 
2064 | 说不过去 
2065 | 说来话长 
2066 | 斯文扫地 
2067 | 私 
2068 | 私底下 
2069 | 私密 
2070 | 私下 
2071 | 私下里 
2072 | 私自 
2073 | 死 
2074 | 死板 
2075 | 死板板 
2076 | 死沉沉 
2077 | 死脑筋 
2078 | 死气沉沉 
2079 | 死去活来 
2080 | 死死 
2081 | 死心塌地 
2082 | 死心眼 
2083 | 死心眼儿 
2084 | 死性 
2085 | 死一般 
2086 | 死硬 
2087 | 死有余辜 
2088 | 肆 
2089 | 肆无忌惮 
2090 | 肆意 
2091 | 四大皆空 
2092 | 四面楚歌 
2093 | 似 
2094 | 似乎 
2095 | 似是而非 
2096 | 松垮 
2097 | 松垮垮 
2098 | 松散 
2099 | 松散散 
2100 | 松松垮垮 
2101 | 耸人听闻 
2102 | 酥 
2103 | 酥软 
2104 | 酥松 
2105 | 俗 
2106 | 俗气 
2107 | 素不相识 
2108 | 素昧平生 
2109 | 肃 
2110 | 肃杀 
2111 | 酸 
2112 | 酸不溜丢 
2113 | 酸臭 
2114 | 酸刻 
2115 | 酸溜溜 
2116 | 酸涩 
2117 | 随便 
2118 | 随风倒 
2119 | 随风使舵 
2120 | 随风转舵 
2121 | 随随便便 
2122 | 随心所欲 
2123 | 碎 
2124 | 祟 
2125 | 损 
2126 | 损人利己 
2127 | 琐 
2128 | 琐碎 
2129 | 琐细 
2130 | 琐屑 
2131 | 索 
2132 | 索然 
2133 | 索然乏味 
2134 | 索然寡味 
2135 | 索然无味 
2136 | 所谓 
2137 | 太随便 
2138 | 太虚 
2139 | 贪 
2140 | 贪得无厌 
2141 | 贪婪 
2142 | 贪心 
2143 | 贪心不足 
2144 | 瘫软 
2145 | 谈何容易 
2146 | 唐突 
2147 | 烫手 
2148 | 淘 
2149 | 淘气 
2150 | 淘神 
2151 | 讨厌 
2152 | 特困 
2153 | 特贫 
2154 | 体力不支 
2155 | 体弱 
2156 | 体衰 
2157 | 天昏地暗 
2158 | 天南地北 
2159 | 天南海北 
2160 | 天真 
2161 | 恬淡 
2162 | 腆 
2163 | 挑逗性 
2164 | 铁杆儿 
2165 | 铁公鸡一毛不拔 
2166 | 铁石心肠 
2167 | 铁血 
2168 | 听天由命 
2169 | 偷 
2170 | 偷工减料 
2171 | 偷偷 
2172 | 偷偷摸摸 
2173 | 投机 
2174 | 头脑空虚 
2175 | 头痛 
2176 | 秃 
2177 | 徒 
2178 | 徒劳 
2179 | 徒劳无功 
2180 | 徒劳无益 
2181 | 徒然 
2182 | 土 
2183 | 土得掉渣 
2184 | 土里土气 
2185 | 土气 
2186 | 土俗 
2187 | 土头土脑 
2188 | 兔死狗烹 
2189 | 兔子不吃窝边草 
2190 | 兔子尾巴长不了 
2191 | 颓 
2192 | 颓败 
2193 | 颓废 
2194 | 蜕化 
2195 | 蜕化变质 
2196 | 退化 
2197 | 拖泥带水 
2198 | 拖沓 
2199 | 歪 
2200 | 歪歪扭扭 
2201 | 歪斜 
2202 | 外面儿光 
2203 | 外行 
2204 | 顽 
2205 | 顽钝 
2206 | 顽梗 
2207 | 顽固 
2208 | 顽劣 
2209 | 顽皮 
2210 | 完全不重要 
2211 | 万恶 
2212 | 万花筒似 
2213 | 万马齐喑 
2214 | 万难 
2215 | 枉 
2216 | 枉费心机 
2217 | 枉然 
2218 | 望梅止渴 
2219 | 忘恩负义 
2220 | 忘情 
2221 | 妄 
2222 | 妄自尊大 
2223 | 威 
2224 | 威厉 
2225 | 微不足道 
2226 | 微贱 
2227 | 微茫 
2228 | 微末 
2229 | 危殆 
2230 | 危机四伏 
2231 | 危机重重 
2232 | 危急 
2233 | 危如累卵 
2234 | 危亡 
2235 | 危险 
2236 | 危在旦夕 
2237 | 唯利是图 
2238 | 唯我独尊 
2239 | 惟利是图 
2240 | 惟我独尊 
2241 | 为人作嫁 
2242 | 为人作嫁衣裳 
2243 | 为所欲为 
2244 | 萎靡不振 
2245 | 委靡不振 
2246 | 委琐 
2247 | 伪 
2248 | 伪善 
2249 | 伪造 
2250 | 未便 
2251 | 未成熟 
2252 | 未归类 
2253 | 未揭露 
2254 | 未老先衰 
2255 | 未列计划 
2256 | 未受过教育 
2257 | 味道不好 
2258 | 味同嚼蜡 
2259 | 畏怯 
2260 | 畏首畏尾 
2261 | 文不对题 
2262 | 文弱 
2263 | 文恬武嬉 
2264 | 紊 
2265 | 紊乱 
2266 | 问道于盲 
2267 | 窝囊 
2268 | 乌沉沉 
2269 | 乌灯黑火 
2270 | 乌洞洞 
2271 | 乌七八糟 
2272 | 乌漆墨黑 
2273 | 乌涂 
2274 | 乌托邦 
2275 | 乌压压 
2276 | 乌烟瘴气 
2277 | 污 
2278 | 污秽 
2279 | 污七八糟 
2280 | 污浊 
2281 | 无伴 
2282 | 无表情 
2283 | 无补 
2284 | 无补于事 
2285 | 无常 
2286 | 无诚意 
2287 | 无道德观念 
2288 | 无的放矢 
2289 | 无动于衷 
2290 | 无度 
2291 | 无端 
2292 | 无端端 
2293 | 无法无天 
2294 | 无根据 
2295 | 无故 
2296 | 无关大局 
2297 | 无关宏旨 
2298 | 无关紧要 
2299 | 无关痛痒 
2300 | 无光 
2301 | 无光泽 
2302 | 无规 
2303 | 无涵养 
2304 | 无稽 
2305 | 无济于事 
2306 | 无计划 
2307 | 无记名 
2308 | 无纪律 
2309 | 无价值 
2310 | 无教养 
2311 | 无节制 
2312 | 无可无不可 
2313 | 无口才 
2314 | 无赖 
2315 | 无理 
2316 | 无礼 
2317 | 无力 
2318 | 无聊 
2319 | 无眉目 
2320 | 无目的 
2321 | 无能 
2322 | 无能为力 
2323 | 无凭无据 
2324 | 无情 
2325 | 无情无义 
2326 | 无人过问 
2327 | 无人问津 
2328 | 无伤大雅 
2329 | 无生气 
2330 | 无实效 
2331 | 无实质 
2332 | 无所帮助 
2333 | 无所不用其极 
2334 | 无特色 
2335 | 无望 
2336 | 无味 
2337 | 无谓 
2338 | 无吸引力 
2339 | 无限制 
2340 | 无效 
2341 | 无依无靠 
2342 | 无意义 
2343 | 无益 
2344 | 无用 
2345 | 无原则 
2346 | 无缘无故 
2347 | 无证据 
2348 | 无知 
2349 | 无中生有 
2350 | 无助 
2351 | 无足轻重 
2352 | 芜 
2353 | 芜杂 
2354 | 武断 
2355 | 雾里看花 
2356 | 误 
2357 | 误诊 
2358 | 稀里糊涂 
2359 | 稀松 
2360 | 稀松平常 
2361 | 喜新厌旧 
2362 | 细 
2363 | 细碎 
2364 | 细小 
2365 | 瞎 
2366 | 下 
2367 | 下乘 
2368 | 下道儿 
2369 | 下等 
2370 | 下贱 
2371 | 下流 
2372 | 下品 
2373 | 下三烂 
2374 | 下三滥 
2375 | 下作 
2376 | 吓人 
2377 | 纤弱 
2378 | 险 
2379 | 险毒 
2380 | 险恶 
2381 | 险峻 
2382 | 险峭 
2383 | 险象环生 
2384 | 险要 
2385 | 险诈 
2386 | 险阻 
2387 | 现行 
2388 | 羡余 
2389 | 香艳 
2390 | 享乐 
2391 | 向上倾斜 
2392 | 向下倾斜 
2393 | 象征性 
2394 | 萧 
2395 | 萧然 
2396 | 萧瑟 
2397 | 萧森 
2398 | 萧疏 
2399 | 萧索 
2400 | 萧条 
2401 | 萧飒 
2402 | 嚣杂 
2403 | 嚣张 
2404 | 消极 
2405 | 小 
2406 | 小肚鸡肠 
2407 | 小儿科 
2408 | 小家子气 
2409 | 小家子相 
2410 | 小里小气 
2411 | 小气 
2412 | 小手小脚 
2413 | 小小不言 
2414 | 小心眼 
2415 | 小心眼儿 
2416 | 笑里藏刀 
2417 | 效率很差 
2418 | 携贰 
2419 | 邪 
2420 | 邪恶 
2421 | 斜 
2422 | 斜体 
2423 | 斜歪 
2424 | 卸磨杀驴 
2425 | 懈怠 
2426 | 辛 
2427 | 辛苦 
2428 | 辛酸 
2429 | 辛辛苦苦 
2430 | 心不在焉 
2431 | 心粗 
2432 | 心地狭窄 
2433 | 心毒 
2434 | 心浮 
2435 | 心黑手辣 
2436 | 心狠 
2437 | 心狠手辣 
2438 | 心口不一 
2439 | 心切 
2440 | 心如蛇蝎 
2441 | 心如铁石 
2442 | 心胸狭隘 
2443 | 心胸狭窄 
2444 | 心眼儿小 
2445 | 心眼儿窄 
2446 | 心眼小 
2447 | 心猿意马 
2448 | 星星点点 
2449 | 腥 
2450 | 腥臭 
2451 | 腥臊 
2452 | 腥膻 
2453 | 形单影只 
2454 | 形格势禁 
2455 | 形同路人 
2456 | 形同虚设 
2457 | 形影相吊 
2458 | 行不通 
2459 | 行为不端 
2460 | 行为不检 
2461 | 性格内向 
2462 | 性急 
2463 | 性情急躁 
2464 | 凶 
2465 | 凶巴巴 
2466 | 凶暴 
2467 | 凶残 
2468 | 凶毒 
2469 | 凶恶 
2470 | 凶悍 
2471 | 凶狠 
2472 | 凶横 
2473 | 凶狂 
2474 | 凶蛮 
2475 | 凶猛 
2476 | 凶煞 
2477 | 凶顽 
2478 | 凶险 
2479 | 凶戾 
2480 | 胸无城府 
2481 | 胸无点墨 
2482 | 熊 
2483 | 虚 
2484 | 虚诞 
2485 | 虚浮 
2486 | 虚幻 
2487 | 虚假 
2488 | 虚空 
2489 | 虚夸 
2490 | 虚拟 
2491 | 虚荣 
2492 | 虚弱 
2493 | 虚设 
2494 | 虚妄 
2495 | 虚伪 
2496 | 虚无 
2497 | 虚无飘渺 
2498 | 虚无缥缈 
2499 | 虚虚实实 
2500 | 虚有其表 
2501 | 虚诈 
2502 | 絮 
2503 | 絮叨 
2504 | 絮聒 
2505 | 喧 
2506 | 喧天 
2507 | 喧嚣 
2508 | 喧杂 
2509 | 喧噪 
2510 | 悬 
2511 | 悬乎 
2512 | 悬空 
2513 | 玄 
2514 | 学究气 
2515 | 学识浅薄 
2516 | 学识谫陋 
2517 | 雪上加霜 
2518 | 血淋淋 
2519 | 血腥 
2520 | 血雨腥风 
2521 | 牙碜 
2522 | 亚 
2523 | 烟雾弥漫 
2524 | 烟雾腾腾 
2525 | 严 
2526 | 严加 
2527 | 严峻 
2528 | 严苛 
2529 | 严酷 
2530 | 严冷 
2531 | 严厉 
2532 | 严肃 
2533 | 严重 
2534 | 言不由衷 
2535 | 言之无物 
2536 | 眼巴巴 
2537 | 眼光短浅 
2538 | 眼皮子高 
2539 | 眼皮子浅 
2540 | 眼生 
2541 | 衍 
2542 | 扬长 
2543 | 羊质虎皮 
2544 | 阳奉阴违 
2545 | 妖 
2546 | 妖里妖气 
2547 | 摇摇晃晃 
2548 | 摇摇欲坠 
2549 | 要不得 
2550 | 野 
2551 | 野鸡 
2552 | 野蛮 
2553 | 叶公好龙 
2554 | 夜郎自大 
2555 | 一把死拿 
2556 | 一暴十寒 
2557 | 一波三折 
2558 | 一不小心 
2559 | 一场空 
2560 | 一成不变 
2561 | 一触即溃 
2562 | 一发千钧 
2563 | 一锅粥 
2564 | 一脸横肉 
2565 | 一脸稚气 
2566 | 一毛不拔 
2567 | 一偏 
2568 | 一贫如洗 
2569 | 一钱不值 
2570 | 一仍旧贯 
2571 | 一手遮天 
2572 | 一塌糊涂 
2573 | 一团乱麻 
2574 | 一团漆黑 
2575 | 一团糟 
2576 | 一文不名 
2577 | 一文不值 
2578 | 一窝蜂 
2579 | 一无可取 
2580 | 一无是处 
2581 | 一无所长 
2582 | 一无所有 
2583 | 一言堂 
2584 | 一掷千金 
2585 | 依违 
2586 | 依稀 
2587 | 衣衫不整 
2588 | 颐指气使 
2589 | 疑难 
2590 | 倚老卖老 
2591 | 以怨报德 
2592 | 易变 
2593 | 易怒 
2594 | 臆 
2595 | 意马心猿 
2596 | 义正词严 
2597 | 溢价 
2598 | 异常 
2599 | 异形 
2600 | 荫 
2601 | 因循守旧 
2602 | 殷 
2603 | 殷切 
2604 | 阴 
2605 | 阴暗 
2606 | 阴沉 
2607 | 阴沉沉 
2608 | 阴毒 
2609 | 阴恶 
2610 | 阴晦 
2611 | 阴冷 
2612 | 阴凄 
2613 | 阴森 
2614 | 阴森森 
2615 | 阴损 
2616 | 阴险 
2617 | 阴险毒辣 
2618 | 阴性 
2619 | 阴阳怪气 
2620 | 淫 
2621 | 淫荡 
2622 | 淫秽 
2623 | 淫贱 
2624 | 淫乱 
2625 | 淫靡 
2626 | 淫邪 
2627 | 淫逸 
2628 | 淫亵 
2629 | 淫猥 
2630 | 引起反感 
2631 | 隐 
2632 | 隐晦 
2633 | 隐秘 
2634 | 隐然 
2635 | 隐身 
2636 | 隐形 
2637 | 隐性 
2638 | 隐隐 
2639 | 隐隐绰绰 
2640 | 隐隐约约 
2641 | 隐约 
2642 | 应名儿 
2643 | 影影绰绰 
2644 | 硬 
2645 | 硬气 
2646 | 硬生生 
2647 | 硬性 
2648 | 拥挤 
2649 | 拥挤不堪 
2650 | 庸 
2651 | 庸碌 
2652 | 庸俗 
2653 | 庸庸碌碌 
2654 | 用不着 
2655 | 幽 
2656 | 幽暗 
2657 | 幽晦 
2658 | 幽幽 
2659 | 幽冥 
2660 | 幽黯 
2661 | 优柔 
2662 | 优柔寡断 
2663 | 悠谬 
2664 | 犹犹豫豫 
2665 | 犹豫不决 
2666 | 犹豫不前 
2667 | 油 
2668 | 油乎乎 
2669 | 油滑 
2670 | 油腻 
2671 | 油腻腻 
2672 | 油头滑脑 
2673 | 油汪汪 
2674 | 油脂麻花 
2675 | 油渍渍 
2676 | 有碍观瞻 
2677 | 有弊 
2678 | 有点旧 
2679 | 有毒 
2680 | 有毒性 
2681 | 有害 
2682 | 有名无实 
2683 | 有难度 
2684 | 有伤风化 
2685 | 有失检点 
2686 | 有失偏颇 
2687 | 有失身分 
2688 | 有始无终 
2689 | 有恃无恐 
2690 | 有头无尾 
2691 | 有一搭没一搭 
2692 | 有义务 
2693 | 有罪 
2694 | 幼稚 
2695 | 迂 
2696 | 迂腐 
2697 | 迂阔 
2698 | 迂拙 
2699 | 于事无补 
2700 | 愚 
2701 | 愚笨 
2702 | 愚不可及 
2703 | 愚痴 
2704 | 愚蠢 
2705 | 愚钝 
2706 | 愚陋 
2707 | 愚鲁 
2708 | 愚昧 
2709 | 愚昧无知 
2710 | 愚蒙 
2711 | 愚傻 
2712 | 愚顽 
2713 | 愚妄 
2714 | 愚拙 
2715 | 余剩 
2716 | 逾分 
2717 | 鱼龙混杂 
2718 | 鱼游釜中 
2719 | 与虎谋皮 
2720 | 与世隔绝 
2721 | 与世无争 
2722 | 语无伦次 
2723 | 语焉不详 
2724 | 羽毛未丰 
2725 | 欲壑难填 
2726 | 原始 
2727 | 圆 
2728 | 圆滑 
2729 | 约略 
2730 | 越轨 
2731 | 越礼 
2732 | 云遮雾障 
2733 | 云谲波诡 
2734 | 蕴藉 
2735 | 晕头转向 
2736 | 杂 
2737 | 杂草丛生 
2738 | 杂乱 
2739 | 杂乱无章 
2740 | 杂牌 
2741 | 杂七杂八 
2742 | 杂遝 
2743 | 杂沓 
2744 | 灾难性 
2745 | 在困难中 
2746 | 脏 
2747 | 脏乎乎 
2748 | 脏乱 
2749 | 脏乱差 
2750 | 脏兮兮 
2751 | 糟 
2752 | 糟糕 
2753 | 凿空 
2754 | 凿死理儿 
2755 | 躁 
2756 | 躁急 
2757 | 躁狂 
2758 | 造次 
2759 | 造作 
2760 | 贼 
2761 | 贼溜溜 
2762 | 贼眉鼠眼 
2763 | 贼去关门 
2764 | 贼头贼脑 
2765 | 扎手 
2766 | 轧 
2767 | 窄 
2768 | 张冠李戴 
2769 | 张狂 
2770 | 招致不幸 
2771 | 照本宣科 
2772 | 狰狞 
2773 | 正色 
2774 | 正颜厉色 
2775 | 枝蔓 
2776 | 支离 
2777 | 支离破碎 
2778 | 直呆呆 
2779 | 直瞪瞪 
2780 | 直盯盯 
2781 | 直勾勾 
2782 | 直愣愣 
2783 | 执迷不悟 
2784 | 执拗 
2785 | 趾高气扬 
2786 | 只顾自身利益 
2787 | 只听楼梯响，不见人下来 
2788 | 纸上谈兵 
2789 | 纸醉金迷 
2790 | 志大才疏 
2791 | 智障 
2792 | 稚气 
2793 | 质次价高 
2794 | 质量差 
2795 | 滞 
2796 | 滞背 
2797 | 滞钝 
2798 | 滞涩 
2799 | 滞销 
2800 | 窒闷 
2801 | 重 
2802 | 重沓 
2803 | 众叛亲离 
2804 | 皱 
2805 | 皱巴 
2806 | 皱巴巴 
2807 | 皱皱巴巴 
2808 | 竹篮打水 
2809 | 竹篮子打水 
2810 | 竹篮子打水一场空 
2811 | 煮豆燃萁 
2812 | 主观 
2813 | 主观上 
2814 | 讆 
2815 | 专横 
2816 | 专横跋扈 
2817 | 专制 
2818 | 转移性 
2819 | 装备不良 
2820 | 装模作样 
2821 | 装腔 
2822 | 装腔作势 
2823 | 装相 
2824 | 装样子 
2825 | 赘 
2826 | 赘余 
2827 | 捉襟见肘 
2828 | 捉摸不定 
2829 | 拙 
2830 | 拙笨 
2831 | 拙劣 
2832 | 着三不着两 
2833 | 浊 
2834 | 子虚 
2835 | 子虚乌有 
2836 | 自傲 
2837 | 自大 
2838 | 自负 
2839 | 自高自大 
2840 | 自豪 
2841 | 自命不凡 
2842 | 自命清高 
2843 | 自恃 
2844 | 自私 
2845 | 自私自利 
2846 | 自相矛盾 
2847 | 纵恣 
2848 | 走神 
2849 | 走油 
2850 | 嘴尖 
2851 | 醉翁之意不在酒 
2852 | 最差 
2853 | 最坏 
2854 | 罪不容诛 
2855 | 罪大恶极 
2856 | 罪恶 
2857 | 罪恶多端 
2858 | 罪恶深重 
2859 | 罪恶滔天 
2860 | 罪恶昭彰 
2861 | 罪恶昭著 
2862 | 罪该万死 
2863 | 罪孽深重 
2864 | 左 
2865 | 做作 
2866 | 作势 
2867 | 坐而论道 
2868 | 坐井观天 
2869 | 兀突 
2870 | 孬 
2871 | 噩 
2872 | 卮 
2873 | 孛 
2874 | 啬 
2875 | 啬刻 
2876 | 厝火积薪 
2877 | 赝 
2878 | 剌 
2879 | 剌戾 
2880 | 剽悍 
2881 | 罔 
2882 | 伧 
2883 | 伧俗 
2884 | 佶屈聱牙 
2885 | 侉 
2886 | 佻 
2887 | 俚 
2888 | 俚俗 
2889 | 倜然 
2890 | 倥 
2891 | 倥侗 
2892 | 倥偬 
2893 | 倨 
2894 | 倨傲 
2895 | 傥 
2896 | 僭 
2897 | 儇 
2898 | 巽 
2899 | 亵 
2900 | 羸弱 
2901 | 跅弛 
2902 | 跅驰 
2903 | 冥 
2904 | 冥顽 
2905 | 冥顽不化 
2906 | 冥顽不灵 
2907 | 冥冥 
2908 | 讷 
2909 | 诎 
2910 | 诘屈聱牙 
2911 | 谫 
2912 | 谫陋 
2913 | 谲 
2914 | 谲诈 
2915 | 阢 
2916 | 阽 
2917 | 刍 
2918 | 堙 
2919 | 艽 
2920 | 芴 
2921 | 苴 
2922 | 茕 
2923 | 茕茕 
2924 | 茕茕孑立 
2925 | 茕茕孑立，形影相吊 
2926 | 荏 
2927 | 荏弱 
2928 | 萋迷 
2929 | 迍邅 
2930 | 瞢 
2931 | 拗 
2932 | 拮据 
2933 | 吆三喝四 
2934 | 吆五喝六 
2935 | 咄咄逼人 
2936 | 哙 
2937 | 哝 
2938 | 啷 
2939 | 嗲 
2940 | 嗲声嗲气 
2941 | 嘈 
2942 | 嘈杂 
2943 | 嘀里嘟噜 
2944 | 岌岌 
2945 | 岌岌不可终日 
2946 | 岌岌可危 
2947 | 嶙峋 
2948 | 嶙嶙 
2949 | 犷 
2950 | 犷悍 
2951 | 狃 
2952 | 狎 
2953 | 狎昵 
2954 | 狯 
2955 | 狷 
2956 | 狷急 
2957 | 猥 
2958 | 猥鄙 
2959 | 猥贱 
2960 | 猥劣 
2961 | 猥陋 
2962 | 猥琐 
2963 | 猥亵 
2964 | 獐头鼠目 
2965 | 獠 
2966 | 舛 
2967 | 馀 
2968 | 廪 
2969 | 忉 
2970 | 忮 
2971 | 忸忸怩怩 
2972 | 忸怩作态 
2973 | 怊 
2974 | 恹 
2975 | 恹恹 
2976 | 悖 
2977 | 悖晦 
2978 | 悖谬 
2979 | 悖逆 
2980 | 悖妄 
2981 | 悭 
2982 | 悭吝 
2983 | 悱 
2984 | 愦 
2985 | 愣 
2986 | 愣头愣脑 
2987 | 愀然 
2988 | 愎 
2989 | 慵懒 
2990 | 懵 
2991 | 懵懂 
2992 | 懵里懵懂 
2993 | 懵懵懂懂 
2994 | 阙陋 
2995 | 阙略 
2996 | 湎 
2997 | 溲 
2998 | 溷浊 
2999 | 滂 
3000 | 澹然 
3001 | 蹇 
3002 | 遴 
3003 | 邋里邋遢 
3004 | 邋遢 
3005 | 邋邋遢遢 
3006 | 孱 
3007 | 孱弱 
3008 | 羼 
3009 | 娆 
3010 | 媸 
3011 | 孑 
3012 | 孑然 
3013 | 孑然一身 
3014 | 孑身 
3015 | 驽 
3016 | 驽钝 
3017 | 骈枝 
3018 | 绌 
3019 | 缈 
3020 | 缛 
3021 | 缥缈 
3022 | 缭乱 
3023 | 幺麽 
3024 | 杌 
3025 | 桀 
3026 | 桀骜不驯 
3027 | 棼 
3028 | 槁 
3029 | 轫 
3030 | 辁 
3031 | 暧 
3032 | 暧昧 
3033 | 暝 
3034 | 犟 
3035 | 毵毵 
3036 | 虢 
3037 | 朦 
3038 | 朦胧 
3039 | 朦朦胧胧 
3040 | 臊 
3041 | 膻 
3042 | 膻气 
3043 | 膻腥 
3044 | 熹微 
3045 | 戾 
3046 | 恝 
3047 | 恝然 
3048 | 恣 
3049 | 恣肆 
3050 | 恣意 
3051 | 憝 
3052 | 戆 
3053 | 戆头戆脑 
3054 | 沓 
3055 | 硗 
3056 | 硗薄 
3057 | 硗瘠 
3058 | 碜 
3059 | 睨 
3060 | 瞀 
3061 | 瞑 
3062 | 瞽 
3063 | 锱铢必较 
3064 | 鸷 
3065 | 鸷悍 
3066 | 疣赘 
3067 | 瘠 
3068 | 瘠薄 
3069 | 癃 
3070 | 癫狂 
3071 | 窈冥 
3072 | 窭 
3073 | 窳 
3074 | 窳败 
3075 | 窳惰 
3076 | 窳劣 
3077 | 褊 
3078 | 褊急 
3079 | 褊狭 
3080 | 褶 
3081 | 矜 
3082 | 矜持 
3083 | 矜夸 
3084 | 聒 
3085 | 聒噪 
3086 | 颛 
3087 | 颛蒙 
3088 | 颟顸 
3089 | 蚍蜉撼大树 
3090 | 蚍蜉撼树 
3091 | 蚩 
3092 | 蜻蜓点水 
3093 | 笃 
3094 | 箪食瓢饮 
3095 | 趄 
3096 | 蹩脚 
3097 | 霭霭 
3098 | 龃 
3099 | 龃龉 
3100 | 龉 
3101 | 龌 
3102 | 龌龊 
3103 | 鲰 
3104 | 饕 
3105 | 黝 
3106 | 黝暗 
3107 | 黝黯 
3108 | 黠 
3109 | 黠慧 
3110 | 黢 
3111 | 黢黑 
3112 | 黩 
3113 | 黪 
3114 | 黯 
3115 | 黯淡 
3116 | 黯然 
3117 | 


--------------------------------------------------------------------------------
/corpus/Hownet/url.txt:
--------------------------------------------------------------------------------
1 | 数据标题：Hownet情感词语集
2 | 数据网址：http://www.datatang.com/datares/go.aspx?dataid=603399
3 | 


--------------------------------------------------------------------------------
/corpus/Tsinghua/readme.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/corpus/Tsinghua/readme.txt


--------------------------------------------------------------------------------
/corpus/Tsinghua/url.txt:
--------------------------------------------------------------------------------
1 | 数据标题：中文褒贬义词典v1.0（清华大学李军）
2 | 数据网址：http://www.datatang.com/datares/go.aspx?dataid=618405
3 | 


--------------------------------------------------------------------------------
/data/NLPCC_2016_Stance_Detection_Task_A_Unknown.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/data/NLPCC_2016_Stance_Detection_Task_A_Unknown.txt


--------------------------------------------------------------------------------
/data/NLPCC_2016_Stance_Detection_Task_A_gold.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jacoxu/2016NLPCC_Stance_Detection/185c76329ec77be974d21e6259696d2c1d84b871/data/NLPCC_2016_Stance_Detection_Task_A_gold.txt


--------------------------------------------------------------------------------