├── README.md ├── dataset ├── CFRA_report_ids.txt └── GJO_question_ids.txt └── extraction ├── extract_numerical_forecasts.py └── pdf_extraction_sample_code.py /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Measuring Forecasting Skill from Text 3 | 4 | This repository contains the resources from the following paper: 5 | 6 | Measuring Forecasting Skill from Text 7 | 8 | Shi Zong, Alan Ritter, Eduard Hovy 9 | 10 | ACL 2020 11 | 12 | https://www.aclweb.org/anthology/2020.acl-main.473.pdf 13 | 14 | ``` 15 | @inproceedings{zong-etal-2020-measuring, 16 | title = "Measuring Forecasting Skill from Text", 17 | author = "Zong, Shi and 18 | Ritter, Alan and 19 | Hovy, Eduard", 20 | booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", 21 | month = jul, 22 | year = "2020", 23 | address = "Online", 24 | publisher = "Association for Computational Linguistics", 25 | url = "https://www.aclweb.org/anthology/2020.acl-main.473", 26 | pages = "5317--5331", 27 | } 28 | ``` 29 | 30 | ### Dataset descriptions 31 | 32 | Due to privacy and policy restrictions, we are only able to provide the question IDs for Good Judgment Open dataset and report IDs (from THOMSON ONE database) for financial dataset under `dataset` folder. Please check library subscriptions of your institution for accessing THOMSON ONE database. 33 | 34 | ### Extracting information from PDF files 35 | 36 | #### Preprocess PDF files 37 | 38 | We provide some suggestions / methods that we find useful in practice when converting pdf files into text. 39 | 40 | - We suggest first split pdf files into single pages, then combine pages with same layout for easier batch process. 41 | - When extracting paragraph contents from pdf files, we suggest open files by using pdf readers (e.g., Adobe) and directly copy text out. Using some automatic tools may introduce many line breaks, making it hard to recover the original paragraph structures. 42 | - We suggest crop pdf pages into the area needed to make the post process easier. 43 | - Automatic tools (e.g., [tabula](https://github.com/tabulapdf/tabula)) could be used for structured table extractions. 44 | - Always add some separators in header part of pdf files to divide different pages. 45 | 46 | We provide some sample processing codes under `extraction` folder. Note that you may need to adjust the parameters (for example the number of blank spaces and the pixel location for extraction areas) for your own needs. 47 | 48 | #### Extract financial numerical estimates 49 | 50 | We provide our code for extracting financial numerical estimates from analysts notes in `extract_numerical_forecasts.py`. 51 | 52 | We tokenize our financial analyst notes by Stanford CoreNLP. The tagging results shall be organized in .jsonl format: each line has a 'tagging' field containing a list, with each element a tuple `[a_sentence_id, a_tokenized_sentence]` (note that tokenized money values need to be recovered, e.g., to reconnect '$' and '10' into '$10' as shown in the following example). Then run `extractEstimatesNew(input_data)` to get the extraction results. 53 | 54 | ```angular2 55 | [{'tagging': 56 | [['23038356-0-0', 'We are keeping our EPS forecast for CI , but boost our target price by $10 to $97 , 57 | on revised P/E analysis .'], 58 | ['23038356-0-1', 'X X X'], 59 | ['23038356-0-2', 'X X X']]} 60 | ] 61 | ``` 62 | 63 | 64 | ### Computational linguistic tools used 65 | 66 | We list the computational linguistic tools used in this paper. 67 | 68 | - Uncertainty: https://github.com/heikeadel/attention_methods 69 | - LIWC: https://liwc.wpengine.com/ 70 | - SO-CAL: https://github.com/sfu-discourse-lab/SO-CAL 71 | - Discourse lexicon: http://connective-lex.info/ 72 | - Financial sentiment: https://sraf.nd.edu/textual-analysis/resources/ 73 | 74 | 75 | -------------------------------------------------------------------------------- /dataset/GJO_question_ids.txt: -------------------------------------------------------------------------------- 1 | 4 2 | 5 3 | 6 4 | 7 5 | 8 6 | 9 7 | 10 8 | 12 9 | 13 10 | 14 11 | 15 12 | 16 13 | 17 14 | 18 15 | 19 16 | 22 17 | 24 18 | 27 19 | 28 20 | 29 21 | 30 22 | 33 23 | 34 24 | 35 25 | 36 26 | 37 27 | 38 28 | 39 29 | 40 30 | 42 31 | 44 32 | 47 33 | 49 34 | 50 35 | 51 36 | 52 37 | 54 38 | 55 39 | 56 40 | 61 41 | 63 42 | 64 43 | 65 44 | 66 45 | 70 46 | 72 47 | 73 48 | 74 49 | 75 50 | 76 51 | 78 52 | 79 53 | 80 54 | 84 55 | 85 56 | 86 57 | 90 58 | 94 59 | 95 60 | 96 61 | 97 62 | 98 63 | 99 64 | 100 65 | 101 66 | 103 67 | 106 68 | 109 69 | 111 70 | 112 71 | 113 72 | 115 73 | 117 74 | 123 75 | 127 76 | 128 77 | 129 78 | 130 79 | 133 80 | 135 81 | 137 82 | 138 83 | 143 84 | 146 85 | 148 86 | 149 87 | 150 88 | 153 89 | 156 90 | 157 91 | 158 92 | 161 93 | 162 94 | 166 95 | 167 96 | 168 97 | 172 98 | 173 99 | 174 100 | 175 101 | 176 102 | 177 103 | 178 104 | 179 105 | 180 106 | 182 107 | 183 108 | 184 109 | 185 110 | 187 111 | 188 112 | 190 113 | 193 114 | 194 115 | 195 116 | 197 117 | 198 118 | 199 119 | 200 120 | 201 121 | 202 122 | 203 123 | 204 124 | 206 125 | 208 126 | 210 127 | 211 128 | 214 129 | 215 130 | 216 131 | 217 132 | 219 133 | 221 134 | 222 135 | 223 136 | 224 137 | 226 138 | 228 139 | 231 140 | 232 141 | 233 142 | 236 143 | 237 144 | 244 145 | 245 146 | 250 147 | 251 148 | 252 149 | 256 150 | 257 151 | 258 152 | 259 153 | 261 154 | 263 155 | 264 156 | 265 157 | 270 158 | 275 159 | 277 160 | 279 161 | 280 162 | 281 163 | 282 164 | 284 165 | 287 166 | 290 167 | 291 168 | 293 169 | 294 170 | 295 171 | 298 172 | 299 173 | 300 174 | 302 175 | 303 176 | 307 177 | 308 178 | 309 179 | 344 180 | 345 181 | 346 182 | 349 183 | 350 184 | 351 185 | 352 186 | 353 187 | 354 188 | 355 189 | 357 190 | 359 191 | 361 192 | 363 193 | 364 194 | 366 195 | 367 196 | 369 197 | 370 198 | 371 199 | 373 200 | 374 201 | 375 202 | 376 203 | 377 204 | 378 205 | 379 206 | 381 207 | 382 208 | 386 209 | 387 210 | 388 211 | 389 212 | 390 213 | 391 214 | 392 215 | 405 216 | 407 217 | 409 218 | 413 219 | 414 220 | 417 221 | 419 222 | 420 223 | 421 224 | 423 225 | 426 226 | 427 227 | 429 228 | 430 229 | 436 230 | 440 231 | 442 232 | 445 233 | 446 234 | 447 235 | 448 236 | 449 237 | 450 238 | 451 239 | 452 240 | 453 241 | 454 242 | 455 243 | 456 244 | 457 245 | 458 246 | 459 247 | 460 248 | 461 249 | 462 250 | 463 251 | 464 252 | 465 253 | 467 254 | 468 255 | 469 256 | 470 257 | 471 258 | 473 259 | 476 260 | 479 261 | 480 262 | 481 263 | 482 264 | 483 265 | 484 266 | 485 267 | 486 268 | 487 269 | 490 270 | 491 271 | 493 272 | 498 273 | 500 274 | 501 275 | 502 276 | 507 277 | 508 278 | 509 279 | 510 280 | 511 281 | 512 282 | 514 283 | 515 284 | 516 285 | 517 286 | 518 287 | 519 288 | 520 289 | 521 290 | 523 291 | 524 292 | 525 293 | 530 294 | 537 295 | 539 296 | 540 297 | 541 298 | 545 299 | 546 300 | 548 301 | 561 302 | 563 303 | 564 304 | 567 305 | 568 306 | 569 307 | 572 308 | 583 309 | 585 310 | 588 311 | 597 312 | 599 313 | 602 314 | 608 315 | 609 316 | 610 317 | 614 318 | 615 319 | 617 320 | 618 321 | 619 322 | 620 323 | 622 324 | 623 325 | 624 326 | 625 327 | 626 328 | 627 329 | 628 330 | 629 331 | 630 332 | 631 333 | 632 334 | 633 335 | 640 336 | 641 337 | 644 338 | 645 339 | 646 340 | 649 341 | 655 342 | 662 343 | 665 344 | 666 345 | 667 346 | 669 347 | 671 348 | 678 349 | 679 350 | 680 351 | 681 352 | 682 353 | 683 354 | 685 355 | 686 356 | 689 357 | 694 358 | 696 359 | 698 360 | 704 361 | 753 362 | 755 363 | 756 364 | 757 365 | 759 366 | 762 367 | 764 368 | 765 369 | 769 370 | 770 371 | 771 372 | 772 373 | 773 374 | 795 375 | 796 376 | 797 377 | 798 378 | 799 379 | 800 380 | 802 381 | 803 382 | 807 383 | 811 384 | 812 385 | 815 386 | 818 387 | 819 388 | 820 389 | 821 390 | 825 391 | 826 392 | 827 393 | 828 394 | 829 395 | 831 396 | 832 397 | 833 398 | 834 399 | 835 400 | 836 401 | 837 402 | 838 403 | 839 404 | 840 405 | 842 406 | 844 407 | 877 408 | 880 409 | 881 410 | 882 411 | 883 412 | 886 413 | 889 414 | 893 415 | 894 416 | 896 417 | 899 418 | 900 419 | 901 420 | 904 421 | 924 422 | 926 423 | 928 424 | 933 425 | 936 426 | 944 427 | 952 428 | 958 429 | 964 430 | 965 431 | 967 432 | 986 433 | 1001 434 | 1004 435 | 1005 436 | 1028 437 | 1033 438 | 1035 439 | 1038 440 | 1051 441 | 1069 442 | -------------------------------------------------------------------------------- /extraction/extract_numerical_forecasts.py: -------------------------------------------------------------------------------- 1 | 2 | ### extract numerical forecasts from analysts notes 3 | ### ATTENTION: it is only designed for CFRA reports 4 | 5 | import re 6 | 7 | 8 | def maskMapping(mask_name, input_string, start_idx, count1): 9 | 10 | def replaceStrIndex(text, index1, index2, replacement=''): 11 | return '%s%s%s' % (text[:index1], replacement, text[index2:]) 12 | 13 | pattern_iter = re.finditer(mask_name, input_string) 14 | for tag_idx, match in enumerate(pattern_iter): 15 | curr_start_idx = match.start() + tag_idx * 3 16 | curr_end_idx = match.end() + tag_idx * 3 17 | input_string = replaceStrIndex(input_string, curr_start_idx, curr_end_idx,\ 18 | mask_name.replace('>', '-') + str(tag_idx+start_idx+count1).zfill(2) + '>') 19 | 20 | return input_string 21 | 22 | 23 | def maskEntity(input_string, pattern_list, mask_name, count1): 24 | 25 | output_tuple = [] 26 | num_total_replacement = 0 27 | for each_pattern in pattern_list: 28 | curr_matches = re.findall(each_pattern[0], input_string) 29 | input_string, count = re.subn(each_pattern[0], mask_name, input_string) 30 | input_string = maskMapping(mask_name, input_string, num_total_replacement, count1) 31 | # print(input_string, count) 32 | if curr_matches: 33 | output_tuple.append(([(mask_name.replace('>', '-')+str(idx+num_total_replacement+count1).zfill(2)+'>', i) for idx, i in enumerate(curr_matches)])) 34 | num_total_replacement = num_total_replacement + count 35 | 36 | output_tuple = [j for i in output_tuple for j in i] 37 | 38 | return input_string, output_tuple, num_total_replacement+count1 39 | 40 | 41 | def maskNumSpecSign(input_string, sign, replacement, count): 42 | 43 | input_string_split = input_string.split(' ') 44 | ## find token 45 | token_chosen = [(idx, i) for idx, i in enumerate(input_string_split) if sign in i] 46 | ## replace and build mapping 47 | mapping = [] 48 | for idx, i in enumerate(token_chosen): 49 | input_string_split[i[0]] = replacement.replace('>', '-') + str(idx+count).zfill(2) + '>' 50 | mapping.append((replacement.replace('>', '-') + str(idx+count).zfill(2) + '>', i[1])) 51 | output_string = ' '.join(input_string_split) 52 | 53 | return output_string, mapping, len(token_chosen)+count 54 | 55 | 56 | def performMasking(input_string, count1=0, count2=0): 57 | 58 | time_pattern_list = [(r"'1\d 's", 'year'), (r"1\d 's", 'year'), (r"'1\d's", 'year'), 59 | (r"1\d's", 'year'), (r"'1\d", 'year'), # year 60 | (r"Q\d", 'quarter'), # quarter 61 | (r'\d+[ |-]month', 'time-span'), 62 | (r'(\d+[ |-]year|\d+[ |-]yr)', 'time-span'), # time 63 | (r'FY [20]*1\d -LRB- [A-Za-z\.]{3,10}? -RRB-', 'FY'), 64 | (r'FY 1\d', 'FY'), (r"201\d", 'year')] 65 | 66 | neg_value_list = [(r'-\$[0-9\.]+', 'neg_value'), 67 | (r'a loss of \$[0-9\.]+', 'neg_value'), 68 | (r'a loss per share of \$[0-9\.]+', 'neg_value'), 69 | (r'\$[0-9\.]+ loss', 'neg_value'), 70 | (r'\$[0-9\.]+ EPS', 'pos_value'), 71 | (r'(\$[0-9\. ]+)(million|billion|M)', 'normal_value')] 72 | 73 | output_tuple = [] 74 | 75 | ## time masking 76 | input_string, output_tuple_time, time_count = maskEntity(input_string, time_pattern_list, 77 | '