├── README.md
├── dataset
    ├── CFRA_report_ids.txt
    └── GJO_question_ids.txt
└── extraction
    ├── extract_numerical_forecasts.py
    └── pdf_extraction_sample_code.py


/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Measuring Forecasting Skill from Text
 3 | 
 4 | This repository contains the resources from the following paper:
 5 | 
 6 | Measuring Forecasting Skill from Text
 7 | 
 8 | Shi Zong, Alan Ritter, Eduard Hovy
 9 | 
10 | ACL 2020
11 | 
12 | https://www.aclweb.org/anthology/2020.acl-main.473.pdf
13 | 
14 | ```
15 | @inproceedings{zong-etal-2020-measuring,
16 |     title = "Measuring Forecasting Skill from Text",
17 |     author = "Zong, Shi  and
18 |       Ritter, Alan  and
19 |       Hovy, Eduard",
20 |     booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
21 |     month = jul,
22 |     year = "2020",
23 |     address = "Online",
24 |     publisher = "Association for Computational Linguistics",
25 |     url = "https://www.aclweb.org/anthology/2020.acl-main.473",
26 |     pages = "5317--5331",
27 | }
28 | ```
29 | 
30 | ### Dataset descriptions
31 | 
32 | Due to privacy and policy restrictions, we are only able to provide the question IDs for Good Judgment Open dataset and report IDs (from THOMSON ONE database) for financial dataset under `dataset` folder. Please check library subscriptions of your institution for accessing THOMSON ONE database.
33 | 
34 | ### Extracting information from PDF files
35 | 
36 | #### Preprocess PDF files
37 | 
38 | We provide some suggestions / methods that we find useful in practice when converting pdf files into text.
39 | 
40 | - We suggest first split pdf files into single pages, then combine pages with same layout for easier batch process.
41 | - When extracting paragraph contents from pdf files, we suggest open files by using pdf readers (e.g., Adobe) and directly copy text out. Using some automatic tools may introduce many line breaks, making it hard to recover the original paragraph structures.
42 | - We suggest crop pdf pages into the area needed to make the post process easier.
43 | - Automatic tools (e.g., [tabula](https://github.com/tabulapdf/tabula)) could be used for structured table extractions.
44 | - Always add some separators in header part of pdf files to divide different pages.
45 | 
46 | We provide some sample processing codes under `extraction` folder. Note that you may need to adjust the parameters (for example the number of blank spaces and the pixel location for extraction areas) for your own needs.
47 | 
48 | #### Extract financial numerical estimates
49 | 
50 | We provide our code for extracting financial numerical estimates from analysts notes in `extract_numerical_forecasts.py`.
51 | 
52 | We tokenize our financial analyst notes by Stanford CoreNLP. The tagging results shall be organized in .jsonl format: each line has a 'tagging' field containing a list, with each element a tuple `[a_sentence_id, a_tokenized_sentence]` (note that tokenized money values need to be recovered, e.g., to reconnect '$' and '10' into '$10' as shown in the following example). Then run `extractEstimatesNew(input_data)` to get the extraction results.
53 | 
54 | ```angular2
55 | [{'tagging': 
56 |       [['23038356-0-0', 'We are keeping our EPS forecast for CI , but boost our target price by $10 to $97 ,
57 |                         on revised P/E analysis .'],
58 |        ['23038356-0-1', 'X X X'],
59 |        ['23038356-0-2', 'X X X']]}
60 | ]
61 | ```
62 | 
63 | 
64 | ### Computational linguistic tools used
65 | 
66 | We list the computational linguistic tools used in this paper.
67 | 
68 | - Uncertainty: https://github.com/heikeadel/attention_methods
69 | - LIWC: https://liwc.wpengine.com/
70 | - SO-CAL: https://github.com/sfu-discourse-lab/SO-CAL
71 | - Discourse lexicon: http://connective-lex.info/
72 | - Financial sentiment: https://sraf.nd.edu/textual-analysis/resources/
73 | 
74 | 
75 | 


--------------------------------------------------------------------------------
/dataset/GJO_question_ids.txt:
--------------------------------------------------------------------------------
  1 | 4
  2 | 5
  3 | 6
  4 | 7
  5 | 8
  6 | 9
  7 | 10
  8 | 12
  9 | 13
 10 | 14
 11 | 15
 12 | 16
 13 | 17
 14 | 18
 15 | 19
 16 | 22
 17 | 24
 18 | 27
 19 | 28
 20 | 29
 21 | 30
 22 | 33
 23 | 34
 24 | 35
 25 | 36
 26 | 37
 27 | 38
 28 | 39
 29 | 40
 30 | 42
 31 | 44
 32 | 47
 33 | 49
 34 | 50
 35 | 51
 36 | 52
 37 | 54
 38 | 55
 39 | 56
 40 | 61
 41 | 63
 42 | 64
 43 | 65
 44 | 66
 45 | 70
 46 | 72
 47 | 73
 48 | 74
 49 | 75
 50 | 76
 51 | 78
 52 | 79
 53 | 80
 54 | 84
 55 | 85
 56 | 86
 57 | 90
 58 | 94
 59 | 95
 60 | 96
 61 | 97
 62 | 98
 63 | 99
 64 | 100
 65 | 101
 66 | 103
 67 | 106
 68 | 109
 69 | 111
 70 | 112
 71 | 113
 72 | 115
 73 | 117
 74 | 123
 75 | 127
 76 | 128
 77 | 129
 78 | 130
 79 | 133
 80 | 135
 81 | 137
 82 | 138
 83 | 143
 84 | 146
 85 | 148
 86 | 149
 87 | 150
 88 | 153
 89 | 156
 90 | 157
 91 | 158
 92 | 161
 93 | 162
 94 | 166
 95 | 167
 96 | 168
 97 | 172
 98 | 173
 99 | 174
100 | 175
101 | 176
102 | 177
103 | 178
104 | 179
105 | 180
106 | 182
107 | 183
108 | 184
109 | 185
110 | 187
111 | 188
112 | 190
113 | 193
114 | 194
115 | 195
116 | 197
117 | 198
118 | 199
119 | 200
120 | 201
121 | 202
122 | 203
123 | 204
124 | 206
125 | 208
126 | 210
127 | 211
128 | 214
129 | 215
130 | 216
131 | 217
132 | 219
133 | 221
134 | 222
135 | 223
136 | 224
137 | 226
138 | 228
139 | 231
140 | 232
141 | 233
142 | 236
143 | 237
144 | 244
145 | 245
146 | 250
147 | 251
148 | 252
149 | 256
150 | 257
151 | 258
152 | 259
153 | 261
154 | 263
155 | 264
156 | 265
157 | 270
158 | 275
159 | 277
160 | 279
161 | 280
162 | 281
163 | 282
164 | 284
165 | 287
166 | 290
167 | 291
168 | 293
169 | 294
170 | 295
171 | 298
172 | 299
173 | 300
174 | 302
175 | 303
176 | 307
177 | 308
178 | 309
179 | 344
180 | 345
181 | 346
182 | 349
183 | 350
184 | 351
185 | 352
186 | 353
187 | 354
188 | 355
189 | 357
190 | 359
191 | 361
192 | 363
193 | 364
194 | 366
195 | 367
196 | 369
197 | 370
198 | 371
199 | 373
200 | 374
201 | 375
202 | 376
203 | 377
204 | 378
205 | 379
206 | 381
207 | 382
208 | 386
209 | 387
210 | 388
211 | 389
212 | 390
213 | 391
214 | 392
215 | 405
216 | 407
217 | 409
218 | 413
219 | 414
220 | 417
221 | 419
222 | 420
223 | 421
224 | 423
225 | 426
226 | 427
227 | 429
228 | 430
229 | 436
230 | 440
231 | 442
232 | 445
233 | 446
234 | 447
235 | 448
236 | 449
237 | 450
238 | 451
239 | 452
240 | 453
241 | 454
242 | 455
243 | 456
244 | 457
245 | 458
246 | 459
247 | 460
248 | 461
249 | 462
250 | 463
251 | 464
252 | 465
253 | 467
254 | 468
255 | 469
256 | 470
257 | 471
258 | 473
259 | 476
260 | 479
261 | 480
262 | 481
263 | 482
264 | 483
265 | 484
266 | 485
267 | 486
268 | 487
269 | 490
270 | 491
271 | 493
272 | 498
273 | 500
274 | 501
275 | 502
276 | 507
277 | 508
278 | 509
279 | 510
280 | 511
281 | 512
282 | 514
283 | 515
284 | 516
285 | 517
286 | 518
287 | 519
288 | 520
289 | 521
290 | 523
291 | 524
292 | 525
293 | 530
294 | 537
295 | 539
296 | 540
297 | 541
298 | 545
299 | 546
300 | 548
301 | 561
302 | 563
303 | 564
304 | 567
305 | 568
306 | 569
307 | 572
308 | 583
309 | 585
310 | 588
311 | 597
312 | 599
313 | 602
314 | 608
315 | 609
316 | 610
317 | 614
318 | 615
319 | 617
320 | 618
321 | 619
322 | 620
323 | 622
324 | 623
325 | 624
326 | 625
327 | 626
328 | 627
329 | 628
330 | 629
331 | 630
332 | 631
333 | 632
334 | 633
335 | 640
336 | 641
337 | 644
338 | 645
339 | 646
340 | 649
341 | 655
342 | 662
343 | 665
344 | 666
345 | 667
346 | 669
347 | 671
348 | 678
349 | 679
350 | 680
351 | 681
352 | 682
353 | 683
354 | 685
355 | 686
356 | 689
357 | 694
358 | 696
359 | 698
360 | 704
361 | 753
362 | 755
363 | 756
364 | 757
365 | 759
366 | 762
367 | 764
368 | 765
369 | 769
370 | 770
371 | 771
372 | 772
373 | 773
374 | 795
375 | 796
376 | 797
377 | 798
378 | 799
379 | 800
380 | 802
381 | 803
382 | 807
383 | 811
384 | 812
385 | 815
386 | 818
387 | 819
388 | 820
389 | 821
390 | 825
391 | 826
392 | 827
393 | 828
394 | 829
395 | 831
396 | 832
397 | 833
398 | 834
399 | 835
400 | 836
401 | 837
402 | 838
403 | 839
404 | 840
405 | 842
406 | 844
407 | 877
408 | 880
409 | 881
410 | 882
411 | 883
412 | 886
413 | 889
414 | 893
415 | 894
416 | 896
417 | 899
418 | 900
419 | 901
420 | 904
421 | 924
422 | 926
423 | 928
424 | 933
425 | 936
426 | 944
427 | 952
428 | 958
429 | 964
430 | 965
431 | 967
432 | 986
433 | 1001
434 | 1004
435 | 1005
436 | 1028
437 | 1033
438 | 1035
439 | 1038
440 | 1051
441 | 1069
442 | 


--------------------------------------------------------------------------------
/extraction/extract_numerical_forecasts.py:
--------------------------------------------------------------------------------
  1 | 
  2 | ### extract numerical forecasts from analysts notes
  3 | ### ATTENTION: it is only designed for CFRA reports
  4 | 
  5 | import re
  6 | 
  7 | 
  8 | def maskMapping(mask_name, input_string, start_idx, count1):
  9 | 
 10 |     def replaceStrIndex(text, index1, index2, replacement=''):
 11 |         return '%s%s%s' % (text[:index1], replacement, text[index2:])
 12 | 
 13 |     pattern_iter = re.finditer(mask_name, input_string)
 14 |     for tag_idx, match in enumerate(pattern_iter):
 15 |         curr_start_idx = match.start() + tag_idx * 3
 16 |         curr_end_idx = match.end() + tag_idx * 3
 17 |         input_string = replaceStrIndex(input_string, curr_start_idx, curr_end_idx,\
 18 |                                mask_name.replace('>', '-') + str(tag_idx+start_idx+count1).zfill(2) + '>')
 19 | 
 20 |     return input_string
 21 | 
 22 | 
 23 | def maskEntity(input_string, pattern_list, mask_name, count1):
 24 | 
 25 |     output_tuple = []
 26 |     num_total_replacement = 0
 27 |     for each_pattern in pattern_list:
 28 |         curr_matches = re.findall(each_pattern[0], input_string)
 29 |         input_string, count = re.subn(each_pattern[0], mask_name, input_string)
 30 |         input_string = maskMapping(mask_name, input_string, num_total_replacement, count1)
 31 |         # print(input_string, count)
 32 |         if curr_matches:
 33 |             output_tuple.append(([(mask_name.replace('>', '-')+str(idx+num_total_replacement+count1).zfill(2)+'>', i) for idx, i in enumerate(curr_matches)]))
 34 |         num_total_replacement = num_total_replacement + count
 35 | 
 36 |     output_tuple = [j for i in output_tuple for j in i]
 37 | 
 38 |     return input_string, output_tuple, num_total_replacement+count1
 39 | 
 40 | 
 41 | def maskNumSpecSign(input_string, sign, replacement, count):
 42 | 
 43 |     input_string_split = input_string.split(' ')
 44 |     ## find token
 45 |     token_chosen = [(idx, i) for idx, i in enumerate(input_string_split) if sign in i]
 46 |     ## replace and build mapping
 47 |     mapping = []
 48 |     for idx, i in enumerate(token_chosen):
 49 |         input_string_split[i[0]] = replacement.replace('>', '-') + str(idx+count).zfill(2) + '>'
 50 |         mapping.append((replacement.replace('>', '-') + str(idx+count).zfill(2) + '>', i[1]))
 51 |     output_string = ' '.join(input_string_split)
 52 | 
 53 |     return output_string, mapping, len(token_chosen)+count
 54 | 
 55 | 
 56 | def performMasking(input_string, count1=0, count2=0):
 57 | 
 58 |     time_pattern_list = [(r"'1\d 's", 'year'), (r"1\d 's", 'year'), (r"'1\d's", 'year'),
 59 |                          (r"1\d's", 'year'), (r"'1\d", 'year'), # year
 60 |                          (r"Q\d", 'quarter'),  # quarter
 61 |                          (r'\d+[ |-]month', 'time-span'),
 62 |                          (r'(\d+[ |-]year|\d+[ |-]yr)', 'time-span'),  # time
 63 |                          (r'FY [20]*1\d -LRB- [A-Za-z\.]{3,10}? -RRB-', 'FY'),
 64 |                          (r'FY 1\d', 'FY'), (r"201\d", 'year')]
 65 | 
 66 |     neg_value_list = [(r'-\$[0-9\.]+', 'neg_value'),
 67 |                       (r'a loss of \$[0-9\.]+', 'neg_value'),
 68 |                       (r'a loss per share of \$[0-9\.]+', 'neg_value'),
 69 |                       (r'\$[0-9\.]+ loss', 'neg_value'),
 70 |                       (r'\$[0-9\.]+ EPS', 'pos_value'),
 71 |                       (r'(\$[0-9\. ]+)(million|billion|M)', 'normal_value')]
 72 | 
 73 |     output_tuple = []
 74 | 
 75 |     ## time masking
 76 |     input_string, output_tuple_time, time_count = maskEntity(input_string, time_pattern_list,
 77 |                                                              '<TIME>', count1)
 78 |     output_tuple = output_tuple + output_tuple_time
 79 | 
 80 |     ## mask negative value
 81 |     #
 82 |     input_string, output_tuple_money_1, money_count = maskEntity(input_string, neg_value_list,
 83 |                                                                  '<MONEY>', count2)
 84 | 
 85 |     ## money masking
 86 |     input_string, output_tuple_money, money_count = maskNumSpecSign(input_string, '$', '<MONEY>', money_count)
 87 |     output_tuple = output_tuple + output_tuple_money_1 + output_tuple_money
 88 | 
 89 |     return (input_string, output_tuple), time_count, money_count
 90 | 
 91 | 
 92 | 
 93 | def maskTimeMoney(input_pred):
 94 |     """
 95 | 
 96 |     :param input_pred:
 97 |     :return: a new dict with each sentence masked
 98 |     """
 99 | 
100 |     ## process tagging res
101 |     # all_sent = []
102 |     # for each_sent in input_pred.items():
103 |     #     curr_sent = ' '.join([i for i in each_sent[1][0]])
104 |     #     all_sent.append((each_sent[0], curr_sent))
105 |     ## new version
106 |     all_sent = input_pred
107 |     ## masking
108 |     curr_para_masking = []
109 |     time_count, money_count = 0, 0
110 |     for each_sent in all_sent:
111 |         curr_id = each_sent[0]
112 |         curr_sent_mask, time_count, money_count = performMasking(each_sent[1], time_count, money_count)
113 |         curr_para_masking.append((curr_id, curr_sent_mask[0], curr_sent_mask[1]))
114 |     # print(curr_para_masking)
115 |     ## reorganizing
116 |     curr_para_masking_dict = {}
117 |     for i in curr_para_masking:
118 |         curr_para_masking_dict[i[0]] = i[1]
119 |     ## curr_mapping_dict
120 |     curr_mapping = [j for i in curr_para_masking for j in i[-1] if j != []]
121 |     curr_mapping_dict = {}
122 |     for i in curr_mapping:
123 |         curr_mapping_dict[i[0]] = i[1]
124 | 
125 |     return curr_para_masking_dict, curr_mapping_dict
126 | 
127 | 
128 | def extractPatterns(input_pred):
129 | 
130 |     ##### input_pred should be just a single pred
131 |     ##### largely copy from extractNumForecasts()
132 | 
133 |     EPS_AND = \
134 |         [r'<TIME-\d+> and <TIME-\d+>.{0,10}EPS estimate[s]*.{0,40}to <MONEY-\d+> and <MONEY-\d+>',
135 |          r'<TIME-\d+> and <TIME-\d+>.{0,10}EPS estimate[s]*.{0,40}at <MONEY-\d+> and <MONEY-\d+>',
136 |          r'our <TIME-\d+> and <TIME-\d+>.{0,10}EPS estimate[s]* of <MONEY-\d+> and <MONEY-\d+>',
137 |          r'.{0,10}EPS estimate[s]* for <TIME-\d+> and <TIME-\d+>.{0,20}at <MONEY-\d+> and '
138 |          r'<MONEY-\d+>',
139 |          r'EPS estimates for <TIME-\d+> and <TIME-\d+> remain <MONEY-\d+> and <MONEY-\d+>',
140 |          r'EPS estimates for <TIME-\d+> and <TIME-\d+> to <MONEY-\d+>.{0,20}and <MONEY-\d+>.{0,20}',
141 |          r'<TIME-\d+> and <TIME-\d+> EPS estimates of <MONEY-\d+> and <MONEY-\d+>']
142 | 
143 |     LOSS = [r'<TIME-\d+> [operating |adjusted ]*loss per share estimate of <MONEY-\d+>',
144 |             r'<TIME-\d+> [operating |adjusted ]*loss per share estimate.{0,20}to <MONEY-\d+>']
145 | 
146 |     LOSS_AND = [
147 |         r'<TIME-\d+> and <TIME-\d+> loss per share estimates.{0,40}to <MONEY-\d+> and <MONEY-\d+>']
148 | 
149 |     estimate_pattern_list = [r'[our]*.{0,20}EPS estimate[s]* for <TIME-\d+>.{0,20}to <MONEY-\d+>',
150 |                              r'[our]* <TIME-\d+> [operating |adjusted ]*[EPS ]*estimate[s]* of '
151 |                              r'<MONEY-\d+>',
152 |                              r'[our]* <TIME-\d+> [operating |adjusted ]*EPS estimate[s]*.{0,'
153 |                              r'20}to <MONEY-\d+>',
154 |                              r'[our]* EPS estimate[s]* in <TIME-\d+>.{0,20}to <MONEY-\d+>',
155 |                              r'and.{0,20}<TIME-\d+>.{0,20}to <MONEY-\d+>',
156 |                              r'and.{0,20}<TIME-\d+>.{0,20}at <MONEY-\d+>',
157 |                              r'but.{0,20}<TIME-\d+>.{0,20}to <MONEY-\d+>',
158 |                              r'but.{0,20}<TIME-\d+>.{0,20}at <MONEY-\d+>',
159 |                              r'and.{0,20}<MONEY-\d+>.{0,20}for <TIME-\d+>',
160 |                              r'[our]* <TIME-\d+> [operating |adjusted ]*EPS estimate[s]*.{0,'
161 |                              r'20}at <MONEY-\d+>',
162 |                              r'[our]* EPS estimate in <TIME-\d+>.{0,20}to <MONEY-\d+>',
163 |                              r'[our]* <MONEY-\d+> [operating |adjusted ]*EPS estimate[s]* for '
164 |                              r'<TIME-\d+>',
165 |                              r'[our]* <MONEY-\d+> <TIME-\d+> EPS estimate[s]*',
166 |                              r'[operating |adjusted ]*EPS estimate[s]* of <MONEY-\d+> for '
167 |                              r'<TIME-\d+>']
168 | 
169 |     extracted_patterns = []
170 |     for each_sent in input_pred:
171 | 
172 |         ## for each sentence, identify the pattern and need to have the flag
173 |         curr_sent_id = each_sent[0]
174 |         curr_text = each_sent[1]
175 |         num_and = len([i for i in curr_text.split(' ') if i == 'and'])
176 | 
177 |         curr_all_pattern = []
178 | 
179 |         ## global masking
180 |         curr_text = re.sub('-LRB-.{0,30}-RRB-', '<RB_MASK>', curr_text)
181 | 
182 |         ## get num_and >= 2 processed
183 |         # first masking
184 |         curr_text = re.sub('from <MONEY-\d+> and <MONEY-\d+>', '<FROM_MASK>', curr_text)
185 |         curr_text = re.sub('by <MONEY-\d+> and <MONEY-\d+>', '<BY_MASK>', curr_text)
186 |         # get num_and >= 2 processed
187 |         if num_and >= 2:
188 |             for each_pattern in EPS_AND:
189 |                 curr_found_pattern = re.findall(each_pattern, curr_text)
190 |                 if curr_found_pattern:
191 |                     for each_pattern in curr_found_pattern:
192 |                         curr_text = curr_text.replace(each_pattern, '<PROCESSED_TWO_AND>')
193 |                         extracted_patterns.append([each_pattern, 'TWO_AND'])
194 |             for each_pattern in LOSS_AND:
195 |                 curr_found_pattern = re.findall(each_pattern, curr_text)
196 |                 # print(curr_found_pattern)
197 |                 if curr_found_pattern:
198 |                     for each_pattern in curr_found_pattern:
199 |                         curr_text = curr_text.replace(each_pattern, '<PROCESSED_TWO_AND>')
200 |                         extracted_patterns.append([each_pattern, 'TWO_AND-LOSS'])
201 | 
202 |         ## then estimates
203 |         # second masking
204 |         curr_text = re.sub(r'by <MONEY-\d+>', '<BY_MASK>', curr_text)
205 |         curr_text = re.sub(r'from <MONEY-\d+>', '<FROM_MASK>', curr_text)
206 |         # find patterns
207 |         for each_pattern in estimate_pattern_list:
208 |             curr_found_pattern = re.findall(each_pattern, curr_text)
209 |             if curr_found_pattern:
210 |                 for each_pattern in curr_found_pattern:
211 |                     extracted_patterns.append([each_pattern, 'EPS_EST'])
212 | 
213 |         ## then loss
214 |         for each_pattern in LOSS:
215 |             curr_found_pattern = re.findall(each_pattern, curr_text)
216 |             if curr_found_pattern:
217 |                 for each_pattern in curr_found_pattern:
218 |                     extracted_patterns.append([each_pattern, 'LOSS'])
219 | 
220 |         # print(extracted_patterns)
221 | 
222 |     return extracted_patterns
223 | 
224 | 
225 | def genForecastTuples(extracted_patterns):
226 | 
227 |     not_matched = []
228 |     curr_matched_tuple = []
229 | 
230 |     for each_pred_tuple in extracted_patterns:
231 | 
232 |         each_pred = each_pred_tuple[0]
233 |         each_pred_flag = each_pred_tuple[1]
234 | 
235 |         ## count number of time and money tags
236 |         curr_time_label = re.findall(r'<TIME-\d+>', each_pred)
237 |         curr_money_label = re.findall(r'<MONEY-\d+>', each_pred)
238 |         to_part = re.findall(r'to <MONEY-\d+>', each_pred)
239 |         to_and_part = re.findall(r'to <MONEY-\d+> and <MONEY-\d+>', each_pred)
240 | 
241 |         if len(curr_time_label) == len(curr_money_label):
242 |             curr_matched_tuple = curr_matched_tuple + \
243 |                                  [(curr_time_label[i], curr_money_label[i], each_pred, each_pred_flag) for i in
244 |                                   range(len(curr_time_label))]
245 |         elif len(to_part) == len(curr_time_label):
246 |             to_part_money = re.findall(r'<MONEY-\d+>', ' '.join(to_part))
247 |             curr_matched_tuple = curr_matched_tuple + \
248 |                                  [(curr_time_label[i], to_part_money[i], each_pred, each_pred_flag) for i in
249 |                                   range(len(curr_time_label))]
250 |         elif len(to_and_part) * 2 == len(curr_time_label):
251 |             to_and_part_money = re.findall(r'<MONEY-\d+>', ' '.join(to_and_part))
252 |             curr_matched_tuple = curr_matched_tuple + \
253 |                                  [(curr_time_label[i], to_and_part_money[i], each_pred, each_pred_flag) for i in
254 |                                   range(len(curr_time_label))]
255 |         else:
256 |             not_matched.append(each_pred_tuple)
257 |     # deduplication
258 |     curr_matched_tuple_dedup = list(set(curr_matched_tuple))
259 | 
260 |     return curr_matched_tuple_dedup, not_matched
261 | 
262 | 
263 | def extractEstimatesNew(input_data):
264 | 
265 |     # we process record one by one
266 |     for each_pred in input_data:
267 | 
268 |         ## perform masking
269 |         curr_masking, curr_mapping = maskTimeMoney(each_pred['tagging'])
270 |         each_pred['text_masking_dict'] = curr_masking
271 |         each_pred['mapping'] = curr_mapping
272 | 
273 |         # print(each_pred['note_main'])
274 | 
275 |         ## control input sequences
276 |         sent_pending_process = [i for i in each_pred['text_masking_dict'].items()
277 |                                 if 'EPS estimate' in i[1]
278 |                                 or 'loss per share estimate' in i[1]
279 |                                 or 'per share loss estimate' in i[1]]
280 | 
281 |         ## extract patterns
282 |         curr_extracted_patterns = extractPatterns(sent_pending_process)
283 |         # print(curr_extracted_patterns)
284 | 
285 |         ## generate forecasts
286 |         curr_extracted_tuples, not_matched = genForecastTuples(curr_extracted_patterns)
287 |         # if not_matched != []:
288 |         #     print(each_pred['note_main'])
289 |         #     print(each_pred['id'])
290 |         #     print(not_matched)
291 | 
292 |         ## mapping back and reverse label
293 |         curr_extracted_est = []
294 |         for each_tuple in curr_extracted_tuples:
295 |             if 'LOSS' in each_tuple[3]:
296 |                 curr_extracted_est.append([each_pred['mapping'][each_tuple[0]],
297 |                                            '-' + each_pred['mapping'][each_tuple[1]],
298 |                                            each_tuple[2],
299 |                                            each_tuple[3]])
300 |             else:
301 |                 curr_extracted_est.append([each_pred['mapping'][each_tuple[0]],
302 |                                            each_pred['mapping'][each_tuple[1]],
303 |                                            each_tuple[2],
304 |                                            each_tuple[3]])
305 |         # print(curr_extracted_est)
306 |         each_pred['pred'] = curr_extracted_est
307 | 
308 |         ## organize format
309 |         reorganized_pred = []
310 |         # deal with a loss of:
311 |         for each_est in curr_extracted_est:
312 |             if 'loss' in each_est[1]:
313 |                 each_est[1] = each_est[1].replace('a loss of', '').replace('loss', '').replace(' ', '')
314 |             if 'EPS' in each_est[1]:
315 |                 each_est[1] = each_est[1].replace('EPS', '')
316 |         for each_est in curr_extracted_est:
317 |             if '14' in each_est[0]:
318 |                 reorganized_pred.append(('2014', each_est[1].replace('$', '').replace('US', '')))
319 |             elif '15' in each_est[0]:
320 |                 reorganized_pred.append(('2015', each_est[1].replace('$', '').replace('US', '')))
321 |             elif '16' in each_est[0]:
322 |                 reorganized_pred.append(('2016', each_est[1].replace('$', '').replace('US', '')))
323 |             elif '17' in each_est[0]:
324 |                 reorganized_pred.append(('2017', each_est[1].replace('$', '').replace('US', '')))
325 |             elif '18' in each_est[0]:
326 |                 reorganized_pred.append(('2018', each_est[1].replace('$', '').replace('US', '')))
327 |             #     print(i['pred'], curr_pred_trans)
328 | 
329 |         if each_pred['pred']:
330 |             each_pred['pred_updated'] = sorted(reorganized_pred, key=lambda x: x[0])
331 |         else:
332 |             each_pred['pred_updated'] = []
333 | 
334 |         # print(each_pred['pred_updated'])
335 | 
336 |     return input_data


--------------------------------------------------------------------------------
/extraction/pdf_extraction_sample_code.py:
--------------------------------------------------------------------------------
  1 | 
  2 | ### Some sample code for processing pdf files
  3 | 
  4 | import numpy as np
  5 | import re
  6 | 
  7 | import tabula
  8 | 
  9 | 
 10 | def extractNumfromTable(input_string, num_year=5):
 11 |     """
 12 |     :param input_string: input from pdf
 13 |     :param num_year: how many lines
 14 |     :return:
 15 |     """
 16 |     trail_table_split = input_string.split('\n')
 17 | 
 18 |     earning_table_idx = [idx for idx, i in enumerate(trail_table_split) if 'Earnings Per Share' in i]
 19 | 
 20 |     if earning_table_idx:
 21 |         column_name = trail_table_split[earning_table_idx[0]+1].split(' ')
 22 |         num_values = trail_table_split[earning_table_idx[0]+2:earning_table_idx[0]+2+num_year]
 23 |         num_values = [i.replace('E ', 'E').split(' ') for i in num_values]
 24 |         return [column_name] + num_values
 25 |     else:
 26 |         return None
 27 | 
 28 | 
 29 | def extractAnnoNotesfromText(input_string):
 30 | 
 31 |     if 'Analyst Research Notes and other Company News' in input_string:
 32 |         curr_split = re.split(r'[a-zA-Z]+ \d+, 201\d\n', input_string)
 33 |         # if curr_split:
 34 |         extracted_time = re.findall(r'[a-zA-Z]+ \d+, 201\d\n', input_string)
 35 |         output = []
 36 |         for i in range(len(curr_split[1:])):
 37 |             curr_text = extracted_time[i]+curr_split[1+i].replace('\n', ' ')
 38 |             ## deal with two cases
 39 |             if "Note: Research notes reflect CFRA's published" in curr_text:
 40 |                 curr_text = curr_text.split("Note: Research notes reflect CFRA's published")[0]
 41 |             elif "Stock Report |" in curr_text:
 42 |                 curr_text = curr_text.split("Stock Report |")[0]
 43 |             output = output + [curr_text.strip()]
 44 |         return output
 45 |     else:
 46 |         return None
 47 | 
 48 | 
 49 | def extractHeaderInfo(input_string):
 50 | 
 51 |     curr_header = re.findall(r"Stock Report .*\n.*\n", input_string)
 52 | 
 53 |     if curr_header:
 54 |         curr_header = curr_header[0]
 55 |         curr_symbol = re.findall(r"Symbol:[ \.A-Z]+", curr_header)[0].split(':')[-1].strip()
 56 |         header_info = (curr_header, curr_symbol)
 57 |         return header_info
 58 |     else:
 59 |         return None
 60 | 
 61 | 
 62 | ### It is an example of how to use tabula to extract information for financial reports from UBS
 63 | def extractUBS(pdf_path):
 64 | 
 65 |     def mappingFromEnd(header, content):
 66 | 
 67 |         mapping = []
 68 |         for i in range(1, len(header) + 1):
 69 |             mapping.append([header[-i], content[-i]])
 70 |         return mapping
 71 | 
 72 |     def removeNaN(df):
 73 | 
 74 |         df_no_nan = []
 75 |         for each_line in df[0].values.tolist():
 76 |             df_no_nan.append([i for i in each_line if str(i) != 'nan'])
 77 | 
 78 |         return df_no_nan
 79 | 
 80 |     def reorganizeEstimate(required_format):
 81 |         """
 82 | 
 83 |         :param required_format: input format should be ('2015A', '$2.59')
 84 |         :return:
 85 |         """
 86 | 
 87 |         if 'A' in required_format[0]:
 88 |             flag = 'Actual'
 89 |         if 'E' in required_format[0]:
 90 |             flag = 'Estimate'
 91 |         reorg_output = None
 92 |         try:
 93 |             reorg_output = [required_format[0].strip('A').strip('E'), flag,
 94 |                             float(required_format[-1].strip('$').strip('A').strip('E'))]
 95 |         except:
 96 |             pass
 97 | 
 98 |         return reorg_output
 99 | 
100 |     ## extract from pdf
101 |     df = tabula.read_pdf(pdf_path, pages='1', multiple_tables=True,
102 |                          area=[70, 0, 90, 100], relative_area=True, stream=True)
103 | 
104 |     # print(df)
105 | 
106 |     ## remove NaN
107 |     table = removeNaN(df)
108 | 
109 |     header_idx = None
110 |     curr_header = None
111 |     curr_content = None
112 |     for idx, each_line in enumerate(table):
113 |         try:
114 |             curr_line = ' '.join(each_line)
115 |             if 'Highlights (US$m)' in curr_line:
116 |                 curr_header = curr_line.replace('Highlights (US$m) ', '').split(' ')
117 |                 # print(curr_header)
118 |             if 'EPS' in curr_line[:20]:
119 |                 curr_content = curr_line.split(' ')
120 |                 # print(curr_content)
121 |         except:
122 |             pass
123 | 
124 |     mapping = []
125 |     if curr_header and curr_content:
126 |         mapping = mappingFromEnd(curr_header, curr_content)
127 | 
128 |     # print(mapping)
129 | 
130 |     ## reorganize
131 |     if mapping:
132 |         mapping_new = []
133 |         for each_one in mapping:
134 |             each_one[0] = '20' + each_one[0].split('/')[-1]
135 |             if 'E' not in each_one[0]:
136 |                 each_one[0] = each_one[0] + 'A'
137 |             mapping_new.append([each_one[0], each_one[1]])
138 |         mapping = [reorganizeEstimate(i) for i in mapping_new]
139 | 
140 |     return mapping
141 | 
142 | 


--------------------------------------------------------------------------------