├── .gitignore
├── LICENSE
├── README.md
├── data
    └── raw
    │   ├── sample_test.json
    │   └── sample_train.json
├── information_extraction_t5
    ├── __init__.py
    ├── data
    │   ├── __init__.py
    │   ├── basic_to_squad.py
    │   ├── convert_dataset_to_squad.py
    │   ├── convert_squad_to_t5.py
    │   ├── file_handling.py
    │   └── qa_data.py
    ├── features
    │   ├── __init__.py
    │   ├── context.py
    │   ├── highlights.py
    │   ├── postprocess.py
    │   ├── preprocess.py
    │   ├── questions
    │   │   ├── __init__.py
    │   │   ├── questions.py
    │   │   └── type_map.py
    │   └── sentences.py
    ├── models
    │   ├── __init__.py
    │   └── qa_model.py
    ├── predict.py
    ├── train.py
    └── utils
    │   ├── __init__.py
    │   ├── balance_data.py
    │   ├── freeze.py
    │   ├── metrics.py
    │   └── processing.py
├── params.yaml
├── requirements.txt
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # These are some examples of commonly ignored file patterns.
 2 | # You should customize this list as applicable to your project.
 3 | # Learn more about .gitignore:
 4 | #     https://www.atlassian.com/git/tutorials/saving-changes/gitignore
 5 | 
 6 | # Node artifact files
 7 | node_modules/
 8 | dist/
 9 | 
10 | # Compiled Java class files
11 | *.class
12 | 
13 | # Compiled Python bytecode
14 | *.py[cod]
15 | 
16 | # Log files
17 | *.log
18 | 
19 | # Package files
20 | *.jar
21 | 
22 | # Maven
23 | target/
24 | dist/
25 | 
26 | # JetBrains IDE
27 | .idea/
28 | 
29 | # Unit test reports
30 | TEST*.xml
31 | 
32 | # Generated by MacOS
33 | .DS_Store
34 | 
35 | # Generated by Windows
36 | Thumbs.db
37 | 
38 | # Applications
39 | *.app
40 | *.exe
41 | *.war
42 | 
43 | # Large media files
44 | *.mp4
45 | *.tiff
46 | *.avi
47 | *.flv
48 | *.mov
49 | *.wmv
50 | 
51 | MANIFEST
52 | *egg-info
53 | 
54 | .vscode
55 | 
56 | cache*
57 | 
58 | data/processed
59 | 
60 | lightning_logs/*
61 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 NeuralMind
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Information Extraction using T5
  2 | 
  3 | [![arXiv](https://img.shields.io/badge/arXiv-2101.05658-f9f107.svg)](https://arxiv.org/abs/2201.05658)
  4 | 
  5 | This project provides a solution for training and validating seq2seq models for information extraction. The method can be applied for any text-only type of document, such as legal, registration, news, etc. The project extracts information by question answering.
  6 | 
  7 | In this work, we evaluate sequence-to-sequence models as an alternative to token-level classification methods to extract information from documents. T5 models are finetuned to jointly extract the information and generate the output in a structured format. Post-processing steps are learned during training, eliminating the need for rule-based methods and simplifying the pipeline.
  8 | 
  9 | Neither the models weights nor the datasets are available for ethical issues. But we make efforts to release the source code that works for different models of T5 family, and can be easily extended for new datasets and different languages.
 10 | 
 11 | # Installation
 12 | 
 13 | Clone the repository and install via:
 14 | 
 15 | ```bash
 16 | pip install .
 17 | ```
 18 | 
 19 | # Fine-tuning
 20 | 
 21 | Configure the parameters in `params.yaml`. Then, preprocess the datasets by running:
 22 | 
 23 | ```bash
 24 | python information_extraction_t5/data/convert_dataset_to_squad.py -c params.yaml
 25 | ```
 26 | 
 27 | Start the fine-tuning experiment via:
 28 | 
 29 | ```bash
 30 | python information_extraction_t5/train.py -c params.yaml
 31 | ```
 32 | 
 33 | The code executes both the training and inference using the checkpoint with best exact matching over the validation set, as well as the complete post-processing that involves especially:
 34 | 
 35 | - selecting the most likely answer (the sliding window that provided the highest-probability response), and
 36 | - breaking compound answers in clean and individual sub-answers.
 37 | 
 38 | It also computes metrics and generates the following output files:
 39 | 
 40 | - `metrics_by_typenames.json`: JSON file with exact matching and F1-score for each *field*, each dataset and all the documents.
 41 | - `metrics_by_documents.json`: JSON file with exact matching and F1-score for each *document*, each dataset and all the documents.
 42 | - `outputs_by_typenames.txt`: TXT file with labels, predictions, *document id*, probability of selected window, and window id, grouped by *fields*.
 43 | - `outputs_by_documents.txt`: TXT file with labels, predictions, *field id*, probability of selected window, and window id, grouped by *documents*.
 44 | - `output_sheet.xlsl`: Excel file with field id, labels, predictions and probs grouped by documents.
 45 | - `output_sheet_client.xlsl`: Excel file with labels, predictions, probs and metrics, organized as one sheet for each dataset.
 46 | 
 47 | Note that, for now, only a [tiny synthetic dataset](data/raw/sample_train.json) is available. To extend for new datasets, please consult [this section](#extending-for-new-datasets).
 48 | 
 49 | # Inference
 50 | 
 51 | Assuming you have finished the training and inference and want to use a different testing dataset, an intermediary checkpoint or even explore a different post-processing function or parameter, just run:
 52 | 
 53 | ```bash
 54 | python information_extraction_t5/predict.py -c params.yaml
 55 | ```
 56 | 
 57 | For cases that require a new inference round, remember to set `use_cache_predictions: false` to overwrite the cache. Otherwise, if you intend to rerun the post-processing, set `use_cache_predictions: true`.
 58 | 
 59 | # Setting the hyperparameters
 60 | 
 61 | To know all the settings related to pre-processing, training, inference and pos-processing stages, please run:
 62 | 
 63 | ```bash
 64 | python information_extraction_t5/train.py --help
 65 | ```
 66 | 
 67 | You will find an extensive list of parameters because it is inheriting Pytorch-Lightning's Trainer arguments.
 68 | Give special attention only to the parameters that are in the `params.yaml` file.
 69 | 
 70 | <!-- TODO: describe the most import arguments -->
 71 | <!-- Main arguments to describe: -->
 72 | <!-- - context_content -->
 73 | 
 74 | # Extending for new datasets
 75 | 
 76 | In this section we explain how to include new datasets for running fine-tuning and inference.
 77 | It is important to emphasize that the four datasets that have been originally applied in the project cannot be released for ethical issues.
 78 | 
 79 | ## Preparing the questions and type-map
 80 | 
 81 | There are two preliminar steps when extending the project for new datasets.
 82 | 
 83 | ### Mapping field names to clues *[Mandatory]*
 84 | 
 85 | The original field names of the datasets can be noisy, not natural. One important step is converting those irregular names into natural ones. The natural names will be used as clues in the answers.
 86 | 
 87 | For each dataset, it is necessary to [map](information_extraction_t5/features/questions/type_map.py) field names (we call it as type-name in the code) to types and vice-versa. The types are used as clues in brackets in T5 outputs. The field names are recovered in post-processing stage.
 88 | 
 89 | Each dataset has it own `TYPENAME_TO_TYPE` dictionary. We strongly recommend that the types used in all the projects be consistent, and as generic as possible. For example, using *CPF/CNPJ* for all CPFs and CNPJs, regardless of being a consultant, current account holder, business partner, land owner, etc.
 90 | 
 91 | ### Formulating questions *[Optional]*
 92 | 
 93 | If your dataset does not follow the required [SQuAD format](#format-of-the-dataset) and you intend to use the [pre-processing code](#converting-the-dataset-to-squad-format), before starting the conversion it is necessary to formulate the [questions](information_extraction_t5/features/questions/questions.py).
 94 | 
 95 | Each dataset will have a particular dictionary of questions, in which the key is the field name (we call it as type-name in the code) and the value is a list of questions. 
 96 | 
 97 | HINT: We use one list of questions for each field as a strategy to augment the dataset. You can use the data augmentation by setting `train_choose_question: all` in `params.yaml`. Use `random` to select one question randomly, or `first` to get the first one for each field.
 98 | 
 99 | If you have a compound information (the value is an internal dictionary), we recommend representing the dict as an OrderedDict in order to use the dictionary keys as field signature, ensure a possíble compound answer will have it sub-answers in an inmutable order.
100 | 
101 | ## Format of the dataset
102 | 
103 | As the project aims at extracting information using QA modality, we adopt the SQuAD as the format of the datasets, with a few adaptations. Below we present an example to illustrate the structure of the dataset file and describe the adaptations to enable the use of sliding windows and the reference of each pair [document, field], in order to enable an effective metric computation for each document, dataset and field.
104 | 
105 | ```json
106 | {
107 |   "data": [
108 |     {
109 |       "title": "318",
110 |       "paragraphs": [
111 |         {
112 |           "context": "Proposta de Abertura de Conta, Contrata\u00e7\u00e3o de Cr\u00e9dito e\nAdes\u00e3o a Produtos e Servi\u00e7os Banc\u00e1rios - Pessoa F\u00edsica\nID00147\nAg\u00eancia N\u00ba\n1234\nConta Corrente 0011-2347-0000809875312\nCondi\u00e7\u00e3o de Movimenta\u00e7\u00e3o da Conta X Individual\nAltera\u00e7\u00e3o cadastral\nAngariador (matr\u00edcula) L\n00098961\nDados B\u00e1sicos do Titular\nCPF\n516.759.760-90\n...",
113 |           "qas": [
114 |             {
115 |               "answers": [
116 |                 {
117 |                   "answer_start": 157,
118 |                   "text": "[Ag\u00eancia]: 2347"
119 |                 }
120 |               ],
121 |               "question": "Qual \u00e9 o n\u00famero da ag\u00eancia?",
122 |               "id": "form.agencia"
123 |             },
124 |             {
125 |               "answers": [
126 |                 {
127 |                   "answer_start": -1,
128 |                   "text": "[Nome]: N/A"
129 |                 }
130 |               ],
131 |               "question": "Qual \u00e9 o nome?",
132 |               "id": "form.nome_completo"
133 |             }
134 |           ]
135 |         },
136 |         {
137 |           "context": "...\nNome Completo ANA MADALENA SILVEIRA ALVES\nDocumento de Identifica\u00e7\u00e3o CNH CTPS Entidade de Classe Mercosul Passaporte\nProtocolo Refugiado\nRIC RNE\nCIE Guia de Acolhimento ao Menor Registro Nacional Migrat\u00f3rio\nN\u00b0 Documento / N\u00b0 da S\u00e9rie (CTPS)\n73258674 \u00d3rg\u00e3o Emissor SSP\nUF BA\nData de Emiss\u00e3o 21/07/2018 Data de Vcto (passaporte/CNH).",
138 |           "qas": [
139 |             {
140 |               "answers": [
141 |                 {
142 |                   "answer_start": -1,
143 |                   "text": "[Ag\u00eancia]: N/A"
144 |                 }
145 |               ],
146 |               "question": "Qual \u00e9 o n\u00famero da ag\u00eancia?",
147 |               "id": "form.agencia"
148 |             },
149 |             {
150 |               "answers": [
151 |                 {
152 |                   "answer_start": 18,
153 |                   "text": "[Nome]: ANA MADALENA SILVEIRA ALVES"
154 |                 }
155 |               ],
156 |               "question": "Qual \u00e9 o nome?",
157 |               "id": "form.nome_completo"
158 |             }
159 |           ]
160 |         }
161 |       ]
162 |     }
163 |   ],
164 |   "version": "0.1"
165 | }
166 | ```
167 | The example we presented herein includes one document whose `id = 318`, and context fits into two sliding windows. We adapted SQuAD format by transforming the list of different documents related to the same theme into a list of different sliding windows of the same document. For each document, reference in `title`, `paragraphs` is a list of dictionaries that have context and an internal dictionary of QAs.
168 | 
169 | The dictionaries of QAs follows the same intuition of SQuAD dataset, but we included in `id` the signature of QA, that involves the project (dataset name) and the field. This is very important since enables to get metrics not only for the datasets altogether, but also for each dataset individually as well as for each field. 
170 | 
171 | ## Adding the dataset
172 | 
173 | Assuming you have a dataset already pre-processed, in SQuAD format, to include it in the project you may choose a name for it and edit the following parameters in the `params.yaml` file:
174 | 
175 | ```yaml
176 | project: [
177 |   form,
178 |   ]
179 | train_file: data/processed/train-v0.1.json
180 | valid_file: data/processed/dev-v0.1.json
181 | test_file: data/processed/test-v0.1.json
182 | ```
183 | 
184 | Note that it's possible to include several datasets in the list of projects, but each `{train, valid, test}_file` includes the examples of all the datasets listed in `project`.
185 | 
186 | ## Converting the dataset to SQuAD format
187 | 
188 | If your dataset is not in the complex SQuAD-like format with the document divided into sliding windows, the pairs of question-answers, the correct qa-id, don't worry! We are releasing a code to [convert the dataset to the expected format](information_extraction_t5/data/basic_to_squad.py).
189 | 
190 | What you need to do is just ensure the dataset follows the format of a basic JSON: a dictionary of documents, in which each key is the document-id, and each value is an internal dict that must have the key "text" with the respective document content as value, and other pairs key-values representing the fields the document has.
191 | 
192 | You can visualize [here](data/raw/sample_train.json) one raw dataset that is ready to be converted into SQuAD format. NOTE: If you want to extract compound information (using compound QA feature) for the one compound field, such as `address`, the value of the respective key must be another dictionary with the expected information.
193 | 
194 | Thus, to generate a SQuAD-like dataset illustrated in the previous subsection, just set the parameters as below:
195 | 
196 | ```yaml
197 | project: [
198 |   form,
199 |   ]
200 | raw_data_file: [
201 |   data/raw/sample_train.json,
202 |   ]
203 | raw_valid_data_file: [
204 |   null,
205 |   ]
206 | raw_test_data_file: [
207 |   data/raw/sample_test.json,
208 |   ]
209 | train_file: data/processed/train-v0.1.json
210 | valid_file: data/processed/dev-v0.1.json
211 | test_file: data/processed/test-v0.1.json
212 | type_names: [
213 |   form.agencia,
214 |   form.nome_completo,
215 |   ]
216 | ```
217 | 
218 | The parameter names are intuitive. You can include any number of dataset names and their respected train, validation and test paths (the four lists must have the same number of parameters). If any of the datasets does not have a validation subset, just include `null` in the position, and a fraction of `valid_percent` of the training set will be moved for validation set.
219 | 
220 | Finally, just run the command below to get the listed datasets converted to SQuAD format and saved as `{train, valid, test}_file`.
221 | 
222 | ```bash
223 | python information_extraction_t5/data/convert_dataset_to_squad.py -c params.yaml
224 | ```
225 | 
226 | ### Limitation
227 | 
228 | The released code for dataset pre-processing does not include the features `sentence-ids` and `raw-text` formats as it would require more complex and ellaborated raw dataset, whose structure must include annotations of positions and texts both in raw and canonical formats. Those features are important only for industrial applications, but, depending of the dataset size, you can manually include `answer_start` for each qa, and setting the answer as *N/A* if it does not fit in the window. For training the model to extract canonical and raw-text information, you can change both the questions and answers as:
229 | 
230 | ```
231 | Q: What is the state?
232 | A: [State]: São Paulo
233 | 
234 | Q: What is the state and how does it appear in the text?
235 | A: [State]: SP [appears in the text]: São Paulo
236 | ```
237 | 
238 | # Cite as
239 |        
240 | ```bibtex 
241 | @inproceedings{pires2022seq2seq,
242 |           title = {Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents},
243 |           author = {Pires, Ramon and de Souza, Fábio C. and Rosa, Guilherme and Lotufo, Roberto A. and Nogueira, Rodrigo},
244 |           publisher = {arXiv},
245 |           doi = {10.48550/ARXIV.2201.05658},
246 |           url = {https://arxiv.org/abs/2201.05658},
247 |           year = {2022},
248 |         }
249 | ``` 
250 | 


--------------------------------------------------------------------------------
/data/raw/sample_test.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "965": {
 3 |         "etiqueta": "ID00123",
 4 |         "agencia": "1234",
 5 |         "conta_corrente": "0093-1234-0000103133931",
 6 |         "cpf": "675.957.460-51",
 7 |         "nome_completo": "MARCIO MACEDO SOUZA",
 8 |         "n_doc_serie": "73258674",
 9 |         "orgao_emissor": "SSP",
10 |         "data_emissao": "05/09/1988",
11 |         "data_nascimento": "05/05/1971",
12 |         "nome_mae": "MARIA ANTONIETA MACEDO",
13 |         "nome_pai": "JOABE SALVADOR SOUZA",
14 |         "endereco": {
15 |             "logradouro": "RUA PEDRO NOVAIS",
16 |             "numero": "78",
17 |             "complemento": "Apto 8",
18 |             "bairro": "AFOGADOS",
19 |             "cidade": "RECIFE",
20 |             "estado": "PE",
21 |             "cep": "56220-000"
22 |         },
23 |         "text": "Proposta de Abertura de Conta, Contrata\u00e7\u00e3o de Cr\u00e9dito e\nAdes\u00e3o a Produtos e Servi\u00e7os Banc\u00e1rios - Pessoa F\u00edsica\nID00123\nAg\u00eancia N\u00ba\n1234\nConta Corrente 0093-1234-0000103133931\nCondi\u00e7\u00e3o de Movimenta\u00e7\u00e3o da Conta X Individual\nAltera\u00e7\u00e3o cadastral\nAngariador (matr\u00edcula) L\n000685631\nDados B\u00e1sicos do Titular\nCPF\n675.957.460-51\nNome Completo MARCIO MACEDO SOUZA\nDocumento de Identifica\u00e7\u00e3o CNH CTPS Entidade de Classe Mercosul Passaporte\nProtocolo Refugiado\nRIC RNE\nCIE Guia de Acolhimento ao Menor Registro Nacional Migrat\u00f3rio\nN\u00b0 Documento / N\u00b0 da S\u00e9rie (CTPS)\n73258674 \u00d3rg\u00e3o Emissor SSP\nUF PE\nData de Emiss\u00e3o 05/09/1988 Data de Vcto (passaporte/CNH)\n | Data de Nascimento 05/05/1971 Sexo F X M\nNacionalidade x Brasileira\nNome da M\u00e3e MARIA ANTONIETA MACEDO\nNome do Pai JOABE SALVADOR SOUZA\nCidadania\nBRASILEIRA\nDomic\u00edlio fiscal\nBRASIL\nEndere\u00e7os\nEndere\u00e7o Residencial\nRua/Av/P\u00e7a/Estrada RUA PEDRO NOVAIS\nN\u00famero\n78 Complemento Apto 8\nBairro AFOGADOS\nMunic\u00edpio RECIFE\nUF PE\nPa\u00eds BRASIL\n56220-000"
24 |     }
25 | }


--------------------------------------------------------------------------------
/data/raw/sample_train.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "318": {
 3 |         "etiqueta": "ID00147",
 4 |         "agencia": "2347",
 5 |         "conta_corrente": "0011-2347-0000809875312",
 6 |         "cpf": "516.759.760-90",
 7 |         "nome_completo": "ANA MADALENA SILVEIRA ALVES",
 8 |         "n_doc_serie": "73258674",
 9 |         "orgao_emissor": "SSP",
10 |         "data_emissao": "21/07/2018",
11 |         "data_nascimento": "12/04/1992",
12 |         "nome_mae": "MADALENA COSTA SILVEIRA",
13 |         "nome_pai": "JUNIOR AUGUSTO ALVES",
14 |         "endereco": {
15 |             "logradouro": "AV. CRESCENCIO LISBOA",
16 |             "numero": "986",
17 |             "complemento": "Apto 3",
18 |             "bairro": "BARAUNA",
19 |             "cidade": "BARREIRAS",
20 |             "estado": "BA",
21 |             "cep": "47800-013"
22 |         },
23 |         "text": "Proposta de Abertura de Conta, Contrata\u00e7\u00e3o de Cr\u00e9dito e\nAdes\u00e3o a Produtos e Servi\u00e7os Banc\u00e1rios - Pessoa F\u00edsica\nID00147\nAg\u00eancia N\u00ba\n1234\nConta Corrente 0011-2347-0000809875312\nCondi\u00e7\u00e3o de Movimenta\u00e7\u00e3o da Conta X Individual\nAltera\u00e7\u00e3o cadastral\nAngariador (matr\u00edcula) L\n00098961\nDados B\u00e1sicos do Titular\nCPF\n516.759.760-90\nNome Completo ANA MADALENA SILVEIRA ALVES\nDocumento de Identifica\u00e7\u00e3o CNH CTPS Entidade de Classe Mercosul Passaporte\nProtocolo Refugiado\nRIC RNE\nCIE Guia de Acolhimento ao Menor Registro Nacional Migrat\u00f3rio\nN\u00b0 Documento / N\u00b0 da S\u00e9rie (CTPS)\n73258674 \u00d3rg\u00e3o Emissor SSP\nUF BA\nData de Emiss\u00e3o 21/07/2018 Data de Vcto (passaporte/CNH)\n | Data de Nascimento 12/04/1992 Sexo X F M\nNacionalidade x Brasileira\nNome da M\u00e3e MADALENA COSTA SILVEIRA\nNome do Pai JUNIOR AUGUSTO ALVES\nCidadania\nBRASILEIRA\nDomic\u00edlio fiscal\nBRASIL\nEndere\u00e7os\nEndere\u00e7o Residencial\nRua/Av/P\u00e7a/Estrada AV. CRESCENCIO LISBOA\nN\u00famero\n986 Complemento Apto 3\nBairro BARAUNA\nMunic\u00edpio BARREIRAS\nUF BA\nPa\u00eds BRASIL\n47800-013"
24 |     },
25 |     "108": {
26 |         "etiqueta": "ID00357",
27 |         "agencia": "2964",
28 |         "conta_corrente": "0071-2964-0000798456556",
29 |         "cpf": "096.653.550-23",
30 |         "nome_completo": "LUCIANA TRINDADE CARDOSO",
31 |         "n_doc_serie": "249878615",
32 |         "orgao_emissor": "SSP",
33 |         "data_emissao": "12/12/2021",
34 |         "data_nascimento": "23/06/1997",
35 |         "nome_mae": "AMANDA COSTA TRINDADE",
36 |         "nome_pai": "MARCELO MOREIRA CARDOSO",
37 |         "endereco": {
38 |             "logradouro": "RUA ANDERSON TEIXEIRA",
39 |             "numero": "988",
40 |             "bairro": "CAONZE",
41 |             "cidade": "NOVA IGUACU",
42 |             "estado": "RJ",
43 |             "cep": "13970-190"
44 |         },
45 |         "text": "Proposta de Abertura de Conta, Contrata\u00e7\u00e3o de Cr\u00e9dito e\nAdes\u00e3o a Produtos e Servi\u00e7os Banc\u00e1rios - Pessoa F\u00edsica\nID00357\nAg\u00eancia N\u00ba\n2964\nConta Corrente 0071-2964-0000798456556\nCondi\u00e7\u00e3o de Movimenta\u00e7\u00e3o da Conta X Individual\nAltera\u00e7\u00e3o cadastral\nAngariador (matr\u00edcula) L\n00087238978\nDados B\u00e1sicos do Titular\nCPF\n096.653.550-23\nNome Completo LUCIANA TRINDADE CARDOSO\nDocumento de Identifica\u00e7\u00e3o CNH CTPS Entidade de Classe Mercosul Passaporte\nProtocolo Refugiado\nRIC RNE\nCIE Guia de Acolhimento ao Menor Registro Nacional Migrat\u00f3rio\nN\u00b0 Documento / N\u00b0 da S\u00e9rie (CTPS)\n249878615 \u00d3rg\u00e3o Emissor SSP\nUF RJ\nData de Emiss\u00e3o 12/12/2021 Data de Vcto (passaporte/CNH)\n | Data de Nascimento 23/06/1997 Sexo X F M\nNacionalidade x Brasileira\nNome da M\u00e3e AMANDA COSTA TRINDADE\nNome do Pai MARCELO MOREIRA CARDOSO\nCidadania\nBRASILEIRA\nDomic\u00edlio fiscal\nBRASIL\nEndere\u00e7os\nEndere\u00e7o Residencial\nRua/Av/P\u00e7a/Estrada RUA ANDERSON TEIXEIRA\nN\u00famero\n634 Complemento \nBairro CAONZE\nMunic\u00edpio NOVA IGUACU\nUF RJ\nPa\u00eds BRASIL\n13970-190"
46 |     },
47 |     "965": {
48 |         "etiqueta": "ID00885",
49 |         "agencia": "8875",
50 |         "conta_corrente": "0044-8875-000080874526544",
51 |         "cpf": "010.442.950-07",
52 |         "nome_completo": "CARLOS PRATES SOUZA",
53 |         "n_doc_serie": "78945646",
54 |         "orgao_emissor": "SSP",
55 |         "data_emissao": "05/01/2002",
56 |         "data_nascimento": "13/06/1985",
57 |         "nome_mae": "ANABELLE VIEIRA PRATES",
58 |         "nome_pai": "PEDRO ROSA SOUZA",
59 |         "endereco": {
60 |             "logradouro": "RUA JORGE LIMA",
61 |             "numero": "634",
62 |             "complemento": "Apto 1",
63 |             "bairro": "APARECIDA",
64 |             "cidade": "SANTOS",
65 |             "estado": "SP",
66 |             "cep": "01311-000"
67 |         },
68 |         "text": "Proposta de Abertura de Conta, Contrata\u00e7\u00e3o de Cr\u00e9dito e\nAdes\u00e3o a Produtos e Servi\u00e7os Banc\u00e1rios - Pessoa F\u00edsica\nID00885\nAg\u00eancia N\u00ba\n8875\nConta Corrente 0044-8875-000080874526544\nCondi\u00e7\u00e3o de Movimenta\u00e7\u00e3o da Conta X Individual\nAltera\u00e7\u00e3o cadastral\nAngariador (matr\u00edcula) L\n0009861245\nDados B\u00e1sicos do Titular\nCPF\n010.442.950-07\nNome Completo CARLOS PRATES SOUZA\nDocumento de Identifica\u00e7\u00e3o CNH CTPS Entidade de Classe Mercosul Passaporte\nProtocolo Refugiado\nRIC RNE\nCIE Guia de Acolhimento ao Menor Registro Nacional Migrat\u00f3rio\nN\u00b0 Documento / N\u00b0 da S\u00e9rie (CTPS)\n78945646 \u00d3rg\u00e3o Emissor SSP\nUF SP\nData de Emiss\u00e3o 05/01/2002 Data de Vcto (passaporte/CNH)\n | Data de Nascimento 13/06/1985 Sexo F X M\nNacionalidade x Brasileira\nNome da M\u00e3e ANABELLE VIEIRA PRATES\nNome do Pai PEDRO ROSA SOUZA\nCidadania\nBRASILEIRA\nDomic\u00edlio fiscal\nBRASIL\nEndere\u00e7os\nEndere\u00e7o Residencial\nRua/Av/P\u00e7a/Estrada RUA JORGE LIMA\nN\u00famero\n634 Complemento Apto 1\nBairro APARECIDA\nMunic\u00edpio SANTOS\nUF SP\nPa\u00eds BRASIL\n01311-000"
69 |     }
70 | }


--------------------------------------------------------------------------------
/information_extraction_t5/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/neuralmind-ai/information-extraction-t5/988c589b433d96004139e1f63bfefd6778c0851b/information_extraction_t5/__init__.py


--------------------------------------------------------------------------------
/information_extraction_t5/data/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/neuralmind-ai/information-extraction-t5/988c589b433d96004139e1f63bfefd6778c0851b/information_extraction_t5/data/__init__.py


--------------------------------------------------------------------------------
/information_extraction_t5/data/basic_to_squad.py:
--------------------------------------------------------------------------------
  1 | """Convert a simple JSON dataset into SQuAD format."""
  2 | from typing import Dict, List, Optional
  3 | from transformers import T5Tokenizer
  4 | import numpy.random as nr
  5 | 
  6 | from information_extraction_t5.features.context import get_context
  7 | from information_extraction_t5.features.questions.type_map import TYPENAME_TO_TYPE
  8 | from information_extraction_t5.features.preprocess import get_questions_for_chunk
  9 | 
 10 | WARNING_MISSING_TYPENAMES = []
 11 | 
 12 | 
 13 | def get_question_answers(document: Dict[str, str],
 14 |                          questions: Optional[List[str]] = None,
 15 |                          qa_id: str = 'publicacoes.instancia',
 16 |                          choose_question: str = 'first'):
 17 |     """Gets question-answers in SQUAD format for the specified type name.
 18 | 
 19 |     The answers encompass only the canonical response (value).
 20 |     The size of the list is:
 21 |     - zero: if there's no question-answer with the specified type name or if the
 22 |         corresponding value of a type name key in the document is not a string.
 23 |     - one: choose_question is 'first' or 'random'.
 24 |     - N: an element for each question passed to the questions parameter.
 25 | 
 26 |     Returns:
 27 |         List of dictionaries where each element is a question and its answers.
 28 |     """
 29 |     if questions is None:
 30 |         questions = []
 31 | 
 32 |     subanswer = document
 33 |     qa_id_split = qa_id.split('.')
 34 | 
 35 |     for type_name in qa_id_split[1:]:
 36 |         subanswer = subanswer[type_name]
 37 |     
 38 |     # select questions
 39 |     if choose_question == 'first':
 40 |         selected_questions = [questions[0]]
 41 |     elif choose_question == 'random':
 42 |         idx = nr.randint(len(questions))
 43 |         selected_questions = [questions[idx]]
 44 |     else:
 45 |         selected_questions = questions
 46 | 
 47 |     qas = []
 48 |     answer = f"[{TYPENAME_TO_TYPE[type_name]}]: {subanswer}"
 49 |     for question in selected_questions:
 50 |         answers = [ 
 51 |             {
 52 |                 "answer_start": -1,  # None,
 53 |                 "text": answer,
 54 |             }
 55 |         ]
 56 |         qa = {
 57 |             "answers": answers,
 58 |             "question": question,
 59 |             "id": qa_id,
 60 |         }
 61 |         qas.append(qa)
 62 |     return qas
 63 | 
 64 | 
 65 | def get_compound_question_answers(
 66 |         document: Dict[str, str],
 67 |         questions: Optional[List[str]] = None,
 68 |         qa_id: str = 'publicacoes.instancia_orgao_tipo',
 69 |         choose_question: str = 'first') -> List[Dict]:
 70 |     """Gets question-answers in SQUAD format for the specified type names.
 71 | 
 72 |     The answers encompass only the canonical response (value).
 73 |     The size of the list is:
 74 |     - zero: if there's no question-answer with the specified type name or if the
 75 |         corresponding value of a type name key in the document is not a string.
 76 |     - one: choose_question is 'first' or 'random'.
 77 |     - N: an element for each question passed to the questions parameter.
 78 | 
 79 |     Returns:
 80 |         List of dictionaries where each element is a question and its answers.
 81 |     """
 82 |     # select questions
 83 |     if questions is None:
 84 |         questions = []
 85 |     if choose_question == 'first':
 86 |         selected_questions = [questions[0]]
 87 |     elif choose_question == 'random':
 88 |         idx = nr.randint(len(questions))
 89 |         selected_questions = [questions[idx]]
 90 |     else:
 91 |         selected_questions = questions
 92 | 
 93 |     type_name = qa_id.split('.')[1]
 94 | 
 95 |     all_type_names = get_questions_for_chunk(qa_id=qa_id, return_dict=True).copy()
 96 |     for tn in all_type_names.keys():
 97 |         if tn == 'compound':
 98 |             continue
 99 |         all_type_names[tn] = f'[{TYPENAME_TO_TYPE[tn]}]: N/A'
100 |     if 'compound' in all_type_names.keys():
101 |         all_type_names.pop('compound')
102 | 
103 |     # preparing the compound answer
104 |     for tn in document[type_name].keys():
105 |         type = TYPENAME_TO_TYPE[tn]
106 |         subanswer = document[type_name][tn]
107 | 
108 |         if tn in all_type_names.keys():
109 |             all_type_names[tn] = f"[{type}]: {subanswer}"
110 |         elif not tn in WARNING_MISSING_TYPENAMES:
111 |             print(f'WARNING: type-name {tn} is not in question signature for {type_name}: please add it in the OrderedDict if you want to keep.')
112 |             WARNING_MISSING_TYPENAMES.append(tn)
113 | 
114 |     answer = ' '.join(all_type_names.values())
115 | 
116 |     qas = []
117 |     for question in selected_questions:
118 |         answers = [ 
119 |             {
120 |                 "answer_start": -1,  # None,
121 |                 "text": answer
122 |             }
123 |         ]
124 |         qa = {
125 |             "answers": answers,
126 |             "question": question,
127 |             "id": qa_id,
128 |         }
129 |         qas.append(qa)
130 |     return qas
131 | 
132 | 
133 | def get_notapplicable_question_answers(
134 |     qa_id: str = 'matriculas.endereco',
135 |     choose_question: str = 'first',
136 |     list_of_use_compound_question: Optional[List[str]] = None):
137 |     """
138 |     Return a list of question-answers in SQUAD format for non-annotated
139 |     type-names.
140 | 
141 |     The size of the list is:
142 |     - one (choose_question as 'first' or 'random')
143 |     - the number of questions defined as 'compound' for the current chunk
144 |         returned by get_questions_for_chunk(chunk) (choose_question as 'all')
145 |     """
146 |     if list_of_use_compound_question is None:
147 |         list_of_use_compound_question = []
148 | 
149 |     is_compound = qa_id in list_of_use_compound_question
150 | 
151 |     questions = get_questions_for_chunk(qa_id=qa_id, is_compound=is_compound)
152 |     if questions is None:
153 |         questions = []
154 |     if choose_question == 'first':
155 |         selected_questions = [questions[0]]
156 |     elif choose_question == 'random':
157 |         idx = nr.randint(len(questions))
158 |         selected_questions = [questions[idx]]
159 |     else:
160 |         selected_questions = questions
161 | 
162 |     if is_compound:
163 |         # type_name = qa_id.split('.')[1]
164 |         all_type_names = get_questions_for_chunk(qa_id=qa_id, return_dict=True).copy()
165 |         for tn in all_type_names.keys():
166 |             if tn == 'compound':
167 |                 continue
168 |             all_type_names[tn] = f'[{TYPENAME_TO_TYPE[tn]}]: N/A'
169 |         if 'compound' in all_type_names.keys():
170 |             all_type_names.pop('compound')
171 |         
172 |         answer = ' '.join(all_type_names.values())
173 |     else:
174 |         type_name = qa_id.split('.', 1)[1]
175 |         type = TYPENAME_TO_TYPE[type_name]
176 | 
177 |         answer = f"[{type}]: N/A"
178 | 
179 |     qas = []
180 |     for question in selected_questions:
181 |         answers = [ 
182 |             {
183 |                 "answer_start": -1,  # None,
184 |                 "text": answer
185 |             }
186 |         ]
187 |         qa = {
188 |             "answers": answers,
189 |             "question": question,
190 |             "id": qa_id,
191 |         }
192 |         qas.append(qa)
193 |     return qas
194 | 
195 | 
196 | def get_document_data(document: Dict,
197 |                       document_type: str = 'publicacoes',
198 |                       all_qa_ids: List[str] = ['publicacoes.orgao'],
199 |                       max_size: int = 4000,
200 |                       list_of_use_compound_question: Optional[List[str]] = None,
201 |                       list_of_type_names: Optional[List[str]] = None,
202 |                       context_content: str = 'abertura',
203 |                       window_overlap: float = 0.5,
204 |                       max_windows: int = 3,
205 |                       tokenizer: T5Tokenizer = None,
206 |                       max_tokens: int = 512,
207 |                       choose_question: str = 'first',
208 |                       use_sentence_id: bool = False):
209 |     # using the document uuid as title
210 |     # paragraphs will contain only one dict with context of document and all the
211 |     # question-answers
212 |     if list_of_type_names is None:
213 |         list_of_type_names = []
214 |     if list_of_use_compound_question is None:
215 |         list_of_use_compound_question = []
216 | 
217 |     # assuming that this is the largest question
218 |     largest_question = 'Quais são as principais informações do documento de publicação?'
219 | 
220 |     # create dummy document
221 |     dummy_document = {}
222 |     dummy_document['text'] = document['text'] if 'text' in document.keys() else  document['texto']
223 |     dummy_document['uuid'] = document['uuid']
224 | 
225 |     # exclude crazy chars
226 |     dummy_document['text'] = dummy_document['text'].replace('༡༨/༢','')
227 | 
228 |     # extract the context(s) and respective offset(s)
229 |     contexts, offsets = get_context(
230 |         dummy_document,
231 |         context_content=context_content,
232 |         max_size=max_size,
233 |         start_position=0,
234 |         proportion_before=0.2,
235 |         return_position_offset=True,
236 |         use_sentence_id=use_sentence_id,
237 |         tokenizer=tokenizer,
238 |         max_tokens=max_tokens,
239 |         question=largest_question,
240 |         window_overlap=window_overlap,
241 |         max_windows=max_windows)
242 |     if not isinstance(contexts, list):
243 |         contexts = [contexts]
244 |         offsets = [offsets]
245 | 
246 |     # document structure in SQuAD format
247 |     document_data = {
248 |         "title": document['uuid'],
249 |         "paragraphs": []
250 |     }
251 |     counter_qas = 0
252 | 
253 |     for context, _ in zip(contexts, offsets):
254 |         # create one paragraph for each context.
255 |         # it will be unique, except for windows-based context_contents
256 |         paragraph = {
257 |             "context": context,
258 |             "qas": [],
259 |         }
260 |         paragraph_counter_qas = 0
261 | 
262 |         # control which of the requested qa_ids were satified. It force not-applicable
263 |         # qas for qa_ids whose information does not exist in the dataset.
264 |         all_qa_ids_satisfied = []
265 | 
266 |         # We will use only the fields listed in list_of_type_names
267 |         for qa_id in list_of_type_names:
268 |             doc_type = qa_id.split('.')[0]
269 |             if doc_type != document_type:
270 |                 continue
271 | 
272 |             if qa_id in list_of_use_compound_question:
273 |                 questions = get_questions_for_chunk(qa_id=qa_id, is_compound=True)
274 |                 qas = get_compound_question_answers(
275 |                     document,
276 |                     questions=questions,
277 |                     qa_id=qa_id,
278 |                     choose_question=choose_question)
279 |             else:
280 |                 questions = get_questions_for_chunk(qa_id=qa_id)
281 |                 qas = get_question_answers(document,
282 |                                         questions=questions,
283 |                                         qa_id=qa_id,
284 |                                         choose_question=choose_question)
285 |             
286 |             paragraph_counter_qas += len(qas)
287 | 
288 |             # Include the question-answer of the current type_name (e.g., tipo)
289 |             # in the current paragraph of the current document
290 |             for qa in qas:
291 |                 paragraph["qas"].append(qa)
292 |                 all_qa_ids_satisfied.append(qa_id)
293 | 
294 |         # extract not-applicable qas for non-existent information.
295 |         add_not_applicable = sorted(
296 |             list(set(all_qa_ids) - set(all_qa_ids_satisfied))
297 |         )
298 | 
299 |         for qa_id in add_not_applicable:
300 | 
301 |             qas = get_notapplicable_question_answers(
302 |                 qa_id=qa_id,
303 |                 choose_question='first',  # avoid using too much negatives
304 |                 list_of_use_compound_question=list_of_use_compound_question)
305 | 
306 |             paragraph_counter_qas += len(qas)
307 | 
308 |             # Include the not-applicable question-answer in the current
309 |             # paragraph of the current document
310 |             for qa in qas:
311 |                 paragraph["qas"].append(qa)
312 |                 all_qa_ids_satisfied.append(qa_id)
313 | 
314 |         # Add the current paragraph in the structure
315 |         if paragraph_counter_qas > 0:
316 |             document_data["paragraphs"].append(paragraph)
317 |             counter_qas += paragraph_counter_qas
318 | 
319 |     return document_data, counter_qas
320 | 


--------------------------------------------------------------------------------
/information_extraction_t5/data/convert_dataset_to_squad.py:
--------------------------------------------------------------------------------
  1 | """Converts the dataset into SQuAD format."""
  2 | import json
  3 | import os
  4 | from typing import List, Tuple
  5 | 
  6 | import configargparse
  7 | import numpy.random as nr
  8 | from sklearn.model_selection import train_test_split
  9 | from transformers import AutoTokenizer
 10 | 
 11 | import information_extraction_t5.data.basic_to_squad as basic_to_squad
 12 | from information_extraction_t5.data.file_handling import load_raw_data
 13 | from information_extraction_t5.features.preprocess import get_all_qa_ids
 14 | 
 15 | DATA_VERSION = "0.1"
 16 | 
 17 | 
 18 | def convert_raw_data(documents: List[tuple],
 19 |                      project: str,
 20 |                      all_qa_ids: List[str],
 21 |                      tokenizer: AutoTokenizer,
 22 |                      choose_question: str,
 23 |                      use_sentence_id: bool,
 24 |                      args) -> Tuple[List[dict], int]:
 25 |     """Loops over the documents and converts to SQuaD format.
 26 | 
 27 |     Args:
 28 |         documents: list with selected document tuples
 29 |         project: the project name
 30 |         tokenizer: T5 Tokenizer instance
 31 |         choose_question: flag to indicate which questions to use
 32 |         is_true: True for training data (useful for function build_answer)
 33 |         args: additional configs
 34 |     """
 35 |     qa_data = []
 36 |     qa_counter = 0
 37 | 
 38 |     for doc_id, document in documents:
 39 |         document['uuid'] = doc_id
 40 |         document_data, count = convert_document(
 41 |             document,
 42 |             project=project,
 43 |             all_qa_ids=all_qa_ids,
 44 |             max_size=args.max_size,
 45 |             type_names=args.type_names,
 46 |             use_compound_question=args.use_compound_question,
 47 |             return_raw_text=args.return_raw_text,
 48 |             context_content=args.context_content,
 49 |             window_overlap=args.window_overlap,
 50 |             max_windows=args.max_windows,
 51 |             tokenizer=tokenizer,
 52 |             max_tokens=args.max_seq_length,
 53 |             choose_question=choose_question,
 54 |             use_sentence_id=use_sentence_id)
 55 |         qa_counter += count
 56 | 
 57 |         # To finish a document, include its document_data into the
 58 |         # qa_json
 59 |         if count > 0:
 60 |             qa_data.append(document_data)
 61 | 
 62 |     return qa_data, qa_counter
 63 | 
 64 | 
 65 | def convert_document(document,
 66 |                      project='publicacoes',
 67 |                      all_qa_ids=['publicacoes.tipoPublicacao'],
 68 |                      max_size=4000,
 69 |                      type_names=None,
 70 |                      use_compound_question=None,
 71 |                      return_raw_text=None,
 72 |                      context_content='abertura',
 73 |                      window_overlap=0.5,
 74 |                      max_windows=3,
 75 |                      tokenizer=None,
 76 |                      max_tokens=512,
 77 |                      choose_question='first',
 78 |                      use_sentence_id: bool = False):
 79 |     """Converts a document and returns it along with the question count."""
 80 |     if return_raw_text is None:
 81 |         return_raw_text = []
 82 |     if use_compound_question is None:
 83 |         use_compound_question = []
 84 |     if type_names is None:
 85 |         type_names = []
 86 | 
 87 |     document_data, count = basic_to_squad.get_document_data(
 88 |         document,
 89 |         document_type=project,
 90 |         all_qa_ids=all_qa_ids,
 91 |         max_size=max_size,
 92 |         list_of_use_compound_question=use_compound_question,
 93 |         list_of_type_names=type_names,
 94 |         context_content=context_content,
 95 |         window_overlap=window_overlap,
 96 |         max_windows=max_windows,
 97 |         tokenizer=tokenizer,
 98 |         max_tokens=max_tokens,
 99 |         choose_question=choose_question,
100 |         use_sentence_id=use_sentence_id)
101 | 
102 |     return document_data, count
103 | 
104 | 
105 | def main():
106 |     """Preparing data for QA in SQuAD format."""
107 |     parser = configargparse.ArgParser(
108 |         'Preparing data for QA',
109 |         config_file_parser_class=configargparse.YAMLConfigFileParser)
110 |     parser.add_argument('-c', '--my-config', required=True,
111 |                         is_config_file=True,
112 |                         help='config file path')
113 | 
114 |     parser.add_argument('--project', action='append', required=True,
115 |                         help='List pointing out the project each train/test '
116 |                              'dataset came from')
117 |     parser.add_argument('--raw_data_file', action='append', required=True,
118 |                         help='List of raw train datasets to use in the '
119 |                              'experiment')
120 |     parser.add_argument('--raw_valid_data_file', action='append',
121 |                         help='List of raw validation datasets to use in the '
122 |                              'experiment')
123 |     parser.add_argument('--raw_test_data_file', action='append',
124 |                         help='List of raw test datasets to use in the '
125 |                              'experiment')
126 |     parser.add_argument('--train_file', type=str,
127 |                         default='data/interim/train-v0.1.json')
128 |     parser.add_argument('--valid_file', type=str,
129 |                         default='data/interim/dev-v0.1.json')
130 |     parser.add_argument('--test_file', type=str,
131 |                         default='data/interim/test-v0.1.json')
132 |     parser.add_argument('--type_names', nargs='+', default=['matriculas.imovel'],
133 |                         help='List of first-level chunks (qa_id) to use in the '
134 |                         'experiment')
135 |     parser.add_argument('--use_compound_question', nargs='+',
136 |                         default=['matriculas.area_terreno_comp'],
137 |                         help='List of fields (qa_id) that must use '
138 |                         'compound question gathering all nested information '
139 |                         'in answer (instead of per-subchunk questions)')
140 |     parser.add_argument('--return_raw_text', nargs='+', default=['estado'],
141 |                         help='List of fields (type_name) that '
142 |                         'require both canonical answer and how it appears in '
143 |                         'the text. Valid to individual and compound questions. NOT IMPLEMENTED.')
144 | 
145 |     parser.add_argument("--valid_percent", default=0.2, type=float,
146 |                         help='Percentage of dataset to used as validation')
147 |     parser.add_argument("--max_size", default=1024, type=int,
148 |                         help="The maximum input length after char-based "
149 |                              "tokenization. And also the maximum context size "
150 |                              "for char-based contexts.")
151 |     parser.add_argument("--context_content", type=str, default='abertura',
152 |                         help="Definition of context content for generic "
153 |                              "type-names (max_size, position, token, "
154 |                              "position_token, windows, or windows_token)")
155 |     parser.add_argument("--train_choose_question", type=str, default='all',
156 |                         help='Choose which question of the list to use for '
157 |                              'training set (first, random, all). '
158 |                              'Validation/test set use first.')
159 |     parser.add_argument('--train_force_qa', action="store_true",
160 |                         help='Set this flag if you want to force not-applicable '
161 |                             'qas for qa_ids that does not exist in the document. '
162 |                             'This is required for test set.')
163 |     parser.add_argument("--seed", type=int, default=42,
164 |                         help="random seed for choose qestion")
165 | 
166 |     # used to get contexts
167 |     parser.add_argument("--model_name_or_path", default='t5-small', type=str,
168 |                         help="Path to pretrained model or model identifier "
169 |                         "from huggingface.co/models")
170 |     parser.add_argument("--config_name", default="", type=str,
171 |                         help="Pretrained config name or path if not the same "
172 |                         "as model_name")
173 |     parser.add_argument("--tokenizer_name", default="", type=str,
174 |                         help="Pretrained tokenizer name or path if not the "
175 |                         "same as model_name")
176 |     parser.add_argument("--do_lower_case", action="store_true",
177 |                         help="Set this flag if you are using an uncased "
178 |                         "model.")
179 |     parser.add_argument("--max_seq_length", default=384, type=int,
180 |                         help="The maximum total input sequence length after "
181 |                         "WordPiece tokenization. Sequences longer than this "
182 |                         "will be truncated, and sequences shorter than this "
183 |                         "will be padded.")
184 |     parser.add_argument("--window_overlap", default=0.5, type=float,
185 |                         help="Define the overlapping of sliding windows.")
186 |     parser.add_argument("--max_windows", default=3, type=int,
187 |                         help="the maximum number of windows to generate, use -1 "
188 |                         "to get all the possible windows.")
189 |     parser.add_argument("--use_sentence_id", action="store_true",
190 |                         help="Set this flag if you are using the approach that "
191 |                         "breaks the contexts into sentences.")
192 | 
193 |     args, _ = parser.parse_known_args()
194 | 
195 |     assert len(args.project) == len(args.raw_data_file) == \
196 |         len(args.raw_valid_data_file) == len(args.raw_test_data_file), \
197 |         ('raw_data_file, raw_valid_data_file and raw_test_data_file lists '
198 |          'must have same size of projects list')
199 |     assert args.train_choose_question in ['first', 'random', 'all'], \
200 |         ('train_choose_question must be "first", "random" or "all"')
201 |     assert args.context_content in ['max_size', 'position', 'token',
202 |                                     'position_token', 'windows', 'windows_token'], \
203 |         ('context_content must be "max_size", "position", "token", "position_token", '
204 |         '"windows" or "windows_token"')
205 | 
206 |     # set tokenizer for context_context based on tokens
207 |     tokenizer = AutoTokenizer.from_pretrained(
208 |         args.tokenizer_name if args.tokenizer_name
209 |         else args.model_name_or_path,
210 |         use_fast=False,
211 |         do_lower_case=args.do_lower_case
212 |     )
213 | 
214 |     # setting seed for choose question
215 |     nr.seed(args.seed)
216 | 
217 |     print('>> Using the following fields with respective compound-qa indicator:')
218 |     for type_name in args.type_names:
219 |         print(f'- {type_name:<43} {type_name in args.use_compound_question}\t')
220 |     print(f'>> List of fields that require how answer appears in '
221 |           f'the text: {args.return_raw_text}')
222 | 
223 |     qa_train_json = {'data': [], 'version': DATA_VERSION}
224 |     qa_valid_json = {'data': [], 'version': DATA_VERSION}
225 |     qa_test_json = {'data': [], 'version': DATA_VERSION}
226 | 
227 |     train_qa_counter, valid_qa_counter, test_qa_counter = 0, 0, 0
228 | 
229 |     for (raw_data_file, raw_valid_data_file, raw_test_data_file, project) in \
230 |             zip(args.raw_data_file, args.raw_valid_data_file,
231 |                 args.raw_test_data_file, args.project):
232 | 
233 |         print('\n')
234 | 
235 |         # Extract the list of all possible qa_ids for the current document class.
236 |         # This forces N/A qas for valid/test, and for train if --train_force_qa
237 |         all_qa_ids = get_all_qa_ids(
238 |             document_class=project,
239 |             list_of_type_names=args.type_names,
240 |             list_of_use_compound_question=args.use_compound_question)
241 | 
242 |         # prepare VALIDATION set (if provided)
243 |         has_valid_set = raw_valid_data_file is not None \
244 |             and raw_valid_data_file != 'None'
245 | 
246 |         if has_valid_set:
247 | 
248 |             print(f'>> Loading the VALID dataset {raw_valid_data_file} '
249 |                   f'({project})...')
250 |             _, all_documents, raw_data_fname = load_raw_data(
251 |                 raw_valid_data_file
252 |             )
253 | 
254 |             print(f'>> Converting the VALID dataset {raw_valid_data_file} '
255 |                   'into SQuAD format...')
256 |             qa_data, qa_counter = convert_raw_data(
257 |                 documents=all_documents,
258 |                 project=project,
259 |                 all_qa_ids=all_qa_ids,
260 |                 tokenizer=tokenizer,
261 |                 choose_question='first',
262 |                 use_sentence_id=args.use_sentence_id,
263 |                 args=args
264 |             )
265 | 
266 |             if qa_counter > 0:
267 |                 print(f'{raw_valid_data_file} (valid) dataset has '
268 |                       f'{qa_counter} question-answers')
269 |                 valid_qa_counter += qa_counter
270 |                 qa_valid_json['data'].extend(qa_data)
271 | 
272 |             if raw_valid_data_file.endswith('tar') \
273 |                     or raw_valid_data_file.endswith('tar.gz'):
274 |                 os.unlink(raw_data_fname)
275 | 
276 |         has_test_set = raw_test_data_file is not None \
277 |             and raw_test_data_file != 'None'
278 | 
279 |         # prepare TEST set (if provided)
280 |         if has_test_set:
281 | 
282 |             print(f'>> Loading the TEST dataset {raw_test_data_file} '
283 |                   f'({project})...')
284 |             _, all_documents, raw_data_fname = load_raw_data(
285 |                 raw_test_data_file
286 |             )
287 | 
288 |             print(f'>> Converting the TEST dataset {raw_test_data_file} into '
289 |                   'SQuAD format...')
290 |             qa_data, qa_counter = convert_raw_data(
291 |                 documents=all_documents,
292 |                 project=project,
293 |                 all_qa_ids=all_qa_ids,
294 |                 tokenizer=tokenizer,
295 |                 choose_question='first',
296 |                 use_sentence_id=args.use_sentence_id,
297 |                 args=args
298 |             )
299 | 
300 |             if qa_counter > 0:
301 |                 print(f'{raw_test_data_file} (test) dataset has '
302 |                       f'{qa_counter} question-answers')
303 |                 test_qa_counter += qa_counter
304 |                 qa_test_json['data'].extend(qa_data)
305 | 
306 |             if raw_test_data_file.endswith('tar') \
307 |                     or raw_test_data_file.endswith('tar.gz'):
308 |                 os.unlink(raw_data_fname)
309 | 
310 |         # prepare TRAIN set
311 |         print(f'>> Loading the dataset {raw_data_file} ({project})...')
312 |         _, all_documents, raw_data_fname = load_raw_data(
313 |             raw_data_file
314 |         )
315 | 
316 |         if not has_valid_set and 0 < args.valid_percent < 1.0:
317 |             documents_train, documents_valid = train_test_split(
318 |                 all_documents,
319 |                 test_size=args.valid_percent,
320 |                 random_state=42)
321 | 
322 |             qa_data, qa_counter = convert_raw_data(
323 |                 documents=documents_valid,
324 |                 project=project,
325 |                 all_qa_ids=all_qa_ids,
326 |                 tokenizer=tokenizer,
327 |                 choose_question='first',
328 |                 use_sentence_id=args.use_sentence_id,
329 |                 args=args
330 |             )
331 | 
332 |             # if a TEST dataset is provided, use the split for VALIDATION only,
333 |             # otherwise, use it for both VALIDATION and TEST
334 |             if has_test_set:
335 |                 if qa_counter > 0:
336 |                     print(f'{raw_data_file} (valid) dataset has {qa_counter} '
337 |                           f'question-answers')
338 |                     valid_qa_counter += qa_counter
339 |                     qa_valid_json['data'].extend(qa_data)
340 |             else:
341 |                 if qa_counter > 0:
342 |                     print(f'{raw_data_file} (valid/test) dataset has '
343 |                           f'{qa_counter} question-answers')
344 |                     valid_qa_counter += qa_counter
345 |                     qa_valid_json['data'].extend(qa_data)
346 |                     test_qa_counter += qa_counter
347 |                     qa_test_json['data'].extend(qa_data)
348 | 
349 |         else:
350 |             documents_train = all_documents
351 | 
352 |         print(
353 |             f'>> Converting the dataset {raw_data_file} into SQuAD format...'
354 |         )
355 |         qa_data, qa_counter = convert_raw_data(
356 |             documents=documents_train,
357 |             project=project,
358 |             all_qa_ids=all_qa_ids if args.train_force_qa else [],
359 |             tokenizer=tokenizer,
360 |             choose_question=args.train_choose_question,
361 |             use_sentence_id=args.use_sentence_id,
362 |             args=args
363 |         )
364 |         print(f'{raw_data_file} (train) dataset has {qa_counter} '
365 |               f'question-answers')
366 |         train_qa_counter += qa_counter
367 |         qa_train_json['data'].extend(qa_data)
368 | 
369 |         if raw_data_file.endswith('tar') or raw_data_file.endswith('tar.gz'):
370 |             os.unlink(raw_data_fname)
371 | 
372 |     print(f'\nTRAIN dataset has {train_qa_counter} question-answers')
373 |     print(f'VALID dataset has {valid_qa_counter} question-answers')
374 |     print(f'TEST dataset has {test_qa_counter} question-answers')
375 | 
376 |     # Save the train, valid and test processed data
377 |     os.makedirs(os.path.dirname(args.train_file), exist_ok=True)
378 |     with open(args.train_file, 'w', encoding='utf-8') as outfile:
379 |         json.dump(qa_train_json, outfile)
380 |     with open(args.valid_file, 'w', encoding='utf-8') as outfile:
381 |         json.dump(qa_valid_json, outfile)
382 |     with open(args.test_file, 'w', encoding='utf-8') as outfile:
383 |         json.dump(qa_test_json, outfile)
384 | 
385 | 
386 | if __name__ == "__main__":
387 |     main()
388 | 


--------------------------------------------------------------------------------
/information_extraction_t5/data/convert_squad_to_t5.py:
--------------------------------------------------------------------------------
  1 | """Converts the dataset from SQuAD format to T5 format."""
  2 | import torch
  3 | from rich.progress import track
  4 | from typing import List, Union
  5 | 
  6 | from transformers.data.processors.squad import SquadExample
  7 | 
  8 | from information_extraction_t5.features.preprocess import generate_t5_input_sentence, generate_t5_label_sentence
  9 | from information_extraction_t5.utils.balance_data import balance_data
 10 | 
 11 | class QADataset(torch.utils.data.Dataset):
 12 |     """
 13 |     Dataset for question-answering.
 14 | 
 15 |     Args:
 16 |         examples: the inputs to the model in T5 format.
 17 |         labels: the targets in T5 format.
 18 |         document_ids: the IDs to reference specific documents.
 19 |         example_ids: the IDs to reference specific pairs dataset-field.
 20 |         negative_ratios: the resultant negative-positive ratio of the samples.
 21 |         return_ids: indicates if the dataset will return the document-ids and example_ids.
 22 |     
 23 |     Returns:
 24 |         Dataset
 25 | 
 26 |     Ex.:
 27 |     examples    = ['question: When was the Third Assessment Report published? context: Another example of scientific research ...']
 28 |     labels      = ['2011']
 29 |     document_ids= ['ec57d59d-972c-40fc-82ff-c7c818d7dd39']
 30 |     example_ids = ['reports.third_assessment.publication_data']
 31 |     """
 32 | 
 33 |     def __init__(self, examples, labels, document_ids, example_ids, negative_ratio=1.0, return_ids=False):
 34 |         if negative_ratio >= 1.0:
 35 |             self.examples, self.labels, self.document_ids, self.example_ids = balance_data(
 36 |                 examples, labels, document_ids, example_ids, negative_ratio=negative_ratio
 37 |             )
 38 |         else:
 39 |             self.examples = examples
 40 |             self.labels = labels
 41 |             self.document_ids = document_ids
 42 |             self.example_ids = example_ids
 43 |         self.return_ids = return_ids
 44 | 
 45 |     def __len__(self):
 46 |         return len(self.examples)
 47 |     
 48 |     def __getitem__(self, idx):
 49 |         if self.return_ids:
 50 |             return self.examples[idx], self.labels[idx], self.document_ids[idx], self.example_ids[idx]
 51 |         else:
 52 |             return self.examples[idx], self.labels[idx]
 53 |             
 54 | 
 55 | def squad_convert_examples_to_t5_format(
 56 |     examples: List[SquadExample],
 57 |     use_sentence_id: bool = True,
 58 |     evaluate: bool = False,
 59 |     negative_ratio: int = 0,
 60 |     return_dataset: Union[bool, str] = False,
 61 |     tqdm_enabled: bool = True,
 62 | ):
 63 |     """Converts a list of examples into a list to the T5 format for
 64 |         question-answer with prefix question/context.
 65 | 
 66 |         Args:
 67 |             examples: examples to convert to T5 format.
 68 |             evaluate: True for validation or test dataset.
 69 |             negative_ratio: balances dataset using negative-positive ratio.
 70 |             return_dataset: if True, returns a torch.data.TensorDataset.
 71 |             tqdm_enabled: if True, uses tqdm.
 72 | 
 73 |         Returns:
 74 |             list of examples into a list to the T5 format for
 75 |         question-answer with prefix question/context.
 76 | 
 77 |         Examples:
 78 |             >>> processor = SquadV2Processor()
 79 |             >>> examples = processor.get_dev_examples(data_dir)
 80 |             >>> examples, labels = squad_convert_examples_to_t5_format(
 81 |             >>>     examples=examples)
 82 |     """
 83 | 
 84 |     examples_t5_format = []
 85 |     labels_t5_format = []
 86 |     document_ids = []   # which document the example came from? (e.g, 54f94949-0fb4-45e5-81dd-c4385f681e2b)
 87 |     example_ids = []    # which document-type and type-name does the example belong to? (e.g., matriculas.endereco)
 88 | 
 89 |     for example in track(examples, description="convert examples to T5 format", disable=not tqdm_enabled):
 90 | 
 91 |         # prepare the input
 92 |         x = generate_t5_input_sentence(example.context_text, example.question_text, use_sentence_id)
 93 |         
 94 |         # extract answer and start position (squad-example is in evaluate mode)
 95 |         y = example.answers[0]['text']  # getting the first answer in the list
 96 |         answer_start = example.answers[0]['answer_start']
 97 | 
 98 |         # prepate the target
 99 |         y = generate_t5_label_sentence(y, answer_start, example.context_text, use_sentence_id)
100 |         
101 |         examples_t5_format.append(x)
102 |         labels_t5_format.append(y)
103 |         document_ids.append(example.title)
104 |         example_ids.append(example.qas_id)
105 | 
106 |     if return_dataset:
107 |         # Create the dataset
108 |         dataset = QADataset(examples_t5_format, labels_t5_format, document_ids, 
109 |             example_ids, negative_ratio=negative_ratio, return_ids=evaluate)
110 | 
111 |         return examples_t5_format, labels_t5_format, dataset
112 |     else:
113 |         return examples_t5_format, labels_t5_format


--------------------------------------------------------------------------------
/information_extraction_t5/data/file_handling.py:
--------------------------------------------------------------------------------
 1 | """Tools for handling dataset files."""
 2 | import glob
 3 | import json
 4 | import tarfile
 5 | from typing import Tuple
 6 | 
 7 | 
 8 | def decompress(fname):
 9 |     """Unpack a tar file and return the name of the JSON dataset file.
10 | 
11 |     Args:
12 |         fname: compressed dataset file name
13 | 
14 |     Returns:
15 |         The name of the unpacked JSON raw dataset file.
16 |     """
17 |     if fname.endswith("tar.gz"):
18 |         tar = tarfile.open(fname, "r:gz")
19 |         tar.extractall('data/raw/')
20 |         tar.close()
21 |     elif fname.endswith("tar"):
22 |         tar = tarfile.open(fname, "r:")
23 |         tar.extractall('data/raw/')
24 |         tar.close()
25 | 
26 |     fname = glob.glob('data/raw/*json')[-1]
27 | 
28 |     return fname
29 | 
30 | 
31 | def load_raw_data(fname: str) -> Tuple[dict, list, str]:
32 |     """Loads raw dataset file.
33 | 
34 |     Args:
35 |         fname: the dataset file name
36 | 
37 |     Returns:
38 |         A tuple with the json-like raw data dict and a corresponding list
39 |             of tuples with keys and values.
40 |     """
41 |     if fname.endswith('tar') or fname.endswith('tar.gz'):
42 |         print(f'>> Decompressing dataset file {fname}...')
43 |         raw_data_fname = decompress(fname)
44 |     else:
45 |         raw_data_fname = fname
46 | 
47 |     with open(raw_data_fname) as f:
48 |         raw_data = json.load(f)
49 |         documents = list(raw_data.items())
50 | 
51 |     return raw_data, documents, raw_data_fname
52 | 


--------------------------------------------------------------------------------
/information_extraction_t5/data/qa_data.py:
--------------------------------------------------------------------------------
  1 | """Implement DataModule"""
  2 | import os
  3 | from typing import Optional
  4 | import configargparse
  5 | 
  6 | import torch
  7 | from torch.utils.data import DataLoader, Dataset
  8 | import pytorch_lightning as pl
  9 | from transformers.data.processors.squad import SquadV1Processor
 10 | 
 11 | from information_extraction_t5.data.convert_squad_to_t5 import squad_convert_examples_to_t5_format
 12 | 
 13 | class QADataModule(pl.LightningDataModule):
 14 | 
 15 |     def __init__(self, hparams):
 16 |         super().__init__()
 17 |         self.hparams.update(vars(hparams))
 18 | 
 19 |     def setup(self, stage: Optional[str] = None):
 20 |         input_dir = self.hparams.data_dir if self.hparams.data_dir else "."
 21 | 
 22 |         # Prepare train and valid datasets
 23 |         if stage == 'fit' or stage is None:
 24 |             # Load data examples from cache or dataset file
 25 |             cached_examples_train_file = os.path.join(
 26 |                 input_dir,
 27 |                 f"cached_train_{list(filter(None, self.hparams.model_name_or_path.split('/'))).pop()}"
 28 |             )
 29 |             cached_examples_valid_file = os.path.join(
 30 |                 input_dir,
 31 |                 f"cached_valid_{list(filter(None, self.hparams.model_name_or_path.split('/'))).pop()}"
 32 |             )
 33 | 
 34 |             # Init examples and dataset from cache if it exists
 35 |             if os.path.exists(cached_examples_train_file) and \
 36 |                 os.path.exists(cached_examples_valid_file) and not self.hparams.overwrite_cache:
 37 |                 print("Loading examples from cached files %s and %s" % (cached_examples_train_file, cached_examples_valid_file))
 38 | 
 39 |                 examples_and_dataset = torch.load(cached_examples_train_file)
 40 |                 self.train_dataset = examples_and_dataset["dataset"]
 41 |                 examples_and_dataset = torch.load(cached_examples_valid_file)
 42 |                 self.valid_dataset = examples_and_dataset["dataset"]
 43 |             else:
 44 |                 print("Creating examples from dataset file at %s" % input_dir)
 45 | 
 46 |                 processor = SquadV1Processor()
 47 |                 
 48 |                 # examples_train = processor.get_train_examples(self.hparams.data_dir, filename=self.hparams.train_file)
 49 |                 examples_train = processor.get_dev_examples(
 50 |                     self.hparams.data_dir, filename=self.hparams.train_file
 51 |                 )
 52 |                 examples_valid = processor.get_dev_examples(
 53 |                     self.hparams.data_dir, filename=self.hparams.valid_file
 54 |                 )
 55 |                         
 56 |                 _, _, self.train_dataset = squad_convert_examples_to_t5_format(
 57 |                     examples=examples_train,
 58 |                     use_sentence_id=self.hparams.use_sentence_id,
 59 |                     evaluate=False,
 60 |                     negative_ratio=self.hparams.negative_ratio,
 61 |                     return_dataset=True,
 62 |                 )
 63 |                 _, _, self.valid_dataset = squad_convert_examples_to_t5_format(
 64 |                     examples=examples_valid,
 65 |                     use_sentence_id=self.hparams.use_sentence_id,
 66 |                     evaluate=True,
 67 |                     negative_ratio=0,
 68 |                     return_dataset=True,
 69 |                 )
 70 | 
 71 |                 print(f"Saving examples into cached file {cached_examples_train_file}")
 72 |                 torch.save({"dataset": self.train_dataset}, cached_examples_train_file)
 73 |                 print(f"Saving examples into cached file {cached_examples_valid_file}")
 74 |                 torch.save({"dataset": self.valid_dataset}, cached_examples_valid_file)
 75 |             
 76 |             print(f'>> train-dataset: {len(self.train_dataset)} samples')
 77 |             print(f'>> valid-dataset: {len(self.valid_dataset)} samples')
 78 | 
 79 |         # Prepare test dataset
 80 |         if stage == 'test' or stage is None:
 81 | 
 82 |             assert self.hparams.test_file, 'test_file must be specificed'
 83 | 
 84 |             cached_examples_test_file = os.path.join(
 85 |                 input_dir,
 86 |                 f"cached_test_{list(filter(None, self.hparams.model_name_or_path.split('/'))).pop()}"
 87 |             )
 88 | 
 89 |             # Init examples and dataset from cache if it exists
 90 |             if os.path.exists(cached_examples_test_file) and not self.hparams.overwrite_cache:
 91 | 
 92 |                 print("Loading examples from cached file %s" % (cached_examples_test_file))
 93 | 
 94 |                 examples_and_dataset = torch.load(cached_examples_test_file)
 95 |                 self.test_dataset = examples_and_dataset["dataset"]
 96 |             else:
 97 |                 print("Creating examples from dataset file at %s" % input_dir)
 98 | 
 99 |                 processor = SquadV1Processor()
100 | 
101 |                 examples_test = processor.get_dev_examples(self.hparams.data_dir, filename=self.hparams.test_file)
102 | 
103 |                 _, _, self.test_dataset = squad_convert_examples_to_t5_format(
104 |                     examples=examples_test,
105 |                     use_sentence_id=self.hparams.use_sentence_id,
106 |                     evaluate=True,
107 |                     negative_ratio=0,
108 |                     return_dataset=True,
109 |                 )
110 | 
111 |                 print("Saving examples into cached file %s" % cached_examples_test_file)
112 |                 torch.save({"dataset": self.test_dataset}, cached_examples_test_file)
113 |             
114 |             print(f'>> test-dataset: {len(self.test_dataset)} samples')
115 | 
116 |     def get_dataloader(self, dataset: Dataset, batch_size: int, shuffle: bool, num_workers: int) -> DataLoader:
117 |         return DataLoader(
118 |             dataset,
119 |             batch_size=batch_size,
120 |             shuffle=shuffle,
121 |             num_workers=num_workers
122 |         )
123 | 
124 |     def train_dataloader(self,) -> DataLoader:
125 |         return self.get_dataloader(
126 |             self.train_dataset,
127 |             batch_size=self.hparams.train_batch_size,
128 |             shuffle=self.hparams.shuffle_train,
129 |             num_workers=self.hparams.num_workers
130 |         )
131 | 
132 |     def val_dataloader(self,) -> DataLoader:
133 |         return self.get_dataloader(
134 |             self.valid_dataset,
135 |             batch_size=self.hparams.val_batch_size,
136 |             shuffle=False,
137 |             num_workers=self.hparams.num_workers
138 |         )
139 | 
140 |     def test_dataloader(self,) -> DataLoader:
141 |         return self.get_dataloader(
142 |             self.test_dataset,
143 |             batch_size=self.hparams.val_batch_size,
144 |             shuffle=False,
145 |             num_workers=self.hparams.num_workers
146 |         )
147 | 
148 |     @staticmethod
149 |     def add_model_specific_args(parent_parser):
150 |         parser = configargparse.ArgumentParser(parents=[parent_parser], add_help=False)
151 |         parser.add_argument(
152 |             "--data_dir",
153 |             default=None,
154 |             type=str,
155 |             help="The input data dir. Should contain the .json files for the task."
156 |         )
157 |         parser.add_argument(
158 |             "--train_file",
159 |             default=None,
160 |             type=str,
161 |             help="The input training file. If a data dir is specified, will look for the file there"
162 |         )
163 |         parser.add_argument(
164 |             "--valid_file",
165 |             default=None,
166 |             type=str,
167 |             help="The input evaluation file. If a data dir is specified, will look for the file there"
168 |         )
169 |         parser.add_argument(
170 |             "--test_file",
171 |             default=None,
172 |             type=str,
173 |             help="The input test file. If a data dir is specified, will look for the file there"
174 |         )
175 |         parser.add_argument("--train_batch_size", default=8, type=int,
176 |             help="Batch size per GPU/CPU for training.")
177 |         parser.add_argument("--val_batch_size", default=8, type=int, 
178 |             help="Batch size per GPU/CPU for evaluation.")
179 |         parser.add_argument("--shuffle_train", action="store_true", 
180 |             help="Shuffle the train dataset")
181 |         parser.add_argument("--negative_ratio", default=0, type=int,
182 |             help="Set the positive-negative ratio of the training dataset. "
183 |             "Data balancing is performed for each pair document-typename. If less than one, keep the ratio of the original dataset")
184 |         parser.add_argument("--use_sentence_id", action="store_true",
185 |             help="Set this flag if you are using the approach that breaks the contexts into sentences")
186 |         parser.add_argument("--overwrite_cache", action="store_true",
187 |             help="Overwrite the cached training and evaluation sets")
188 |         
189 |         return parser
190 | 


--------------------------------------------------------------------------------
/information_extraction_t5/features/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/neuralmind-ai/information-extraction-t5/988c589b433d96004139e1f63bfefd6778c0851b/information_extraction_t5/features/__init__.py


--------------------------------------------------------------------------------
/information_extraction_t5/features/context.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import numpy as np
  3 | import re
  4 | from typing import Any, Dict, List, Optional, Tuple, Union
  5 | from transformers import AutoTokenizer, PreTrainedTokenizerBase
  6 | 
  7 | 
  8 | def get_tokens_and_offsets(text: str, tokenizer: PreTrainedTokenizerBase) -> List[Tuple[Any, int, int]]:
  9 |     tokens = tokenizer.tokenize(text)
 10 |     token_lens = [len(token) for token in tokens]
 11 |     token_lens[0] -= 1  # Ignore first "_" token
 12 |     token_ends = np.cumsum(token_lens)
 13 |     token_starts = [0] + token_ends[:-1].tolist()
 14 |     tokens_and_offsets = list(zip(tokens, token_starts, token_ends))
 15 |     return tokens_and_offsets
 16 | 
 17 | 
 18 | def get_token_id_from_position(tokens_and_offsets: List[Tuple[Any, int, int]], position: int) -> int:
 19 |     for idx, tok_offs in enumerate(tokens_and_offsets):
 20 |         _, start, end = tok_offs
 21 |         if start <= position < end:
 22 |             return idx
 23 |     return len(tokens_and_offsets) - 1
 24 | 
 25 | 
 26 | def get_max_size_context(document: Dict, max_size: int = 4000, question: str = 'Qual?') -> str:
 27 |     """Returns the first max_size characters of the document_text.
 28 |     """
 29 |     document_text = document['text']
 30 |     question_sentence = f'question: {question} context: '
 31 |     num_chars_question = len(question_sentence)
 32 |     remaining_chars = max_size - num_chars_question
 33 | 
 34 |     context = document_text[:remaining_chars - 4]
 35 |     context = context + ' ...'
 36 |     return context
 37 | 
 38 | 
 39 | def get_position_context(
 40 |     document: Dict,
 41 |     max_size: int = 4000,
 42 |     start_position: int = 0,
 43 |     proportion_before: float = 0.2,
 44 |     question: str = 'Qual?',
 45 |     use_sentence_id: bool = False,
 46 |     verbose: bool = False,
 47 |     ) -> Tuple[str, int]:
 48 |     """Returns the content around a specific position with size controlled by max_size.
 49 |     proportion_before indicates the proportion of max_size the must be taken before 
 50 |     the position, while 1 - position_before is after.
 51 |     """
 52 |     document_text = document['text']
 53 |     question_sentence = f'question: {question} context: '
 54 |     num_chars_question = len(question_sentence)
 55 | 
 56 |     remaining_chars = max_size - num_chars_question
 57 |     start_reticences, end_reticences = False, False
 58 | 
 59 |     start = math.floor(remaining_chars * proportion_before)
 60 |     start = max(0, start_position - start)
 61 |     end = min(len(document_text), remaining_chars + start)
 62 | 
 63 |     if use_sentence_id:
 64 |         num_chars_each_sentence_id = len('[SENT1]')
 65 |         num_chars_sentence_id = (document_text[start: end].count('\n') + 1) * num_chars_each_sentence_id
 66 |     else:
 67 |         num_chars_sentence_id = 0
 68 |     size = end - start
 69 | 
 70 |     # remove chars if current size + sentence-ids chars exceed the remaining chars
 71 |     if size + num_chars_sentence_id > remaining_chars:
 72 |         to_remove = (size + num_chars_sentence_id) - remaining_chars
 73 |         
 74 |         # Chars are removed fractionally (20 times), in order to control the 
 75 |         # expected size, avoiding exaggerated removal. In each iteration, as 
 76 |         # long as the the window size is updated, the num_chars_sentence_id
 77 |         # is updated as well.
 78 |         to_remove_fractions = [to_remove // 20] * 20 + [to_remove % 20]
 79 | 
 80 |         for to_remove in to_remove_fractions:
 81 |             if start == start_position:
 82 |                 end -= to_remove
 83 |             else:
 84 |                 remove_before = math.floor(to_remove * proportion_before)
 85 |                 remove_before = min(remove_before, start_position - start)
 86 |                 remove_after = to_remove - remove_before
 87 |                 start += remove_before
 88 |                 end -= remove_after 
 89 | 
 90 |             num_chars_sentence_id = (document_text[start: end].count('\n') + 1) * num_chars_each_sentence_id 
 91 |             size = end - start
 92 | 
 93 |             # the size satifies remaining_tokens
 94 |             if size + num_chars_sentence_id <= remaining_chars:
 95 |                 break
 96 | 
 97 |     # check if it requires reticences
 98 |     # if it does, try to find a space before/after the start_position
 99 |     if start != 0:
100 |         start_reticences = True
101 |         start = max(start, document_text.find(' ', start, start_position))
102 |         position_offset = start - 3  # reticences
103 |     else:
104 |         position_offset = start
105 |     
106 |     if end < len(document_text):
107 |        end_reticences = True
108 |        end = document_text.rfind(' ', start_position, end)
109 | 
110 |     if verbose:
111 |         print('-- MUST CONTAIN: ' + document_text[start_position: start_position+30])
112 |         print(f'-- start: {start}, end: {end}')
113 |         c = document_text[start:end]
114 |         print(f'-- len (char): {len(c)}')
115 |         print(f'-- context: {c} \n')
116 |     
117 |     context = ('...' if start_reticences else '') \
118 |         + document_text[start: end] \
119 |         + ('...' if end_reticences else '')
120 | 
121 |     if verbose:
122 |         # it can exceed the expected num of chars because of reticences.
123 |         print('--> testing the number of chars:')
124 |         t5_input = question_sentence + context
125 |         n = len(t5_input)
126 |         print(f'>> The input occupies {n} chars. '
127 |             f'It will have additional {num_chars_sentence_id} for sentence-ids. '
128 |             f'Total: {n + num_chars_sentence_id}. Expected: {max_size}.')
129 |     
130 |     return context, position_offset
131 | 
132 | 
133 | def get_windows_context(
134 |     document: Dict,
135 |     max_size: int = 4000,
136 |     window_overlap: float = 0.5,
137 |     max_windows: int = 3,
138 |     question: str = 'Qual?',
139 |     use_sentence_id: bool = False,
140 |     verbose: bool = False,
141 |     ) -> Tuple[List[str], List[int]]:
142 |     """Returns a list of window contents with size controlled by max_size, with
143 |     overlapping near to 50%.
144 |     """
145 |     document_text = document['text']
146 | 
147 |     assert max_windows != 0, (
148 |         'Set max_windows higher than 0 to get a specific quantity of windows, '
149 |         'or below to extract all possible ones.')
150 | 
151 |     contexts, offsets = [], []
152 | 
153 |     start_position, position_offset = 0, 0
154 |     context = ''
155 |     # the offset + current context size surpassing document size means the 
156 |     # window reached the end of document
157 |     while position_offset + len(context) < len(document_text):
158 | 
159 |         context, position_offset = get_position_context(document, max_size=max_size,
160 |             start_position=start_position, proportion_before=0, question=question, 
161 |             use_sentence_id=use_sentence_id, verbose=verbose)
162 | 
163 |         contexts.append(context)
164 |         offsets.append(position_offset)
165 | 
166 |         if verbose:
167 |             print(f'>>>>>>>>>> WINDOW: start_position = {start_position}, offset = {position_offset}')
168 | 
169 |         start_position += int(len(context) * (1 - window_overlap))
170 | 
171 |         if max_windows > 0 and len(contexts) == max_windows: break
172 | 
173 |     return contexts, offsets
174 | 
175 | 
176 | def get_token_context(document: Dict, 
177 |     tokenizer: Union[None, PreTrainedTokenizerBase] = None,
178 |     max_tokens: int = 512,
179 |     question: str = 'Qual?',
180 |     use_sentence_id: bool = False,
181 |     verbose: bool = False,
182 |     ) -> Tuple[str, int]:
183 |     """Returns the first max_tokens tokens of the document_text.
184 |     """
185 |     context, position_offset = get_position_token_context(document, start_position=0,
186 |         proportion_before=0, tokenizer=tokenizer, max_tokens=max_tokens, question=question,
187 |         use_sentence_id=use_sentence_id, verbose=verbose)
188 |     return context, position_offset
189 | 
190 | 
191 | def get_position_token_context(
192 |     document: Dict,
193 |     start_position: int = 0,
194 |     proportion_before: float = 0.2,
195 |     tokenizer: Union[None, PreTrainedTokenizerBase] = None,
196 |     max_tokens: int = 512,
197 |     tokens_and_offsets: Optional[List[Tuple[Any, int, int]]] = None,
198 |     question: str = 'Qual?',
199 |     use_sentence_id: bool = False,
200 |     verbose: bool = False,
201 |     ) -> Tuple[str, int]:
202 |     """Returns the content around a specific position, with size controlled by max_tokens.
203 |     proportion_before indicates the proportion of max_size the must be taken before the 
204 |     position, while 1 - position_before is after.
205 |     """
206 |     document_text = document['text']
207 |     question_sentence = f'question: {question} context: '
208 |     num_tokens_question = len(tokenizer.tokenize(question_sentence))
209 | 
210 |     remaining_tokens = max_tokens - num_tokens_question
211 |     start_reticences, end_reticences = False, False
212 | 
213 |     if tokens_and_offsets is None:
214 |         tokens_and_offsets = get_tokens_and_offsets(text=document_text, tokenizer=tokenizer)
215 |     positional_token_id = get_token_id_from_position(tokens_and_offsets=tokens_and_offsets, position=start_position)
216 |     start_token_id = max(0, positional_token_id - math.floor(remaining_tokens * proportion_before))
217 |     end_token_id = min(positional_token_id + math.ceil(remaining_tokens * (1-proportion_before)), len(tokens_and_offsets))
218 | 
219 |     start = tokens_and_offsets[start_token_id][1]
220 |     end = tokens_and_offsets[end_token_id-1][2]
221 | 
222 |     if use_sentence_id:
223 |         num_tokens_each_sentence_id = len(tokenizer.tokenize('[SENT10]'))
224 |         num_tokens_sentence_id = (document_text[start: end].count('\n') + 1) * num_tokens_each_sentence_id 
225 |     else:
226 |         num_tokens_sentence_id = 0
227 |     size = end_token_id - start_token_id
228 | 
229 |     # remove tokens if current size + sentence-ids tokens exceed the remaining tokens
230 |     if size + num_tokens_sentence_id > remaining_tokens:
231 |         to_remove = (size + num_tokens_sentence_id) - remaining_tokens
232 |         
233 |         # Tokens are removed fractionally (20 times), in order to control the 
234 |         # expected size, avoiding exaggerated removal. In each iteration, as 
235 |         # long as the the window size is updated, the num_tokens_sentence_id
236 |         # is updated as well.
237 |         to_remove_fractions = [to_remove // 20] * 20 + [to_remove % 20]
238 | 
239 |         for to_remove in to_remove_fractions:
240 |             if start == start_position:
241 |                 end_token_id -= to_remove
242 |             else:
243 |                 remove_before = math.floor(to_remove * proportion_before)
244 |                 remove_before = min(remove_before, positional_token_id - start_token_id)
245 |                 remove_after = to_remove - remove_before
246 |                 start_token_id += remove_before
247 |                 end_token_id -= remove_after 
248 | 
249 |             start = tokens_and_offsets[start_token_id][1]
250 |             end = tokens_and_offsets[end_token_id-1][2]
251 | 
252 |             num_tokens_sentence_id = (document_text[start: end].count('\n') + 1) * num_tokens_each_sentence_id 
253 |             size = end_token_id - start_token_id
254 | 
255 |             # the size satifies remaining_tokens
256 |             if size + num_tokens_sentence_id <= remaining_tokens:
257 |                 break
258 |         
259 |     # check if it requires reticences
260 |     # if it does, try to find a space before/after the start_position
261 |     if start != 0:
262 |         start_reticences = True
263 |         start = max(start, document_text.find(' ', start, start_position))
264 |         position_offset = start - 3  # reticences
265 |     else:
266 |         position_offset = tokens_and_offsets[start_token_id][1]
267 | 
268 |     if end < len(document_text):
269 |        end_reticences = True
270 |        end = document_text.rfind(' ', start_position, end)
271 | 
272 |     if verbose:
273 |         print('-- MUST CONTAIN: ' + document_text[start_position: start_position+30])
274 |         print(f'-- start: {start}, end: {end}')
275 |         c = document_text[start: end]
276 |         print(f'-- len (char): {len(c)}')
277 |         print(f'-- len (toks): {end_token_id - start_token_id}')
278 |         print(f'-- context: {c} \n')
279 | 
280 |     context = ('...' if start_reticences else '') \
281 |         + document_text[start: end] \
282 |         + ('...' if end_reticences else '')
283 | 
284 |     if verbose:
285 |         # it can exceed the expected num of tokens because of reticences.
286 |         print('--> testing the number of tokens:')
287 |         t5_input = question_sentence + context
288 |         n = len(tokenizer.tokenize(t5_input))
289 |         print(f'>> The input occupies {n} tokens. '
290 |             f'It will have additional {num_tokens_sentence_id} for sentence-ids. '
291 |             f'Total: {n + num_tokens_sentence_id}. Expected: {max_tokens}.')
292 | 
293 |     return context, position_offset 
294 | 
295 | 
296 | def get_windows_token_context(
297 |     document: Dict,
298 |     window_overlap: float = 0.5,
299 |     max_windows: int = 3,
300 |     tokenizer: Union[None, PreTrainedTokenizerBase] = None,
301 |     max_tokens: int = 512,
302 |     question: str = 'Qual?',
303 |     use_sentence_id: bool = False,
304 |     verbose: bool = False,
305 |     ) -> Tuple[List[str], List[int]]:
306 |     """Returns a list of window contents with size controlled by max_tokens, with
307 |     overlapping near to 50%.
308 |     """
309 |     document_text = document['text']
310 | 
311 |     assert max_windows != 0, (
312 |         'Set max_windows higher than 0 to get a specific quantity of windows, '
313 |         'or below to extract all possible ones.')
314 |     
315 |     contexts, offsets = [], []
316 |     tokens_and_offsets = get_tokens_and_offsets(text=document_text, tokenizer=tokenizer)
317 | 
318 |     assert len(document_text) == tokens_and_offsets[-1][2], (
319 |         f'The original document ({document["uuid"]}) and the end of last token are not matching: {len(document_text)} != {tokens_and_offsets[-1][2]}')
320 | 
321 |     start_position, position_offset = 0, 0
322 |     context = ''
323 |     # the offset + current context size surpassing document size means the 
324 |     # window reached the end of document
325 |     while position_offset + len(context) < len(document_text):
326 | 
327 |         context, position_offset = get_position_token_context(document, start_position=start_position,
328 |             proportion_before=0, tokenizer=tokenizer, max_tokens=max_tokens, tokens_and_offsets=tokens_and_offsets, 
329 |             question=question, use_sentence_id=use_sentence_id, verbose=verbose)
330 | 
331 |         contexts.append(context)
332 |         offsets.append(position_offset)
333 | 
334 |         if verbose:
335 |             print(f'>>>>>>>>>> WINDOW: start_position = {start_position}, offset = {position_offset}')
336 | 
337 |         start_position += int(len(context) * (1 - window_overlap))
338 | 
339 |         if max_windows > 0 and len(contexts) == max_windows: break
340 | 
341 |     return contexts, offsets
342 | 
343 | 
344 | def get_context(
345 |     document: Dict,
346 |     context_content: str = 'windows_token',
347 |     max_size: int = 4000,
348 |     start_position: int = 0,
349 |     proportion_before: float = 0.2,
350 |     return_position_offset: bool = False,
351 |     use_sentence_id: bool = False,
352 |     tokenizer: Union[None, PreTrainedTokenizerBase] = None,
353 |     max_tokens: int = 512,
354 |     question: str = 'Qual?',
355 |     window_overlap: float = 0.5,
356 |     max_windows: int = 3,
357 |     verbose: bool = False,
358 |     ) -> Union[str, List[str], Tuple[Union[str, List[str]], Union[int, List[int]]]]: 
359 |     """Returns the context to use in T5 input based on context_content.
360 |         
361 |      Args:
362 |         document: dict with all the information of current document.
363 |         context_content: type of context (max_size, position, token, 
364 |             position_token or windows_token).
365 |             - max_size: gets the first max_size characters.
366 |             - position: gets a window text limited to max_size characters 
367 |             around a start_position, respecting a proportion before and after 
368 |             the position.
369 |             - windows: gets a list of sliding windows of max_size, comprising
370 |             the complete document.
371 |             - token: gets the first max_tokens tokens.
372 |             - position_token: gets a window text limited to max_tokens tokens
373 |             around a start_position, respecting a proportion before and after 
374 |             the position, and penalizing tokens that will be occupied by 
375 |             question and sentence-ids in the T5 input.
376 |             - windows_token: gets a list of sliding windows of max_tokens,
377 |             comprising the complete document.
378 |         max_size: maximum size of context, in chars (used for max_size and
379 |             position).
380 |         start_position: char index of a keyword in the original document text 
381 |             (used for position and position_token).
382 |         proportion_before: proportion of maximum context size (max_size or 
383 |             max_tokens) that must be before start_position (used for position, 
384 |             position_token and the variants).
385 |         return_position_offset: if True, returns the position of returned 
386 |             context with respect to original document text (used for position, 
387 |             position_token and the variants).
388 |         tokenizer: AutoTokenizer used in the model (used for position_token and 
389 |             windows_token).
390 |         max_tokens: maximum size of context, in tokens (used for position_token 
391 |             and windows token).
392 |         question: question that will be used along with the context in the T5
393 |             input (used for position_token and windows_token).
394 |         window_overlap: overlapping between windows (used for windows and 
395 |             windows_token).
396 |         max_windows: the maximum number of windows to generate, use -1 to get
397 |             all the possible windows (used for windows and windows_token)
398 |         verbose: visualize the processing, tests, and resultant contexts.
399 | 
400 |     Returns:
401 |         - the context.
402 |         - the position_offset (optional).        
403 |     """
404 |     position_offset = 0
405 | 
406 |     # remove repeated breaklines, repeated spaces/tabs, space/tabs before
407 |     # breaklines, and breaklines in start/end of document text to make the token 
408 |     # positions match the char positions. Those rules avoid incorrect alignments.
409 |     document['text'] = document['text'].replace('\t', ' ')            # '\t'
410 |     document['text'] = re.sub(r'\s*\n+\s*', r'\n', document['text'])  #  space (0 or more) + '\n' (1 or more) + space (0 or more)
411 |     document['text'] = re.sub(r'(\s)\1+', r'\1', document['text'])    # space (1 or more)
412 |     # special characters that causes raw and tokinization texts to desagree
413 |     document['text'] = document['text'].replace('´', '\'')          # 0 char --> 1 char in tokenization (common in publicacoes)
414 |     document['text'] = document['text'].replace('™', 'TM')          # 1 char --> 2 chars in tokenization
415 |     document['text'] = document['text'].replace('…', '...')         # 1 char --> 3 chars in tokenization
416 |     document['text'] = document['text'].strip()
417 | 
418 |     if context_content == 'max_size':
419 |         context = get_max_size_context(document, max_size=max_size, question=question)
420 |     elif context_content == 'position':
421 |         context, position_offset = get_position_context(document, max_size=max_size, 
422 |             start_position=start_position, proportion_before=proportion_before, 
423 |             question=question, use_sentence_id=use_sentence_id, verbose=verbose)
424 |     elif context_content == 'windows':
425 |         context, position_offset = get_windows_context(document, max_size=max_size,
426 |             window_overlap=window_overlap, max_windows=max_windows,
427 |             question=question, use_sentence_id=use_sentence_id, verbose=verbose)
428 |     elif context_content == 'token':
429 |         context, position_offset = get_token_context(document, 
430 |             tokenizer=tokenizer, max_tokens=max_tokens, question=question,
431 |             use_sentence_id=use_sentence_id, verbose=verbose)
432 |     elif context_content == 'position_token':
433 |         context, position_offset = get_position_token_context(document, start_position=start_position,
434 |             proportion_before=proportion_before, tokenizer=tokenizer, max_tokens=max_tokens,
435 |             question=question, use_sentence_id=use_sentence_id, verbose=verbose)
436 |     elif context_content == 'windows_token':
437 |         context, position_offset = get_windows_token_context(document,
438 |             window_overlap=window_overlap, max_windows=max_windows, tokenizer=tokenizer,
439 |             max_tokens=max_tokens, question=question, use_sentence_id=use_sentence_id, verbose=verbose)
440 |     else:
441 |         return '', position_offset
442 | 
443 |     if verbose:
444 |         if isinstance(context, list):
445 |             for (i, cont) in enumerate(context):
446 |                 print(f'--------\nWINDOW {i}\n--------')
447 |                 print(f'len: {len(cont)} context: {cont} \n')
448 |         else:
449 |             print(f'len: {len(context)} context: {context} \n')
450 | 
451 |     if return_position_offset:
452 |         return context, position_offset
453 |     else:
454 |         return context
455 | 
456 | 
457 | def main():
458 |     document = {}
459 |     document['uuid'] = '1234567'
460 |     document['text'] = "Que tal fazer uma poc inicial para vermos a viabilidade e identificarmos as dificuldades?\nA motivação da escolha desse problema " \
461 |     "foi que boa parte dos atos de matrícula passam de 512 tokens, e ainda não temos uma solução definida para fazer treinamento e predições em " \
462 |     "janelas usando o QA.\nEssa limitação dificulta o uso de QA para problemas que não sabemos onde a informação está no documento (por enquanto, " \
463 |     "só aplicamos QA em tarefas que sabemos que a resposta está nos primeiros 512 tokens da matrícula).\nComo esse problema de identificar a proporção " \
464 |     "de cada pessoa são duas tarefas (identificação + relação com uma pessoa), podemos usar a localização da pessoa no texto para selecionar apenas " \
465 |     "uma pedaço do ato de alienação pra passar como contexto pro modelo, evitando um pouco essa limitação dos 512 tokens."
466 |     document['text'] = "PREFEITURA DE CAUCAIA\nSECRETARIA DE FINAN\u00c7AS,PLANEJAMENTO E OR\u00c7AMENTO\nCERTID\u00c3O NEGATIVA DE TRIBUTOS ECON\u00d4MICOS\nLA SULATE\nN\u00ba 2020000982\nRaz\u00e3o Social\nCOMPASS MINERALS AMERICA DO SUL INDUSTRIA E COMERC\nINSCRI\u00c7\u00c3O ECON\u00d4MICA Documento\nBairro\n00002048159\nC.N.P.J.: 60398138001860\nSITIO SALGADO\nLocalizado ROD CE 422 KM 17, S/N - SALA SUPERIOR 01 CXP - CAUCAIA-CE\nCEP\n61600970\nDADOS DO CONTRIBUINTE OU RESPONS\u00c1VEL\nInscri\u00e7\u00e3o Contribuinte / Nome\n169907 - COMPASS MINERALS AMERICA DO SUL INDUSTRIA E COMERC\nEndere\u00e7o\nROD CE 422 KM 17, S/N SALA SUPERIOR 01 CXP\nDocumento\nC.N.P.J.: 60.398.138/0018-60\nSITIO SALGADO CAUCAIA-CE CEP: 61600970\nNo. Requerimento\n2020000982/2020\nNatureza jur\u00eddica\nPessoa Juridica\nCERTID\u00c3O\nCertificamos para os devidos fins, que revendo os registros dos cadastros da d\u00edvida ativa e de\ninadimplentes desta Secretaria, constata-se - at\u00e9 a presente data \u2013 n\u00e3o existirem em nome do (a)\nrequerente, nenhuma pend\u00eancia relativa a tributos municipais.\nSECRETARIA DE FINAN\u00c7AS, PLANEJAMENTO E OR\u00c7AMENTO se reserva o direito de inscrever e cobrar as\nd\u00edvidas que posteriormente venham a ser apurados. Para Constar, foi lavrada a presente Certid\u00e3o.\nA aceita\u00e7\u00e3o desta certid\u00e3o est\u00e1 condicionada a verifica\u00e7\u00e3o de sua autenticidade na internet, nos\nseguinte endere\u00e7o: http://sefin.caucaia.ce.gov.br/\nCAUCAIA-CE, 03 DE AGOSTO DE 2020\nEsta certid\u00e3o \u00e9 v\u00e1lida por 090 dias contados da data de emiss\u00e3o\nVALIDA AT\u00c9: 31/10/2020\nCOD. VALIDA\u00c7\u00c3O 2020000982"
467 |     # document['text'] = "DESAANZ\nJUCESP - Junta Comercial do Estado de S\u00e3o Paulo\nMinist\u00e9rio do Desenvolvimento, Ind\u00fastria e Com\u00e9rcio Exterios\nSamas\nSECRETAR\u00cdA DE DESENVOLVIMENTO\ndo Com\u00e9rcio - DNRC\nECONOMICO, CI\u00caNCIA,\nn\u00f4mico, Ci\u00eancia e Tecnologia\nTECNOLOGIA E INOVA\u00c7\u00c3O\nBestellen\nCERTIFICO O REGISTROFLAVIA REAT BRITTO\nSOB O N\u00daMERO SECRETARIA IGERAL EM EXERC\nda Reguerimento:\n5461/15-7 A LEHET BEDS\nSEQ. DOC.\n15 JAN. 2015\n1\nJUCESP\nHU\nSIP\nJUCESP PROTOCOLO\n0.024.119/15-5\n1\nJunta Comba\nEstado de S\u00e3o Paulo\n14\nJUNTA CON\nNubia Cristina da Silva Cembull\nAssessora T\u00e9cnica do Registro Publico\nR.G.: 36.431.427-3\nDADOS CADASTRAIS\n13 Hd\nCODIGO DE BARRAS (NIRE)\nCNPJ DA SEDE\nNIRE DA SEDE\n3522550861-2\n13.896.623/0001-36\nSEM EXIG\u00caNCIA ANTERIOR\nPROIE\nATO(S)\nAltera\u00e7\u00e3o de Endere\u00e7o; Altera\u00e7\u00e3o de Nome Empresarial; Consolida\u00e7\u00e3o da\nNOME EMPRESARIAL\nRF MOTOR'S Com\u00e9rcio de ve\u00edculos Ltda. - ME\n!\nLOGRADOURO\nAvenida Regente Feij\u00f3\nN\u00daMERO\n277\n:\nCEP\nCOMPLEMENTO\nBAIRRO/DISTRITO\nVila Regente Feij\u00f3\nC\u00d3DIGO DO MUNICIPIO\n5433\n03342-000\nUF\nMUNICIPIO\nS\u00e3o Paulo\nSP\nTELEFONE\nCORREIO ELETR\u00d4NICO\nIN, OAB\nU.F.\nNOME DO ADVOGADO\nVALORES RECOLHIDOS IDENTIFICA\u00c7\u00c3O DO REPRESENTANTE DA EMPRESA\nDARE 54,00\nNOME:\nBruno Vinicius Ferreira (S\u00f3cio )\nDARF 21,00\nASSINATURA:\nDATA ASSINATURA:\n12/01/2015\nB\nDECLARO, SOB AS PENAS DA LEI, QUE AS INFORMA\u00c7\u00d5ES CONSTANTES DO REQUERIMENTO/PROCESSO S\u00c3O EXPRESS\u00c3O DA VERDADE.\nControle Internet\n\u0421.\n015755122-9\n12/1/2015 10:19:14 - P\u00e1gina 1 de 2\n\n\n1ERCIAL\npy\nOLO\nINSTRUMENTO PARTICULAR DE ALTERA\u00c7\u00c3O\nCONTRATUAL DE SOCIEDADE EMPRES\u00c1RIA DE FORMA\nLIMITADA:\nREAVEL FERREIRA COM\u00c9RCIO DE VE\u00cdCULOS LTDA. ME\nCNPJ 13.896.623/0001-36\nPelo presente instrumento particular de altera\u00e7\u00e3o\ndo contrato social, os abaixo qualificados e ao final assinados:\nBruno Vinicius Ferreira, brasileiro, solteiro, nascido em 26/10/1985,\nempres\u00e1rio, portador da c\u00e9dula de identidade RG sob n\u00ba. 42.318.703-X/SSP-SP, inscrito no CPF/MF sob n\u00ba. 340.446.998-44, residente e\ndomiciliado no Estado de S\u00e3o Paulo, \u00e0 Rua Altina Penna Botto, 16 -\nCasa 02 - Vila Ivone - CEP 03375-001;\nDiogo Gabriel Ferreira, brasileiro, solteiro, nascido em 08/10/1988,\nempres\u00e1rio, portador da c\u00e9dula de identidade RG sob n\u00ba. 44.476.866-X/SSP-SP, inscrito no CPF/MF sob n\u00ba. 359.085.288-70, residente e\ndomiciliado no Estado de S\u00e3o Paulo, \u00e0 Rua Altina Penna Botto, 16 -\nCasa 02 - Vila Ivone - CEP 03375-001;\n\u00danicos s\u00f3cios da sociedade empres\u00e1ria de forma limitada que gira\na denomina\u00e7\u00e3o social de REAVEL FERREIRA\nCom\u00e9rcio de Ve\u00edculos Ltda. ME, inscrita no CNPJ/MF sob n\u00ba.\n13.896.623/0001-36, com estabelecimento e sede \u00e0 Rua Acuru\u00ed, 508 -\nVila Formosa S\u00e3o Paulo CEP 03355-000 S.P., cujos atos\nconstitutivos encontram-se registrados e arquivados na Junta\nComercial do Estado de S\u00e3o Paulo, com NIRE sob n\u00ba 35.2.25508612,\nem sess\u00e3o de 22 de Junho de 2011, t\u00eam, entre si justos e contratados\npromovem a altera\u00e7\u00e3o contratual e consequente consolida\u00e7\u00e3o da\nempresa que obedecera as clausulas e condi\u00e7\u00f5es adiante descritas:\nB\n\n\nVistoContenido\nRG: 36.430.427-3\nAltera\u00e7\u00e3o Contratual\nCl\u00e1usula 1a:- Altera-se a raz\u00e3o social da empresa que passa a ser\nRF MOTOR'S Com\u00e9rcio de Ve\u00edculos Ltda. - ME com denomina\u00e7\u00e3o\nde fantasia RF MOTOR'S;\nCl\u00e1usula 2a:- Altera-se o endere\u00e7o da sociedade que passa a ser \u00e0\nAv. Regente Feij\u00f3, 277 - Vila Regente Feij\u00f3 - S\u00e3o Paulo CEP\n03342-000 - S.P.;\nCl\u00e1usula 3a:- Face \u00e0s altera\u00e7\u00f5es\nos s\u00f3cios deliberam a\nCONSOLIDA\u00c7\u00c3O CONTRATUAL, conforme segue:\nCONTRATUAL\nCl\u00e1usula 1:- A sociedade girar\u00e1 sob a denomina\u00e7\u00e3o social de RF\nMOTOR'S Com\u00e9rcio Veiculos Ltda. - ME com denomina\u00e7\u00e3o de\nfantasia RF MOTOR'S, e ter\u00e1 a sua sede \u00e0 Av. Regente Feij\u00f3, 277 -\nVila Regente Feij\u00f3 - S\u00e3o Paulo - CEP 03342-000 - S.P.;\nCl\u00e1usula 2: sociedade tem por fim e objetivo na forma da\nlegisla\u00e7\u00e3o\nCom\u00e9rcio a varejo de autom\u00f3veis, camionetas e utilit\u00e1rios novos;\nCom\u00e9rcio por atacado de autom\u00f3veis, camionetas e utilit\u00e1rios\nnovos e usados;\nCom\u00e9rcio a varejo de autom\u00f3veis, camionetas e utilit\u00e1rios usados;\nCom\u00e9rcio por atacado de motocicletas e motonetas;\nCom\u00e9rcio a varejo de motocicletas e motone novas;\nCl\u00e1usula 3:- A sociedade teve in\u00edcio em 22 de Junho de 2011 e ter\u00e1\ndura\u00e7\u00e3o por tempo indeterminado;\n8.\n\n\n300 m\nVi\u015fte\nCl\u00e1usula 4:- 0 capital social \u00e9 de R$ 10.000,00 (Dez Mil Reais)\ntotalmente subscrito e integralizado em moeda corrente nacional,\nrepresentado por 10.000 (dez mil) cotas no valor unit\u00e1rio de R$ 1,00\n(Hum Real) cada, assim distribu\u00eddo:-1. Bruno Vinicius Ferreira, 9.900 (nove mil e novecentas) cotas de\nvalor unit\u00e1rio de R$ 1,00 (Hum Real), totalizando R$ 9.900,00 (Nove\nMil e Novecentos Reais), totalmente subscritas e integralizadas em\nmoeda corrente nacional, neste ato;\n2. Diogo Gabriel Ferreira, 100 cotas de valor unit\u00e1rio de R$\n1,00 (Hum Real), totalizando R$ 100,00 (Cem Reais), totalmente\nsubscritas e integralizadas em moeda corrente nacional, neste ato;\nCl\u00e1usula 5:- A responsabilidade dos s\u00f3cios \u00e9 restrita ao valor de suas\ncotas, mas todos\ndo Capital\npela integraliza\u00e7\u00e3o\ndeliberam que a administra\u00e7\u00e3o da sociedade,\nbem como sua representa\u00e7\u00e3o ativa e passiva, judicial ou extrajudicial,\nser\u00e1 exercida pelo s\u00f3cio Bruno Vinicius Ferreira individual e\nisoladamente. Inclusive todos os documentos legais e banc\u00e1rios, que\npoder\u00e1 constituir procuradores com tais poderes.\nPar\u00e1grafo Primeiro:- Os s\u00f3cios ter\u00e3o direito, a uma retirada mensal\na t\u00edtulo de Pr\u00f3-Labore e poder\u00e3o efetuar a distribui\u00e7\u00e3o de lucro, desde\nque, fixado em comum acordo no in\u00edcio de cada exerc\u00edcio.\nPar\u00e1grafo Segundo:- Os s\u00f3cios far\u00e3o uso da firma, podendo assinar\nseparadamente, ficando-lhes vedado, entretanto, o uso da firma em\nneg\u00f3cios alheios aos do objetivo social; e t\u00edtulos de responsabilidade\nsocial de esp\u00e9cie alguma, tais como avais, endossos, fian\u00e7as, etc.\n\n\nVisto\nConitor\n.RG36.430.427-3\nPar\u00e1grafo Terceiro:- A onera\u00e7\u00e3o ou venda de bens im\u00f3veis depende\nda expressa anu\u00eancia de s\u00f3cios que representem pelo menos 75%\n(setenta e cinco por cento) das quotas com direito a voto, respondendo\nos administradores solidariamente perante a sociedade e os terceiros\nprejudicados, por culpa no desempenho de suas fun\u00e7\u00f5es , de acordo\ncom o disposto no art. 1016, da Lei n. 10.406 de 10 de janeiro de\n2002.\nPar\u00e1grafo Quarto:- Depender\u00e1 tamb\u00e9m de expressa anu\u00eancia dos\ns\u00f3cios, conforme o disposto da Lei n. 10.406 de 10 de janeiro de\n2002, ficando assim solidariamente respons\u00e1vel civil e criminalmente\no s\u00f3cio que infringir o presente artigo:-a) Alienar, onerar ou de qualquer forma dispor de t\u00edtulos imobili\u00e1rios,\nbem como cotas ou a\u00e7\u00f5es de que a sociedade seja titular no capital de\noutras empresas;\nb) Fixar remunera\u00e7\u00e3o dos adminis\nistradores e assessores, sem v\u00ednculo\nempregat\u00edcio, a eles subordinados.\nCl\u00e1usula 7 :- Faculta-se a qualquer dos s\u00f3cios, retirar-se da sociedade\ndesde que o fa\u00e7a mediante aviso pr\u00e9vio de sua resolu\u00e7\u00e3o ao outro\ns\u00f3cio, observado o direito de prefer\u00eancia, com anteced\u00eancia m\u00ednima\nde pelo menos 6 (Seis) meses. Seus haveres lhes ser\u00e3o pagos em 12\n(Doze) meses corrigidos pelo IGPM e o primeiro vencimento \u00e0 partir\nde 60 (sessenta) dias da data do Balan\u00e7o Especial.\nCl\u00e1usula 8:- Os lucros e perdas apurados regularmente em balan\u00e7o\nanual que se realizar\u00e1 no dia 31 de Dezembro de cada ano, ser\u00e3o\ndivididos proporcionalmente ao capital social de cada um dos s\u00f3cios,\nem eventual preju\u00edzo os s\u00f3cios poder\u00e3o optar pelo aumento de capital\npara saldar tais preju\u00edzos.\n\n\nVisto\n15\nConfezidb\nRG: 15.530427-3\nCl\u00e1usula 9:- Em caso de falecimento de qualquer um dos s\u00f3cios, na\nvig\u00eancia do presente contrato, n\u00e3o importa na extin\u00e7\u00e3o da sociedade e\nseus neg\u00f3cios, cabendo ao s\u00f3cio remanescente a apura\u00e7\u00e3o dos haveres\ndo s\u00f3cio ausente segundo balan\u00e7o especial na data do \u00f3bito e, ser\u00e3o\npagos aos herdeiros do falecido em 12 (Doze) presta\u00e7\u00f5es mensais\ncorrigidos pelo IGPM, sendo vedado aos herdeiros poss\u00edvel ingresso\nna sociedade.\nCl\u00e1usula 10\u00b0:- Nenhum dos s\u00f3cios, pessoalmente ou por interposta\npessoa, poder\u00e1 participar ou colaborar a qualquer t\u00edtulo em outra\npessoa jur\u00eddica, que tenha por qualquer forma atividade an\u00e1loga ou\nconcorrente \u00e0 da sociedade, sem expressa anu\u00eancia dos demais.\nCl\u00e1usula 11:- Os administradores declaram, sob as penas da Lei, de\nque n\u00e3o est\u00e3o impedidos de exercerem a administra\u00e7\u00e3o da sociedade,\npor lei especial, ou em virtude de condena\u00e7\u00e3o criminal, ou por se\nencontrarem sob os efeitos dela, a pena que vede, ainda que\ntemporariamente, o acesso a cargos p\u00fablicos; ou por crime falimentar,\nde prevarica\u00e7\u00e3o, peita ou subomo, concuss\u00e3o, peculato, ou contra a\neconomia popular, contra o sistema financeiro nacional, contra\nnormas de defesa da concorr\u00eancia, contra as rela\u00e7\u00f5es de consumo, f\u00e9\np\u00fablica, ou a\n. (art. 1.011, \u00a71\u00baCC/2002).\nCl\u00e1usula 12 :- Para os casos omissos neste contrato, os mesmos ser\u00e3o\nregidos pelas disposi\u00e7\u00f5es legais vigentes atinentes \u00e0 mat\u00e9ria, em\nespecial a Lei n. 10.406, de 10 de janeiro de 2002.\nCl\u00e1usula 134:- Os s\u00f3cios elegem o foro Central da Comarca da\nCapital, no Estado de S\u00e3o Paulo, para as eventuais quest\u00f5es que\npossam advir.\nB\n\n\n..\n..\nVisto\nConletico\nE, assim, por estarem em tudo justos e contratados, as partes\nassinam o presente instrumento em 03 (tr\u00eas) vias de igual teor e valor\npara um s\u00f3 efeito, tudo, ante duas testemunhas a tudo presentes que\ntamb\u00e9m assinam, devendo em seguida ser encaminhado para registro\ne arquivamento junto a JU ESP - Junta Comercial do Estado de S\u00e3o\nPaulo.\nS\u00e3o Paulo, 12 de Janeiro de 2015.\nGolul tenuina\nBruno Vinicius Ferreira\nDiogo Gabriel Ferreira\nTestemunhas:\nRicardo\nHellon Austina da s Santos\nRicardo Silva Bezerra\nRG n\u00ba 29.074.987-6/SSP-SP\nCPF n\u00ba 213.108.838-82\nHellen Cristina da Silva Santos\nRG n\u00b037965378-3/SSP-SP\nCPF n\u00ba 405.216.528-47\nDO\n15 JAN. 2015\nwww.\nSASA\nSECRETARIA DE DESENVOLVIMENTO\n01/ECON\u00d3MICO, CI\u00caNCIA,\nTECNOLOGIA E INOVA\u00c7\u00c3O\nom\nCERTIFICO O REGISTRO FLAVTA REOTTA eri to\nSOB O NUMERO SECRET\u00c1RIA GERAL EM EXERCICIO\n5.461/15-7 tena MRITH FUIT\n...www.si\n\n\nDocumento B\u00e1sico de Entrada\nPage 1 of 1\n...\nREP\u00daBLICA FERERATIVA DO BRASIL\nCADASTRO NACIONAL.JA PESSOA JUR\u00cdDICA - CNPJ\nPROTOCOLO DE TRANSMISS\u00c3O DA FCP JUOVI30\nA an\u00e1lise e o deferimento deste documento ser\u00e3o efetuados pelo seguinte \u00f3rg\u00e3o:\n\u2022 Junta Comercial do Estado de S\u00e3o Paulo\nC\u00d3DIGO DE ACESSO\nSP.63.05.42.31 - 13.896.623.000.136\n01. IDENTIFICA\u00c7\u00c3O\nNOME EMPRESARIAL (firma ou denomina\u00e7\u00e3o)\nN\u00b0 DE INSCRI\u00c7\u00c3O NO CNPJ\nRF MOTORS COMERCIO DE VEICULOS LTDA.\n13.896.623/0001-36\n02. MOTIVO DO PREENCHIMENTO\nRELA\u00c7\u00c3O DOS EVENTOS SOLICITADOS / DATA DO EVENTO\n203 Exclus\u00e3o do t\u00edtulo do estabelecimento (nome de fantasia) - 12/01/2015\n211 Altera\u00e7\u00e3o de endere\u00e7o dentro do mesmo munic\u00edpio - 12/01/2015\n220 Altera\u00e7\u00e3o do nome empresarial (firma ou denomina\u00e7\u00e3o) - 12/01/2\n03. IDENTIFICA\u00c7\u00c3O DO REPRESENTANTE DA PESSOA JUR\u00cdDICA\nNOME\nBRUNO VINICIUS FERREIRA\nCPF\n340.446.998-44\nILOCAL\nDATA\n12/01/2015\n04. C\u00d3DIGO DE CONTROLE DO CERTIFICADO DIGITAL\nEste documento foi assinado com uso de senha da Sefaz SP\nAprovado pela Instru\u00e7\u00e3o Normativa RFB n\u00ba 1.183, de 19 de agosto de 2011\n12/01/2015\nhttp://www.receita fazenda.gov.br/pessoajuridica/cnpj/fcpj/dbe.asp\n\n\nES\nSP\nGOVERNO DO ESTADO DE S\u00c3O BAULO\nSECRETARIA DE DESENVOLVIMENTO ECONOMICO, CIENCIA E TECNOLOGIA\nJUNTA COMERCIAL DO ESTADO.DE S\u00c3O PAULO: JUCES...\nJUCESP\nAnta Comercial do\nEstado de Sio Pub\nDECLARA\u00c7\u00c3O\n,\nEu, Bruno Vinicius Ferreira, portador da C\u00e9dula de Identidade n\u00ba 42318703-X, inscrito no\nCadastro de Pessoas F\u00edsicas - CPF sob n\u00ba 340.446.998-44, na qualidade de titular, s\u00f3cio ou\nrespons\u00e1vel legal da empresa RF MOTOR'S Com\u00e9rcio de ve\u00edculos Ltda. - ME, DECLARO\nestar ciente que o ESTABELECIMENTO situado no(a) Avenida Regente Feij\u00f3, 277 Vila\nRegente Feij\u00f3, S\u00e3o Paulo, S\u00e3o Paulo, CEP 03342-000, N\u00c3O PODER\u00c1 EXERCER suas\natividades sem que obtenha o parecer municipal sobre a viabilidade de sua instala\u00e7\u00e3o e\nfuncionamento no local indicado, conforme diretrizes estabelecidas na legisla\u00e7\u00e3o de uso e\nocupa\u00e7\u00e3o do solo, posturas municipais e restri\u00e7\u00f5es das \u00e1reas de prote\u00e7\u00e3o ambiental, nos\ntermos do art. 24, $2 do Decreto Estadual n\u00ba 55.660/2010 e sem que tenha um CERTIFICADO\nDE LICENCIAMENTO INTEGRADO V\u00c1LIDO, obtido pelo sistema Via R\u00e1pida Empresa\nM\u00f3dulo de Licenciamento Estadual.\nDeclaro ainda estar ciente que qualquer altera\u00e7\u00e3o no endere\u00e7o do estabelecimento, em sua\natividade ou grupo de atividades, ou em qualquer outra das condi\u00e7\u00f5es determinantes \u00e0\nexpedi\u00e7\u00e3o do Certificado de Licenciamento Integrado, implica na perda de sua validade,\nassumindo, desde o momento da altera\u00e7\u00e3o, a obriga\u00e7\u00e3o de renov\u00e1-lo.\nPor fim, declaro estar ciente que a emiss\u00e3o do Certificado de Licenciamento Integrado poder\u00e1\nser solicitada por representante legal devidamente habilitado, presencialmente e no ato da\nretirada das certid\u00f5es relativas ao registro empresarial na Prefeitura, ou pelo titular, s\u00f3cio, ou\ncontabilista vinculado no Cadastro Nacional da Pessoa Jur\u00eddica (CNPJ) diretamente no site da\nJucesp, atrav\u00e9s do m\u00f3dulo de licenciamento, mediante uso da respectiva certifica\u00e7\u00e3o digital.\nBruno Vinicius Ferreira\nRG: 42318703-X\nRF MOTOR'S Com\u00e9rcio de ve\u00edculos Ltda. - ME"   
468 |         
469 |     context_content = 'position_token'
470 |     context_content = 'windows_token'
471 |     use_sentence_id = True
472 |     window_overlap = 0.5
473 |     max_windows = 3
474 |     
475 |     start_position = 158
476 |     max_size = 200
477 | 
478 |     #tokenizer = AutoTokenizer.from_pretrained('models/', do_lower_case=False)
479 |     tokenizer = AutoTokenizer.from_pretrained('unicamp-dl/ptt5-base-portuguese-vocab', do_lower_case=False)
480 |     max_tokens = 150
481 |     question =  'Qual o tipo, a classe, o órgão emissor, a localização e a abrangência?'
482 | 
483 |     context, offset = get_context(
484 |         document,
485 |         context_content=context_content,
486 |         max_size=max_size,
487 |         start_position=start_position,
488 |         proportion_before=0.2,
489 |         return_position_offset=True,
490 |         use_sentence_id=use_sentence_id,
491 |         tokenizer=tokenizer,
492 |         max_tokens=max_tokens,
493 |         max_windows=max_windows,
494 |         question=question,
495 |         window_overlap=window_overlap,
496 |         verbose=True)
497 | 
498 |     print('--> testing the offset:')
499 |     if isinstance(context, list):
500 |         context, offset = context[-1], offset[-1]  # last window
501 |     print('>>>>>>>>>> using the offset\n' + document['text'][offset:offset + len(context)])
502 |     print('>>>>>>>>>> returned context\n' + context)
503 | 
504 | 
505 | if __name__ == "__main__":
506 |     main()
507 | 


--------------------------------------------------------------------------------
/information_extraction_t5/features/highlights.py:
--------------------------------------------------------------------------------
  1 | from typing import Optional, Tuple, Union, Dict
  2 | from collections import OrderedDict
  3 | 
  4 | from fuzzysearch import find_near_matches
  5 | from fuzzywuzzy import process
  6 | 
  7 | from information_extraction_t5.features.sentences import (
  8 |     check_sent_id_is_valid,
  9 |     T5_RAW_CONTEXT,
 10 |     split_context_into_sentences,
 11 | )
 12 | 
 13 | estados = {
 14 |     'AC': 'Acre',
 15 |     'AL': 'Alagoas',
 16 |     'AP': 'Amapá',
 17 |     'AM': 'Amazonas',
 18 |     'BA': 'Bahia',
 19 |     'CE': 'Ceará',
 20 |     'DF': 'Distrito Federal',
 21 |     'ES': 'Espírito Santo',
 22 |     'GO': 'Goiás',
 23 |     'MA': 'Maranhão',
 24 |     'MT': 'Mato Grosso',
 25 |     'MS': 'Mato Grosso do Sul',
 26 |     'MG': 'Minas Gerais',
 27 |     'PA': 'Pará',
 28 |     'PB': 'Paraíba',
 29 |     'PR': 'Paraná',
 30 |     'PE': 'Pernambuco',
 31 |     'PI': 'Piauí',
 32 |     'RJ': 'Rio de Janeiro',
 33 |     'RN': 'Rio Grande do Norte',
 34 |     'RS': 'Rio Grande do Sul',
 35 |     'RO': 'Rondônia',
 36 |     'RR': 'Roraima',
 37 |     'SC': 'Santa Catarina',
 38 |     'SP': 'São Paulo',
 39 |     'SE': 'Sergipe',
 40 |     'TO': 'Tocantins'
 41 | }
 42 | 
 43 | area = {
 44 |     'metro_quadrado':   ['m²', 'm2', 'metros quadrados'],
 45 |     'hectare':          ['has', 'hectares'],
 46 |     'alq_paulista':     ['alqueires paulistas', 'alqueires']
 47 | }
 48 | 
 49 | 
 50 | def include_variations(query):
 51 |     """Given a canonical format, include possible variations of how the 
 52 |     information can appear in the text.
 53 |     """
 54 |     if query in estados.keys():
 55 |         return [estados[query]]
 56 |     if query in area.keys():
 57 |         return area[query]
 58 |     return []
 59 | 
 60 | 
 61 | def find_sentence_of_sent_id(context: T5_RAW_CONTEXT, sent_id: int) -> str:
 62 |     """Returns the sentence of number `sent_id` in the context.
 63 | 
 64 |     This method assumes the sentence ids are defined by linebreaks and start at
 65 |     1.
 66 | 
 67 |     Args:
 68 |         context: question raw context
 69 |         sent_id: Index of the sentence. Must be greater or equal to 0.
 70 |     """
 71 |     assert sent_id >= 0, (
 72 |         f'SENT id must be greater or equal to 0. Received: {sent_id}')
 73 |     
 74 |     sentences = split_context_into_sentences(context)
 75 | 
 76 |     return sentences[sent_id - 1]
 77 | 
 78 | 
 79 | def find_indexes_of_sentence(
 80 |         context: T5_RAW_CONTEXT, sent_id: int
 81 | ) -> Union[Tuple[int, int], Tuple[None, None]]:
 82 |     """Returns character indexes for the start and end of a sentence in the
 83 |     context.
 84 | 
 85 |     This method assumes the sentence ids are defined by linebreaks and start at
 86 |     1.
 87 |     """
 88 |     sentence = find_sentence_of_sent_id(context, sent_id)
 89 |     # get the start_char and end_char of the sentence
 90 |     start_char = context.find(sentence)
 91 |     end_char = context.find('\n', start_char)
 92 | 
 93 |     return start_char, end_char
 94 | 
 95 | 
 96 | def get_levenshtein_dist(
 97 |     query_string,
 98 |     levenshtein_dist_dict: Optional[Dict[int, int]] = None
 99 | ) -> int:
100 |     """Returns a maximum levenshtein distance based on query string length."""
101 |     if levenshtein_dist_dict is None:
102 |         levenshtein_dist_dict=OrderedDict({3: 0, 10: 1, 20: 3, 30: 5})
103 |     for str_size, dist in levenshtein_dist_dict.items():
104 |         if len(query_string) < str_size:
105 |             return dist
106 |     return list(levenshtein_dist_dict.values())[-1]
107 | 
108 | 
109 | def fuzzy_extract(
110 |         query_string: str, large_string: str, score_cutoff: int = 30,
111 |         max_levenshtein_dist: Union[int, Dict[int, int]] = -1,
112 |         verbose: bool = False
113 | ) -> Union[Tuple[int, int], Tuple[None, None]]:
114 |     """Fuzzy matches query string (and its variations) on a large string.
115 | 
116 |     Args:
117 |         query_string: substring to be searched inside another string.
118 |         large_string: the string to be searched on.
119 |         score_cutoff: fuzzy matches with a score below this one will be ignored.
120 |         max_levenshtein_dist: if a Dict, then changes the maximum levenshtein
121 |             distance of matches (value) based on `query_string` length (key).
122 |             Otherwise, an int should be supplied for a fixed maximum distance.
123 |         verbose: When True, prints debug messages to stdout.
124 | 
125 |     Returns:
126 |          Indexes of the start and end characters of the best match. If nothing
127 |          is found, returns (None, None) instead.
128 |     """
129 |     if max_levenshtein_dist == -1:
130 |         OrderedDict({3: 0, 10: 1, 20: 3, 30: 5})
131 |     query_strings = include_variations(query_string) + [query_string]
132 |     matches = []
133 |     starts = []
134 |     ends = []
135 |     scores = []
136 |     large_string = large_string.lower()
137 | 
138 |     for query_string in query_strings:
139 |         query_string = query_string.lower()
140 |         if verbose:
141 |             print(f'query: {query_string}')
142 | 
143 |         # set dynamic Levenshtein distance
144 |         if isinstance(max_levenshtein_dist, dict):
145 |             max_l_dist_query = get_levenshtein_dist(query_string, max_levenshtein_dist)
146 |         else:
147 |             max_l_dist_query = max_levenshtein_dist
148 | 
149 |         all_matches = process.extractBests(query_string, (large_string,),
150 |                                            score_cutoff=score_cutoff)
151 |         for large, _ in all_matches:
152 |             if verbose:
153 |                 print('word::: {}'.format(large))
154 |             for match in find_near_matches(query_string, large,
155 |                                            max_l_dist=max_l_dist_query):
156 |                 matched = match.matched
157 |                 start = match.start
158 |                 end = match.end
159 |                 score = match.dist
160 |                
161 |                 if verbose:
162 |                     print(f"match: {matched}\tindex: {start}\tscore: {score}")
163 | 
164 |                 matches.append(matched)
165 |                 starts.append(start)
166 |                 ends.append(end)
167 |                 scores.append(score)
168 | 
169 |     if len(matches) == 0:
170 |         return None, None
171 | 
172 |     best_id = scores.index(min(scores))
173 | 
174 |     return starts[best_id], ends[best_id]
175 | 
176 | 
177 | def get_answer_highlight(
178 |         answer: str, sent_id: int, context: T5_RAW_CONTEXT,
179 |         sentence_expansion: int = 0, verbose: bool = False
180 | ) -> Union[Tuple[int, int, str], Tuple[None, None, None]]:
181 |     r"""Given a single answer and its SENT ID, returns highlights of its
182 |     location within the context.
183 | 
184 |     Sometimes the answer has line breaks in the middle of it (ex.: São\nPaulo).
185 |     To find the highlight even on these cases, this optionally expands the
186 |     highlight window some sentences beyond the original ID.
187 | 
188 |     Args:
189 |         answer: the answer to search on the context.
190 |         sent_id: ID of the sentence the answer is in (or starts at).
191 |         context: the question raw context.
192 |         sentence_expansion: When this is 0, looks for the answer only in the
193 |             sentence pointed by SENT ID. If the value is a `N` greater than 0,
194 |             then looks for it on the `N` sequences that come after SENT ID,
195 |             i.e. the interval `[SENT ID, ..., SENT ID + N]`.
196 |         verbose: If True, enables debug prints.
197 | 
198 |     Examples:
199 |         >>> answer = 'Rua Albert Einstein'
200 |         >>> sent_id = 3
201 |         >>> context = "Campinas\n\nRua 4lbert \nE1nstein 1000"
202 |         >>> get_answer_highlight(answer, sent_id, context, sentence_expansion=2)
203 |         fuzzy ==> answer: Rua Albert Einstein, sentence: "Rua 4lbert  E1nstein 1000"
204 |         (10, 30, 'Rua 4lbert \nE1nstein')
205 |     """
206 |     sentence = find_sentence_of_sent_id(context, sent_id)
207 | 
208 |     expanded_sentence = [sentence]
209 |     for i in range(1, sentence_expansion + 1):
210 |         is_valid = check_sent_id_is_valid(context, sent_id + i)
211 |         if not is_valid:
212 |             break
213 | 
214 |         extra_sentence = find_sentence_of_sent_id(context, sent_id + i)
215 |         expanded_sentence.append(extra_sentence)
216 |     sentence = ' '.join(expanded_sentence)
217 | 
218 |     if verbose:
219 |         print(f'fuzzy ==> answer: {answer}, sentence: "{sentence}"')
220 | 
221 |     shift, _ = find_indexes_of_sentence(context, sent_id)
222 |     start_char, end_char = fuzzy_extract(answer, sentence)
223 | 
224 |     if start_char is None or end_char is None:
225 |         highlight = None
226 | 
227 |     else:
228 |         start_char += shift
229 |         end_char += shift
230 |         highlight = context[start_char:end_char]
231 | 
232 |     return start_char, end_char, highlight
233 | 


--------------------------------------------------------------------------------
/information_extraction_t5/features/postprocess.py:
--------------------------------------------------------------------------------
  1 | """Utility methods to post-process model output."""
  2 | from typing import Dict, List, Tuple
  3 | 
  4 | import numpy as np
  5 | import pandas as pd
  6 | 
  7 | from information_extraction_t5.features.sentences import (
  8 |     T5_SENTENCE,
  9 |     find_ids_of_sent_tokens,
 10 |     deconstruct_answer,
 11 |     get_raw_answer_from_subsentence,
 12 |     get_subanswer_from_subsentence
 13 | )
 14 | 
 15 | 
 16 | def group_qas(document_or_example_ids: List[str], group_by_typenames=True) -> Dict[str, List[int]]:
 17 |     """Groups the sentences according to qa-ids of the examples or documents.
 18 | 
 19 |     Args:
 20 |         sentences: List of qa-ids (strings)
 21 | 
 22 |     Returns:
 23 |         Dict with qa-ids (document-type + type-name) as keys and list of indexes
 24 |         of grouped sentences as values.
 25 |     """
 26 |     qid_dict = {}
 27 |     for idx, document_or_example_id in enumerate(document_or_example_ids):
 28 |         # When grouping by example_ids (pattern document_class.typename), add only the project 
 29 |         # (document class), such as matriculas, certidoes, etc. Ex.: qid_dict['matriculas'] = [0, 1, 2].
 30 |         # Must include only for original answers, excluding the ones related to dismembered sub-answers.
 31 |         if group_by_typenames and '~' not in document_or_example_id:  # '~' appears in sub-answers of originally compound answers
 32 |             proj = document_or_example_id.split('.')[0]
 33 |             if proj in qid_dict.keys():
 34 |                 qid_dict[proj].append(idx)
 35 |             else:
 36 |                 qid_dict[proj] = [idx]
 37 | 
 38 |         if document_or_example_id in qid_dict.keys():
 39 |             qid_dict[document_or_example_id].append(idx)
 40 |         else:
 41 |             qid_dict[document_or_example_id] = [idx]
 42 | 
 43 |         # multiple chunks are suffixed with _i. Here we group those cases removing the suffix.
 44 |         if group_by_typenames:
 45 |             comp = None
 46 |             try:
 47 |                 document_or_example_id, comp = document_or_example_id.rsplit('~', 1)
 48 |             except:
 49 |                 pass
 50 | 
 51 |             try:
 52 |                 doc_ex_id, t = document_or_example_id.rsplit('_', 1)
 53 |                 has_asterisk = t.endswith('*')
 54 |                 if comp is None:
 55 |                     if has_asterisk:
 56 |                         t = t[:-1]
 57 |                 t = int(t.strip())  # try to convert suffix to integer
 58 |                 if comp is not None:
 59 |                     doc_ex_id += '~' + comp
 60 |                 elif has_asterisk:
 61 |                     doc_ex_id += '*'
 62 |                     
 63 |                 if doc_ex_id in qid_dict.keys():
 64 |                     qid_dict[doc_ex_id].append(idx)
 65 |                 else:
 66 |                     qid_dict[doc_ex_id] = [idx]
 67 |             except:
 68 |                 pass
 69 | 
 70 |     return qid_dict
 71 | 
 72 | 
 73 | def split_compound_labels_and_predictions(
 74 |     labels: List[T5_SENTENCE], predictions: List[T5_SENTENCE], document_ids: List[str],
 75 |     example_ids: List[str], probs: List[float], window_ids: List[str], keep_original_compound: bool = True,
 76 |     keep_disjoint_compound: bool = True
 77 |     ) -> Tuple[List[T5_SENTENCE], List[T5_SENTENCE], List[str], List[str],
 78 |                List[float], List[int], List[int], List[str], List[int], Dict]:
 79 |     """Splits compound answers as individual subsentences (complete sub-anwers) 
 80 |     like \"[SENT1] [Estado]: SP [aparece no texto]: São Paulo\" extending 
 81 |     original label and prediction sets.
 82 | 
 83 |     This is useful in inference for getting individual metrics for each
 84 |     subsentence that composes a compound answer.
 85 | 
 86 |     The function keeps for predictions only the first occurrence of the
 87 |     type-names that compose the labels. If the prediction has some type-name
 88 |     that is absent in the label, it is ignored. If the prediction has 
 89 |     sentence-id or raw-text, but the label does not have, those terms are
 90 |     considered as part of prediction, and certainly will result in misprediction.
 91 | 
 92 |     If keep_original_compound, the function returns the indices of the original
 93 |     sentences, ignoring the ones the reference individual subsentences. This is
 94 |     useful for getting metrics only for the answers as returned by the model.
 95 |     The metrics will appear as 'ORIG' in the metrics file.
 96 | 
 97 |     If keep_disjoint_compound, the function returns, for each document class,
 98 |     (1) the indices of non-compound answers and (2) the indices of subsentences
 99 |     of compound answers (ignoring the compound answers). In both cases, the 
100 |     indices references the senteces with sentence-ids and raw-text complements
101 |     already filtered. This is useful for getting metrics for each document class
102 |     that can be compared with other experiments that does not use compound qas 
103 |     and/or does not use sentence-ids and raw-text complements. The metrics will
104 |     appear prefixed by 'DISJOINT_' in the metrics file.
105 | 
106 |     Examples:
107 |         >>> labels = ['[SENT1] [Tipo de Logradouro]: Rua [SENT1] [Logradouro]: Abert Einstein']
108 |         >>> predictions = ['[SENT1] [Tipo de Logradouro]: Rua [SENT1] [Logradouro]: 41bert Ein5tein [SENT1] [Bairro]: Cidade Universitária']
109 |         >>> labels, predictions, document_ids, example_ids, probs, window_ids, sent_ids, raw_texts, _, _ = \
110 |         >>>     split_compound_labels_and_predictions(labels, predictions, ['doc_1'], ['matriculas.endereco'], [0.98], ['1 1'])
111 |         >>> print(labels)
112 |         ['[SENT1] [tipo_de_logradouro]: Rua [SENT1] [fp_logradouro]: Abert Einstein', '[SENT1] [tipo_de_logradouro]: Rua', '[tipo_de_logradouro]: Rua', '[SENT1] [fp_logradouro]: Abert Einstein', '[fp_logradouro]: Abert Einstein']
113 |         >>> print(predictions)
114 |         ['[SENT1] [tipo_de_logradouro]: Rua [SENT1] [fp_logradouro]: 41bert Ein5tein [SENT1] [fp_bairro]: Cidade Universitária', '[SENT1] [tipo_de_logradouro]: Rua', '[tipo_de_logradouro]: Rua', '[SENT1] [fp_logradouro]: 41bert Ein5tein', '[fp_logradouro]: 41bert Ein5tein']
115 |         >>> print(document_ids)
116 |         ['doc_1', 'doc_1', 'doc_1', 'doc_1', 'doc_1']
117 |         >>> print(example_ids)
118 |         ['matriculas.endereco', 'matriculas.endereco~tipo_de_logradouro', 'matriculas.endereco~tipo_de_logradouro*', 'matriculas.endereco~fp_logradouro', 'matriculas.endereco~fp_logradouro*']
119 |         >>> print(probs)
120 |         [0.98, 0.0, 0.0, 0.0, 0.0]
121 |         >>> print(window_ids)
122 |         [[1, 1], [1], [1], [1], [1]]
123 |         >>> print(sent_ids)
124 |         [None, None, [1], None, [1]]
125 |         >>> print(raw_texts)
126 |         [None, None, None, None, None]
127 |         
128 |     Returns:
129 |         labels, predictions, document_ids, example_ids, probs, window_ids, sent_ids, raw_texts
130 |     """
131 |     labels_new, predictions_new = [], []
132 |     document_ids_new, example_ids_new, probs_new, window_ids_new, sent_ids, raw_texts = \
133 |         [], [], [], [], [], []
134 |     original_idx = []
135 |     disjoint_answer_idx_by_doc_class = {}
136 | 
137 |     for label, prediction, doc_id, ex_id, prob, window_id in zip(
138 |         labels, predictions, document_ids, example_ids, probs, window_ids):
139 |         window_id = [ int(w) for w in window_id.split(' ') ]
140 |         label_subsentences, label_type_names = deconstruct_answer(label)
141 |         prediction_subsentences, prediction_type_names = deconstruct_answer(prediction)
142 | 
143 |         # this is not compound answer, then get the original label/predicion pair
144 |         if len(label_type_names) <= 1 or keep_original_compound:
145 |             label = ' '.join(label_subsentences)
146 |             prediction = ' '.join(prediction_subsentences)
147 | 
148 |             labels_new.append(label)
149 |             predictions_new.append(prediction)
150 |             document_ids_new.append(doc_id)
151 |             example_ids_new.append(ex_id)
152 |             probs_new.append(prob)
153 |             window_ids_new.append(window_id)
154 |             sent_ids.append(None)
155 |             raw_texts.append(None)
156 | 
157 |             # indexes to compute the f1 and exact ONLY with original (non-splitted) answers
158 |             if keep_original_compound:
159 |                 idx = len(labels_new) - 1
160 |                 original_idx.append(idx)
161 | 
162 |             if len(label_type_names) <= 1:
163 |                 # remove sent-id and raw-text complement, if the label has,
164 |                 # in order to get metric only for the response per se.
165 |                 label_sa = get_subanswer_from_subsentence(label)
166 |                 pred_sa = get_subanswer_from_subsentence(prediction)
167 |                     
168 |                 raw_text = get_raw_answer_from_subsentence(prediction_subsentences[0])
169 |                 sent_id = find_ids_of_sent_tokens(prediction_subsentences[0])
170 | 
171 |                 ex_id_ = ex_id + '*'
172 |                 
173 |                 labels_new.append(label_sa)
174 |                 predictions_new.append(pred_sa)
175 |                 document_ids_new.append(doc_id)
176 |                 example_ids_new.append(ex_id_)
177 |                 probs_new.append(prob)
178 |                 window_ids_new.append(window_id)
179 |                 sent_ids.append(sent_id)
180 |                 raw_texts.append(raw_text)
181 | 
182 |                 # keep by-document-class the indices of non-compound answers.
183 |                 # The sent-id and raw-text complement are already filtered.
184 |                 if keep_disjoint_compound:
185 |                     idx = len(labels_new) - 1
186 |                     doc_class = ex_id.split('.')[0]
187 |                     if doc_class in disjoint_answer_idx_by_doc_class.keys():
188 |                         disjoint_answer_idx_by_doc_class[doc_class].append(idx)
189 |                     else:
190 |                         disjoint_answer_idx_by_doc_class[doc_class] = [idx]
191 | 
192 |         if len(label_type_names) > 1:
193 |             window_id = window_id[:1]  # for compound qa, the window_id is repeated
194 |             for label_ss, label_tn in zip(label_subsentences, label_type_names):
195 | 
196 |                 try:
197 |                     # the same type-name was predicted, get the first occurrence
198 |                     pred_idx = prediction_type_names.index(label_tn)
199 |                     pred_ss = prediction_subsentences[pred_idx]
200 |                 except:
201 |                     # the same type-name was not predicted, use empty
202 |                     pred_ss = ''
203 | 
204 |                 ex_id_ = ex_id + '~' + label_tn
205 | 
206 |                 labels_new.append(label_ss)
207 |                 predictions_new.append(pred_ss)
208 |                 document_ids_new.append(doc_id)
209 |                 example_ids_new.append(ex_id_)
210 |                 probs_new.append(0.0)
211 |                 window_ids_new.append(window_id)
212 |                 sent_ids.append(None)
213 |                 raw_texts.append(None)
214 |                 
215 |                 # remove sent-id and raw-text complement, if the label has,
216 |                 # in order to get metric only for the response per se
217 |                 label_sa = get_subanswer_from_subsentence(label_ss)
218 |                 pred_sa = get_subanswer_from_subsentence(pred_ss)
219 | 
220 |                 raw_text = get_raw_answer_from_subsentence(pred_ss)
221 |                 sent_id = find_ids_of_sent_tokens(pred_ss)
222 | 
223 |                 ex_id_ = ex_id + '~' + label_tn + '*'
224 | 
225 |                 labels_new.append(label_sa)
226 |                 predictions_new.append(pred_sa)
227 |                 document_ids_new.append(doc_id)
228 |                 example_ids_new.append(ex_id_)
229 |                 probs_new.append(0.0)
230 |                 window_ids_new.append(window_id)
231 |                 sent_ids.append(sent_id)
232 |                 raw_texts.append(raw_text)
233 | 
234 |                 # keep by-document-class only indices of sub-responses for compound answers,
235 |                 # not the original compound answer.
236 |                 # The sent-id and raw-text complement are already filtered.
237 |                 if keep_disjoint_compound:
238 |                     idx = len(labels_new) - 1
239 |                     doc_class = ex_id.split('.')[0]
240 |                     if doc_class in disjoint_answer_idx_by_doc_class.keys():
241 |                         disjoint_answer_idx_by_doc_class[doc_class].append(idx)
242 |                     else:
243 |                         disjoint_answer_idx_by_doc_class[doc_class] = [idx]
244 | 
245 |     return labels_new, predictions_new, document_ids_new, example_ids_new, probs_new, \
246 |         window_ids_new, sent_ids, raw_texts, original_idx, disjoint_answer_idx_by_doc_class
247 | 
248 | 
249 | def get_highest_probability_window(
250 |     labels: List[T5_SENTENCE], predictions: List[T5_SENTENCE], 
251 |     document_ids: List[str], example_ids: List[str], probs: List[float],
252 |     use_fewer_NA: bool = False
253 |     ) -> Tuple[List[T5_SENTENCE], List[T5_SENTENCE], List[str], List[str], List[float], List[str]]:
254 |     """Get the highest-probability components for each pair document, example.
255 |     """
256 |     if use_fewer_NA:
257 |         na_cases = [ pred.count('N/A') for pred in predictions ]
258 |         arr = np.vstack([np.array(labels), np.array(predictions),
259 |                         np.array(document_ids), np.array(example_ids),
260 |                         np.array(na_cases), np.array(probs, dtype=object)]).transpose()
261 |         df1 = pd.DataFrame(arr, columns=['labels', 'predictions', 'document_ids', 'example_ids', 'na', 'probs'])
262 |     else:
263 |         arr = np.vstack([np.array(labels), np.array(predictions),
264 |                         np.array(document_ids), np.array(example_ids),
265 |                         np.array(probs, dtype=object)]).transpose()
266 |         df1 = pd.DataFrame(arr, columns=['labels', 'predictions', 'document_ids', 'example_ids', 'probs'])
267 | 
268 |     # include windows-id to access which window got the highest probability.
269 |     # In case of compound qa, the windows-id is replicated for each 
270 |     # prediction subsentence
271 |     df1['window_ids'] = df1.groupby(['document_ids', 'example_ids']).cumcount().astype(str)
272 |     df1['window_ids'] = df1.apply(lambda x: ' '.join([x['window_ids']] * len(deconstruct_answer(x['predictions'])[0]) ), axis=1)
273 | 
274 |     if use_fewer_NA:
275 |         # get the highest-probability sample among cases with fewer number if N/As
276 |         # for each pair document-id / example-id
277 |         df1 = df1.sort_values(['na', 'probs'], ascending=[True, False]).groupby(['document_ids', 'example_ids']).head(1)
278 |         df1.sort_index(inplace=True)
279 | 
280 |         labels, predictions, document_ids, example_ids, _, probs, window_ids = df1.T.values.tolist()
281 |     else:
282 |         # get the highest-probability sample for each pair document-id / example-id
283 |         df1 = df1.sort_values('probs', ascending=False).groupby(['document_ids', 'example_ids']).head(1)
284 |         df1.sort_index(inplace=True)
285 | 
286 |         labels, predictions, document_ids, example_ids, probs, window_ids = df1.T.values.tolist()
287 | 
288 |     return labels, predictions, document_ids, example_ids, probs, window_ids
289 | 


--------------------------------------------------------------------------------
/information_extraction_t5/features/preprocess.py:
--------------------------------------------------------------------------------
  1 | """Utility methods to preprocess model input."""
  2 | from collections import Counter
  3 | from typing import Dict, List, Optional, OrderedDict, Tuple, Union
  4 | 
  5 | from information_extraction_t5.features.questions import (
  6 |     COMPLEMENT,
  7 |     QUESTION,
  8 |     QUESTION_DICT,
  9 |     QUESTIONS as ALL_QUESTIONS,
 10 |     SUBQUESTION_DICT,
 11 | )
 12 | from information_extraction_t5.features.questions.type_map import COMPLEMENT_TYPE
 13 | from information_extraction_t5.features.sentences import SENT_TOKEN
 14 | 
 15 | # Large number to not let the number of sentences be too large for a model.
 16 | MAX_SENTENCES = 9999
 17 | 
 18 | 
 19 | def _replace_brackets_with_parenthesis(text: str) -> str:
 20 |     text = text.replace('{', '(')
 21 |     text = text.replace('}', ')')
 22 | 
 23 |     return text
 24 | 
 25 | 
 26 | def _replace_linebreak_with_token_patterns(
 27 |         text: str, token_pattern: str = SENT_TOKEN
 28 | ) -> Tuple[str, int]:
 29 |     """Returns new string with `\n` replaced with the token pattern and the
 30 |     number of tokens."""
 31 |     num_tokens = text.count('\n')
 32 |     text = text.replace('\n', token_pattern)
 33 | 
 34 |     return text, num_tokens
 35 | 
 36 | 
 37 | def _replace_linebreaks_with_tokens(text: str) -> str:
 38 |     r"""Replaces every `\n` in a string with a numbered SENT token.
 39 | 
 40 |     If the inputs string has brackets, they will be replaced with parenthesis.
 41 |     Always adds least one SENT token at the beginning of the new sentence.
 42 |     Tokens are numerated starting from 1.
 43 | 
 44 |     Args:
 45 |         text: string to have `\n` replaced. It can't be split into more than
 46 |             MAX_SENTENCES.
 47 | 
 48 |     Examples:
 49 |         >>> sentence = 'Rua PEDRO BIAGI 462 Apartamento nº 103, 1º Andar do RESIDENCIAL IMPERIAL. Sertãozinho\nSP'
 50 |         >>> new_sentence = _replace_linebreaks_with_tokens(sentence)
 51 |         >>> print(new_sentence)
 52 |         ' [SENT1] Rua PEDRO BIAGI 462 Apartamento nº 103, 1º Andar do RESIDENCIAL IMPERIAL. Sertãozinho [SENT2] SP'
 53 | 
 54 |     Returns:
 55 |         New string with token instead of `\n`
 56 |     """
 57 |     # Should have at least one SENT token at start
 58 |     text = '\n' + text
 59 |     text = _replace_brackets_with_parenthesis(text)
 60 |     text, num_tokens = _replace_linebreak_with_token_patterns(text)
 61 | 
 62 |     assert num_tokens <= MAX_SENTENCES, 'Maximum number of sentences violated.'
 63 | 
 64 |     # token numeration must start from 1
 65 |     text = text.format(*range(1, num_tokens + 1))
 66 | 
 67 |     return text
 68 | 
 69 | 
 70 | def _replace_linebreaks_with_spaces(text: str) -> str:
 71 |     r"""Replaces every `\n` in a string with a space.
 72 | 
 73 |     Examples:
 74 |         >>> sentence = 'Rua PEDRO BIAGI 462 Apartamento nº 103, 1º Andar do RESIDENCIAL IMPERIAL. Sertãozinho\nSP'
 75 |         >>> new_sentence = _replace_linebreaks_with_spaces(sentence)
 76 |         >>> print(new_sentence)
 77 |         'Rua PEDRO BIAGI 462 Apartamento nº 103, 1º Andar do RESIDENCIAL IMPERIAL. Sertãozinho SP'
 78 |     """
 79 |     text = text.replace('\n', ' ')
 80 | 
 81 |     return text
 82 | 
 83 | 
 84 | def _get_id_based_on_linebreaks(context: str, answer_position: int) -> int:
 85 |     """Recovers the sentence-id assuming the context is always partiotioned based
 86 |     on occurrences of linebreaks.
 87 | 
 88 |     Args:
 89 |         context: text context of the question
 90 |         answer_position: index of last character from answer.
 91 |     """
 92 |     if answer_position == -1:
 93 |         return 0
 94 | 
 95 |     sent_id = Counter(context[:answer_position])['\n'] + 1
 96 | 
 97 |     return sent_id
 98 | 
 99 | 
100 | def get_questions_for_chunk(
101 |         qa_id: str = 'matriculas.imovel.comarca', is_compound: bool = False,
102 |         return_dict: bool = False, all_questions: QUESTION_DICT = ALL_QUESTIONS
103 | ) -> Union[List[QUESTION], QUESTION_DICT, SUBQUESTION_DICT]:
104 |     """Returns a list of questions for a specific qa_id, or a dict mapping 
105 |     typenames to question for building compound answers. The function can 
106 |     return also all the questions.
107 | 
108 |     Args:
109 |         qa_id: the id of a question-answer, generally represented by dot-separated
110 |             document class, chunks typenames, and possibly subchunks typenames.
111 |             use 'all' to get a dictionary containing all the possible questions.
112 |         is_compound: if qa_id represents a compound field.
113 |         return_dict: if the function should return a dict that is useful to build
114 |             compound answers.
115 |         all_questions: Dictionary with all the questions and subquestions.
116 | 
117 |     Examples:
118 |         >>> questions = {'question1': {'subquestion1': ['What?']}}
119 |         >>> get_questions_for_chunk('all', all_questions=questions)
120 |         {'question1': {'subquestion1': ['What?']}}
121 |         >>> get_questions_for_chunk('matriculas.question1', all_questions=questions)
122 |         {'subquestion1': ['What?']}
123 |     
124 |     Returns:
125 |         List of all questions for a specific field, or dictionary with all the questions
126 |         for a compound field.
127 |     """
128 |     if qa_id == 'all':
129 |         return all_questions
130 | 
131 |     typenames = qa_id.split('.')
132 |     questions = all_questions
133 |     for typename in typenames:
134 |         questions = questions[typename]
135 |     
136 |     if is_compound:
137 |         questions = questions['compound']
138 | 
139 |     assert isinstance(questions, List) != return_dict, (
140 |         f'Shouldn\'t you set "is_compound=True" for the field {qa_id} to get a '
141 |         'list of questions for a specific compound typename? Or set '
142 |         '"return_signature=True" to get the ordered dict of typenames to build '
143 |         'a compound answer?')
144 | 
145 |     return questions
146 | 
147 | 
148 | def get_qa_ids_recursively(
149 |         dict_or_list, base_qa_id, list_of_use_compound_question,
150 |         list_of_compound_chunks_to_ignore, list_of_subchunks_to_skip, 
151 |         qa_ids_list=[]
152 | ) -> List[str]:
153 |     """Auxiliar function to get recursively all the possible qa_ids"""
154 | 
155 |     if isinstance(dict_or_list, List) and not base_qa_id.endswith('compound'):
156 |         qa_ids_list.append(base_qa_id)
157 | 
158 |     if isinstance(dict_or_list, Dict) or isinstance(dict_or_list, OrderedDict):
159 |         if base_qa_id in list_of_use_compound_question:
160 |             qa_ids_list.append(base_qa_id)
161 | 
162 |         elif base_qa_id not in list_of_compound_chunks_to_ignore:
163 |             for typename, value in dict_or_list.items():
164 |                 if typename not in list_of_subchunks_to_skip:
165 |                     qa_id = f'{base_qa_id}.{typename}'
166 |                     _ = get_qa_ids_recursively(
167 |                         value.copy(), qa_id,
168 |                         list_of_use_compound_question,
169 |                         list_of_compound_chunks_to_ignore,
170 |                         list_of_subchunks_to_skip, qa_ids_list)
171 | 
172 |     return qa_ids_list
173 | 
174 | 
175 | def get_all_qa_ids(
176 |         document_class: Optional[str] = None,
177 |         list_of_type_names: List[str] = [],
178 |         list_of_use_compound_question: List[str] = [],
179 |         list_of_subchunks_to_list: List[str] = [],
180 |         list_subchunks_to_complement_siblings: List[str] = [], 
181 |         list_of_subchunks_to_skip: List[str] = [],
182 |         all_questions: QUESTION_DICT = ALL_QUESTIONS
183 | ) -> List[str]:
184 |     """Returns a list of all possible qa_ids that will be used to force 
185 |     the qa, even if the chunk does not exist.
186 | 
187 |     Args:
188 |         document_class: class of documents to extract qa_ids. Use None to
189 |             get for all possible document classes.
190 |         list_of_typenames: list of type-names.
191 |         list_of_use_compound_question: list of compound qa_ids.
192 |         list_of_subchunks_to_list: list of listing qa_ids.
193 |         list_subchunks_to_complement_sibling_questions: list of subchunks that 
194 |             will complement siblings, and does not require a qa.
195 |         list_of_subchunks_to_skip: list of subchunks that will be skipped.
196 | 
197 |     Examples:
198 |         >>> typenames = ['matriculas.imovel', 'matriculas.endereco', 'certidoes.resultado']
199 |         >>> use_compound = ['matriculas.endereco']
200 |         >>> get_all_qa_ids('matriculas', typenames, use_compound)
201 |         ['matriculas.imovel.no_da_matricula', 'matriculas.imovel.oficio', 'matriculas.imovel.comarca', 
202 |             'matriculas.imovel.estado', 'matriculas.endereco']
203 |         
204 |     Returns:
205 |         List of all possible qa_ids.
206 |     """
207 |     all_qa_ids = []
208 | 
209 |     # ignore chunks for which one subchunk will complement siblings.
210 |     # we cannot force qas, since the question depends on a information
211 |     # that is possibly non-annotated.
212 |     list_of_compound_chunks_to_ignore = [sc.rsplit('.', 1)[0] 
213 |         for sc in list_subchunks_to_complement_siblings]
214 | 
215 |     for doc_class, questions_dict in all_questions.items():
216 |         if document_class is not None and doc_class != document_class: continue
217 | 
218 |         for typename, list_or_dict in questions_dict.items():
219 |             qa_id = f'{doc_class}.{typename}'
220 |             
221 |             if qa_id in list_of_type_names:
222 |                 qa_ids = get_qa_ids_recursively(list_or_dict, qa_id, list_of_use_compound_question, 
223 |                     list_of_compound_chunks_to_ignore, list_of_subchunks_to_skip, [])
224 |                 for qa_id in qa_ids:
225 |                     all_qa_ids.append(qa_id)
226 | 
227 |     # for listing qa_ids, keep only document-class and last subchunk
228 |     # with the suffix "_list"
229 |     for qa_id in list_of_subchunks_to_list:
230 |         typenames = qa_id.split('.')
231 |         if document_class is None or document_class == typenames[0]: 
232 |             qa_id = f'{typenames[0]}.{typenames[-1]}_list'
233 |             all_qa_ids.append(qa_id)
234 | 
235 |     return all_qa_ids
236 | 
237 | 
238 | def complement_questions_to_require_rawdata(
239 |         questions: Union[QUESTION, List[QUESTION]], complement: str = COMPLEMENT
240 | ) -> Union[QUESTION, List[QUESTION]]:
241 |     """Add complementary text to a question or questions.
242 | 
243 |     This indicates to the model it must give a subanswer with part of the
244 |     context's raw text.
245 |     """
246 |     if isinstance(questions, str):  # simple question
247 |         questions = questions.replace('?', complement)
248 |     if isinstance(questions, list):  # list of questions
249 |         questions = [q.replace('?', complement) for q in questions]
250 |     return questions
251 | 
252 | 
253 | def generate_t5_input_sentence(
254 |         context: str, question: str, use_sentence_id: bool
255 | ) -> str:
256 |     """Returns a T5 input sentence based on a question and its context.
257 | 
258 |     Args:
259 |         context: text context of the question
260 |         question: the question
261 |         use_sentence_id: if True, every newline on the context will be replaced
262 |             by a SENT token. Otherwise they are replaced with spaces.
263 |     """
264 |     if use_sentence_id:
265 |         context = _replace_linebreaks_with_tokens(context)
266 |     else:
267 |         context = _replace_linebreaks_with_spaces(context)
268 | 
269 |     t5_sentence = f'question: {question} context: {context}'
270 |     return t5_sentence
271 | 
272 | 
273 | def generate_t5_label_sentence(
274 |         answer: str, answer_start: Union[List[int], int], context: str,
275 |         use_sentence_id: bool
276 | ) -> str:
277 |     """Returns a T5 label sentence for simple or compound answers.
278 | 
279 |     Args:
280 |         answer: answer of the current questions
281 |         answer_start: char position of answer starting
282 |         context: text context of the question
283 |         use_sentence_id: if True, every newline on the context will be replaced
284 |             by a SENT token. Otherwise they are replaced with spaces.
285 |     """
286 |     if use_sentence_id:
287 |         if isinstance(answer_start, list):
288 |             # That is a compound_answer, like: "[Valor]: 500,00 [Unidade]: metro_quadrado"
289 | 
290 |             # Separate the compound answer in sub-answers: --, Valor] 500,00, Unidade] metro_quadrado
291 |             # that could be problematic if some sub-answer has brackets, besides COMPLEMENT_TYPE
292 |             sub_answers = answer.split('[')[1:] 
293 |             token_pattern = SENT_TOKEN.strip()                    
294 | 
295 |             # Extract sentence-ids for each sub-answer
296 |             sent_ids = []
297 |             for sub_answer_start in answer_start:
298 |                 sent_ids.append(_get_id_based_on_linebreaks(context,
299 |                                                             sub_answer_start))
300 | 
301 |             # Prepare the final answer with sentence-ids: "[SENTx] [Valor]: 500,00 [SENTy] [Unidade]: metro_quadrado"
302 |             answer = ''
303 |             for sub_answer in sub_answers:
304 |                 if sub_answer.startswith(COMPLEMENT_TYPE):
305 |                     answer = f'{answer}[{sub_answer}'
306 |                 else:
307 |                     answer = f'{answer}{token_pattern} [{sub_answer}'
308 | 
309 |             # Include the sentence-ids
310 |             answer = answer.format(*sent_ids)
311 |         elif isinstance(answer_start, int):
312 |             # That is a simple answer
313 | 
314 |             sent_id = _get_id_based_on_linebreaks(context, answer_start)
315 |             answer = f'[SENT{sent_id}] {answer}'
316 |         else:
317 |             # That is an occurrence of non-annotated data, as publicacoes (null in squad json)
318 |             # [SENTX] is not included
319 |             pass
320 | 
321 |     return answer
322 | 


--------------------------------------------------------------------------------
/information_extraction_t5/features/questions/__init__.py:
--------------------------------------------------------------------------------
1 | from .questions import *
2 | 


--------------------------------------------------------------------------------
/information_extraction_t5/features/questions/questions.py:
--------------------------------------------------------------------------------
 1 | """Map of all questions, subquestions and its corresponding names.
 2 | Compound type-names must be represented as a OrderedDict with 'compound' and
 3 | subchunks' type-names as keys (even with empty lists) to keep a signature used
 4 | to prepare the compound answers.
 5 | 
 6 | A question name might have associated with it one of two things:
 7 | 1. A list of questions.
 8 | 2. A dictionary containing names of subquestions, which have their own list of
 9 |     subquestions.
10 | """
11 | from collections import OrderedDict
12 | from typing import Dict, List, Union
13 | 
14 | SUBQUESTION_NAME = str  # Ex.: 'rua'
15 | SUBQUESTION = str  # Ex.: 'Qual a rua?'
16 | QUESTION_NAME = str  # Ex.: 'endereco'
17 | QUESTION = str  # Ex.: 'Qual o endereco?'
18 | SUBQUESTION_DICT = Dict[SUBQUESTION_NAME, List[SUBQUESTION]]
19 | QUESTION_DICT = Dict[QUESTION_NAME, Union[SUBQUESTION_DICT, List[QUESTION]]]
20 | 
21 | COMPLEMENT = ' e como aparece no texto?'  # or 'and how does it appear in the text?' for EN
22 | 
23 | _QUESTIONS_FORM = {
24 |     'etiqueta':             [
25 |         'Qual é o número da etiqueta?',
26 |     ],
27 |     'agencia':              [
28 |         'Qual é o número da agência?',
29 |     ],
30 |     'conta_corrente':       [
31 |         'Qual é o número da conta corrente?',
32 |     ],
33 |     'cpf':                  [
34 |         'Qual é o CPF/CNPJ?',
35 |         'Qual é o CPF do titular?',
36 |     ],
37 |     'nome_completo':        [
38 |         'Qual é o nome?',
39 |         'Qual é o nome completo?',
40 |     ],
41 |     'n_doc_serie':          [
42 |         'Qual é o número do documento ou número da série?',
43 |     ],
44 |     'orgao_emissor':        [
45 |         'Qual é o órgão emissor?',
46 |     ],
47 |     'doc_id_uf':            [
48 |         'Qual é o estado do documento de identificação?',
49 |         'Qual é a UF do documento de identificação?',
50 |     ],
51 |     'data_emissao':         [
52 |         'Qual é a data de emissão?',
53 |     ],
54 |     'data_nascimento':      [
55 |         'Qual é a data de nascimento?',
56 |     ],
57 |     'nome_mae':             [
58 |         'Qual é o nome da mãe?',
59 |     ],
60 |     'nome_pai':             [
61 |         'Qual é o nome do pai?',
62 |     ],
63 |     'endereco': OrderedDict({
64 |         'compound':         [
65 |             'Qual o endereço?',
66 |         ],
67 |         'logradouro':       [
68 |             'Qual é o logradouro?',
69 |         ],
70 |         'numero':           [
71 |             'Qual é o número?',
72 |         ],
73 |         'complemento':      [
74 |             'Qual é o complemento?',
75 |         ],
76 |         'bairro':           [
77 |             'Qual é o bairro?',
78 |         ],
79 |         'cidade':           [
80 |             'Qual é a cidade?',
81 |         ],
82 |         'estado':           [
83 |             'Qual é o estado?',
84 |         ],
85 |         'cep':              [
86 |             'Qual é o CEP?',
87 |         ]
88 |     }),
89 | }
90 | 
91 | # Include here other pairs (project, questions dict) for new datasets
92 | QUESTIONS = {
93 |     'form': _QUESTIONS_FORM,
94 | }
95 | 


--------------------------------------------------------------------------------
/information_extraction_t5/features/questions/type_map.py:
--------------------------------------------------------------------------------
 1 | """Dictionaries that map type-names to types and vice-versa. The types are
 2 | used as clues in brackets in T5 outputs. The type-names are recovered in
 3 | post-processing stage.
 4 | 
 5 | Each document class (project) has it own TYPENAME_TO_TYPE dictionary. We
 6 | strongly recommend that the types used in all the projects be consistent, and
 7 | as generic as possible. For example, using `CPF/CNPJ` for all CPFs and CNPJs,
 8 | regardless of being a consultant, current account holder, business partner,
 9 | land owner, etc.
10 | """
11 | COMPLEMENT_TYPE = 'aparece no texto'  # or 'appears in the text' for EN
12 | 
13 | # Create a _NEWDATASET_TYPENAME_TO_TYPE for each new dataset, and
14 | # update the TYPENAME_TO_TYPE dict.
15 | 
16 | _FORM_TYPENAME_TO_TYPE = {
17 |     "etiqueta":                  "Etiqueta",
18 |     "agencia":                   "Agência",
19 |     "conta_corrente":            "Conta Corrente",
20 |     "cpf":                       "CPF/CNPJ",
21 |     "nome_completo":             "Nome",
22 |     "n_doc_serie":               "No do Documento",
23 |     "orgao_emissor":             "Órgão Emissor",
24 |     "data_emissao":              "Data de Emissão",
25 |     "data_nascimento":           "Data de Nascimento",
26 |     "nome_mae":                  "Nome da Mãe",
27 |     "nome_pai":                  "Nome do Pai",
28 |     "endereco":                  "Endereço",
29 |     "logradouro":                "Logradouro",
30 |     "numero":                    "Número",
31 |     "complemento":               "Complemento",
32 |     "bairro":                    "Bairro",
33 |     "cidade":                    "Cidade",
34 |     "estado":                    "Estado",
35 |     "cep":                       "CEP"
36 | }
37 | 
38 | TYPENAME_TO_TYPE = {
39 |     COMPLEMENT_TYPE: COMPLEMENT_TYPE,
40 | }
41 | TYPENAME_TO_TYPE.update(_FORM_TYPENAME_TO_TYPE)
42 | # TYPENAME_TO_TYPE.update(_NEWDATASET_TYPENAME_TO_TYPE)
43 | 
44 | # This dict is used to recover the type-name by using the type. It is not
45 | # critical to recover exactly the original type-name (different typenames
46 | # can be mapped to the same type). Those type-names will be used in post
47 | # processing after splitting sentences.
48 | TYPE_TO_TYPENAME = {v: k for k, v in TYPENAME_TO_TYPE.items()}
49 | 


--------------------------------------------------------------------------------
/information_extraction_t5/features/sentences.py:
--------------------------------------------------------------------------------
  1 | """Auxiliar methods for post-processing T5 input/output sentences."""
  2 | import re
  3 | from typing import List, Tuple, Union
  4 | 
  5 | from information_extraction_t5.features.questions.type_map import TYPE_TO_TYPENAME, COMPLEMENT_TYPE
  6 | 
  7 | SENTENCE_ID_PATTERN = r'\[SENT(.*?)\]'
  8 | SUBANSWER_PATTERN = r'([^[\]]+)(?:$|\[)'
  9 | TYPE_NAME_PATTERN = r'\[([A-Za-záàâãéèêíïóôõöúçñÁÀÂÃÉÈÍÏÓÔÕÖÚÇÑºª_ \/]*?)\]'
 10 | 
 11 | SENT_TOKEN = ' [SENT{}] '
 12 | T5_RAW_CONTEXT = str
 13 | 
 14 | # Type of a sentence that may have T5 identification tokens
 15 | # Example: '[SENT1] [Comarca] Campinas'
 16 | T5_SENTENCE = str
 17 | 
 18 | 
 19 | def _has_text(string: str) -> bool:
 20 |     """Returns True if a string has non whitespace text."""
 21 |     string_without_whitespace = string.strip()
 22 |     return len(string_without_whitespace) > 0
 23 | 
 24 | 
 25 | def _clean_sub_answer(sub_answer: str) -> str:
 26 |     """Removes undesired characters from a sub answer.
 27 | 
 28 |     Removes any `:` and whitespace the subanswer.
 29 |     """
 30 |     sub_answer = sub_answer.replace(':', '')
 31 |     sub_answer = sub_answer.strip()
 32 | 
 33 |     return sub_answer
 34 | 
 35 | 
 36 | def find_sub_answers(prediction_str: T5_SENTENCE) -> List[str]:
 37 |     """Returns a list containing the sub answers of a T5 sentence in the order
 38 |     they appear.
 39 | 
 40 |     Examples:
 41 |         >>> sentence = '[SENT25] [Tipo de Logradouro]: Rua [SENT25] [Logradouro]: PEDRO BIAGI'
 42 |         >>> sub_answers = _find_sub_answers(sentence)
 43 |         >>> print(sub_answers)
 44 |         ['Rua', 'PEDRO BIAGI']
 45 |     """
 46 |     sub_answer_list = []
 47 |     for sub_answer in re.findall(SUBANSWER_PATTERN, prediction_str):
 48 |         if _has_text(sub_answer):
 49 |             sub_answer = _clean_sub_answer(sub_answer)
 50 |             sub_answer_list.append(sub_answer)
 51 | 
 52 |     return sub_answer_list
 53 | 
 54 | 
 55 | def find_ids_of_sent_tokens(sentence: T5_SENTENCE) -> List[int]:
 56 |     """Returns a list containing the IDs of the SENT tokens if a T5 sentence in
 57 |     the order they appear.
 58 | 
 59 |     The ID is the number that follows a SENT token.
 60 | 
 61 |     Examples:
 62 |         >>> sentence = '[SENT1] Campinas'
 63 |         >>> ids = _find_ids_of_sent_tokens(sentence)
 64 |         >>> print(ids)
 65 |         [1]
 66 |     """
 67 |     ids = []
 68 |     for sentid in re.findall(SENTENCE_ID_PATTERN, sentence):
 69 |         try:
 70 |             ids.append(int(sentid))
 71 |         except:
 72 |             ids.append(sentid)
 73 | 
 74 |     return ids
 75 | 
 76 | 
 77 | def _convert_name_from_t5_to_type_name(name: str) -> str:
 78 |     """Converts name outputed by T5 for a type name.
 79 | 
 80 |     When the model was trained it learned to output display names from chunks.
 81 |     This method replaces the display names with their type name version.
 82 |     """
 83 |     if name not in TYPE_TO_TYPENAME:
 84 |         raise ValueError(f'Unknown type name: {name}')
 85 | 
 86 |     return TYPE_TO_TYPENAME[name]
 87 | 
 88 | 
 89 | def find_type_names(sentence: T5_SENTENCE, map_type: bool = True) -> List[str]:
 90 |     """Returns a list containing the names of the type tokens of a T5 sentence
 91 |     in the order they appear.
 92 | 
 93 |     The name is the text that appears inside the type token.
 94 | 
 95 |     Examples:
 96 |         >>> sentence = '[Logradouro] Campinas'
 97 |         >>> type_names = _find_type_names(sentence)
 98 |         >>> print(type_names)
 99 |         ['Logradouro']
100 |     """
101 |     type_names = re.findall(TYPE_NAME_PATTERN, sentence)
102 |     if map_type:
103 |         type_names = [
104 |             _convert_name_from_t5_to_type_name(name) for name in type_names
105 |         ]
106 | 
107 |     return type_names
108 | 
109 | 
110 | def split_context_into_sentences(
111 |         context: T5_RAW_CONTEXT
112 | ) -> List[str]:
113 |     """Splits a question context into multiple questions.
114 | 
115 |     The criteria of splitting is simply every linebreak found.
116 |     """
117 |     return context.split('\n')
118 | 
119 | 
120 | def split_t5_sentence_into_components(
121 |         sentence: T5_SENTENCE,
122 |         map_type: bool = True
123 | ) -> Tuple[List[int], List[str], List[str]]:
124 |     """Splits the string outputed by T5 into its components.
125 | 
126 |     If no occurrences are found of a component, returns an empty list for it.
127 | 
128 |     Components:
129 |         - sent ids: the ID that follows a SENT token.
130 |         - type names: the name inside a answer type token.
131 |         - sub answers: each answer fragment found.
132 |     Args:
133 |         sentence: a T5 output sentence.
134 | 
135 |     Examples:
136 |         >>> sentence = '[SENT25] [Tipo de Logradouro]: Rua [SENT25] [Logradouro]: PEDRO BIAGI [SENT26] [Número]: 462 [SENT25] [Cidade]: Sertãozinho [SENT0] [Estado]: SP'
137 |         >>> sent_ids, type_names, sub_answers = \
138 |         >>>     split_t5_sentence_into_components(sentence)
139 |         >>> print(sent_ids)
140 |         [25, 25, 26 25, 0]
141 |         >>> print(type_names)
142 |         ['tipo_de_logradouro', 'logradouro', 'numero', 'cidade', 'estado']
143 |         >>> print(sub_answers)
144 |         ['Rua', 'PEDRO BIAGI', '462', 'Sertãozinho', 'SP'])
145 | 
146 |     Returns:
147 |         Sentence ids, type names, answers/sub-answers
148 |     """
149 |     sent_ids = find_ids_of_sent_tokens(sentence)
150 |     type_names = find_type_names(sentence, map_type=map_type)
151 |     sub_answers = find_sub_answers(sentence)
152 | 
153 |     return sent_ids, type_names, sub_answers
154 | 
155 | 
156 | def check_sent_id_is_valid(
157 |         context: T5_RAW_CONTEXT, sent_id: int
158 | ) -> bool:
159 |     """Returns True if a SENT ID is valid.
160 | 
161 |     An ID is valid when it corresponds to the ID of a sentence or its ID is 0.
162 |     """
163 |     if sent_id < 0:
164 |         return False
165 | 
166 |     sentences = split_context_into_sentences(context)
167 | 
168 |     if len(sentences) < sent_id:
169 |         return False
170 | 
171 |     return True
172 |     
173 | 
174 | def deconstruct_answer(
175 |     answer_sentence: T5_SENTENCE = ''
176 | ) -> Tuple[List[T5_SENTENCE], List[str]]:
177 |     """Gets individual answer subsentences from the compound answer sentence.
178 | 
179 |     Args:
180 |         answer sentence: a T5 output sentence.
181 | 
182 |     Examples:
183 |         >>> sentence = '[SENT25] [Tipo de Logradouro]: Rua [SENT25] [Logradouro]: PEDRO BIAGI [SENT26] [Número]: 462 [SENT25] [Cidade]: Sertãozinho [SENT0] [Estado]: SP [aparece no texto] s paulo'
184 |         >>> sub_sentences, type_names = deconstruct_answer(sentence)
185 |         >>> print(sub_sentences)
186 |         [
187 |             '[SENT25] [tipo_de_logradouro] Rua', 
188 |             '[SENT25] [logradouro] PEDRO BIAGI',
189 |             '[SENT26] [numero] 462',
190 |             '[SENT25] [cidade] Sertãozinho',
191 |             '[SENT0] [estado] SP [aparece no texto] s paulo'
192 |         ]
193 |         >>> print(type_names)
194 |         ['tipo_de_logradouro', 'logradouro', 'numero', 'cidade', 'estado']
195 |         
196 |     Returns:
197 |         sub-ansers and type-names
198 |     """
199 |     sent_ids, type_names, sub_answers = split_t5_sentence_into_components(answer_sentence)
200 |     sub_sentences = []
201 |     all_type_names = []
202 | 
203 |     while len(sub_answers) > 0:
204 |         sub_sentence = '' 
205 | 
206 |         if len(sent_ids) > 0:
207 |             sent_id = sent_ids.pop(0)
208 |             sentence_token = SENT_TOKEN.format(sent_id).strip()
209 |             sub_sentence += sentence_token + ' '
210 | 
211 |         if len(type_names) > 0:
212 |             type_name = type_names.pop(0)
213 |             sub_sentence += f'[{type_name}]: '
214 |             all_type_names.append(type_name)
215 | 
216 |         sub_answer = sub_answers.pop(0)
217 |         sub_sentence += f'{sub_answer} '
218 | 
219 |         if len(type_names) > 0 and len(sub_answers) > 0 and type_names[0] == COMPLEMENT_TYPE:
220 |             type_name = type_names.pop(0)
221 |             sub_answer = sub_answers.pop(0)
222 | 
223 |             sub_sentence += f'[{type_name}] {sub_answer} '
224 | 
225 |         sub_sentences.append(sub_sentence.strip())
226 | 
227 |     return sub_sentences, all_type_names
228 | 
229 | 
230 | def get_subanswer_from_subsentence(subsentence: T5_SENTENCE) -> T5_SENTENCE:
231 |     """Get only the sub-answer from the current subsentence.
232 | 
233 |     Args:
234 |         subsentence: a T5 subsentence.
235 | 
236 |     Examples:
237 |         >>> subsentence = [SENT1] [no_da_matricula] 88975 [aparece no texto] 88.975
238 |         >>> subanswer = get_subanswer_from_subsentence(subsentence)
239 |         >>> print(subanswer)
240 |         [no_da_matricula]: 88975
241 |         
242 |     Returns:
243 |         subanswer that corresponds to subsentence without SENT_TOKEN and COMPLEMENT_TYPE
244 | 
245 |     """
246 |     _, tn, ans = split_t5_sentence_into_components(subsentence, map_type=False)
247 | 
248 |     if len(ans) == 0:
249 |         return ''
250 | 
251 |     if len(tn) == 0:
252 |         subanswer = ans[0]
253 |     else:
254 |         subanswer = f'[{tn[0]}]: {ans[0]}'
255 | 
256 |     return subanswer
257 | 
258 | 
259 | def get_raw_answer_from_subsentence(subsentence: T5_SENTENCE) -> Union[str, None]:
260 |     """Get only the raw-text answer from the current subsentence.
261 | 
262 |     Args:
263 |         subsentence: a T5 subsentence.
264 | 
265 |     Examples:
266 |         >>> subsentence = [SENT1] [no_da_matricula] 88975 [aparece no texto] 88.975
267 |         >>> subanswer = get_raw_answer_from_subsentence(subsentence)
268 |         >>> print(subanswer)
269 |         88.975
270 |         
271 |     Returns:
272 |         subanswer that corresponds to subsentence without SENT_TOKEN and COMPLEMENT_TYPE
273 | 
274 |     """
275 |     try:
276 |         return subsentence.split(f'[{COMPLEMENT_TYPE}]')[1].strip()
277 |     except:
278 |         return None
279 | 
280 | 
281 | def get_clean_answer_from_subanswer(subanswer: T5_SENTENCE) -> List[str]:
282 |     """Get the final and pure answer from each sub-answer.
283 | 
284 |     Args:
285 |         subanswer: subanswer extracted with function get_subanswer_from_subsentence.
286 | 
287 |     Examples:
288 |         >>> subanswer = '[no_da_matricula]: 88975'
289 |         >>> answer_ = get_clean_answer_from_subanswer(subanswer)
290 |         >>> print(answer)
291 |         ['88975']
292 |         
293 |     Returns:
294 |         clean answers without the clues in square brackets
295 |     """
296 |     try:
297 |         return find_sub_answers(subanswer)
298 |     except:
299 |         return ['']
300 | 


--------------------------------------------------------------------------------
/information_extraction_t5/models/__init__.py:
--------------------------------------------------------------------------------
1 |  
2 | 


--------------------------------------------------------------------------------
/information_extraction_t5/models/qa_model.py:
--------------------------------------------------------------------------------
  1 | """Model definition based on Pytorh-Lightning."""
  2 | import os
  3 | import json
  4 | import configargparse
  5 | import numpy as np
  6 | import pandas as pd
  7 | 
  8 | import torch
  9 | import pytorch_lightning as pl
 10 | 
 11 | from deepspeed.ops.adam import DeepSpeedCPUAdam
 12 | 
 13 | from transformers import (
 14 |     AutoTokenizer,
 15 |     T5ForConditionalGeneration,
 16 |     T5Config,
 17 |     MT5ForConditionalGeneration,
 18 |     MT5Config
 19 | )
 20 | 
 21 | from information_extraction_t5.features.postprocess import (
 22 |     group_qas,
 23 |     get_highest_probability_window,
 24 |     split_compound_labels_and_predictions,
 25 | )
 26 | from information_extraction_t5.features.sentences import (
 27 |     get_clean_answer_from_subanswer
 28 | )
 29 | from information_extraction_t5.utils.metrics import (
 30 |     normalize_answer,
 31 |     t5_qa_evaluate,
 32 |     compute_exact,
 33 |     compute_f1
 34 | )
 35 | from information_extraction_t5.utils.freeze import freeze_embeds
 36 | 
 37 | class QAClassifier(torch.nn.Module):
 38 |     def __init__(self, hparams):
 39 |         super().__init__()
 40 |         self.hparams.update(vars(hparams))
 41 | 
 42 |         if 'mt5' in self.hparams.config_name:
 43 |             config = MT5Config.from_pretrained(
 44 |                 self.hparams.config_name if self.hparams.config_name else self.hparams.model_name_or_path,
 45 |                 cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None,
 46 |             )
 47 |             self.model = MT5ForConditionalGeneration.from_pretrained(
 48 |                 self.hparams.model_name_or_path,
 49 |                 from_tf=bool(".ckpt" in self.hparams.model_name_or_path),
 50 |                 config=config,
 51 |                 cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None,
 52 |             )
 53 |         else:
 54 |             config = T5Config.from_pretrained(
 55 |                 self.hparams.config_name if self.hparams.config_name else self.hparams.model_name_or_path,
 56 |                 cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None,
 57 |             )
 58 |             self.model = T5ForConditionalGeneration.from_pretrained(
 59 |                 self.hparams.model_name_or_path,
 60 |                 from_tf=bool(".ckpt" in self.hparams.model_name_or_path),
 61 |                 config=config,
 62 |                 cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None,
 63 |             )
 64 |         self.tokenizer = AutoTokenizer.from_pretrained(
 65 |             self.hparams.tokenizer_name if self.hparams.tokenizer_name else self.hparams.model_name_or_path,
 66 |             do_lower_case=self.hparams.do_lower_case,
 67 |             use_fast=False,
 68 |             cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None,
 69 |         )
 70 | 
 71 |         if 'byt5' in self.hparams.model_name_or_path.lower():
 72 |             self.input_max_length = self.hparams.max_size  # chars
 73 |         else:
 74 |             self.input_max_length = self.hparams.max_seq_length  # tokens
 75 | 
 76 |         # use for faster training/larger batch size
 77 |         freeze_embeds(self.model)
 78 | 
 79 |         # filename for cache predictions
 80 |         self.cache_fname = os.path.join(
 81 |             self.hparams.data_dir if self.hparams.data_dir else ".",
 82 |             "cached_predictions_{}.pkl".format(
 83 |                     list(filter(None, self.hparams.model_name_or_path.split("/"))).pop()
 84 |             )
 85 |         )
 86 | 
 87 |     def forward(self, x):
 88 |         return self.model(x)
 89 | 
 90 | class LitQA(QAClassifier, pl.LightningModule):
 91 |     
 92 |     def configure_optimizers(self):
 93 |         optimizer = self.get_optimizer()
 94 |         return optimizer
 95 | 
 96 |     def training_step(self, batch, batch_idx):
 97 |         sentences, labels = batch
 98 | 
 99 |         sentences_tokens = self.tokenizer.batch_encode_plus(
100 |             sentences, padding=True, truncation=True,
101 |             max_length=self.input_max_length, return_tensors='pt'
102 |         )
103 |         labels = self.tokenizer.batch_encode_plus(
104 |             labels, padding=True, truncation=True,
105 |             max_length=self.input_max_length, return_tensors='pt'
106 |         )
107 | 
108 |         inputs = {
109 |             "input_ids": sentences_tokens['input_ids'].to(self.device),
110 |             "labels": labels['input_ids'].to(self.device),
111 |             "attention_mask": sentences_tokens['attention_mask'].to(self.device),
112 |         }
113 | 
114 |         outputs = self.model(**inputs)
115 | 
116 |         self.log('train_loss', outputs[0], on_step=True, on_epoch=True,
117 |             prog_bar=True, batch_size=len(sentences)
118 |         )
119 |         return {'loss': outputs[0]}
120 | 
121 |     def validation_step(self, batch, batch_idx):
122 |         sentences, labels, _, _ = batch
123 | 
124 |         sentences_tokens = self.tokenizer.batch_encode_plus(
125 |             sentences, padding=True, truncation=True,
126 |             max_length=self.input_max_length, return_tensors='pt'
127 |         )
128 | 
129 |         inputs = {
130 |             "input_ids": sentences_tokens['input_ids'].to(self.device),
131 |             "attention_mask": sentences_tokens['attention_mask'].to(self.device),
132 |             "max_length": self.hparams.max_length,
133 |         }
134 | 
135 |         outputs = self.model.generate(**inputs)
136 |         predictions = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
137 | 
138 |         return {'labels': labels, 'preds': predictions}
139 | 
140 |     def test_step(self, batch, batch_idx):
141 |         sentences, labels, document_ids, typename_ids = batch
142 | 
143 |         # if we are using cached predictions, is not necessary to run steps again
144 |         if self.hparams.use_cached_predictions and os.path.exists(self.cache_fname):
145 |             return {'labels': [], 'preds': [], 'doc_ids': [], 'tn_ids': [], 'probs': []}
146 | 
147 |         sentences_tokens = self.tokenizer.batch_encode_plus(
148 |             sentences, padding=True, truncation=True,
149 |             max_length=self.input_max_length, return_tensors='pt'
150 |         )
151 | 
152 |         # This is handled differently then the others because of conflicts of
153 |         # the previous approach with quantization.
154 |         inputs = {
155 |             "input_ids": sentences_tokens['input_ids'].to(self.device).long(),
156 |             "attention_mask": sentences_tokens['attention_mask'].to(self.device).long(),
157 |             "max_length": self.hparams.max_length,
158 |             "num_beams": self.hparams.num_beams,
159 |             "early_stopping": True,
160 |         }
161 | 
162 |         outputs = self.model.generate(**inputs)
163 |         predictions = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
164 | 
165 |         # compute probs
166 |         probs = self._compute_probs(sentences, predictions)
167 | 
168 |         return {
169 |             'labels': labels, 'preds': predictions, 'doc_ids': document_ids,
170 |             'tn_ids': typename_ids, 'probs': probs
171 |         }
172 | 
173 |     def validation_epoch_end(self, outputs):
174 |         predictions, labels = [], []
175 |         for output in outputs:
176 |             for label, pred in zip(output['labels'], output['preds']):
177 |                 predictions.append(pred)
178 |                 labels.append(label)
179 | 
180 |         results = t5_qa_evaluate(labels, predictions)
181 |         exact = torch.tensor(results['exact'])
182 |         f1 = torch.tensor(results['f1'])
183 | 
184 |         log = {
185 |             'val_exact': exact,       # for monitoring checkpoint callback
186 |             'val_f1': f1,             # for monitoring checkpoint callback
187 |         }
188 |         self.log_dict(log, logger=True, prog_bar=True, on_epoch=True)
189 | 
190 |     def test_epoch_end(self, outputs):
191 |         predictions, labels, document_ids, typename_ids, probs, window_ids = \
192 |             [], [], [], [], [], []
193 | 
194 |         for output in outputs:
195 |             for label, pred, doc_id, tn_id, prob in zip(
196 |                 output['labels'], output['preds'], 
197 |                 output['doc_ids'], output['tn_ids'], output['probs']):
198 |                 predictions.append(pred)
199 |                 labels.append(label)
200 |                 document_ids.append(doc_id)
201 |                 typename_ids.append(tn_id)
202 |                 probs.append(prob)
203 | 
204 |         # cache labels, predictions, document_ids, typename_ids and probs
205 |         # so that we can post-process without running again the test_steps
206 |         if self.hparams.use_cached_predictions and os.path.exists(self.cache_fname):
207 |             print(f'Loading predictions from cached file {self.cache_fname}')
208 |             labels, predictions, document_ids, typename_ids, probs = \
209 |                 pd.read_pickle(self.cache_fname).T.values.tolist()
210 |         else:
211 |             self._backup_outputs(labels, predictions, document_ids, typename_ids, probs)
212 |                 
213 |         # pick up the highest-probability prediction for each pair document-typename
214 |         if self.hparams.get_highestprob_answer:
215 |             (
216 |                 labels,
217 |                 predictions,
218 |                 document_ids,
219 |                 typename_ids,
220 |                 probs,
221 |                 window_ids,
222 |             ) = get_highest_probability_window(
223 |                 labels,
224 |                 predictions,
225 |                 document_ids,
226 |                 typename_ids,
227 |                 probs,
228 |                 use_fewer_NA=True,
229 |             )
230 | 
231 |         # split compound answers to get metrics to visualize and compute metrics for each subsentence
232 |         if self.hparams.split_compound_answers:
233 |             (
234 |                 labels,
235 |                 predictions,
236 |                 document_ids,
237 |                 typename_ids,
238 |                 probs,
239 |                 window_ids,
240 |                 _, 
241 |                 _,
242 |                 original_idx,
243 |                 disjoint_answer_idx_by_doc_class,
244 |             ) = split_compound_labels_and_predictions(
245 |                 labels,
246 |                 predictions,
247 |                 document_ids,
248 |                 typename_ids,
249 |                 probs,
250 |                 window_ids,
251 | 		    )
252 |         else:
253 |             print('WARNING: We strongly recommend to set --split_compound_answers=True, '
254 |                   'even for datasets without compound qas. This is useful to get metrics '
255 |                   'for clean outputs (without sentence-IDs and raw-text).')
256 |             original_idx = list(range(len(labels)))
257 |             disjoint_answer_idx_by_doc_class = {}
258 | 
259 |         # for each typename_id or document_id, extract its indexes to get specific metrics
260 |         if self.hparams.group_qas:
261 |             qid_dict_by_typenames = group_qas(typename_ids, group_by_typenames=True)
262 |             qid_dict_by_documents = group_qas(document_ids, group_by_typenames=False)
263 |             qid_dict_by_typenames['ORIG'] = original_idx
264 |             qid_dict_by_documents['ORIG'] = original_idx
265 |         else:
266 |             qid_dict_by_typenames = {'ORIG': original_idx}
267 |             qid_dict_by_documents = {'ORIG': original_idx}
268 | 
269 |         # save labels and predictions
270 |         self._save_outputs(
271 |             labels, predictions, document_ids,
272 |             probs, window_ids, qid_dict_by_typenames, 
273 |             outputs_fname='outputs_by_typenames.txt',
274 |             document_classes=list(disjoint_answer_idx_by_doc_class.keys())
275 |         )
276 |         self._save_outputs(
277 |             labels, predictions, typename_ids, 
278 |             probs, window_ids, qid_dict_by_documents,
279 |             outputs_fname='outputs_by_documents.txt', 
280 |             document_classes=list(disjoint_answer_idx_by_doc_class.keys())
281 |         )
282 | 
283 |         # For each document class, include the indexes of individual qas, and 
284 |         # of subsentences of compound qas. This is useful for fair comparison 
285 |         # of experiments with compound qas and individual qas.
286 |         # Also save the disjoint samples in Excel sheets.
287 |         all_idx = []
288 |         writer = pd.ExcelWriter('outputs_sheet_client.xlsx')
289 |         for document_class, indices in disjoint_answer_idx_by_doc_class.items():
290 |             qid_dict_by_typenames['DISJOINT_' + document_class] = indices
291 |             qid_dict_by_documents['DISJOINT_' + document_class] = indices
292 |             all_idx += indices
293 |             self._save_sheets(
294 |                 labels, predictions, document_ids, 
295 |                 typename_ids, probs, document_class, indices, writer
296 |             )
297 |         writer.close()
298 |         qid_dict_by_typenames['DISJOINT_ALL'] = all_idx
299 |         qid_dict_by_documents['DISJOINT_ALL'] = all_idx
300 |         self._save_sheets(
301 |             labels, predictions, document_ids,
302 |             typename_ids, probs, 'all', all_idx
303 |         )
304 | 
305 |         # compute metrics
306 |         results_by_typenames = t5_qa_evaluate(
307 |             labels, predictions, qid_dict=qid_dict_by_typenames
308 |         )
309 |         results_by_documents = t5_qa_evaluate(
310 |             labels, predictions, qid_dict=qid_dict_by_documents
311 |         )
312 |         exact = torch.tensor(results_by_typenames['exact'])
313 |         f1 = torch.tensor(results_by_typenames['f1'])
314 | 
315 |         # write metric files
316 |         with open('metrics_by_typenames.json', 'w') as f:
317 |             json.dump(results_by_typenames, f, indent=4)
318 |         with open('metrics_by_documents.json', 'w') as f:
319 |             json.dump(results_by_documents, f, indent=4)
320 | 
321 |         log = {
322 |             'exact': exact,
323 |             'f1': f1
324 |         }
325 |         self.log_dict(log, logger=True, on_epoch=True)
326 | 
327 |     @torch.no_grad()
328 |     def _compute_probs(self, sentences, predictions):
329 |         probs = []
330 |         for sentence, prediction in zip(sentences, predictions):
331 |             input_ids = self.tokenizer.encode(sentence, truncation=True, 
332 |                 max_length=self.input_max_length, return_tensors="pt").to(self.device).long()
333 |             output_ids = self.tokenizer.encode(prediction, truncation=True, 
334 |                 max_length=self.input_max_length, return_tensors="pt").to(self.device).long()
335 | 
336 |             outputs = self.model(input_ids=input_ids, labels=output_ids)
337 | 
338 |             loss = outputs[0]
339 |             prob = (loss * -1) / output_ids.shape[1]
340 |             prob = np.exp(prob.cpu().numpy())
341 |             probs.append(prob)
342 |         return probs
343 | 
344 |     def _backup_outputs(self, labels, predictions, document_ids, typename_ids, probs):
345 |             arr = np.vstack([np.array(labels, dtype="O"), np.array(predictions, dtype="O"),
346 |                             np.array(document_ids, dtype="O"), np.array(typename_ids, dtype="O"),
347 |                             np.array(probs, dtype="O")]).transpose()
348 |             df = pd.DataFrame(arr, columns=['labels', 'predictions', 'document_ids', 'typename_ids', 'probs'])
349 |             df.to_pickle(self.cache_fname)
350 | 
351 |     def _save_outputs(
352 |         self, labels, predictions, doc_or_tn_ids, probs, window_ids,
353 |         qid_dict=None, outputs_fname='outputs.txt', document_classes=["form"]
354 |         ):
355 |         if qid_dict is None:
356 |             qid_dict = {}
357 | 
358 |         f = open(outputs_fname, 'w')
359 |         f.write('{0:<50} | {1:50} | {2:30} | {3} | {4}\n'.format(
360 |             'label', 'prediction', 'uuid', 'prob', 'window'))
361 |         if qid_dict == {}:
362 |             for label, prediction, doc_or_tn_id, prob, w_id in zip(
363 |                 labels, predictions, doc_or_tn_ids, probs, window_ids):
364 |                 lab, pred = label, prediction
365 |                 if self.hparams.normalize_outputs:
366 |                     lab, pred = normalize_answer(label), normalize_answer(prediction)
367 |                 if lab != pred or lab == pred and not self.hparams.only_misprediction_outputs:
368 |                     f.write('{0:<50} | {1:50} | {2:30} | {3} | {4}\n'.format(
369 |                         label, prediction, doc_or_tn_id, prob, w_id))
370 |         else:
371 |             for (kword, list_indices) in qid_dict.items():
372 |                 # do not print for ORIG, DISJOINT* and all samples for a specific project/document class
373 |                 # those groups are important for metrics, not for outputs visualization
374 |                 if kword == 'ORIG' or kword.startswith('DISJOINT') or kword in document_classes: 
375 |                     continue
376 |                 f.write(f'===============\n{kword}\n===============\n')
377 |                 for idx in list_indices:
378 |                     label, prediction, doc_or_tn_id, prob, w_id = \
379 |                         labels[idx], predictions[idx], doc_or_tn_ids[idx], probs[idx], window_ids[idx]
380 |                     lab, pred = label, prediction
381 |                     if self.hparams.normalize_outputs:
382 |                         lab, pred = normalize_answer(label), normalize_answer(prediction)
383 |                     if lab != pred or lab == pred and not self.hparams.only_misprediction_outputs:
384 |                         f.write('{0:<50} | {1:50} | {2:30} | {3} | {4}\n'.format(
385 |                             label, prediction, doc_or_tn_id, prob, w_id))
386 |         f.close()
387 | 
388 |     def _save_sheets(self, labels, predictions, document_ids, typename_ids, probs, document_class, indices, writer=None):
389 |         # Saving disjoint predictions (splitted and clean) in a dataframe
390 |         arr = np.vstack([np.array(document_ids, dtype="O")[indices],
391 |                         np.array(typename_ids, dtype="O")[indices],
392 |                         np.array(labels, dtype="O")[indices],
393 |                         np.array(predictions, dtype="O")[indices],
394 |                         np.array(probs, dtype="O")[indices]]).transpose()
395 |         df = pd.DataFrame(arr, 
396 |             columns=['document_ids', 'typename_ids', 'labels', 'predictions', 'probs']
397 |             ).reset_index(drop=True)
398 | 
399 |         if document_class == 'all':
400 |             df = df.sort_values(['document_ids', 'typename_ids'])  # hack to keep listing outputs together for each document-class
401 |             df_all_group_doc = df.set_index('document_ids', append=True).swaplevel(0,1)
402 |             df_all_group_doc.to_excel('outputs_sheet.xlsx')
403 |         else:
404 |             # compute metrics for each pair document_id-typename-id
405 |             df['exact'] = df.apply(lambda x: compute_exact(x['labels'], x['predictions']), axis=1)
406 |             df['f1'] = df.apply(lambda x: compute_f1(x['labels'], x['predictions']), axis=1)
407 | 
408 |             # remove clue/prefix into brackets
409 |             df['labels'] = df.apply(
410 |                 lambda x: ', '.join(get_clean_answer_from_subanswer(x['labels'])),
411 |                 axis=1
412 |             )
413 |             df['predictions'] = df.apply(
414 |                 lambda x: ', '.join(get_clean_answer_from_subanswer(x['predictions'])),
415 |                 axis=1
416 |             )
417 | 
418 |             # use pivot to get a quadruple of columns (labels, predictions, equal, prob) for each typename
419 |             pivoted = df.pivot(
420 |                 index=['document_ids'],
421 |                 columns=['typename_ids'],
422 |                 values=['labels', 'predictions', 'exact', 'f1', 'probs']
423 |             )
424 |             pivoted = pivoted.swaplevel(0, 1, axis=1).sort_index(axis=1)  # put column (typename_ids) above the values
425 | 
426 |             # extract typename_ids in the original order (instead of alphanumeric order)
427 |             # get the columns from the document-ids that have more samples
428 |             cols = df[df['document_ids']==df.document_ids.mode()[0]].typename_ids.tolist()
429 |             if len(cols) == len(pivoted.columns) // 5:
430 |                 pivoted = pivoted[cols]
431 |             else:
432 |                 print('Keeping typenames in alphanumeric order since none of the documents '
433 |                     f'have all the possible qa_ids ({len(cols)} != {len(pivoted.columns) // 5})')
434 | 
435 |             # save sheet
436 |             pivoted.to_excel(writer, sheet_name=document_class)
437 | 
438 |     def get_optimizer(self,) -> torch.optim.Optimizer:
439 |         """Define the optimizer"""
440 |         optimizer_name = self.hparams.optimizer
441 |         lr = self.hparams.lr
442 |         weight_decay=self.hparams.weight_decay
443 |         optimizer = getattr(torch.optim, optimizer_name)
444 | 
445 |         # Prepare optimizer and schedule (linear warmup and decay)
446 |         no_decay = ["bias", "LayerNorm.weight"]
447 |         optimizer_grouped_parameters = [
448 |             {
449 |                 "params": [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)],
450 |                 "weight_decay": weight_decay,
451 |             },
452 |             {"params": [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
453 |         ]
454 | 
455 |         if self.hparams.deepspeed:
456 |             # DeepSpeedCPUAdam provides 5x to 7x speedup over torch.optim.adam(w)
457 |             optimizer = DeepSpeedCPUAdam(
458 |                 optimizer_grouped_parameters, lr=lr, 
459 |                 weight_decay=weight_decay, eps=1e-4, adamw_mode=True
460 |             )
461 |         else:
462 |             optimizer = optimizer(
463 |                 optimizer_grouped_parameters, lr=lr, weight_decay=weight_decay
464 |             )
465 | 
466 |         print(f'=> Using {optimizer_name} optimizer')
467 | 
468 |         return optimizer
469 | 
470 |     @staticmethod
471 |     def add_model_specific_args(parent_parser):
472 | 
473 |         parser = configargparse.ArgumentParser(parents=[parent_parser], add_help=False)
474 | 
475 |         parser.add_argument(
476 |             "--model_name_or_path",
477 |             default='t5-small',
478 |             type=str,
479 |             required=True,
480 |             help="Path to pretrained model or model identifier from huggingface.co/models",
481 |         )
482 |         parser.add_argument(
483 |             "--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name"
484 |         )
485 |         parser.add_argument(
486 |             "--tokenizer_name",
487 |             default="",
488 |             type=str,
489 |             help="Pretrained tokenizer name or path if not the same as model_name",
490 |         )
491 |         parser.add_argument(
492 |             "--cache_dir",
493 |             default="",
494 |             type=str,
495 |             help="Where do you want to store the pre-trained models downloaded from s3",
496 |         )
497 |         parser.add_argument(
498 |             "--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model."
499 |         )
500 |         parser.add_argument(
501 |             "--max_seq_length",
502 |             default=384,
503 |             type=int,
504 |             help="The maximum total input sequence length after WordPiece tokenization. Sequences "
505 |             "longer than this will be truncated, and sequences shorter than this will be padded.",
506 |         )
507 |         parser.add_argument(
508 |             "--max_size",
509 |             default=1024,
510 |             type=int,
511 |             help="The maximum input length after char-based tokenization. And also the maximum context "
512 |             "size for char-based contexts."
513 |         )
514 |         parser.add_argument(
515 |             "--max_length",
516 |             default=120,
517 |             type=int,
518 |             help="The maximum total output sequence length generated by the model."
519 |         )
520 |         parser.add_argument(
521 |             "--num_beams",
522 |             default=1,
523 |             type=int,
524 |             help="Number of beams for beam search. 1 means no beam search."
525 |         )
526 |         parser.add_argument(
527 |             "--get_highestprob_answer", 
528 |             action="store_true",
529 |             help="If true, get the answer from the sliding-window that gives highest probability."
530 |         )
531 |         parser.add_argument(
532 |             "--split_compound_answers",
533 |             action="store_true",
534 |             help="If true, split the T5 outputs into individual answers.",
535 |         )
536 |         parser.add_argument(
537 |             "--group_qas",
538 |             action="store_true",
539 |             help="If true, use group qas to get individual metrics ans structured output file for each type-name.",
540 |         )
541 |         parser.add_argument(
542 |             "--only_misprediction_outputs",
543 |             action="store_true",
544 |             help="If true, return only mispredictions in the output file.",
545 |         )
546 |         parser.add_argument(
547 |             "--normalize_outputs",
548 |             action="store_true",
549 |             help="If true, normalize label and prediction to include in the output file. " 
550 |                  "The normalization is the same applied before computing metrics.",
551 |         )
552 | 
553 |         return parser
554 | 


--------------------------------------------------------------------------------
/information_extraction_t5/predict.py:
--------------------------------------------------------------------------------
 1 | # coding=utf-8
 2 | """ Predicting the T5 model finetuned for question-answering on SQuAD."""
 3 | 
 4 | import configargparse
 5 | import glob
 6 | 
 7 | import torch
 8 | from pytorch_lightning import Trainer
 9 | 
10 | from information_extraction_t5.models.qa_model import LitQA
11 | from information_extraction_t5.data.qa_data import QADataModule
12 | 
13 | 
14 | def main():
15 |     """Predict."""
16 | 
17 |     parser = configargparse.ArgParser(
18 |         'Training and evaluation script for training T5 model for QA', 
19 |         config_file_parser_class=configargparse.YAMLConfigFileParser)
20 |     parser.add_argument('-c', '--my-config', required=True, is_config_file=True,
21 |         help='config file path')
22 | 
23 |     parser.add_argument("--seed", type=int, default=42,
24 |         help="random seed for initialization")
25 |     parser.add_argument('--num_workers', default=8, type=int)
26 |     parser.add_argument("--use_cached_predictions", action="store_true",
27 |         help="If true, reload the cache to post-process the senteces and compute metrics")
28 | 
29 |     parser = LitQA.add_model_specific_args(parser)
30 |     parser = QADataModule.add_model_specific_args(parser)
31 |     args, _ = parser.parse_known_args()
32 | 
33 |     # Load best checkpoint of the current experiment
34 |     ckpt_path = glob.glob('lightning_logs/*ckpt')[0]
35 |     print(f'Loading weights from {ckpt_path}')
36 | 
37 |     model = LitQA.load_from_checkpoint(
38 |         checkpoint_path=ckpt_path,
39 |         hparams_file='lightning_logs/version_0/hparams.yaml',
40 |         map_location=None,
41 |         hparams=args,
42 |     )
43 |     gpus = 1 if torch.cuda.is_available() else 0
44 |     if gpus == 0:
45 |         model = torch.quantization.quantize_dynamic(
46 |             model, {torch.nn.Linear}, dtype=torch.qint8
47 |         )
48 | 
49 |     dm = QADataModule(args)
50 |     dm.setup('test')
51 | 
52 |     torch.set_num_threads(1)
53 |     trainer = Trainer(gpus=gpus)
54 |     trainer.test(model, datamodule=dm)
55 | 
56 | 
57 | if __name__ == "__main__":
58 |     main()
59 | 


--------------------------------------------------------------------------------
/information_extraction_t5/train.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | """ Finetuning the T5 model for question-answering on SQuAD."""
  3 | 
  4 | import math
  5 | import os
  6 | import configargparse
  7 | 
  8 | from pytorch_lightning import Trainer, seed_everything
  9 | from pytorch_lightning.callbacks import LearningRateMonitor
 10 | from pytorch_lightning.callbacks import RichProgressBar
 11 | from pytorch_lightning.callbacks import RichModelSummary
 12 | from pytorch_lightning.callbacks.model_checkpoint import ModelCheckpoint
 13 | from pytorch_lightning.loggers.neptune import NeptuneLogger
 14 | from pytorch_lightning.plugins import DeepSpeedPlugin
 15 | 
 16 | from information_extraction_t5.models.qa_model import LitQA
 17 | from information_extraction_t5.data.qa_data import QADataModule
 18 | 
 19 | MODEL_DIR = 'lightning_logs'
 20 | 
 21 | def main():
 22 |     """Train."""
 23 | 
 24 |     parser = configargparse.ArgParser(
 25 |         'Training and evaluation script for training T5 model for QA',
 26 |         config_file_parser_class=configargparse.YAMLConfigFileParser)
 27 |     parser.add_argument('-c', '--my-config', required=True, is_config_file=True,
 28 |         help='config file path')
 29 | 
 30 |     # optimizer parameters
 31 |     parser.add_argument('--optimizer', type=str, default='Adam')
 32 |     parser.add_argument("--lr", default=5e-5, type=float,
 33 |         help="The initial learning rate.")
 34 |     parser.add_argument("--weight_decay", default=0.0, type=float,
 35 |         help="Weight decay if we apply some.")
 36 | 
 37 |     # neptune
 38 |     parser.add_argument("--neptune", action="store_true", help="If true, use neptune logger.")
 39 |     parser.add_argument('--neptune_project', type=str, default='ramon.pires/bracis-2021')
 40 |     parser.add_argument('--experiment_name', type=str, default='experiment01')
 41 |     parser.add_argument('--tags', action='append')
 42 | 
 43 |     parser.add_argument("--deepspeed", action="store_true", help="If true, use deepspeed plugin.")
 44 | 
 45 |     parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
 46 |     parser.add_argument('--num_workers', default=8, type=int)
 47 | 
 48 |     # add all the available trainer options to argparse
 49 |     parser = Trainer.add_argparse_args(parser)
 50 |     # add model specific args
 51 |     parser = LitQA.add_model_specific_args(parser)
 52 |     # add datamodule specific args
 53 |     parser = QADataModule.add_model_specific_args(parser) 
 54 |     args, _ = parser.parse_known_args()
 55 | 
 56 |     # cached predictions must be used only for predict.py
 57 |     args.use_cached_predictions = False
 58 | 
 59 |     # setting the seed for reproducibility
 60 |     if args.deterministic:
 61 |         seed_everything(args.seed)
 62 | 
 63 |     # data module
 64 |     dm = QADataModule(args)
 65 |     dm.setup('fit')
 66 | 
 67 |     # Defining the model
 68 |     model = LitQA(args)
 69 | 
 70 |     # For training larger models, we are not running validation but saving
 71 |     # checkpoint by train steps.
 72 |     # To doing this, we set check_val_every_n_epoch > max_epochs
 73 |     if args.check_val_every_n_epoch > args.max_epochs:
 74 |         # dataset_size / (batch_size * accum_batches)
 75 |         every_n_train_steps = math.ceil(
 76 |             len(dm.train_dataset) / (args.train_batch_size * args.accumulate_grad_batches))
 77 |         checkpoint_callback = ModelCheckpoint(
 78 |             dirpath=MODEL_DIR, filename='{epoch}-{train_loss:.4f}',
 79 |             monitor='train_loss_step', verbose=False, save_last=False, save_top_k=args.max_epochs,
 80 |             save_weights_only=True, mode='min', every_n_train_steps=every_n_train_steps
 81 |         )
 82 |     else:
 83 |         checkpoint_callback = ModelCheckpoint(
 84 |             dirpath=MODEL_DIR, filename='{epoch}-{val_exact:.2f}-{val_f1:.2f}',
 85 |             monitor='val_exact', verbose=False, save_last=False, save_top_k=5,
 86 |             save_weights_only=False, mode='max', every_n_epochs=1
 87 |         )
 88 | 
 89 |     # Instantiate LearningRateMonitor Callback
 90 |     lr_logger = LearningRateMonitor(logging_interval='epoch')
 91 | 
 92 |     # Set neptune logger
 93 |     if args.neptune:
 94 |         neptune_logger = NeptuneLogger(
 95 |             api_key=os.environ.get('NEPTUNE_API_TOKEN'),
 96 |             project=args.neptune_project,
 97 |             name=args.experiment_name,
 98 |             mode='async',  # Possible values "async", "sync", "offline", and "debug", "read-only"
 99 |             run=None,  # Set run's identifier like 'SAN-1' in case of resuming a tracked run
100 |             tags=args.tags,
101 |             log_model_checkpoints=False,
102 |             source_files=["**/*.py", "*.yaml"],
103 |             capture_stdout=False,
104 |             capture_stderr=False,
105 |             capture_hardware_metrics=False,
106 |         )
107 |     else:
108 |         neptune_logger = None
109 | 
110 |     if args.deepspeed:
111 |         deepspeed_plugin = DeepSpeedPlugin(
112 |             stage=2,
113 |             offload_optimizer=True,
114 |             offload_parameters=True,
115 |             allgather_bucket_size=2e8,
116 |             reduce_bucket_size=2e8,
117 |             allgather_partitions=True,
118 |             reduce_scatter=True,
119 |             overlap_comm=True,
120 |             contiguous_gradients=True,
121 |             ## Activation Checkpointing
122 |             partition_activations=True,
123 |             cpu_checkpointing=True,
124 |             contiguous_memory_optimization=True,
125 |         )
126 |     else:
127 |         deepspeed_plugin = None
128 | 
129 |     # Defining the Trainer, training... and finally testing
130 |     trainer = Trainer.from_argparse_args(
131 |         args,
132 |         logger=neptune_logger,
133 |         plugins=deepspeed_plugin,
134 |         callbacks=[
135 |             lr_logger,
136 |             checkpoint_callback,
137 |             RichProgressBar(),
138 |             RichModelSummary(max_depth=2)
139 |         ]
140 |     )
141 |     trainer.fit(model, datamodule=dm)
142 | 
143 |     dm.setup('test')
144 |     trainer.test(datamodule=dm)
145 | 
146 |     # Save checkpoints folder
147 |     if args.neptune:
148 |         # neptune_logger.experiment.log_artifact(MODEL_DIR)
149 |         neptune_logger.log_hyperparams(vars(args))
150 |         neptune_logger.run['training/artifacts/metrics_by_typenames.json'].log('metrics_by_typenames.json')
151 |         neptune_logger.run['training/artifacts/metrics_by_documents.json'].log('metrics_by_documents.json')
152 |         neptune_logger.run['training/artifacts/outputs_sheet_client.xlsx'].log('outputs_sheet_client.xlsx')
153 | 
154 | 
155 | if __name__ == "__main__":
156 |     main()
157 | 


--------------------------------------------------------------------------------
/information_extraction_t5/utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/neuralmind-ai/information-extraction-t5/988c589b433d96004139e1f63bfefd6778c0851b/information_extraction_t5/utils/__init__.py


--------------------------------------------------------------------------------
/information_extraction_t5/utils/balance_data.py:
--------------------------------------------------------------------------------
 1 | """Code to balance the dataset, keeping a negative-positive ratio"""
 2 | import numpy as np
 3 | import pandas as pd
 4 | 
 5 | from information_extraction_t5.features.sentences import split_t5_sentence_into_components
 6 | 
 7 | 
 8 | def count_pos_neg(labels, document_ids, example_ids):
 9 |     """Count the number of positive and negative samples. The counting is also
10 |     done for each pair of document_id-example-id returned as a dict."""
11 | 
12 |     pos, neg = 0, 0
13 |     counter = {}
14 |     for label, document_id, example_id in zip(labels, document_ids, example_ids):
15 |         if document_id not in counter:
16 |             counter[document_id] = {}
17 | 
18 |         if example_id not in counter[document_id].keys():
19 |             counter[document_id][example_id] = {'pos': 0, 'neg': 0}
20 | 
21 |         if 'N/A' in label:
22 |             counter[document_id][example_id]['neg'] += 1
23 |             neg += 1
24 |         else:
25 |             counter[document_id][example_id]['pos'] += 1
26 |             pos += 1
27 |     return pos, neg, counter
28 | 
29 | 
30 | def balance_data(inputs, labels, document_ids, example_ids, negative_ratio):
31 |     """Control the number of negative examples (N/A) with respect to the number
32 |     of positive examples by negative_ratio. The data balancing is performed for
33 |     each pair of document_id-example_id.
34 | 
35 |     Negative samples are selected with resampling with replacement, as the number
36 |     of negative samples can be lower than the number of positives * negative_ratio.
37 |     """
38 |     
39 |     n_pos, n_neg, _ = count_pos_neg(labels, document_ids, example_ids)
40 |     print(f'>> The negative-positive ratio of the original dataset is {n_neg/n_pos:.2f}.')
41 | 
42 |     is_negative = []
43 |     for label in labels:
44 |         _, _, sub_answers = split_t5_sentence_into_components(label)
45 |         if 'N/A' in sub_answers:
46 |             is_negative.append(True)
47 |         else:
48 |             is_negative.append(False)
49 | 
50 |     # create the initial dataframe
51 |     arr = np.vstack([
52 |         np.array(inputs),
53 |         np.array(labels),
54 |         np.array(document_ids),
55 |         np.array(example_ids),
56 |         np.array(is_negative, dtype=bool)]).transpose()
57 |     df1 = pd.DataFrame(arr, columns=['examples', 'labels', 'document_ids', 'example_ids', 'is_negative'])
58 | 
59 |     # Separate positive and negative samples
60 |     df_pos = df1.loc[df1['is_negative'] == 'False']
61 |     df_neg = df1.loc[df1['is_negative'] == 'True']
62 | 
63 |     # create temporary dataframe with additional column to count how many
64 |     # positive qas we have for each pair document_id-example_id
65 |     df_pos_counter = df_pos.groupby(['document_ids', 'example_ids']).size().reset_index(name='counts')
66 | 
67 |     # merge the positive dataframe with counter and negative dataframe
68 |     df_merge = df_pos_counter.merge(df_neg, on=['document_ids', 'example_ids'], how='outer')
69 |     # remove pairs document x example that has only negative (no positite qa)
70 |     df_merge.dropna(inplace=True)
71 | 
72 |     # process the merged dataframe to resample negative cases proportional
73 |     # to the number of positive cases
74 |     df_group = df_merge.groupby(['document_ids', 'example_ids'])
75 |     frames = []
76 |     for group in df_group.groups:
77 |         df = df_group.get_group(group)
78 |         df = df.sample(int(df['counts'].values[0]) * negative_ratio, replace=True, random_state=42)
79 |         frames.append(df)
80 |     df_merge = pd.concat(frames)
81 | 
82 |     # remove temporary columns
83 |     df_merge = df_merge.drop(['counts', 'is_negative'], axis=1)
84 |     df_pos = df_pos.drop(['is_negative'], axis=1)
85 | 
86 |     # create the final dataframe by concatenating positives and negatives
87 |     dfinal = pd.concat([df_pos, df_merge])
88 | 
89 |     inputs, labels, document_ids, example_ids = dfinal.T.values.tolist()
90 | 
91 |     n_pos, n_neg, _ = count_pos_neg(labels, document_ids, example_ids)
92 |     if n_neg/n_pos != negative_ratio:
93 |         print(f'>> The resultant negative-positive ratio is {n_neg/n_pos:.2f}. '
94 |               'Hint: set "use_missing_answers=False" to get a precise data balancing.')
95 |     else:
96 |         print(f'>> The resultant negative-positive ratio is {n_neg/n_pos:.2f}.')
97 | 
98 |     return inputs, labels, document_ids, example_ids
99 | 


--------------------------------------------------------------------------------
/information_extraction_t5/utils/freeze.py:
--------------------------------------------------------------------------------
 1 | """Utilities for freezing parameters and checking whether they are frozen."""
 2 | from torch import nn
 3 | 
 4 | 
 5 | def freeze_params(model: nn.Module):
 6 |     """Set requires_grad=False for each of model.parameters()"""
 7 |     for par in model.parameters():
 8 |         par.requires_grad = False
 9 | 
10 | 
11 | def freeze_embeds(model):
12 |     """Freeze token embeddings and positional embeddings for bart, just token embeddings for t5."""
13 |     model_type = model.config.model_type
14 | 
15 |     if model_type in ["t5", "mt5"]:
16 |         freeze_params(model.shared)
17 |         for d in [model.encoder, model.decoder]:
18 |             freeze_params(d.embed_tokens)
19 |     elif model_type == "fsmt":
20 |         for d in [model.model.encoder, model.model.decoder]:
21 |             freeze_params(d.embed_positions)
22 |             freeze_params(d.embed_tokens)
23 |     else:
24 |         freeze_params(model.model.shared)
25 |         for d in [model.model.encoder, model.model.decoder]:
26 |             freeze_params(d.embed_positions)
27 |             freeze_params(d.embed_tokens)
28 | 


--------------------------------------------------------------------------------
/information_extraction_t5/utils/metrics.py:
--------------------------------------------------------------------------------
  1 | """ Very heavily inspired by the official evaluation script for SQuAD version
  2 | 2.0 which was modified by XLNet authors to update `find_best_threshold`
  3 | scripts for SQuAD V2.0 In addition to basic functionality, we also compute
  4 | additional statistics and plot precision-recall curves if an additional
  5 | na_prob.json file is provided. This file is expected to map question ID's to
  6 | the model's predicted probability that a question is unanswerable. """
  7 | import collections
  8 | import re
  9 | import string
 10 | from typing import Dict, Optional
 11 | import unicodedata
 12 | 
 13 | 
 14 | def normalize_answer(s):
 15 |     """Lower text and remove punctuation, articles and extra whitespace."""
 16 | 
 17 |     def remove_articles(text):
 18 |         regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
 19 |         # regex = re.compile(r"\b(o|a|os|as|um|uma|uns|umas)\b", re.UNICODE) # portuguese?
 20 |         return re.sub(regex, " ", text)
 21 | 
 22 |     def white_space_fix(text):
 23 |         return " ".join(text.split())
 24 | 
 25 |     def remove_punc(text):
 26 |         exclude = set(string.punctuation)
 27 |         return "".join(ch for ch in text if ch not in exclude)
 28 | 
 29 |     def lower(text):
 30 |         return text.lower()
 31 | 
 32 |     def strip_accents(s):
 33 |         return ''.join(c for c in unicodedata.normalize('NFD', s)
 34 |                   if unicodedata.category(c) != 'Mn')
 35 | 
 36 |     # return white_space_fix(remove_articles(remove_punc(lower(s))))
 37 |     return white_space_fix(remove_articles(strip_accents(remove_punc(lower(s)))))
 38 | 
 39 | 
 40 | def get_tokens(s):
 41 |     if not s:
 42 |         return []
 43 |     return normalize_answer(s).split()
 44 | 
 45 | 
 46 | def compute_exact(a_gold, a_pred):
 47 |     return int(normalize_answer(a_gold) == normalize_answer(a_pred))
 48 | 
 49 | 
 50 | def compute_f1(a_gold, a_pred):
 51 |     gold_toks = get_tokens(a_gold)
 52 |     pred_toks = get_tokens(a_pred)
 53 |     common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
 54 |     num_same = sum(common.values())
 55 |     if len(gold_toks) == 0 or len(pred_toks) == 0:
 56 |         # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
 57 |         return int(gold_toks == pred_toks)
 58 |     if num_same == 0:
 59 |         return 0
 60 |     precision = 1.0 * num_same / len(pred_toks)
 61 |     recall = 1.0 * num_same / len(gold_toks)
 62 |     f1 = (2 * precision * recall) / (precision + recall)
 63 |     return f1
 64 | 
 65 | 
 66 | def make_eval_dict(exact_scores, f1_scores, qid_list=None):
 67 |     if not qid_list:
 68 |         total = len(exact_scores)
 69 |         return collections.OrderedDict(
 70 |             [
 71 |                 ("exact", 100.0 * sum(exact_scores.values()) / total),
 72 |                 ("f1", 100.0 * sum(f1_scores.values()) / total),
 73 |                 ("total", total),
 74 |             ]
 75 |         )
 76 |     else:
 77 |         total = len(qid_list)
 78 |         return collections.OrderedDict(
 79 |             [
 80 |                 ("exact",
 81 |                  100.0 * sum(exact_scores[k] for k in qid_list) / total),
 82 |                 ("f1", 100.0 * sum(f1_scores[k] for k in qid_list) / total),
 83 |                 ("total", total),
 84 |             ]
 85 |         )
 86 | 
 87 | 
 88 | def get_raw_scores(answers, preds):
 89 |     """Computes the exact and f1 scores from the examples and the model
 90 |     predictions.
 91 | 
 92 |     This version gets the answers and prediction in text format, as T5 returns.
 93 |     """
 94 |     exact_scores = {}
 95 |     f1_scores = {}
 96 | 
 97 |     for i, (answer, pred) in enumerate(zip(answers, preds)):
 98 |         exact_scores[i] = compute_exact(answer, pred)
 99 |         f1_scores[i] = compute_f1(answer, pred)
100 | 
101 |     return exact_scores, f1_scores
102 | 
103 | 
104 | def t5_qa_evaluate(answers, preds, qid_dict: Optional[Dict] = None):
105 |     """Evaluates T5 predictions.
106 | 
107 |     This is a siplification of `square_evaluate` to compute the exact and f1
108 |     scores from predictions from T5.
109 |     If required, this version returns subdicts with f1 and exact measures for
110 |     pre-selected groups of question-answers.
111 | 
112 |     Examples:
113 |         >>> qid_dict = {
114 |         >>>     'matriculas': [0, 4],
115 |         >>>     'comarca': [1, 4],
116 |         >>>     'estado': [2, 6]
117 |         >>>     'oficio': [3, 7]
118 |         >>> }
119 |         >>> t5_qa_evaluate(answers, preds, qid_dict=qid_dict)
120 |         >>> {'exact': x, 'f1': y, 'total': 8, 'matriculas': {'exact': z, 'f1': w, 'total': 2}, ... }
121 |     """
122 |     if qid_dict is None:
123 |         qid_dict = {}
124 | 
125 |     exact, f1 = get_raw_scores(answers, preds)
126 |     evaluation = make_eval_dict(exact, f1)
127 | 
128 |     for (kword, qid_list) in qid_dict.items():
129 |         evaluation[kword] = make_eval_dict(exact, f1, qid_list)
130 | 
131 |     return evaluation
132 | 


--------------------------------------------------------------------------------
/information_extraction_t5/utils/processing.py:
--------------------------------------------------------------------------------
 1 | """Utility methods for pre and post processing."""
 2 | from collections import OrderedDict
 3 | from typing import Dict, List, Tuple
 4 | 
 5 | import regex as re
 6 | 
 7 | 
 8 | def get_intersection_set(list_a: List, list_b: List) -> set:
 9 |     """Returns the intersection set of two lists."""
10 |     set_a = set(list_a)
11 |     set_b = set(list_b)
12 |     intersection = set_a.intersection(set_b)
13 | 
14 |     return intersection
15 | 
16 | 
17 | def concat_or_terms(terms, suffix='{e<=1}'):
18 |     """Concats a list of terms in an OR regex.
19 | 
20 |     Example:
21 |     >>> concat_or_terms([r'foo', r'bar'], suffix='{e<=1}')
22 |     '(?:foo|bar){e<=1}'
23 | 
24 |     Args:
25 |         terms (list): terms to be considered in a regex search group
26 |         suffix (str): fuzzy options to use in the search
27 | 
28 |     Returns:
29 |         (str): regex string for group search
30 | 
31 |     """
32 |     groups = '|'.join(map(str, terms))
33 | 
34 |     return r'(?:{}){}'.format(groups, suffix)
35 | 
36 | 
37 | def expand_composite_char_pattern(text: str) -> str:
38 |     """ Replace composable char in the given text for a regex group with all
39 |     its composite versions.
40 | 
41 |     Args:
42 |         text: the string to be expanded
43 | 
44 |     Returns:
45 |         a new string with every composable char replaced by its composites
46 |         pattern
47 |     """
48 | 
49 |     composite_char_groups = [
50 |         'aáàâã',
51 |         'eéê',
52 |         'ií',
53 |         'oóõ',
54 |         'uúü',
55 |         'cç'
56 |     ]
57 | 
58 |     for group in composite_char_groups:
59 |         text = re.sub(fr'[{group}]', f'[{group}]', text)
60 |     return text
61 | 
62 | 
63 | def count_k_v(d):
64 |     """Count keys and values in nested dictionary."""
65 |     keys, values = 0, 0
66 |     if isinstance(d, Dict) or isinstance(d, OrderedDict):
67 |         for item in d.keys():
68 |             if isinstance(d[item], (List, Tuple, Dict)):
69 |                 keys += 1
70 |                 k, v = count_k_v(d[item])
71 |                 values += v
72 |                 keys += k
73 |             else:
74 |                 keys += 1
75 |                 values += 1
76 | 
77 |     elif isinstance(d, (List, Tuple)):
78 |         for item in d:
79 |             if isinstance(item, (List, Tuple, Dict)):
80 |                 k, v = count_k_v(item)
81 |                 values += v
82 |                 keys += k
83 |             else:
84 |                 values += 1
85 | 
86 |     return keys, values
87 | 


--------------------------------------------------------------------------------
/params.yaml:
--------------------------------------------------------------------------------
 1 | model_name_or_path: unicamp-dl/ptt5-base-portuguese-vocab 
 2 | do_lower_case: false
 3 | deepspeed: false
 4 | 
 5 | # neptune
 6 | neptune: false
 7 | neptune_project: ramon.pires/information-extraction-t5
 8 | experiment_name: experiment01
 9 | tags: [ptt5, compound]
10 | 
11 | # optimizer
12 | optimizer: AdamW
13 | lr: 1e-4
14 | weight_decay: 1e-5
15 | 
16 | # preprocess dataset
17 | project: [
18 |   form,
19 |   ]
20 | raw_data_file: [
21 |   data/raw/sample_train.json
22 |   ]
23 | raw_valid_data_file: [
24 |   null,
25 |   ]
26 | raw_test_data_file: [
27 |   data/raw/sample_test.json
28 |   ]
29 | train_file: data/processed/train-v0.1.json
30 | valid_file: data/processed/dev-v0.1.json
31 | test_file: data/processed/test-v0.1.json
32 | type_names: [
33 |   form.etiqueta,
34 |   form.agencia,
35 |   form.conta_corrente,
36 |   form.cpf,
37 |   form.nome_completo,
38 |   form.n_doc_serie,
39 |   form.orgao_emissor,
40 |   form.data_emissao,
41 |   form.data_nascimento,
42 |   form.nome_mae,
43 |   form.nome_pai,
44 |   form.endereco,
45 |   ]
46 | use_compound_question: [
47 |   form.endereco,
48 |   ]
49 | return_raw_text: [
50 |   null,
51 |   ]
52 | 
53 | train_force_qa: true
54 | train_choose_question: first
55 | valid_percent: 0.2
56 | context_content: windows_token
57 | window_overlap: 0.2
58 | max_windows: 3
59 | max_size: 2048
60 | max_seq_length: 512
61 | 
62 | # dataset
63 | train_batch_size: 8
64 | val_batch_size: 8
65 | shuffle_train: true
66 | use_sentence_id: false
67 | negative_ratio: -1
68 | 
69 | seed: 20210519
70 | num_workers: 6
71 | 
72 | # inference and post-processing
73 | num_beams: 5
74 | max_length: 200
75 | get_highestprob_answer: true
76 | split_compound_answers: true
77 | group_qas: true
78 | normalize_outputs: true
79 | only_misprediction_outputs: true
80 | use_cached_predictions: true
81 | 
82 | # Trainer
83 | accelerator: auto
84 | devices: auto
85 | max_epochs: 26
86 | deterministic: true
87 | accumulate_grad_batches: 2
88 | amp_backend: native
89 | precision: 32
90 | gradient_clip_val: 1.0
91 | val_check_interval: 1.0
92 | check_val_every_n_epoch: 2
93 | limit_val_batches: 0.5
94 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | absl-py==0.9.0
 2 | appdirs==1.4.4
 3 | atpublic==1.0
 4 | attrs==19.3.0
 5 | boto3==1.20.20
 6 | botocore==1.23.20
 7 | cachetools==4.1.1
 8 | certifi==2020.6.20
 9 | chardet==3.0.4
10 | click==7.1.2
11 | colorama==0.4.3
12 | ConfigArgParse==1.2.3
13 | configobj==5.0.6
14 | decorator==4.4.2
15 | dill==0.3.2
16 | distro==1.5.0
17 | docutils==0.15.2
18 | dpath==2.0.1
19 | dvc==1.9.1
20 | filelock==3.0.12
21 | flatten-json==0.1.7
22 | flufl.lock==3.2
23 | funcy==1.14
24 | future==0.18.2
25 | fuzzysearch==0.7.3
26 | fuzzywuzzy==0.18.0
27 | gitdb==4.0.5
28 | GitPython==3.1.7
29 | google-auth==1.18.0
30 | google-auth-oauthlib==0.4.1
31 | googleapis-common-protos==1.52.0
32 | grandalf==0.6
33 | grpcio==1.30.0
34 | idna==2.10
35 | jmespath==0.10.0
36 | joblib==0.16.0
37 | jsonpath-ng==1.5.1
38 | Markdown==3.2.2
39 | nanotime==0.5.2
40 | neptune-client==0.13.3
41 | neptune-contrib==0.28.1
42 | networkx==2.3
43 | numpy==1.19.0
44 | oauthlib==3.1.0
45 | openpyxl==3.0.9
46 | packaging>=19.0
47 | pandas<=1.3.5
48 | pathspec==0.8.0
49 | ply==3.11
50 | promise==2.3
51 | protobuf==3.12.2
52 | psutil==5.8.0
53 | pyasn1==0.4.8
54 | pyasn1-modules==0.2.8
55 | pydot==1.4.1
56 | pygtrie==2.3.2
57 | pyparsing==2.4.7
58 | python-dateutil==2.8.1
59 | python-Levenshtein==0.12.0
60 | pytorch-lightning==1.5.5
61 | PyYAML==5.3.1
62 | regex==2020.6.8
63 | requests==2.24.0
64 | requests-oauthlib==1.3.0
65 | rich==10.15.2
66 | rsa==4.5.0
67 | ruamel.yaml==0.16.10
68 | ruamel.yaml.clib==0.2.0
69 | s3transfer==0.5.0
70 | sacremoses==0.0.43
71 | scikit-learn==0.23.1
72 | scipy==1.5.1
73 | sentencepiece==0.1.91
74 | shortuuid==1.0.1
75 | shtab<2.0.0
76 | six==1.15.0
77 | sklearn==0.0
78 | smmap==3.0.4
79 | tabulate==0.8.7
80 | termcolor==1.1.0
81 | threadpoolctl==2.1.0
82 | tokenizers==0.10.3
83 | torch==1.10.0
84 | tqdm==4.47.0
85 | transformers==4.8.2
86 | urllib3==1.25.9
87 | voluptuous==0.11.7
88 | Werkzeug==1.0.1
89 | wrapt==1.12.1
90 | zc.lockfile==2.0
91 | deepspeed==0.4.3
92 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | """Install information_extraction_t5"""
 2 | import os
 3 | from distutils.core import setup
 4 | from setuptools import find_packages
 5 | 
 6 | pkg_dir = os.path.dirname(__name__)
 7 | 
 8 | with open(os.path.join(pkg_dir, 'requirements.txt'), 'r', encoding='utf-8') as fd:
 9 |     requirements = fd.read().splitlines()
10 | 
11 | setup(
12 |     name='information_extraction_t5',
13 |     version='1.0',
14 |     packages=find_packages('.', exclude=['data*',
15 |                                          'lightning_logs*',
16 |                                          'models*']),
17 |     long_description=open('README.md', 'r', encoding='utf-8').read(),
18 |     install_requires=requirements,
19 | )
20 | 


--------------------------------------------------------------------------------