├── .gitignore ├── LICENSE ├── README.md ├── data └── raw │ ├── sample_test.json │ └── sample_train.json ├── information_extraction_t5 ├── __init__.py ├── data │ ├── __init__.py │ ├── basic_to_squad.py │ ├── convert_dataset_to_squad.py │ ├── convert_squad_to_t5.py │ ├── file_handling.py │ └── qa_data.py ├── features │ ├── __init__.py │ ├── context.py │ ├── highlights.py │ ├── postprocess.py │ ├── preprocess.py │ ├── questions │ │ ├── __init__.py │ │ ├── questions.py │ │ └── type_map.py │ └── sentences.py ├── models │ ├── __init__.py │ └── qa_model.py ├── predict.py ├── train.py └── utils │ ├── __init__.py │ ├── balance_data.py │ ├── freeze.py │ ├── metrics.py │ └── processing.py ├── params.yaml ├── requirements.txt └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | # These are some examples of commonly ignored file patterns. 2 | # You should customize this list as applicable to your project. 3 | # Learn more about .gitignore: 4 | # https://www.atlassian.com/git/tutorials/saving-changes/gitignore 5 | 6 | # Node artifact files 7 | node_modules/ 8 | dist/ 9 | 10 | # Compiled Java class files 11 | *.class 12 | 13 | # Compiled Python bytecode 14 | *.py[cod] 15 | 16 | # Log files 17 | *.log 18 | 19 | # Package files 20 | *.jar 21 | 22 | # Maven 23 | target/ 24 | dist/ 25 | 26 | # JetBrains IDE 27 | .idea/ 28 | 29 | # Unit test reports 30 | TEST*.xml 31 | 32 | # Generated by MacOS 33 | .DS_Store 34 | 35 | # Generated by Windows 36 | Thumbs.db 37 | 38 | # Applications 39 | *.app 40 | *.exe 41 | *.war 42 | 43 | # Large media files 44 | *.mp4 45 | *.tiff 46 | *.avi 47 | *.flv 48 | *.mov 49 | *.wmv 50 | 51 | MANIFEST 52 | *egg-info 53 | 54 | .vscode 55 | 56 | cache* 57 | 58 | data/processed 59 | 60 | lightning_logs/* 61 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 NeuralMind 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Information Extraction using T5 2 | 3 | [![arXiv](https://img.shields.io/badge/arXiv-2101.05658-f9f107.svg)](https://arxiv.org/abs/2201.05658) 4 | 5 | This project provides a solution for training and validating seq2seq models for information extraction. The method can be applied for any text-only type of document, such as legal, registration, news, etc. The project extracts information by question answering. 6 | 7 | In this work, we evaluate sequence-to-sequence models as an alternative to token-level classification methods to extract information from documents. T5 models are finetuned to jointly extract the information and generate the output in a structured format. Post-processing steps are learned during training, eliminating the need for rule-based methods and simplifying the pipeline. 8 | 9 | Neither the models weights nor the datasets are available for ethical issues. But we make efforts to release the source code that works for different models of T5 family, and can be easily extended for new datasets and different languages. 10 | 11 | # Installation 12 | 13 | Clone the repository and install via: 14 | 15 | ```bash 16 | pip install . 17 | ``` 18 | 19 | # Fine-tuning 20 | 21 | Configure the parameters in `params.yaml`. Then, preprocess the datasets by running: 22 | 23 | ```bash 24 | python information_extraction_t5/data/convert_dataset_to_squad.py -c params.yaml 25 | ``` 26 | 27 | Start the fine-tuning experiment via: 28 | 29 | ```bash 30 | python information_extraction_t5/train.py -c params.yaml 31 | ``` 32 | 33 | The code executes both the training and inference using the checkpoint with best exact matching over the validation set, as well as the complete post-processing that involves especially: 34 | 35 | - selecting the most likely answer (the sliding window that provided the highest-probability response), and 36 | - breaking compound answers in clean and individual sub-answers. 37 | 38 | It also computes metrics and generates the following output files: 39 | 40 | - `metrics_by_typenames.json`: JSON file with exact matching and F1-score for each *field*, each dataset and all the documents. 41 | - `metrics_by_documents.json`: JSON file with exact matching and F1-score for each *document*, each dataset and all the documents. 42 | - `outputs_by_typenames.txt`: TXT file with labels, predictions, *document id*, probability of selected window, and window id, grouped by *fields*. 43 | - `outputs_by_documents.txt`: TXT file with labels, predictions, *field id*, probability of selected window, and window id, grouped by *documents*. 44 | - `output_sheet.xlsl`: Excel file with field id, labels, predictions and probs grouped by documents. 45 | - `output_sheet_client.xlsl`: Excel file with labels, predictions, probs and metrics, organized as one sheet for each dataset. 46 | 47 | Note that, for now, only a [tiny synthetic dataset](data/raw/sample_train.json) is available. To extend for new datasets, please consult [this section](#extending-for-new-datasets). 48 | 49 | # Inference 50 | 51 | Assuming you have finished the training and inference and want to use a different testing dataset, an intermediary checkpoint or even explore a different post-processing function or parameter, just run: 52 | 53 | ```bash 54 | python information_extraction_t5/predict.py -c params.yaml 55 | ``` 56 | 57 | For cases that require a new inference round, remember to set `use_cache_predictions: false` to overwrite the cache. Otherwise, if you intend to rerun the post-processing, set `use_cache_predictions: true`. 58 | 59 | # Setting the hyperparameters 60 | 61 | To know all the settings related to pre-processing, training, inference and pos-processing stages, please run: 62 | 63 | ```bash 64 | python information_extraction_t5/train.py --help 65 | ``` 66 | 67 | You will find an extensive list of parameters because it is inheriting Pytorch-Lightning's Trainer arguments. 68 | Give special attention only to the parameters that are in the `params.yaml` file. 69 | 70 | 71 | 72 | 73 | 74 | # Extending for new datasets 75 | 76 | In this section we explain how to include new datasets for running fine-tuning and inference. 77 | It is important to emphasize that the four datasets that have been originally applied in the project cannot be released for ethical issues. 78 | 79 | ## Preparing the questions and type-map 80 | 81 | There are two preliminar steps when extending the project for new datasets. 82 | 83 | ### Mapping field names to clues *[Mandatory]* 84 | 85 | The original field names of the datasets can be noisy, not natural. One important step is converting those irregular names into natural ones. The natural names will be used as clues in the answers. 86 | 87 | For each dataset, it is necessary to [map](information_extraction_t5/features/questions/type_map.py) field names (we call it as type-name in the code) to types and vice-versa. The types are used as clues in brackets in T5 outputs. The field names are recovered in post-processing stage. 88 | 89 | Each dataset has it own `TYPENAME_TO_TYPE` dictionary. We strongly recommend that the types used in all the projects be consistent, and as generic as possible. For example, using *CPF/CNPJ* for all CPFs and CNPJs, regardless of being a consultant, current account holder, business partner, land owner, etc. 90 | 91 | ### Formulating questions *[Optional]* 92 | 93 | If your dataset does not follow the required [SQuAD format](#format-of-the-dataset) and you intend to use the [pre-processing code](#converting-the-dataset-to-squad-format), before starting the conversion it is necessary to formulate the [questions](information_extraction_t5/features/questions/questions.py). 94 | 95 | Each dataset will have a particular dictionary of questions, in which the key is the field name (we call it as type-name in the code) and the value is a list of questions. 96 | 97 | HINT: We use one list of questions for each field as a strategy to augment the dataset. You can use the data augmentation by setting `train_choose_question: all` in `params.yaml`. Use `random` to select one question randomly, or `first` to get the first one for each field. 98 | 99 | If you have a compound information (the value is an internal dictionary), we recommend representing the dict as an OrderedDict in order to use the dictionary keys as field signature, ensure a possíble compound answer will have it sub-answers in an inmutable order. 100 | 101 | ## Format of the dataset 102 | 103 | As the project aims at extracting information using QA modality, we adopt the SQuAD as the format of the datasets, with a few adaptations. Below we present an example to illustrate the structure of the dataset file and describe the adaptations to enable the use of sliding windows and the reference of each pair [document, field], in order to enable an effective metric computation for each document, dataset and field. 104 | 105 | ```json 106 | { 107 | "data": [ 108 | { 109 | "title": "318", 110 | "paragraphs": [ 111 | { 112 | "context": "Proposta de Abertura de Conta, Contrata\u00e7\u00e3o de Cr\u00e9dito e\nAdes\u00e3o a Produtos e Servi\u00e7os Banc\u00e1rios - Pessoa F\u00edsica\nID00147\nAg\u00eancia N\u00ba\n1234\nConta Corrente 0011-2347-0000809875312\nCondi\u00e7\u00e3o de Movimenta\u00e7\u00e3o da Conta X Individual\nAltera\u00e7\u00e3o cadastral\nAngariador (matr\u00edcula) L\n00098961\nDados B\u00e1sicos do Titular\nCPF\n516.759.760-90\n...", 113 | "qas": [ 114 | { 115 | "answers": [ 116 | { 117 | "answer_start": 157, 118 | "text": "[Ag\u00eancia]: 2347" 119 | } 120 | ], 121 | "question": "Qual \u00e9 o n\u00famero da ag\u00eancia?", 122 | "id": "form.agencia" 123 | }, 124 | { 125 | "answers": [ 126 | { 127 | "answer_start": -1, 128 | "text": "[Nome]: N/A" 129 | } 130 | ], 131 | "question": "Qual \u00e9 o nome?", 132 | "id": "form.nome_completo" 133 | } 134 | ] 135 | }, 136 | { 137 | "context": "...\nNome Completo ANA MADALENA SILVEIRA ALVES\nDocumento de Identifica\u00e7\u00e3o CNH CTPS Entidade de Classe Mercosul Passaporte\nProtocolo Refugiado\nRIC RNE\nCIE Guia de Acolhimento ao Menor Registro Nacional Migrat\u00f3rio\nN\u00b0 Documento / N\u00b0 da S\u00e9rie (CTPS)\n73258674 \u00d3rg\u00e3o Emissor SSP\nUF BA\nData de Emiss\u00e3o 21/07/2018 Data de Vcto (passaporte/CNH).", 138 | "qas": [ 139 | { 140 | "answers": [ 141 | { 142 | "answer_start": -1, 143 | "text": "[Ag\u00eancia]: N/A" 144 | } 145 | ], 146 | "question": "Qual \u00e9 o n\u00famero da ag\u00eancia?", 147 | "id": "form.agencia" 148 | }, 149 | { 150 | "answers": [ 151 | { 152 | "answer_start": 18, 153 | "text": "[Nome]: ANA MADALENA SILVEIRA ALVES" 154 | } 155 | ], 156 | "question": "Qual \u00e9 o nome?", 157 | "id": "form.nome_completo" 158 | } 159 | ] 160 | } 161 | ] 162 | } 163 | ], 164 | "version": "0.1" 165 | } 166 | ``` 167 | The example we presented herein includes one document whose `id = 318`, and context fits into two sliding windows. We adapted SQuAD format by transforming the list of different documents related to the same theme into a list of different sliding windows of the same document. For each document, reference in `title`, `paragraphs` is a list of dictionaries that have context and an internal dictionary of QAs. 168 | 169 | The dictionaries of QAs follows the same intuition of SQuAD dataset, but we included in `id` the signature of QA, that involves the project (dataset name) and the field. This is very important since enables to get metrics not only for the datasets altogether, but also for each dataset individually as well as for each field. 170 | 171 | ## Adding the dataset 172 | 173 | Assuming you have a dataset already pre-processed, in SQuAD format, to include it in the project you may choose a name for it and edit the following parameters in the `params.yaml` file: 174 | 175 | ```yaml 176 | project: [ 177 | form, 178 | ] 179 | train_file: data/processed/train-v0.1.json 180 | valid_file: data/processed/dev-v0.1.json 181 | test_file: data/processed/test-v0.1.json 182 | ``` 183 | 184 | Note that it's possible to include several datasets in the list of projects, but each `{train, valid, test}_file` includes the examples of all the datasets listed in `project`. 185 | 186 | ## Converting the dataset to SQuAD format 187 | 188 | If your dataset is not in the complex SQuAD-like format with the document divided into sliding windows, the pairs of question-answers, the correct qa-id, don't worry! We are releasing a code to [convert the dataset to the expected format](information_extraction_t5/data/basic_to_squad.py). 189 | 190 | What you need to do is just ensure the dataset follows the format of a basic JSON: a dictionary of documents, in which each key is the document-id, and each value is an internal dict that must have the key "text" with the respective document content as value, and other pairs key-values representing the fields the document has. 191 | 192 | You can visualize [here](data/raw/sample_train.json) one raw dataset that is ready to be converted into SQuAD format. NOTE: If you want to extract compound information (using compound QA feature) for the one compound field, such as `address`, the value of the respective key must be another dictionary with the expected information. 193 | 194 | Thus, to generate a SQuAD-like dataset illustrated in the previous subsection, just set the parameters as below: 195 | 196 | ```yaml 197 | project: [ 198 | form, 199 | ] 200 | raw_data_file: [ 201 | data/raw/sample_train.json, 202 | ] 203 | raw_valid_data_file: [ 204 | null, 205 | ] 206 | raw_test_data_file: [ 207 | data/raw/sample_test.json, 208 | ] 209 | train_file: data/processed/train-v0.1.json 210 | valid_file: data/processed/dev-v0.1.json 211 | test_file: data/processed/test-v0.1.json 212 | type_names: [ 213 | form.agencia, 214 | form.nome_completo, 215 | ] 216 | ``` 217 | 218 | The parameter names are intuitive. You can include any number of dataset names and their respected train, validation and test paths (the four lists must have the same number of parameters). If any of the datasets does not have a validation subset, just include `null` in the position, and a fraction of `valid_percent` of the training set will be moved for validation set. 219 | 220 | Finally, just run the command below to get the listed datasets converted to SQuAD format and saved as `{train, valid, test}_file`. 221 | 222 | ```bash 223 | python information_extraction_t5/data/convert_dataset_to_squad.py -c params.yaml 224 | ``` 225 | 226 | ### Limitation 227 | 228 | The released code for dataset pre-processing does not include the features `sentence-ids` and `raw-text` formats as it would require more complex and ellaborated raw dataset, whose structure must include annotations of positions and texts both in raw and canonical formats. Those features are important only for industrial applications, but, depending of the dataset size, you can manually include `answer_start` for each qa, and setting the answer as *N/A* if it does not fit in the window. For training the model to extract canonical and raw-text information, you can change both the questions and answers as: 229 | 230 | ``` 231 | Q: What is the state? 232 | A: [State]: São Paulo 233 | 234 | Q: What is the state and how does it appear in the text? 235 | A: [State]: SP [appears in the text]: São Paulo 236 | ``` 237 | 238 | # Cite as 239 | 240 | ```bibtex 241 | @inproceedings{pires2022seq2seq, 242 | title = {Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents}, 243 | author = {Pires, Ramon and de Souza, Fábio C. and Rosa, Guilherme and Lotufo, Roberto A. and Nogueira, Rodrigo}, 244 | publisher = {arXiv}, 245 | doi = {10.48550/ARXIV.2201.05658}, 246 | url = {https://arxiv.org/abs/2201.05658}, 247 | year = {2022}, 248 | } 249 | ``` 250 | -------------------------------------------------------------------------------- /data/raw/sample_test.json: -------------------------------------------------------------------------------- 1 | { 2 | "965": { 3 | "etiqueta": "ID00123", 4 | "agencia": "1234", 5 | "conta_corrente": "0093-1234-0000103133931", 6 | "cpf": "675.957.460-51", 7 | "nome_completo": "MARCIO MACEDO SOUZA", 8 | "n_doc_serie": "73258674", 9 | "orgao_emissor": "SSP", 10 | "data_emissao": "05/09/1988", 11 | "data_nascimento": "05/05/1971", 12 | "nome_mae": "MARIA ANTONIETA MACEDO", 13 | "nome_pai": "JOABE SALVADOR SOUZA", 14 | "endereco": { 15 | "logradouro": "RUA PEDRO NOVAIS", 16 | "numero": "78", 17 | "complemento": "Apto 8", 18 | "bairro": "AFOGADOS", 19 | "cidade": "RECIFE", 20 | "estado": "PE", 21 | "cep": "56220-000" 22 | }, 23 | "text": "Proposta de Abertura de Conta, Contrata\u00e7\u00e3o de Cr\u00e9dito e\nAdes\u00e3o a Produtos e Servi\u00e7os Banc\u00e1rios - Pessoa F\u00edsica\nID00123\nAg\u00eancia N\u00ba\n1234\nConta Corrente 0093-1234-0000103133931\nCondi\u00e7\u00e3o de Movimenta\u00e7\u00e3o da Conta X Individual\nAltera\u00e7\u00e3o cadastral\nAngariador (matr\u00edcula) L\n000685631\nDados B\u00e1sicos do Titular\nCPF\n675.957.460-51\nNome Completo MARCIO MACEDO SOUZA\nDocumento de Identifica\u00e7\u00e3o CNH CTPS Entidade de Classe Mercosul Passaporte\nProtocolo Refugiado\nRIC RNE\nCIE Guia de Acolhimento ao Menor Registro Nacional Migrat\u00f3rio\nN\u00b0 Documento / N\u00b0 da S\u00e9rie (CTPS)\n73258674 \u00d3rg\u00e3o Emissor SSP\nUF PE\nData de Emiss\u00e3o 05/09/1988 Data de Vcto (passaporte/CNH)\n | Data de Nascimento 05/05/1971 Sexo F X M\nNacionalidade x Brasileira\nNome da M\u00e3e MARIA ANTONIETA MACEDO\nNome do Pai JOABE SALVADOR SOUZA\nCidadania\nBRASILEIRA\nDomic\u00edlio fiscal\nBRASIL\nEndere\u00e7os\nEndere\u00e7o Residencial\nRua/Av/P\u00e7a/Estrada RUA PEDRO NOVAIS\nN\u00famero\n78 Complemento Apto 8\nBairro AFOGADOS\nMunic\u00edpio RECIFE\nUF PE\nPa\u00eds BRASIL\n56220-000" 24 | } 25 | } -------------------------------------------------------------------------------- /data/raw/sample_train.json: -------------------------------------------------------------------------------- 1 | { 2 | "318": { 3 | "etiqueta": "ID00147", 4 | "agencia": "2347", 5 | "conta_corrente": "0011-2347-0000809875312", 6 | "cpf": "516.759.760-90", 7 | "nome_completo": "ANA MADALENA SILVEIRA ALVES", 8 | "n_doc_serie": "73258674", 9 | "orgao_emissor": "SSP", 10 | "data_emissao": "21/07/2018", 11 | "data_nascimento": "12/04/1992", 12 | "nome_mae": "MADALENA COSTA SILVEIRA", 13 | "nome_pai": "JUNIOR AUGUSTO ALVES", 14 | "endereco": { 15 | "logradouro": "AV. CRESCENCIO LISBOA", 16 | "numero": "986", 17 | "complemento": "Apto 3", 18 | "bairro": "BARAUNA", 19 | "cidade": "BARREIRAS", 20 | "estado": "BA", 21 | "cep": "47800-013" 22 | }, 23 | "text": "Proposta de Abertura de Conta, Contrata\u00e7\u00e3o de Cr\u00e9dito e\nAdes\u00e3o a Produtos e Servi\u00e7os Banc\u00e1rios - Pessoa F\u00edsica\nID00147\nAg\u00eancia N\u00ba\n1234\nConta Corrente 0011-2347-0000809875312\nCondi\u00e7\u00e3o de Movimenta\u00e7\u00e3o da Conta X Individual\nAltera\u00e7\u00e3o cadastral\nAngariador (matr\u00edcula) L\n00098961\nDados B\u00e1sicos do Titular\nCPF\n516.759.760-90\nNome Completo ANA MADALENA SILVEIRA ALVES\nDocumento de Identifica\u00e7\u00e3o CNH CTPS Entidade de Classe Mercosul Passaporte\nProtocolo Refugiado\nRIC RNE\nCIE Guia de Acolhimento ao Menor Registro Nacional Migrat\u00f3rio\nN\u00b0 Documento / N\u00b0 da S\u00e9rie (CTPS)\n73258674 \u00d3rg\u00e3o Emissor SSP\nUF BA\nData de Emiss\u00e3o 21/07/2018 Data de Vcto (passaporte/CNH)\n | Data de Nascimento 12/04/1992 Sexo X F M\nNacionalidade x Brasileira\nNome da M\u00e3e MADALENA COSTA SILVEIRA\nNome do Pai JUNIOR AUGUSTO ALVES\nCidadania\nBRASILEIRA\nDomic\u00edlio fiscal\nBRASIL\nEndere\u00e7os\nEndere\u00e7o Residencial\nRua/Av/P\u00e7a/Estrada AV. CRESCENCIO LISBOA\nN\u00famero\n986 Complemento Apto 3\nBairro BARAUNA\nMunic\u00edpio BARREIRAS\nUF BA\nPa\u00eds BRASIL\n47800-013" 24 | }, 25 | "108": { 26 | "etiqueta": "ID00357", 27 | "agencia": "2964", 28 | "conta_corrente": "0071-2964-0000798456556", 29 | "cpf": "096.653.550-23", 30 | "nome_completo": "LUCIANA TRINDADE CARDOSO", 31 | "n_doc_serie": "249878615", 32 | "orgao_emissor": "SSP", 33 | "data_emissao": "12/12/2021", 34 | "data_nascimento": "23/06/1997", 35 | "nome_mae": "AMANDA COSTA TRINDADE", 36 | "nome_pai": "MARCELO MOREIRA CARDOSO", 37 | "endereco": { 38 | "logradouro": "RUA ANDERSON TEIXEIRA", 39 | "numero": "988", 40 | "bairro": "CAONZE", 41 | "cidade": "NOVA IGUACU", 42 | "estado": "RJ", 43 | "cep": "13970-190" 44 | }, 45 | "text": "Proposta de Abertura de Conta, Contrata\u00e7\u00e3o de Cr\u00e9dito e\nAdes\u00e3o a Produtos e Servi\u00e7os Banc\u00e1rios - Pessoa F\u00edsica\nID00357\nAg\u00eancia N\u00ba\n2964\nConta Corrente 0071-2964-0000798456556\nCondi\u00e7\u00e3o de Movimenta\u00e7\u00e3o da Conta X Individual\nAltera\u00e7\u00e3o cadastral\nAngariador (matr\u00edcula) L\n00087238978\nDados B\u00e1sicos do Titular\nCPF\n096.653.550-23\nNome Completo LUCIANA TRINDADE CARDOSO\nDocumento de Identifica\u00e7\u00e3o CNH CTPS Entidade de Classe Mercosul Passaporte\nProtocolo Refugiado\nRIC RNE\nCIE Guia de Acolhimento ao Menor Registro Nacional Migrat\u00f3rio\nN\u00b0 Documento / N\u00b0 da S\u00e9rie (CTPS)\n249878615 \u00d3rg\u00e3o Emissor SSP\nUF RJ\nData de Emiss\u00e3o 12/12/2021 Data de Vcto (passaporte/CNH)\n | Data de Nascimento 23/06/1997 Sexo X F M\nNacionalidade x Brasileira\nNome da M\u00e3e AMANDA COSTA TRINDADE\nNome do Pai MARCELO MOREIRA CARDOSO\nCidadania\nBRASILEIRA\nDomic\u00edlio fiscal\nBRASIL\nEndere\u00e7os\nEndere\u00e7o Residencial\nRua/Av/P\u00e7a/Estrada RUA ANDERSON TEIXEIRA\nN\u00famero\n634 Complemento \nBairro CAONZE\nMunic\u00edpio NOVA IGUACU\nUF RJ\nPa\u00eds BRASIL\n13970-190" 46 | }, 47 | "965": { 48 | "etiqueta": "ID00885", 49 | "agencia": "8875", 50 | "conta_corrente": "0044-8875-000080874526544", 51 | "cpf": "010.442.950-07", 52 | "nome_completo": "CARLOS PRATES SOUZA", 53 | "n_doc_serie": "78945646", 54 | "orgao_emissor": "SSP", 55 | "data_emissao": "05/01/2002", 56 | "data_nascimento": "13/06/1985", 57 | "nome_mae": "ANABELLE VIEIRA PRATES", 58 | "nome_pai": "PEDRO ROSA SOUZA", 59 | "endereco": { 60 | "logradouro": "RUA JORGE LIMA", 61 | "numero": "634", 62 | "complemento": "Apto 1", 63 | "bairro": "APARECIDA", 64 | "cidade": "SANTOS", 65 | "estado": "SP", 66 | "cep": "01311-000" 67 | }, 68 | "text": "Proposta de Abertura de Conta, Contrata\u00e7\u00e3o de Cr\u00e9dito e\nAdes\u00e3o a Produtos e Servi\u00e7os Banc\u00e1rios - Pessoa F\u00edsica\nID00885\nAg\u00eancia N\u00ba\n8875\nConta Corrente 0044-8875-000080874526544\nCondi\u00e7\u00e3o de Movimenta\u00e7\u00e3o da Conta X Individual\nAltera\u00e7\u00e3o cadastral\nAngariador (matr\u00edcula) L\n0009861245\nDados B\u00e1sicos do Titular\nCPF\n010.442.950-07\nNome Completo CARLOS PRATES SOUZA\nDocumento de Identifica\u00e7\u00e3o CNH CTPS Entidade de Classe Mercosul Passaporte\nProtocolo Refugiado\nRIC RNE\nCIE Guia de Acolhimento ao Menor Registro Nacional Migrat\u00f3rio\nN\u00b0 Documento / N\u00b0 da S\u00e9rie (CTPS)\n78945646 \u00d3rg\u00e3o Emissor SSP\nUF SP\nData de Emiss\u00e3o 05/01/2002 Data de Vcto (passaporte/CNH)\n | Data de Nascimento 13/06/1985 Sexo F X M\nNacionalidade x Brasileira\nNome da M\u00e3e ANABELLE VIEIRA PRATES\nNome do Pai PEDRO ROSA SOUZA\nCidadania\nBRASILEIRA\nDomic\u00edlio fiscal\nBRASIL\nEndere\u00e7os\nEndere\u00e7o Residencial\nRua/Av/P\u00e7a/Estrada RUA JORGE LIMA\nN\u00famero\n634 Complemento Apto 1\nBairro APARECIDA\nMunic\u00edpio SANTOS\nUF SP\nPa\u00eds BRASIL\n01311-000" 69 | } 70 | } -------------------------------------------------------------------------------- /information_extraction_t5/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuralmind-ai/information-extraction-t5/988c589b433d96004139e1f63bfefd6778c0851b/information_extraction_t5/__init__.py -------------------------------------------------------------------------------- /information_extraction_t5/data/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuralmind-ai/information-extraction-t5/988c589b433d96004139e1f63bfefd6778c0851b/information_extraction_t5/data/__init__.py -------------------------------------------------------------------------------- /information_extraction_t5/data/basic_to_squad.py: -------------------------------------------------------------------------------- 1 | """Convert a simple JSON dataset into SQuAD format.""" 2 | from typing import Dict, List, Optional 3 | from transformers import T5Tokenizer 4 | import numpy.random as nr 5 | 6 | from information_extraction_t5.features.context import get_context 7 | from information_extraction_t5.features.questions.type_map import TYPENAME_TO_TYPE 8 | from information_extraction_t5.features.preprocess import get_questions_for_chunk 9 | 10 | WARNING_MISSING_TYPENAMES = [] 11 | 12 | 13 | def get_question_answers(document: Dict[str, str], 14 | questions: Optional[List[str]] = None, 15 | qa_id: str = 'publicacoes.instancia', 16 | choose_question: str = 'first'): 17 | """Gets question-answers in SQUAD format for the specified type name. 18 | 19 | The answers encompass only the canonical response (value). 20 | The size of the list is: 21 | - zero: if there's no question-answer with the specified type name or if the 22 | corresponding value of a type name key in the document is not a string. 23 | - one: choose_question is 'first' or 'random'. 24 | - N: an element for each question passed to the questions parameter. 25 | 26 | Returns: 27 | List of dictionaries where each element is a question and its answers. 28 | """ 29 | if questions is None: 30 | questions = [] 31 | 32 | subanswer = document 33 | qa_id_split = qa_id.split('.') 34 | 35 | for type_name in qa_id_split[1:]: 36 | subanswer = subanswer[type_name] 37 | 38 | # select questions 39 | if choose_question == 'first': 40 | selected_questions = [questions[0]] 41 | elif choose_question == 'random': 42 | idx = nr.randint(len(questions)) 43 | selected_questions = [questions[idx]] 44 | else: 45 | selected_questions = questions 46 | 47 | qas = [] 48 | answer = f"[{TYPENAME_TO_TYPE[type_name]}]: {subanswer}" 49 | for question in selected_questions: 50 | answers = [ 51 | { 52 | "answer_start": -1, # None, 53 | "text": answer, 54 | } 55 | ] 56 | qa = { 57 | "answers": answers, 58 | "question": question, 59 | "id": qa_id, 60 | } 61 | qas.append(qa) 62 | return qas 63 | 64 | 65 | def get_compound_question_answers( 66 | document: Dict[str, str], 67 | questions: Optional[List[str]] = None, 68 | qa_id: str = 'publicacoes.instancia_orgao_tipo', 69 | choose_question: str = 'first') -> List[Dict]: 70 | """Gets question-answers in SQUAD format for the specified type names. 71 | 72 | The answers encompass only the canonical response (value). 73 | The size of the list is: 74 | - zero: if there's no question-answer with the specified type name or if the 75 | corresponding value of a type name key in the document is not a string. 76 | - one: choose_question is 'first' or 'random'. 77 | - N: an element for each question passed to the questions parameter. 78 | 79 | Returns: 80 | List of dictionaries where each element is a question and its answers. 81 | """ 82 | # select questions 83 | if questions is None: 84 | questions = [] 85 | if choose_question == 'first': 86 | selected_questions = [questions[0]] 87 | elif choose_question == 'random': 88 | idx = nr.randint(len(questions)) 89 | selected_questions = [questions[idx]] 90 | else: 91 | selected_questions = questions 92 | 93 | type_name = qa_id.split('.')[1] 94 | 95 | all_type_names = get_questions_for_chunk(qa_id=qa_id, return_dict=True).copy() 96 | for tn in all_type_names.keys(): 97 | if tn == 'compound': 98 | continue 99 | all_type_names[tn] = f'[{TYPENAME_TO_TYPE[tn]}]: N/A' 100 | if 'compound' in all_type_names.keys(): 101 | all_type_names.pop('compound') 102 | 103 | # preparing the compound answer 104 | for tn in document[type_name].keys(): 105 | type = TYPENAME_TO_TYPE[tn] 106 | subanswer = document[type_name][tn] 107 | 108 | if tn in all_type_names.keys(): 109 | all_type_names[tn] = f"[{type}]: {subanswer}" 110 | elif not tn in WARNING_MISSING_TYPENAMES: 111 | print(f'WARNING: type-name {tn} is not in question signature for {type_name}: please add it in the OrderedDict if you want to keep.') 112 | WARNING_MISSING_TYPENAMES.append(tn) 113 | 114 | answer = ' '.join(all_type_names.values()) 115 | 116 | qas = [] 117 | for question in selected_questions: 118 | answers = [ 119 | { 120 | "answer_start": -1, # None, 121 | "text": answer 122 | } 123 | ] 124 | qa = { 125 | "answers": answers, 126 | "question": question, 127 | "id": qa_id, 128 | } 129 | qas.append(qa) 130 | return qas 131 | 132 | 133 | def get_notapplicable_question_answers( 134 | qa_id: str = 'matriculas.endereco', 135 | choose_question: str = 'first', 136 | list_of_use_compound_question: Optional[List[str]] = None): 137 | """ 138 | Return a list of question-answers in SQUAD format for non-annotated 139 | type-names. 140 | 141 | The size of the list is: 142 | - one (choose_question as 'first' or 'random') 143 | - the number of questions defined as 'compound' for the current chunk 144 | returned by get_questions_for_chunk(chunk) (choose_question as 'all') 145 | """ 146 | if list_of_use_compound_question is None: 147 | list_of_use_compound_question = [] 148 | 149 | is_compound = qa_id in list_of_use_compound_question 150 | 151 | questions = get_questions_for_chunk(qa_id=qa_id, is_compound=is_compound) 152 | if questions is None: 153 | questions = [] 154 | if choose_question == 'first': 155 | selected_questions = [questions[0]] 156 | elif choose_question == 'random': 157 | idx = nr.randint(len(questions)) 158 | selected_questions = [questions[idx]] 159 | else: 160 | selected_questions = questions 161 | 162 | if is_compound: 163 | # type_name = qa_id.split('.')[1] 164 | all_type_names = get_questions_for_chunk(qa_id=qa_id, return_dict=True).copy() 165 | for tn in all_type_names.keys(): 166 | if tn == 'compound': 167 | continue 168 | all_type_names[tn] = f'[{TYPENAME_TO_TYPE[tn]}]: N/A' 169 | if 'compound' in all_type_names.keys(): 170 | all_type_names.pop('compound') 171 | 172 | answer = ' '.join(all_type_names.values()) 173 | else: 174 | type_name = qa_id.split('.', 1)[1] 175 | type = TYPENAME_TO_TYPE[type_name] 176 | 177 | answer = f"[{type}]: N/A" 178 | 179 | qas = [] 180 | for question in selected_questions: 181 | answers = [ 182 | { 183 | "answer_start": -1, # None, 184 | "text": answer 185 | } 186 | ] 187 | qa = { 188 | "answers": answers, 189 | "question": question, 190 | "id": qa_id, 191 | } 192 | qas.append(qa) 193 | return qas 194 | 195 | 196 | def get_document_data(document: Dict, 197 | document_type: str = 'publicacoes', 198 | all_qa_ids: List[str] = ['publicacoes.orgao'], 199 | max_size: int = 4000, 200 | list_of_use_compound_question: Optional[List[str]] = None, 201 | list_of_type_names: Optional[List[str]] = None, 202 | context_content: str = 'abertura', 203 | window_overlap: float = 0.5, 204 | max_windows: int = 3, 205 | tokenizer: T5Tokenizer = None, 206 | max_tokens: int = 512, 207 | choose_question: str = 'first', 208 | use_sentence_id: bool = False): 209 | # using the document uuid as title 210 | # paragraphs will contain only one dict with context of document and all the 211 | # question-answers 212 | if list_of_type_names is None: 213 | list_of_type_names = [] 214 | if list_of_use_compound_question is None: 215 | list_of_use_compound_question = [] 216 | 217 | # assuming that this is the largest question 218 | largest_question = 'Quais são as principais informações do documento de publicação?' 219 | 220 | # create dummy document 221 | dummy_document = {} 222 | dummy_document['text'] = document['text'] if 'text' in document.keys() else document['texto'] 223 | dummy_document['uuid'] = document['uuid'] 224 | 225 | # exclude crazy chars 226 | dummy_document['text'] = dummy_document['text'].replace('༡༨/༢','') 227 | 228 | # extract the context(s) and respective offset(s) 229 | contexts, offsets = get_context( 230 | dummy_document, 231 | context_content=context_content, 232 | max_size=max_size, 233 | start_position=0, 234 | proportion_before=0.2, 235 | return_position_offset=True, 236 | use_sentence_id=use_sentence_id, 237 | tokenizer=tokenizer, 238 | max_tokens=max_tokens, 239 | question=largest_question, 240 | window_overlap=window_overlap, 241 | max_windows=max_windows) 242 | if not isinstance(contexts, list): 243 | contexts = [contexts] 244 | offsets = [offsets] 245 | 246 | # document structure in SQuAD format 247 | document_data = { 248 | "title": document['uuid'], 249 | "paragraphs": [] 250 | } 251 | counter_qas = 0 252 | 253 | for context, _ in zip(contexts, offsets): 254 | # create one paragraph for each context. 255 | # it will be unique, except for windows-based context_contents 256 | paragraph = { 257 | "context": context, 258 | "qas": [], 259 | } 260 | paragraph_counter_qas = 0 261 | 262 | # control which of the requested qa_ids were satified. It force not-applicable 263 | # qas for qa_ids whose information does not exist in the dataset. 264 | all_qa_ids_satisfied = [] 265 | 266 | # We will use only the fields listed in list_of_type_names 267 | for qa_id in list_of_type_names: 268 | doc_type = qa_id.split('.')[0] 269 | if doc_type != document_type: 270 | continue 271 | 272 | if qa_id in list_of_use_compound_question: 273 | questions = get_questions_for_chunk(qa_id=qa_id, is_compound=True) 274 | qas = get_compound_question_answers( 275 | document, 276 | questions=questions, 277 | qa_id=qa_id, 278 | choose_question=choose_question) 279 | else: 280 | questions = get_questions_for_chunk(qa_id=qa_id) 281 | qas = get_question_answers(document, 282 | questions=questions, 283 | qa_id=qa_id, 284 | choose_question=choose_question) 285 | 286 | paragraph_counter_qas += len(qas) 287 | 288 | # Include the question-answer of the current type_name (e.g., tipo) 289 | # in the current paragraph of the current document 290 | for qa in qas: 291 | paragraph["qas"].append(qa) 292 | all_qa_ids_satisfied.append(qa_id) 293 | 294 | # extract not-applicable qas for non-existent information. 295 | add_not_applicable = sorted( 296 | list(set(all_qa_ids) - set(all_qa_ids_satisfied)) 297 | ) 298 | 299 | for qa_id in add_not_applicable: 300 | 301 | qas = get_notapplicable_question_answers( 302 | qa_id=qa_id, 303 | choose_question='first', # avoid using too much negatives 304 | list_of_use_compound_question=list_of_use_compound_question) 305 | 306 | paragraph_counter_qas += len(qas) 307 | 308 | # Include the not-applicable question-answer in the current 309 | # paragraph of the current document 310 | for qa in qas: 311 | paragraph["qas"].append(qa) 312 | all_qa_ids_satisfied.append(qa_id) 313 | 314 | # Add the current paragraph in the structure 315 | if paragraph_counter_qas > 0: 316 | document_data["paragraphs"].append(paragraph) 317 | counter_qas += paragraph_counter_qas 318 | 319 | return document_data, counter_qas 320 | -------------------------------------------------------------------------------- /information_extraction_t5/data/convert_dataset_to_squad.py: -------------------------------------------------------------------------------- 1 | """Converts the dataset into SQuAD format.""" 2 | import json 3 | import os 4 | from typing import List, Tuple 5 | 6 | import configargparse 7 | import numpy.random as nr 8 | from sklearn.model_selection import train_test_split 9 | from transformers import AutoTokenizer 10 | 11 | import information_extraction_t5.data.basic_to_squad as basic_to_squad 12 | from information_extraction_t5.data.file_handling import load_raw_data 13 | from information_extraction_t5.features.preprocess import get_all_qa_ids 14 | 15 | DATA_VERSION = "0.1" 16 | 17 | 18 | def convert_raw_data(documents: List[tuple], 19 | project: str, 20 | all_qa_ids: List[str], 21 | tokenizer: AutoTokenizer, 22 | choose_question: str, 23 | use_sentence_id: bool, 24 | args) -> Tuple[List[dict], int]: 25 | """Loops over the documents and converts to SQuaD format. 26 | 27 | Args: 28 | documents: list with selected document tuples 29 | project: the project name 30 | tokenizer: T5 Tokenizer instance 31 | choose_question: flag to indicate which questions to use 32 | is_true: True for training data (useful for function build_answer) 33 | args: additional configs 34 | """ 35 | qa_data = [] 36 | qa_counter = 0 37 | 38 | for doc_id, document in documents: 39 | document['uuid'] = doc_id 40 | document_data, count = convert_document( 41 | document, 42 | project=project, 43 | all_qa_ids=all_qa_ids, 44 | max_size=args.max_size, 45 | type_names=args.type_names, 46 | use_compound_question=args.use_compound_question, 47 | return_raw_text=args.return_raw_text, 48 | context_content=args.context_content, 49 | window_overlap=args.window_overlap, 50 | max_windows=args.max_windows, 51 | tokenizer=tokenizer, 52 | max_tokens=args.max_seq_length, 53 | choose_question=choose_question, 54 | use_sentence_id=use_sentence_id) 55 | qa_counter += count 56 | 57 | # To finish a document, include its document_data into the 58 | # qa_json 59 | if count > 0: 60 | qa_data.append(document_data) 61 | 62 | return qa_data, qa_counter 63 | 64 | 65 | def convert_document(document, 66 | project='publicacoes', 67 | all_qa_ids=['publicacoes.tipoPublicacao'], 68 | max_size=4000, 69 | type_names=None, 70 | use_compound_question=None, 71 | return_raw_text=None, 72 | context_content='abertura', 73 | window_overlap=0.5, 74 | max_windows=3, 75 | tokenizer=None, 76 | max_tokens=512, 77 | choose_question='first', 78 | use_sentence_id: bool = False): 79 | """Converts a document and returns it along with the question count.""" 80 | if return_raw_text is None: 81 | return_raw_text = [] 82 | if use_compound_question is None: 83 | use_compound_question = [] 84 | if type_names is None: 85 | type_names = [] 86 | 87 | document_data, count = basic_to_squad.get_document_data( 88 | document, 89 | document_type=project, 90 | all_qa_ids=all_qa_ids, 91 | max_size=max_size, 92 | list_of_use_compound_question=use_compound_question, 93 | list_of_type_names=type_names, 94 | context_content=context_content, 95 | window_overlap=window_overlap, 96 | max_windows=max_windows, 97 | tokenizer=tokenizer, 98 | max_tokens=max_tokens, 99 | choose_question=choose_question, 100 | use_sentence_id=use_sentence_id) 101 | 102 | return document_data, count 103 | 104 | 105 | def main(): 106 | """Preparing data for QA in SQuAD format.""" 107 | parser = configargparse.ArgParser( 108 | 'Preparing data for QA', 109 | config_file_parser_class=configargparse.YAMLConfigFileParser) 110 | parser.add_argument('-c', '--my-config', required=True, 111 | is_config_file=True, 112 | help='config file path') 113 | 114 | parser.add_argument('--project', action='append', required=True, 115 | help='List pointing out the project each train/test ' 116 | 'dataset came from') 117 | parser.add_argument('--raw_data_file', action='append', required=True, 118 | help='List of raw train datasets to use in the ' 119 | 'experiment') 120 | parser.add_argument('--raw_valid_data_file', action='append', 121 | help='List of raw validation datasets to use in the ' 122 | 'experiment') 123 | parser.add_argument('--raw_test_data_file', action='append', 124 | help='List of raw test datasets to use in the ' 125 | 'experiment') 126 | parser.add_argument('--train_file', type=str, 127 | default='data/interim/train-v0.1.json') 128 | parser.add_argument('--valid_file', type=str, 129 | default='data/interim/dev-v0.1.json') 130 | parser.add_argument('--test_file', type=str, 131 | default='data/interim/test-v0.1.json') 132 | parser.add_argument('--type_names', nargs='+', default=['matriculas.imovel'], 133 | help='List of first-level chunks (qa_id) to use in the ' 134 | 'experiment') 135 | parser.add_argument('--use_compound_question', nargs='+', 136 | default=['matriculas.area_terreno_comp'], 137 | help='List of fields (qa_id) that must use ' 138 | 'compound question gathering all nested information ' 139 | 'in answer (instead of per-subchunk questions)') 140 | parser.add_argument('--return_raw_text', nargs='+', default=['estado'], 141 | help='List of fields (type_name) that ' 142 | 'require both canonical answer and how it appears in ' 143 | 'the text. Valid to individual and compound questions. NOT IMPLEMENTED.') 144 | 145 | parser.add_argument("--valid_percent", default=0.2, type=float, 146 | help='Percentage of dataset to used as validation') 147 | parser.add_argument("--max_size", default=1024, type=int, 148 | help="The maximum input length after char-based " 149 | "tokenization. And also the maximum context size " 150 | "for char-based contexts.") 151 | parser.add_argument("--context_content", type=str, default='abertura', 152 | help="Definition of context content for generic " 153 | "type-names (max_size, position, token, " 154 | "position_token, windows, or windows_token)") 155 | parser.add_argument("--train_choose_question", type=str, default='all', 156 | help='Choose which question of the list to use for ' 157 | 'training set (first, random, all). ' 158 | 'Validation/test set use first.') 159 | parser.add_argument('--train_force_qa', action="store_true", 160 | help='Set this flag if you want to force not-applicable ' 161 | 'qas for qa_ids that does not exist in the document. ' 162 | 'This is required for test set.') 163 | parser.add_argument("--seed", type=int, default=42, 164 | help="random seed for choose qestion") 165 | 166 | # used to get contexts 167 | parser.add_argument("--model_name_or_path", default='t5-small', type=str, 168 | help="Path to pretrained model or model identifier " 169 | "from huggingface.co/models") 170 | parser.add_argument("--config_name", default="", type=str, 171 | help="Pretrained config name or path if not the same " 172 | "as model_name") 173 | parser.add_argument("--tokenizer_name", default="", type=str, 174 | help="Pretrained tokenizer name or path if not the " 175 | "same as model_name") 176 | parser.add_argument("--do_lower_case", action="store_true", 177 | help="Set this flag if you are using an uncased " 178 | "model.") 179 | parser.add_argument("--max_seq_length", default=384, type=int, 180 | help="The maximum total input sequence length after " 181 | "WordPiece tokenization. Sequences longer than this " 182 | "will be truncated, and sequences shorter than this " 183 | "will be padded.") 184 | parser.add_argument("--window_overlap", default=0.5, type=float, 185 | help="Define the overlapping of sliding windows.") 186 | parser.add_argument("--max_windows", default=3, type=int, 187 | help="the maximum number of windows to generate, use -1 " 188 | "to get all the possible windows.") 189 | parser.add_argument("--use_sentence_id", action="store_true", 190 | help="Set this flag if you are using the approach that " 191 | "breaks the contexts into sentences.") 192 | 193 | args, _ = parser.parse_known_args() 194 | 195 | assert len(args.project) == len(args.raw_data_file) == \ 196 | len(args.raw_valid_data_file) == len(args.raw_test_data_file), \ 197 | ('raw_data_file, raw_valid_data_file and raw_test_data_file lists ' 198 | 'must have same size of projects list') 199 | assert args.train_choose_question in ['first', 'random', 'all'], \ 200 | ('train_choose_question must be "first", "random" or "all"') 201 | assert args.context_content in ['max_size', 'position', 'token', 202 | 'position_token', 'windows', 'windows_token'], \ 203 | ('context_content must be "max_size", "position", "token", "position_token", ' 204 | '"windows" or "windows_token"') 205 | 206 | # set tokenizer for context_context based on tokens 207 | tokenizer = AutoTokenizer.from_pretrained( 208 | args.tokenizer_name if args.tokenizer_name 209 | else args.model_name_or_path, 210 | use_fast=False, 211 | do_lower_case=args.do_lower_case 212 | ) 213 | 214 | # setting seed for choose question 215 | nr.seed(args.seed) 216 | 217 | print('>> Using the following fields with respective compound-qa indicator:') 218 | for type_name in args.type_names: 219 | print(f'- {type_name:<43} {type_name in args.use_compound_question}\t') 220 | print(f'>> List of fields that require how answer appears in ' 221 | f'the text: {args.return_raw_text}') 222 | 223 | qa_train_json = {'data': [], 'version': DATA_VERSION} 224 | qa_valid_json = {'data': [], 'version': DATA_VERSION} 225 | qa_test_json = {'data': [], 'version': DATA_VERSION} 226 | 227 | train_qa_counter, valid_qa_counter, test_qa_counter = 0, 0, 0 228 | 229 | for (raw_data_file, raw_valid_data_file, raw_test_data_file, project) in \ 230 | zip(args.raw_data_file, args.raw_valid_data_file, 231 | args.raw_test_data_file, args.project): 232 | 233 | print('\n') 234 | 235 | # Extract the list of all possible qa_ids for the current document class. 236 | # This forces N/A qas for valid/test, and for train if --train_force_qa 237 | all_qa_ids = get_all_qa_ids( 238 | document_class=project, 239 | list_of_type_names=args.type_names, 240 | list_of_use_compound_question=args.use_compound_question) 241 | 242 | # prepare VALIDATION set (if provided) 243 | has_valid_set = raw_valid_data_file is not None \ 244 | and raw_valid_data_file != 'None' 245 | 246 | if has_valid_set: 247 | 248 | print(f'>> Loading the VALID dataset {raw_valid_data_file} ' 249 | f'({project})...') 250 | _, all_documents, raw_data_fname = load_raw_data( 251 | raw_valid_data_file 252 | ) 253 | 254 | print(f'>> Converting the VALID dataset {raw_valid_data_file} ' 255 | 'into SQuAD format...') 256 | qa_data, qa_counter = convert_raw_data( 257 | documents=all_documents, 258 | project=project, 259 | all_qa_ids=all_qa_ids, 260 | tokenizer=tokenizer, 261 | choose_question='first', 262 | use_sentence_id=args.use_sentence_id, 263 | args=args 264 | ) 265 | 266 | if qa_counter > 0: 267 | print(f'{raw_valid_data_file} (valid) dataset has ' 268 | f'{qa_counter} question-answers') 269 | valid_qa_counter += qa_counter 270 | qa_valid_json['data'].extend(qa_data) 271 | 272 | if raw_valid_data_file.endswith('tar') \ 273 | or raw_valid_data_file.endswith('tar.gz'): 274 | os.unlink(raw_data_fname) 275 | 276 | has_test_set = raw_test_data_file is not None \ 277 | and raw_test_data_file != 'None' 278 | 279 | # prepare TEST set (if provided) 280 | if has_test_set: 281 | 282 | print(f'>> Loading the TEST dataset {raw_test_data_file} ' 283 | f'({project})...') 284 | _, all_documents, raw_data_fname = load_raw_data( 285 | raw_test_data_file 286 | ) 287 | 288 | print(f'>> Converting the TEST dataset {raw_test_data_file} into ' 289 | 'SQuAD format...') 290 | qa_data, qa_counter = convert_raw_data( 291 | documents=all_documents, 292 | project=project, 293 | all_qa_ids=all_qa_ids, 294 | tokenizer=tokenizer, 295 | choose_question='first', 296 | use_sentence_id=args.use_sentence_id, 297 | args=args 298 | ) 299 | 300 | if qa_counter > 0: 301 | print(f'{raw_test_data_file} (test) dataset has ' 302 | f'{qa_counter} question-answers') 303 | test_qa_counter += qa_counter 304 | qa_test_json['data'].extend(qa_data) 305 | 306 | if raw_test_data_file.endswith('tar') \ 307 | or raw_test_data_file.endswith('tar.gz'): 308 | os.unlink(raw_data_fname) 309 | 310 | # prepare TRAIN set 311 | print(f'>> Loading the dataset {raw_data_file} ({project})...') 312 | _, all_documents, raw_data_fname = load_raw_data( 313 | raw_data_file 314 | ) 315 | 316 | if not has_valid_set and 0 < args.valid_percent < 1.0: 317 | documents_train, documents_valid = train_test_split( 318 | all_documents, 319 | test_size=args.valid_percent, 320 | random_state=42) 321 | 322 | qa_data, qa_counter = convert_raw_data( 323 | documents=documents_valid, 324 | project=project, 325 | all_qa_ids=all_qa_ids, 326 | tokenizer=tokenizer, 327 | choose_question='first', 328 | use_sentence_id=args.use_sentence_id, 329 | args=args 330 | ) 331 | 332 | # if a TEST dataset is provided, use the split for VALIDATION only, 333 | # otherwise, use it for both VALIDATION and TEST 334 | if has_test_set: 335 | if qa_counter > 0: 336 | print(f'{raw_data_file} (valid) dataset has {qa_counter} ' 337 | f'question-answers') 338 | valid_qa_counter += qa_counter 339 | qa_valid_json['data'].extend(qa_data) 340 | else: 341 | if qa_counter > 0: 342 | print(f'{raw_data_file} (valid/test) dataset has ' 343 | f'{qa_counter} question-answers') 344 | valid_qa_counter += qa_counter 345 | qa_valid_json['data'].extend(qa_data) 346 | test_qa_counter += qa_counter 347 | qa_test_json['data'].extend(qa_data) 348 | 349 | else: 350 | documents_train = all_documents 351 | 352 | print( 353 | f'>> Converting the dataset {raw_data_file} into SQuAD format...' 354 | ) 355 | qa_data, qa_counter = convert_raw_data( 356 | documents=documents_train, 357 | project=project, 358 | all_qa_ids=all_qa_ids if args.train_force_qa else [], 359 | tokenizer=tokenizer, 360 | choose_question=args.train_choose_question, 361 | use_sentence_id=args.use_sentence_id, 362 | args=args 363 | ) 364 | print(f'{raw_data_file} (train) dataset has {qa_counter} ' 365 | f'question-answers') 366 | train_qa_counter += qa_counter 367 | qa_train_json['data'].extend(qa_data) 368 | 369 | if raw_data_file.endswith('tar') or raw_data_file.endswith('tar.gz'): 370 | os.unlink(raw_data_fname) 371 | 372 | print(f'\nTRAIN dataset has {train_qa_counter} question-answers') 373 | print(f'VALID dataset has {valid_qa_counter} question-answers') 374 | print(f'TEST dataset has {test_qa_counter} question-answers') 375 | 376 | # Save the train, valid and test processed data 377 | os.makedirs(os.path.dirname(args.train_file), exist_ok=True) 378 | with open(args.train_file, 'w', encoding='utf-8') as outfile: 379 | json.dump(qa_train_json, outfile) 380 | with open(args.valid_file, 'w', encoding='utf-8') as outfile: 381 | json.dump(qa_valid_json, outfile) 382 | with open(args.test_file, 'w', encoding='utf-8') as outfile: 383 | json.dump(qa_test_json, outfile) 384 | 385 | 386 | if __name__ == "__main__": 387 | main() 388 | -------------------------------------------------------------------------------- /information_extraction_t5/data/convert_squad_to_t5.py: -------------------------------------------------------------------------------- 1 | """Converts the dataset from SQuAD format to T5 format.""" 2 | import torch 3 | from rich.progress import track 4 | from typing import List, Union 5 | 6 | from transformers.data.processors.squad import SquadExample 7 | 8 | from information_extraction_t5.features.preprocess import generate_t5_input_sentence, generate_t5_label_sentence 9 | from information_extraction_t5.utils.balance_data import balance_data 10 | 11 | class QADataset(torch.utils.data.Dataset): 12 | """ 13 | Dataset for question-answering. 14 | 15 | Args: 16 | examples: the inputs to the model in T5 format. 17 | labels: the targets in T5 format. 18 | document_ids: the IDs to reference specific documents. 19 | example_ids: the IDs to reference specific pairs dataset-field. 20 | negative_ratios: the resultant negative-positive ratio of the samples. 21 | return_ids: indicates if the dataset will return the document-ids and example_ids. 22 | 23 | Returns: 24 | Dataset 25 | 26 | Ex.: 27 | examples = ['question: When was the Third Assessment Report published? context: Another example of scientific research ...'] 28 | labels = ['2011'] 29 | document_ids= ['ec57d59d-972c-40fc-82ff-c7c818d7dd39'] 30 | example_ids = ['reports.third_assessment.publication_data'] 31 | """ 32 | 33 | def __init__(self, examples, labels, document_ids, example_ids, negative_ratio=1.0, return_ids=False): 34 | if negative_ratio >= 1.0: 35 | self.examples, self.labels, self.document_ids, self.example_ids = balance_data( 36 | examples, labels, document_ids, example_ids, negative_ratio=negative_ratio 37 | ) 38 | else: 39 | self.examples = examples 40 | self.labels = labels 41 | self.document_ids = document_ids 42 | self.example_ids = example_ids 43 | self.return_ids = return_ids 44 | 45 | def __len__(self): 46 | return len(self.examples) 47 | 48 | def __getitem__(self, idx): 49 | if self.return_ids: 50 | return self.examples[idx], self.labels[idx], self.document_ids[idx], self.example_ids[idx] 51 | else: 52 | return self.examples[idx], self.labels[idx] 53 | 54 | 55 | def squad_convert_examples_to_t5_format( 56 | examples: List[SquadExample], 57 | use_sentence_id: bool = True, 58 | evaluate: bool = False, 59 | negative_ratio: int = 0, 60 | return_dataset: Union[bool, str] = False, 61 | tqdm_enabled: bool = True, 62 | ): 63 | """Converts a list of examples into a list to the T5 format for 64 | question-answer with prefix question/context. 65 | 66 | Args: 67 | examples: examples to convert to T5 format. 68 | evaluate: True for validation or test dataset. 69 | negative_ratio: balances dataset using negative-positive ratio. 70 | return_dataset: if True, returns a torch.data.TensorDataset. 71 | tqdm_enabled: if True, uses tqdm. 72 | 73 | Returns: 74 | list of examples into a list to the T5 format for 75 | question-answer with prefix question/context. 76 | 77 | Examples: 78 | >>> processor = SquadV2Processor() 79 | >>> examples = processor.get_dev_examples(data_dir) 80 | >>> examples, labels = squad_convert_examples_to_t5_format( 81 | >>> examples=examples) 82 | """ 83 | 84 | examples_t5_format = [] 85 | labels_t5_format = [] 86 | document_ids = [] # which document the example came from? (e.g, 54f94949-0fb4-45e5-81dd-c4385f681e2b) 87 | example_ids = [] # which document-type and type-name does the example belong to? (e.g., matriculas.endereco) 88 | 89 | for example in track(examples, description="convert examples to T5 format", disable=not tqdm_enabled): 90 | 91 | # prepare the input 92 | x = generate_t5_input_sentence(example.context_text, example.question_text, use_sentence_id) 93 | 94 | # extract answer and start position (squad-example is in evaluate mode) 95 | y = example.answers[0]['text'] # getting the first answer in the list 96 | answer_start = example.answers[0]['answer_start'] 97 | 98 | # prepate the target 99 | y = generate_t5_label_sentence(y, answer_start, example.context_text, use_sentence_id) 100 | 101 | examples_t5_format.append(x) 102 | labels_t5_format.append(y) 103 | document_ids.append(example.title) 104 | example_ids.append(example.qas_id) 105 | 106 | if return_dataset: 107 | # Create the dataset 108 | dataset = QADataset(examples_t5_format, labels_t5_format, document_ids, 109 | example_ids, negative_ratio=negative_ratio, return_ids=evaluate) 110 | 111 | return examples_t5_format, labels_t5_format, dataset 112 | else: 113 | return examples_t5_format, labels_t5_format -------------------------------------------------------------------------------- /information_extraction_t5/data/file_handling.py: -------------------------------------------------------------------------------- 1 | """Tools for handling dataset files.""" 2 | import glob 3 | import json 4 | import tarfile 5 | from typing import Tuple 6 | 7 | 8 | def decompress(fname): 9 | """Unpack a tar file and return the name of the JSON dataset file. 10 | 11 | Args: 12 | fname: compressed dataset file name 13 | 14 | Returns: 15 | The name of the unpacked JSON raw dataset file. 16 | """ 17 | if fname.endswith("tar.gz"): 18 | tar = tarfile.open(fname, "r:gz") 19 | tar.extractall('data/raw/') 20 | tar.close() 21 | elif fname.endswith("tar"): 22 | tar = tarfile.open(fname, "r:") 23 | tar.extractall('data/raw/') 24 | tar.close() 25 | 26 | fname = glob.glob('data/raw/*json')[-1] 27 | 28 | return fname 29 | 30 | 31 | def load_raw_data(fname: str) -> Tuple[dict, list, str]: 32 | """Loads raw dataset file. 33 | 34 | Args: 35 | fname: the dataset file name 36 | 37 | Returns: 38 | A tuple with the json-like raw data dict and a corresponding list 39 | of tuples with keys and values. 40 | """ 41 | if fname.endswith('tar') or fname.endswith('tar.gz'): 42 | print(f'>> Decompressing dataset file {fname}...') 43 | raw_data_fname = decompress(fname) 44 | else: 45 | raw_data_fname = fname 46 | 47 | with open(raw_data_fname) as f: 48 | raw_data = json.load(f) 49 | documents = list(raw_data.items()) 50 | 51 | return raw_data, documents, raw_data_fname 52 | -------------------------------------------------------------------------------- /information_extraction_t5/data/qa_data.py: -------------------------------------------------------------------------------- 1 | """Implement DataModule""" 2 | import os 3 | from typing import Optional 4 | import configargparse 5 | 6 | import torch 7 | from torch.utils.data import DataLoader, Dataset 8 | import pytorch_lightning as pl 9 | from transformers.data.processors.squad import SquadV1Processor 10 | 11 | from information_extraction_t5.data.convert_squad_to_t5 import squad_convert_examples_to_t5_format 12 | 13 | class QADataModule(pl.LightningDataModule): 14 | 15 | def __init__(self, hparams): 16 | super().__init__() 17 | self.hparams.update(vars(hparams)) 18 | 19 | def setup(self, stage: Optional[str] = None): 20 | input_dir = self.hparams.data_dir if self.hparams.data_dir else "." 21 | 22 | # Prepare train and valid datasets 23 | if stage == 'fit' or stage is None: 24 | # Load data examples from cache or dataset file 25 | cached_examples_train_file = os.path.join( 26 | input_dir, 27 | f"cached_train_{list(filter(None, self.hparams.model_name_or_path.split('/'))).pop()}" 28 | ) 29 | cached_examples_valid_file = os.path.join( 30 | input_dir, 31 | f"cached_valid_{list(filter(None, self.hparams.model_name_or_path.split('/'))).pop()}" 32 | ) 33 | 34 | # Init examples and dataset from cache if it exists 35 | if os.path.exists(cached_examples_train_file) and \ 36 | os.path.exists(cached_examples_valid_file) and not self.hparams.overwrite_cache: 37 | print("Loading examples from cached files %s and %s" % (cached_examples_train_file, cached_examples_valid_file)) 38 | 39 | examples_and_dataset = torch.load(cached_examples_train_file) 40 | self.train_dataset = examples_and_dataset["dataset"] 41 | examples_and_dataset = torch.load(cached_examples_valid_file) 42 | self.valid_dataset = examples_and_dataset["dataset"] 43 | else: 44 | print("Creating examples from dataset file at %s" % input_dir) 45 | 46 | processor = SquadV1Processor() 47 | 48 | # examples_train = processor.get_train_examples(self.hparams.data_dir, filename=self.hparams.train_file) 49 | examples_train = processor.get_dev_examples( 50 | self.hparams.data_dir, filename=self.hparams.train_file 51 | ) 52 | examples_valid = processor.get_dev_examples( 53 | self.hparams.data_dir, filename=self.hparams.valid_file 54 | ) 55 | 56 | _, _, self.train_dataset = squad_convert_examples_to_t5_format( 57 | examples=examples_train, 58 | use_sentence_id=self.hparams.use_sentence_id, 59 | evaluate=False, 60 | negative_ratio=self.hparams.negative_ratio, 61 | return_dataset=True, 62 | ) 63 | _, _, self.valid_dataset = squad_convert_examples_to_t5_format( 64 | examples=examples_valid, 65 | use_sentence_id=self.hparams.use_sentence_id, 66 | evaluate=True, 67 | negative_ratio=0, 68 | return_dataset=True, 69 | ) 70 | 71 | print(f"Saving examples into cached file {cached_examples_train_file}") 72 | torch.save({"dataset": self.train_dataset}, cached_examples_train_file) 73 | print(f"Saving examples into cached file {cached_examples_valid_file}") 74 | torch.save({"dataset": self.valid_dataset}, cached_examples_valid_file) 75 | 76 | print(f'>> train-dataset: {len(self.train_dataset)} samples') 77 | print(f'>> valid-dataset: {len(self.valid_dataset)} samples') 78 | 79 | # Prepare test dataset 80 | if stage == 'test' or stage is None: 81 | 82 | assert self.hparams.test_file, 'test_file must be specificed' 83 | 84 | cached_examples_test_file = os.path.join( 85 | input_dir, 86 | f"cached_test_{list(filter(None, self.hparams.model_name_or_path.split('/'))).pop()}" 87 | ) 88 | 89 | # Init examples and dataset from cache if it exists 90 | if os.path.exists(cached_examples_test_file) and not self.hparams.overwrite_cache: 91 | 92 | print("Loading examples from cached file %s" % (cached_examples_test_file)) 93 | 94 | examples_and_dataset = torch.load(cached_examples_test_file) 95 | self.test_dataset = examples_and_dataset["dataset"] 96 | else: 97 | print("Creating examples from dataset file at %s" % input_dir) 98 | 99 | processor = SquadV1Processor() 100 | 101 | examples_test = processor.get_dev_examples(self.hparams.data_dir, filename=self.hparams.test_file) 102 | 103 | _, _, self.test_dataset = squad_convert_examples_to_t5_format( 104 | examples=examples_test, 105 | use_sentence_id=self.hparams.use_sentence_id, 106 | evaluate=True, 107 | negative_ratio=0, 108 | return_dataset=True, 109 | ) 110 | 111 | print("Saving examples into cached file %s" % cached_examples_test_file) 112 | torch.save({"dataset": self.test_dataset}, cached_examples_test_file) 113 | 114 | print(f'>> test-dataset: {len(self.test_dataset)} samples') 115 | 116 | def get_dataloader(self, dataset: Dataset, batch_size: int, shuffle: bool, num_workers: int) -> DataLoader: 117 | return DataLoader( 118 | dataset, 119 | batch_size=batch_size, 120 | shuffle=shuffle, 121 | num_workers=num_workers 122 | ) 123 | 124 | def train_dataloader(self,) -> DataLoader: 125 | return self.get_dataloader( 126 | self.train_dataset, 127 | batch_size=self.hparams.train_batch_size, 128 | shuffle=self.hparams.shuffle_train, 129 | num_workers=self.hparams.num_workers 130 | ) 131 | 132 | def val_dataloader(self,) -> DataLoader: 133 | return self.get_dataloader( 134 | self.valid_dataset, 135 | batch_size=self.hparams.val_batch_size, 136 | shuffle=False, 137 | num_workers=self.hparams.num_workers 138 | ) 139 | 140 | def test_dataloader(self,) -> DataLoader: 141 | return self.get_dataloader( 142 | self.test_dataset, 143 | batch_size=self.hparams.val_batch_size, 144 | shuffle=False, 145 | num_workers=self.hparams.num_workers 146 | ) 147 | 148 | @staticmethod 149 | def add_model_specific_args(parent_parser): 150 | parser = configargparse.ArgumentParser(parents=[parent_parser], add_help=False) 151 | parser.add_argument( 152 | "--data_dir", 153 | default=None, 154 | type=str, 155 | help="The input data dir. Should contain the .json files for the task." 156 | ) 157 | parser.add_argument( 158 | "--train_file", 159 | default=None, 160 | type=str, 161 | help="The input training file. If a data dir is specified, will look for the file there" 162 | ) 163 | parser.add_argument( 164 | "--valid_file", 165 | default=None, 166 | type=str, 167 | help="The input evaluation file. If a data dir is specified, will look for the file there" 168 | ) 169 | parser.add_argument( 170 | "--test_file", 171 | default=None, 172 | type=str, 173 | help="The input test file. If a data dir is specified, will look for the file there" 174 | ) 175 | parser.add_argument("--train_batch_size", default=8, type=int, 176 | help="Batch size per GPU/CPU for training.") 177 | parser.add_argument("--val_batch_size", default=8, type=int, 178 | help="Batch size per GPU/CPU for evaluation.") 179 | parser.add_argument("--shuffle_train", action="store_true", 180 | help="Shuffle the train dataset") 181 | parser.add_argument("--negative_ratio", default=0, type=int, 182 | help="Set the positive-negative ratio of the training dataset. " 183 | "Data balancing is performed for each pair document-typename. If less than one, keep the ratio of the original dataset") 184 | parser.add_argument("--use_sentence_id", action="store_true", 185 | help="Set this flag if you are using the approach that breaks the contexts into sentences") 186 | parser.add_argument("--overwrite_cache", action="store_true", 187 | help="Overwrite the cached training and evaluation sets") 188 | 189 | return parser 190 | -------------------------------------------------------------------------------- /information_extraction_t5/features/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuralmind-ai/information-extraction-t5/988c589b433d96004139e1f63bfefd6778c0851b/information_extraction_t5/features/__init__.py -------------------------------------------------------------------------------- /information_extraction_t5/features/context.py: -------------------------------------------------------------------------------- 1 | import math 2 | import numpy as np 3 | import re 4 | from typing import Any, Dict, List, Optional, Tuple, Union 5 | from transformers import AutoTokenizer, PreTrainedTokenizerBase 6 | 7 | 8 | def get_tokens_and_offsets(text: str, tokenizer: PreTrainedTokenizerBase) -> List[Tuple[Any, int, int]]: 9 | tokens = tokenizer.tokenize(text) 10 | token_lens = [len(token) for token in tokens] 11 | token_lens[0] -= 1 # Ignore first "_" token 12 | token_ends = np.cumsum(token_lens) 13 | token_starts = [0] + token_ends[:-1].tolist() 14 | tokens_and_offsets = list(zip(tokens, token_starts, token_ends)) 15 | return tokens_and_offsets 16 | 17 | 18 | def get_token_id_from_position(tokens_and_offsets: List[Tuple[Any, int, int]], position: int) -> int: 19 | for idx, tok_offs in enumerate(tokens_and_offsets): 20 | _, start, end = tok_offs 21 | if start <= position < end: 22 | return idx 23 | return len(tokens_and_offsets) - 1 24 | 25 | 26 | def get_max_size_context(document: Dict, max_size: int = 4000, question: str = 'Qual?') -> str: 27 | """Returns the first max_size characters of the document_text. 28 | """ 29 | document_text = document['text'] 30 | question_sentence = f'question: {question} context: ' 31 | num_chars_question = len(question_sentence) 32 | remaining_chars = max_size - num_chars_question 33 | 34 | context = document_text[:remaining_chars - 4] 35 | context = context + ' ...' 36 | return context 37 | 38 | 39 | def get_position_context( 40 | document: Dict, 41 | max_size: int = 4000, 42 | start_position: int = 0, 43 | proportion_before: float = 0.2, 44 | question: str = 'Qual?', 45 | use_sentence_id: bool = False, 46 | verbose: bool = False, 47 | ) -> Tuple[str, int]: 48 | """Returns the content around a specific position with size controlled by max_size. 49 | proportion_before indicates the proportion of max_size the must be taken before 50 | the position, while 1 - position_before is after. 51 | """ 52 | document_text = document['text'] 53 | question_sentence = f'question: {question} context: ' 54 | num_chars_question = len(question_sentence) 55 | 56 | remaining_chars = max_size - num_chars_question 57 | start_reticences, end_reticences = False, False 58 | 59 | start = math.floor(remaining_chars * proportion_before) 60 | start = max(0, start_position - start) 61 | end = min(len(document_text), remaining_chars + start) 62 | 63 | if use_sentence_id: 64 | num_chars_each_sentence_id = len('[SENT1]') 65 | num_chars_sentence_id = (document_text[start: end].count('\n') + 1) * num_chars_each_sentence_id 66 | else: 67 | num_chars_sentence_id = 0 68 | size = end - start 69 | 70 | # remove chars if current size + sentence-ids chars exceed the remaining chars 71 | if size + num_chars_sentence_id > remaining_chars: 72 | to_remove = (size + num_chars_sentence_id) - remaining_chars 73 | 74 | # Chars are removed fractionally (20 times), in order to control the 75 | # expected size, avoiding exaggerated removal. In each iteration, as 76 | # long as the the window size is updated, the num_chars_sentence_id 77 | # is updated as well. 78 | to_remove_fractions = [to_remove // 20] * 20 + [to_remove % 20] 79 | 80 | for to_remove in to_remove_fractions: 81 | if start == start_position: 82 | end -= to_remove 83 | else: 84 | remove_before = math.floor(to_remove * proportion_before) 85 | remove_before = min(remove_before, start_position - start) 86 | remove_after = to_remove - remove_before 87 | start += remove_before 88 | end -= remove_after 89 | 90 | num_chars_sentence_id = (document_text[start: end].count('\n') + 1) * num_chars_each_sentence_id 91 | size = end - start 92 | 93 | # the size satifies remaining_tokens 94 | if size + num_chars_sentence_id <= remaining_chars: 95 | break 96 | 97 | # check if it requires reticences 98 | # if it does, try to find a space before/after the start_position 99 | if start != 0: 100 | start_reticences = True 101 | start = max(start, document_text.find(' ', start, start_position)) 102 | position_offset = start - 3 # reticences 103 | else: 104 | position_offset = start 105 | 106 | if end < len(document_text): 107 | end_reticences = True 108 | end = document_text.rfind(' ', start_position, end) 109 | 110 | if verbose: 111 | print('-- MUST CONTAIN: ' + document_text[start_position: start_position+30]) 112 | print(f'-- start: {start}, end: {end}') 113 | c = document_text[start:end] 114 | print(f'-- len (char): {len(c)}') 115 | print(f'-- context: {c} \n') 116 | 117 | context = ('...' if start_reticences else '') \ 118 | + document_text[start: end] \ 119 | + ('...' if end_reticences else '') 120 | 121 | if verbose: 122 | # it can exceed the expected num of chars because of reticences. 123 | print('--> testing the number of chars:') 124 | t5_input = question_sentence + context 125 | n = len(t5_input) 126 | print(f'>> The input occupies {n} chars. ' 127 | f'It will have additional {num_chars_sentence_id} for sentence-ids. ' 128 | f'Total: {n + num_chars_sentence_id}. Expected: {max_size}.') 129 | 130 | return context, position_offset 131 | 132 | 133 | def get_windows_context( 134 | document: Dict, 135 | max_size: int = 4000, 136 | window_overlap: float = 0.5, 137 | max_windows: int = 3, 138 | question: str = 'Qual?', 139 | use_sentence_id: bool = False, 140 | verbose: bool = False, 141 | ) -> Tuple[List[str], List[int]]: 142 | """Returns a list of window contents with size controlled by max_size, with 143 | overlapping near to 50%. 144 | """ 145 | document_text = document['text'] 146 | 147 | assert max_windows != 0, ( 148 | 'Set max_windows higher than 0 to get a specific quantity of windows, ' 149 | 'or below to extract all possible ones.') 150 | 151 | contexts, offsets = [], [] 152 | 153 | start_position, position_offset = 0, 0 154 | context = '' 155 | # the offset + current context size surpassing document size means the 156 | # window reached the end of document 157 | while position_offset + len(context) < len(document_text): 158 | 159 | context, position_offset = get_position_context(document, max_size=max_size, 160 | start_position=start_position, proportion_before=0, question=question, 161 | use_sentence_id=use_sentence_id, verbose=verbose) 162 | 163 | contexts.append(context) 164 | offsets.append(position_offset) 165 | 166 | if verbose: 167 | print(f'>>>>>>>>>> WINDOW: start_position = {start_position}, offset = {position_offset}') 168 | 169 | start_position += int(len(context) * (1 - window_overlap)) 170 | 171 | if max_windows > 0 and len(contexts) == max_windows: break 172 | 173 | return contexts, offsets 174 | 175 | 176 | def get_token_context(document: Dict, 177 | tokenizer: Union[None, PreTrainedTokenizerBase] = None, 178 | max_tokens: int = 512, 179 | question: str = 'Qual?', 180 | use_sentence_id: bool = False, 181 | verbose: bool = False, 182 | ) -> Tuple[str, int]: 183 | """Returns the first max_tokens tokens of the document_text. 184 | """ 185 | context, position_offset = get_position_token_context(document, start_position=0, 186 | proportion_before=0, tokenizer=tokenizer, max_tokens=max_tokens, question=question, 187 | use_sentence_id=use_sentence_id, verbose=verbose) 188 | return context, position_offset 189 | 190 | 191 | def get_position_token_context( 192 | document: Dict, 193 | start_position: int = 0, 194 | proportion_before: float = 0.2, 195 | tokenizer: Union[None, PreTrainedTokenizerBase] = None, 196 | max_tokens: int = 512, 197 | tokens_and_offsets: Optional[List[Tuple[Any, int, int]]] = None, 198 | question: str = 'Qual?', 199 | use_sentence_id: bool = False, 200 | verbose: bool = False, 201 | ) -> Tuple[str, int]: 202 | """Returns the content around a specific position, with size controlled by max_tokens. 203 | proportion_before indicates the proportion of max_size the must be taken before the 204 | position, while 1 - position_before is after. 205 | """ 206 | document_text = document['text'] 207 | question_sentence = f'question: {question} context: ' 208 | num_tokens_question = len(tokenizer.tokenize(question_sentence)) 209 | 210 | remaining_tokens = max_tokens - num_tokens_question 211 | start_reticences, end_reticences = False, False 212 | 213 | if tokens_and_offsets is None: 214 | tokens_and_offsets = get_tokens_and_offsets(text=document_text, tokenizer=tokenizer) 215 | positional_token_id = get_token_id_from_position(tokens_and_offsets=tokens_and_offsets, position=start_position) 216 | start_token_id = max(0, positional_token_id - math.floor(remaining_tokens * proportion_before)) 217 | end_token_id = min(positional_token_id + math.ceil(remaining_tokens * (1-proportion_before)), len(tokens_and_offsets)) 218 | 219 | start = tokens_and_offsets[start_token_id][1] 220 | end = tokens_and_offsets[end_token_id-1][2] 221 | 222 | if use_sentence_id: 223 | num_tokens_each_sentence_id = len(tokenizer.tokenize('[SENT10]')) 224 | num_tokens_sentence_id = (document_text[start: end].count('\n') + 1) * num_tokens_each_sentence_id 225 | else: 226 | num_tokens_sentence_id = 0 227 | size = end_token_id - start_token_id 228 | 229 | # remove tokens if current size + sentence-ids tokens exceed the remaining tokens 230 | if size + num_tokens_sentence_id > remaining_tokens: 231 | to_remove = (size + num_tokens_sentence_id) - remaining_tokens 232 | 233 | # Tokens are removed fractionally (20 times), in order to control the 234 | # expected size, avoiding exaggerated removal. In each iteration, as 235 | # long as the the window size is updated, the num_tokens_sentence_id 236 | # is updated as well. 237 | to_remove_fractions = [to_remove // 20] * 20 + [to_remove % 20] 238 | 239 | for to_remove in to_remove_fractions: 240 | if start == start_position: 241 | end_token_id -= to_remove 242 | else: 243 | remove_before = math.floor(to_remove * proportion_before) 244 | remove_before = min(remove_before, positional_token_id - start_token_id) 245 | remove_after = to_remove - remove_before 246 | start_token_id += remove_before 247 | end_token_id -= remove_after 248 | 249 | start = tokens_and_offsets[start_token_id][1] 250 | end = tokens_and_offsets[end_token_id-1][2] 251 | 252 | num_tokens_sentence_id = (document_text[start: end].count('\n') + 1) * num_tokens_each_sentence_id 253 | size = end_token_id - start_token_id 254 | 255 | # the size satifies remaining_tokens 256 | if size + num_tokens_sentence_id <= remaining_tokens: 257 | break 258 | 259 | # check if it requires reticences 260 | # if it does, try to find a space before/after the start_position 261 | if start != 0: 262 | start_reticences = True 263 | start = max(start, document_text.find(' ', start, start_position)) 264 | position_offset = start - 3 # reticences 265 | else: 266 | position_offset = tokens_and_offsets[start_token_id][1] 267 | 268 | if end < len(document_text): 269 | end_reticences = True 270 | end = document_text.rfind(' ', start_position, end) 271 | 272 | if verbose: 273 | print('-- MUST CONTAIN: ' + document_text[start_position: start_position+30]) 274 | print(f'-- start: {start}, end: {end}') 275 | c = document_text[start: end] 276 | print(f'-- len (char): {len(c)}') 277 | print(f'-- len (toks): {end_token_id - start_token_id}') 278 | print(f'-- context: {c} \n') 279 | 280 | context = ('...' if start_reticences else '') \ 281 | + document_text[start: end] \ 282 | + ('...' if end_reticences else '') 283 | 284 | if verbose: 285 | # it can exceed the expected num of tokens because of reticences. 286 | print('--> testing the number of tokens:') 287 | t5_input = question_sentence + context 288 | n = len(tokenizer.tokenize(t5_input)) 289 | print(f'>> The input occupies {n} tokens. ' 290 | f'It will have additional {num_tokens_sentence_id} for sentence-ids. ' 291 | f'Total: {n + num_tokens_sentence_id}. Expected: {max_tokens}.') 292 | 293 | return context, position_offset 294 | 295 | 296 | def get_windows_token_context( 297 | document: Dict, 298 | window_overlap: float = 0.5, 299 | max_windows: int = 3, 300 | tokenizer: Union[None, PreTrainedTokenizerBase] = None, 301 | max_tokens: int = 512, 302 | question: str = 'Qual?', 303 | use_sentence_id: bool = False, 304 | verbose: bool = False, 305 | ) -> Tuple[List[str], List[int]]: 306 | """Returns a list of window contents with size controlled by max_tokens, with 307 | overlapping near to 50%. 308 | """ 309 | document_text = document['text'] 310 | 311 | assert max_windows != 0, ( 312 | 'Set max_windows higher than 0 to get a specific quantity of windows, ' 313 | 'or below to extract all possible ones.') 314 | 315 | contexts, offsets = [], [] 316 | tokens_and_offsets = get_tokens_and_offsets(text=document_text, tokenizer=tokenizer) 317 | 318 | assert len(document_text) == tokens_and_offsets[-1][2], ( 319 | f'The original document ({document["uuid"]}) and the end of last token are not matching: {len(document_text)} != {tokens_and_offsets[-1][2]}') 320 | 321 | start_position, position_offset = 0, 0 322 | context = '' 323 | # the offset + current context size surpassing document size means the 324 | # window reached the end of document 325 | while position_offset + len(context) < len(document_text): 326 | 327 | context, position_offset = get_position_token_context(document, start_position=start_position, 328 | proportion_before=0, tokenizer=tokenizer, max_tokens=max_tokens, tokens_and_offsets=tokens_and_offsets, 329 | question=question, use_sentence_id=use_sentence_id, verbose=verbose) 330 | 331 | contexts.append(context) 332 | offsets.append(position_offset) 333 | 334 | if verbose: 335 | print(f'>>>>>>>>>> WINDOW: start_position = {start_position}, offset = {position_offset}') 336 | 337 | start_position += int(len(context) * (1 - window_overlap)) 338 | 339 | if max_windows > 0 and len(contexts) == max_windows: break 340 | 341 | return contexts, offsets 342 | 343 | 344 | def get_context( 345 | document: Dict, 346 | context_content: str = 'windows_token', 347 | max_size: int = 4000, 348 | start_position: int = 0, 349 | proportion_before: float = 0.2, 350 | return_position_offset: bool = False, 351 | use_sentence_id: bool = False, 352 | tokenizer: Union[None, PreTrainedTokenizerBase] = None, 353 | max_tokens: int = 512, 354 | question: str = 'Qual?', 355 | window_overlap: float = 0.5, 356 | max_windows: int = 3, 357 | verbose: bool = False, 358 | ) -> Union[str, List[str], Tuple[Union[str, List[str]], Union[int, List[int]]]]: 359 | """Returns the context to use in T5 input based on context_content. 360 | 361 | Args: 362 | document: dict with all the information of current document. 363 | context_content: type of context (max_size, position, token, 364 | position_token or windows_token). 365 | - max_size: gets the first max_size characters. 366 | - position: gets a window text limited to max_size characters 367 | around a start_position, respecting a proportion before and after 368 | the position. 369 | - windows: gets a list of sliding windows of max_size, comprising 370 | the complete document. 371 | - token: gets the first max_tokens tokens. 372 | - position_token: gets a window text limited to max_tokens tokens 373 | around a start_position, respecting a proportion before and after 374 | the position, and penalizing tokens that will be occupied by 375 | question and sentence-ids in the T5 input. 376 | - windows_token: gets a list of sliding windows of max_tokens, 377 | comprising the complete document. 378 | max_size: maximum size of context, in chars (used for max_size and 379 | position). 380 | start_position: char index of a keyword in the original document text 381 | (used for position and position_token). 382 | proportion_before: proportion of maximum context size (max_size or 383 | max_tokens) that must be before start_position (used for position, 384 | position_token and the variants). 385 | return_position_offset: if True, returns the position of returned 386 | context with respect to original document text (used for position, 387 | position_token and the variants). 388 | tokenizer: AutoTokenizer used in the model (used for position_token and 389 | windows_token). 390 | max_tokens: maximum size of context, in tokens (used for position_token 391 | and windows token). 392 | question: question that will be used along with the context in the T5 393 | input (used for position_token and windows_token). 394 | window_overlap: overlapping between windows (used for windows and 395 | windows_token). 396 | max_windows: the maximum number of windows to generate, use -1 to get 397 | all the possible windows (used for windows and windows_token) 398 | verbose: visualize the processing, tests, and resultant contexts. 399 | 400 | Returns: 401 | - the context. 402 | - the position_offset (optional). 403 | """ 404 | position_offset = 0 405 | 406 | # remove repeated breaklines, repeated spaces/tabs, space/tabs before 407 | # breaklines, and breaklines in start/end of document text to make the token 408 | # positions match the char positions. Those rules avoid incorrect alignments. 409 | document['text'] = document['text'].replace('\t', ' ') # '\t' 410 | document['text'] = re.sub(r'\s*\n+\s*', r'\n', document['text']) # space (0 or more) + '\n' (1 or more) + space (0 or more) 411 | document['text'] = re.sub(r'(\s)\1+', r'\1', document['text']) # space (1 or more) 412 | # special characters that causes raw and tokinization texts to desagree 413 | document['text'] = document['text'].replace('´', '\'') # 0 char --> 1 char in tokenization (common in publicacoes) 414 | document['text'] = document['text'].replace('™', 'TM') # 1 char --> 2 chars in tokenization 415 | document['text'] = document['text'].replace('…', '...') # 1 char --> 3 chars in tokenization 416 | document['text'] = document['text'].strip() 417 | 418 | if context_content == 'max_size': 419 | context = get_max_size_context(document, max_size=max_size, question=question) 420 | elif context_content == 'position': 421 | context, position_offset = get_position_context(document, max_size=max_size, 422 | start_position=start_position, proportion_before=proportion_before, 423 | question=question, use_sentence_id=use_sentence_id, verbose=verbose) 424 | elif context_content == 'windows': 425 | context, position_offset = get_windows_context(document, max_size=max_size, 426 | window_overlap=window_overlap, max_windows=max_windows, 427 | question=question, use_sentence_id=use_sentence_id, verbose=verbose) 428 | elif context_content == 'token': 429 | context, position_offset = get_token_context(document, 430 | tokenizer=tokenizer, max_tokens=max_tokens, question=question, 431 | use_sentence_id=use_sentence_id, verbose=verbose) 432 | elif context_content == 'position_token': 433 | context, position_offset = get_position_token_context(document, start_position=start_position, 434 | proportion_before=proportion_before, tokenizer=tokenizer, max_tokens=max_tokens, 435 | question=question, use_sentence_id=use_sentence_id, verbose=verbose) 436 | elif context_content == 'windows_token': 437 | context, position_offset = get_windows_token_context(document, 438 | window_overlap=window_overlap, max_windows=max_windows, tokenizer=tokenizer, 439 | max_tokens=max_tokens, question=question, use_sentence_id=use_sentence_id, verbose=verbose) 440 | else: 441 | return '', position_offset 442 | 443 | if verbose: 444 | if isinstance(context, list): 445 | for (i, cont) in enumerate(context): 446 | print(f'--------\nWINDOW {i}\n--------') 447 | print(f'len: {len(cont)} context: {cont} \n') 448 | else: 449 | print(f'len: {len(context)} context: {context} \n') 450 | 451 | if return_position_offset: 452 | return context, position_offset 453 | else: 454 | return context 455 | 456 | 457 | def main(): 458 | document = {} 459 | document['uuid'] = '1234567' 460 | document['text'] = "Que tal fazer uma poc inicial para vermos a viabilidade e identificarmos as dificuldades?\nA motivação da escolha desse problema " \ 461 | "foi que boa parte dos atos de matrícula passam de 512 tokens, e ainda não temos uma solução definida para fazer treinamento e predições em " \ 462 | "janelas usando o QA.\nEssa limitação dificulta o uso de QA para problemas que não sabemos onde a informação está no documento (por enquanto, " \ 463 | "só aplicamos QA em tarefas que sabemos que a resposta está nos primeiros 512 tokens da matrícula).\nComo esse problema de identificar a proporção " \ 464 | "de cada pessoa são duas tarefas (identificação + relação com uma pessoa), podemos usar a localização da pessoa no texto para selecionar apenas " \ 465 | "uma pedaço do ato de alienação pra passar como contexto pro modelo, evitando um pouco essa limitação dos 512 tokens." 466 | document['text'] = "PREFEITURA DE CAUCAIA\nSECRETARIA DE FINAN\u00c7AS,PLANEJAMENTO E OR\u00c7AMENTO\nCERTID\u00c3O NEGATIVA DE TRIBUTOS ECON\u00d4MICOS\nLA SULATE\nN\u00ba 2020000982\nRaz\u00e3o Social\nCOMPASS MINERALS AMERICA DO SUL INDUSTRIA E COMERC\nINSCRI\u00c7\u00c3O ECON\u00d4MICA Documento\nBairro\n00002048159\nC.N.P.J.: 60398138001860\nSITIO SALGADO\nLocalizado ROD CE 422 KM 17, S/N - SALA SUPERIOR 01 CXP - CAUCAIA-CE\nCEP\n61600970\nDADOS DO CONTRIBUINTE OU RESPONS\u00c1VEL\nInscri\u00e7\u00e3o Contribuinte / Nome\n169907 - COMPASS MINERALS AMERICA DO SUL INDUSTRIA E COMERC\nEndere\u00e7o\nROD CE 422 KM 17, S/N SALA SUPERIOR 01 CXP\nDocumento\nC.N.P.J.: 60.398.138/0018-60\nSITIO SALGADO CAUCAIA-CE CEP: 61600970\nNo. Requerimento\n2020000982/2020\nNatureza jur\u00eddica\nPessoa Juridica\nCERTID\u00c3O\nCertificamos para os devidos fins, que revendo os registros dos cadastros da d\u00edvida ativa e de\ninadimplentes desta Secretaria, constata-se - at\u00e9 a presente data \u2013 n\u00e3o existirem em nome do (a)\nrequerente, nenhuma pend\u00eancia relativa a tributos municipais.\nSECRETARIA DE FINAN\u00c7AS, PLANEJAMENTO E OR\u00c7AMENTO se reserva o direito de inscrever e cobrar as\nd\u00edvidas que posteriormente venham a ser apurados. Para Constar, foi lavrada a presente Certid\u00e3o.\nA aceita\u00e7\u00e3o desta certid\u00e3o est\u00e1 condicionada a verifica\u00e7\u00e3o de sua autenticidade na internet, nos\nseguinte endere\u00e7o: http://sefin.caucaia.ce.gov.br/\nCAUCAIA-CE, 03 DE AGOSTO DE 2020\nEsta certid\u00e3o \u00e9 v\u00e1lida por 090 dias contados da data de emiss\u00e3o\nVALIDA AT\u00c9: 31/10/2020\nCOD. VALIDA\u00c7\u00c3O 2020000982" 467 | # document['text'] = "DESAANZ\nJUCESP - Junta Comercial do Estado de S\u00e3o Paulo\nMinist\u00e9rio do Desenvolvimento, Ind\u00fastria e Com\u00e9rcio Exterios\nSamas\nSECRETAR\u00cdA DE DESENVOLVIMENTO\ndo Com\u00e9rcio - DNRC\nECONOMICO, CI\u00caNCIA,\nn\u00f4mico, Ci\u00eancia e Tecnologia\nTECNOLOGIA E INOVA\u00c7\u00c3O\nBestellen\nCERTIFICO O REGISTROFLAVIA REAT BRITTO\nSOB O N\u00daMERO SECRETARIA IGERAL EM EXERC\nda Reguerimento:\n5461/15-7 A LEHET BEDS\nSEQ. DOC.\n15 JAN. 2015\n1\nJUCESP\nHU\nSIP\nJUCESP PROTOCOLO\n0.024.119/15-5\n1\nJunta Comba\nEstado de S\u00e3o Paulo\n14\nJUNTA CON\nNubia Cristina da Silva Cembull\nAssessora T\u00e9cnica do Registro Publico\nR.G.: 36.431.427-3\nDADOS CADASTRAIS\n13 Hd\nCODIGO DE BARRAS (NIRE)\nCNPJ DA SEDE\nNIRE DA SEDE\n3522550861-2\n13.896.623/0001-36\nSEM EXIG\u00caNCIA ANTERIOR\nPROIE\nATO(S)\nAltera\u00e7\u00e3o de Endere\u00e7o; Altera\u00e7\u00e3o de Nome Empresarial; Consolida\u00e7\u00e3o da\nNOME EMPRESARIAL\nRF MOTOR'S Com\u00e9rcio de ve\u00edculos Ltda. - ME\n!\nLOGRADOURO\nAvenida Regente Feij\u00f3\nN\u00daMERO\n277\n:\nCEP\nCOMPLEMENTO\nBAIRRO/DISTRITO\nVila Regente Feij\u00f3\nC\u00d3DIGO DO MUNICIPIO\n5433\n03342-000\nUF\nMUNICIPIO\nS\u00e3o Paulo\nSP\nTELEFONE\nCORREIO ELETR\u00d4NICO\nIN, OAB\nU.F.\nNOME DO ADVOGADO\nVALORES RECOLHIDOS IDENTIFICA\u00c7\u00c3O DO REPRESENTANTE DA EMPRESA\nDARE 54,00\nNOME:\nBruno Vinicius Ferreira (S\u00f3cio )\nDARF 21,00\nASSINATURA:\nDATA ASSINATURA:\n12/01/2015\nB\nDECLARO, SOB AS PENAS DA LEI, QUE AS INFORMA\u00c7\u00d5ES CONSTANTES DO REQUERIMENTO/PROCESSO S\u00c3O EXPRESS\u00c3O DA VERDADE.\nControle Internet\n\u0421.\n015755122-9\n12/1/2015 10:19:14 - P\u00e1gina 1 de 2\n\n\n1ERCIAL\npy\nOLO\nINSTRUMENTO PARTICULAR DE ALTERA\u00c7\u00c3O\nCONTRATUAL DE SOCIEDADE EMPRES\u00c1RIA DE FORMA\nLIMITADA:\nREAVEL FERREIRA COM\u00c9RCIO DE VE\u00cdCULOS LTDA. ME\nCNPJ 13.896.623/0001-36\nPelo presente instrumento particular de altera\u00e7\u00e3o\ndo contrato social, os abaixo qualificados e ao final assinados:\nBruno Vinicius Ferreira, brasileiro, solteiro, nascido em 26/10/1985,\nempres\u00e1rio, portador da c\u00e9dula de identidade RG sob n\u00ba. 42.318.703-X/SSP-SP, inscrito no CPF/MF sob n\u00ba. 340.446.998-44, residente e\ndomiciliado no Estado de S\u00e3o Paulo, \u00e0 Rua Altina Penna Botto, 16 -\nCasa 02 - Vila Ivone - CEP 03375-001;\nDiogo Gabriel Ferreira, brasileiro, solteiro, nascido em 08/10/1988,\nempres\u00e1rio, portador da c\u00e9dula de identidade RG sob n\u00ba. 44.476.866-X/SSP-SP, inscrito no CPF/MF sob n\u00ba. 359.085.288-70, residente e\ndomiciliado no Estado de S\u00e3o Paulo, \u00e0 Rua Altina Penna Botto, 16 -\nCasa 02 - Vila Ivone - CEP 03375-001;\n\u00danicos s\u00f3cios da sociedade empres\u00e1ria de forma limitada que gira\na denomina\u00e7\u00e3o social de REAVEL FERREIRA\nCom\u00e9rcio de Ve\u00edculos Ltda. ME, inscrita no CNPJ/MF sob n\u00ba.\n13.896.623/0001-36, com estabelecimento e sede \u00e0 Rua Acuru\u00ed, 508 -\nVila Formosa S\u00e3o Paulo CEP 03355-000 S.P., cujos atos\nconstitutivos encontram-se registrados e arquivados na Junta\nComercial do Estado de S\u00e3o Paulo, com NIRE sob n\u00ba 35.2.25508612,\nem sess\u00e3o de 22 de Junho de 2011, t\u00eam, entre si justos e contratados\npromovem a altera\u00e7\u00e3o contratual e consequente consolida\u00e7\u00e3o da\nempresa que obedecera as clausulas e condi\u00e7\u00f5es adiante descritas:\nB\n\n\nVistoContenido\nRG: 36.430.427-3\nAltera\u00e7\u00e3o Contratual\nCl\u00e1usula 1a:- Altera-se a raz\u00e3o social da empresa que passa a ser\nRF MOTOR'S Com\u00e9rcio de Ve\u00edculos Ltda. - ME com denomina\u00e7\u00e3o\nde fantasia RF MOTOR'S;\nCl\u00e1usula 2a:- Altera-se o endere\u00e7o da sociedade que passa a ser \u00e0\nAv. Regente Feij\u00f3, 277 - Vila Regente Feij\u00f3 - S\u00e3o Paulo CEP\n03342-000 - S.P.;\nCl\u00e1usula 3a:- Face \u00e0s altera\u00e7\u00f5es\nos s\u00f3cios deliberam a\nCONSOLIDA\u00c7\u00c3O CONTRATUAL, conforme segue:\nCONTRATUAL\nCl\u00e1usula 1:- A sociedade girar\u00e1 sob a denomina\u00e7\u00e3o social de RF\nMOTOR'S Com\u00e9rcio Veiculos Ltda. - ME com denomina\u00e7\u00e3o de\nfantasia RF MOTOR'S, e ter\u00e1 a sua sede \u00e0 Av. Regente Feij\u00f3, 277 -\nVila Regente Feij\u00f3 - S\u00e3o Paulo - CEP 03342-000 - S.P.;\nCl\u00e1usula 2: sociedade tem por fim e objetivo na forma da\nlegisla\u00e7\u00e3o\nCom\u00e9rcio a varejo de autom\u00f3veis, camionetas e utilit\u00e1rios novos;\nCom\u00e9rcio por atacado de autom\u00f3veis, camionetas e utilit\u00e1rios\nnovos e usados;\nCom\u00e9rcio a varejo de autom\u00f3veis, camionetas e utilit\u00e1rios usados;\nCom\u00e9rcio por atacado de motocicletas e motonetas;\nCom\u00e9rcio a varejo de motocicletas e motone novas;\nCl\u00e1usula 3:- A sociedade teve in\u00edcio em 22 de Junho de 2011 e ter\u00e1\ndura\u00e7\u00e3o por tempo indeterminado;\n8.\n\n\n300 m\nVi\u015fte\nCl\u00e1usula 4:- 0 capital social \u00e9 de R$ 10.000,00 (Dez Mil Reais)\ntotalmente subscrito e integralizado em moeda corrente nacional,\nrepresentado por 10.000 (dez mil) cotas no valor unit\u00e1rio de R$ 1,00\n(Hum Real) cada, assim distribu\u00eddo:-1. Bruno Vinicius Ferreira, 9.900 (nove mil e novecentas) cotas de\nvalor unit\u00e1rio de R$ 1,00 (Hum Real), totalizando R$ 9.900,00 (Nove\nMil e Novecentos Reais), totalmente subscritas e integralizadas em\nmoeda corrente nacional, neste ato;\n2. Diogo Gabriel Ferreira, 100 cotas de valor unit\u00e1rio de R$\n1,00 (Hum Real), totalizando R$ 100,00 (Cem Reais), totalmente\nsubscritas e integralizadas em moeda corrente nacional, neste ato;\nCl\u00e1usula 5:- A responsabilidade dos s\u00f3cios \u00e9 restrita ao valor de suas\ncotas, mas todos\ndo Capital\npela integraliza\u00e7\u00e3o\ndeliberam que a administra\u00e7\u00e3o da sociedade,\nbem como sua representa\u00e7\u00e3o ativa e passiva, judicial ou extrajudicial,\nser\u00e1 exercida pelo s\u00f3cio Bruno Vinicius Ferreira individual e\nisoladamente. Inclusive todos os documentos legais e banc\u00e1rios, que\npoder\u00e1 constituir procuradores com tais poderes.\nPar\u00e1grafo Primeiro:- Os s\u00f3cios ter\u00e3o direito, a uma retirada mensal\na t\u00edtulo de Pr\u00f3-Labore e poder\u00e3o efetuar a distribui\u00e7\u00e3o de lucro, desde\nque, fixado em comum acordo no in\u00edcio de cada exerc\u00edcio.\nPar\u00e1grafo Segundo:- Os s\u00f3cios far\u00e3o uso da firma, podendo assinar\nseparadamente, ficando-lhes vedado, entretanto, o uso da firma em\nneg\u00f3cios alheios aos do objetivo social; e t\u00edtulos de responsabilidade\nsocial de esp\u00e9cie alguma, tais como avais, endossos, fian\u00e7as, etc.\n\n\nVisto\nConitor\n.RG36.430.427-3\nPar\u00e1grafo Terceiro:- A onera\u00e7\u00e3o ou venda de bens im\u00f3veis depende\nda expressa anu\u00eancia de s\u00f3cios que representem pelo menos 75%\n(setenta e cinco por cento) das quotas com direito a voto, respondendo\nos administradores solidariamente perante a sociedade e os terceiros\nprejudicados, por culpa no desempenho de suas fun\u00e7\u00f5es , de acordo\ncom o disposto no art. 1016, da Lei n. 10.406 de 10 de janeiro de\n2002.\nPar\u00e1grafo Quarto:- Depender\u00e1 tamb\u00e9m de expressa anu\u00eancia dos\ns\u00f3cios, conforme o disposto da Lei n. 10.406 de 10 de janeiro de\n2002, ficando assim solidariamente respons\u00e1vel civil e criminalmente\no s\u00f3cio que infringir o presente artigo:-a) Alienar, onerar ou de qualquer forma dispor de t\u00edtulos imobili\u00e1rios,\nbem como cotas ou a\u00e7\u00f5es de que a sociedade seja titular no capital de\noutras empresas;\nb) Fixar remunera\u00e7\u00e3o dos adminis\nistradores e assessores, sem v\u00ednculo\nempregat\u00edcio, a eles subordinados.\nCl\u00e1usula 7 :- Faculta-se a qualquer dos s\u00f3cios, retirar-se da sociedade\ndesde que o fa\u00e7a mediante aviso pr\u00e9vio de sua resolu\u00e7\u00e3o ao outro\ns\u00f3cio, observado o direito de prefer\u00eancia, com anteced\u00eancia m\u00ednima\nde pelo menos 6 (Seis) meses. Seus haveres lhes ser\u00e3o pagos em 12\n(Doze) meses corrigidos pelo IGPM e o primeiro vencimento \u00e0 partir\nde 60 (sessenta) dias da data do Balan\u00e7o Especial.\nCl\u00e1usula 8:- Os lucros e perdas apurados regularmente em balan\u00e7o\nanual que se realizar\u00e1 no dia 31 de Dezembro de cada ano, ser\u00e3o\ndivididos proporcionalmente ao capital social de cada um dos s\u00f3cios,\nem eventual preju\u00edzo os s\u00f3cios poder\u00e3o optar pelo aumento de capital\npara saldar tais preju\u00edzos.\n\n\nVisto\n15\nConfezidb\nRG: 15.530427-3\nCl\u00e1usula 9:- Em caso de falecimento de qualquer um dos s\u00f3cios, na\nvig\u00eancia do presente contrato, n\u00e3o importa na extin\u00e7\u00e3o da sociedade e\nseus neg\u00f3cios, cabendo ao s\u00f3cio remanescente a apura\u00e7\u00e3o dos haveres\ndo s\u00f3cio ausente segundo balan\u00e7o especial na data do \u00f3bito e, ser\u00e3o\npagos aos herdeiros do falecido em 12 (Doze) presta\u00e7\u00f5es mensais\ncorrigidos pelo IGPM, sendo vedado aos herdeiros poss\u00edvel ingresso\nna sociedade.\nCl\u00e1usula 10\u00b0:- Nenhum dos s\u00f3cios, pessoalmente ou por interposta\npessoa, poder\u00e1 participar ou colaborar a qualquer t\u00edtulo em outra\npessoa jur\u00eddica, que tenha por qualquer forma atividade an\u00e1loga ou\nconcorrente \u00e0 da sociedade, sem expressa anu\u00eancia dos demais.\nCl\u00e1usula 11:- Os administradores declaram, sob as penas da Lei, de\nque n\u00e3o est\u00e3o impedidos de exercerem a administra\u00e7\u00e3o da sociedade,\npor lei especial, ou em virtude de condena\u00e7\u00e3o criminal, ou por se\nencontrarem sob os efeitos dela, a pena que vede, ainda que\ntemporariamente, o acesso a cargos p\u00fablicos; ou por crime falimentar,\nde prevarica\u00e7\u00e3o, peita ou subomo, concuss\u00e3o, peculato, ou contra a\neconomia popular, contra o sistema financeiro nacional, contra\nnormas de defesa da concorr\u00eancia, contra as rela\u00e7\u00f5es de consumo, f\u00e9\np\u00fablica, ou a\n. (art. 1.011, \u00a71\u00baCC/2002).\nCl\u00e1usula 12 :- Para os casos omissos neste contrato, os mesmos ser\u00e3o\nregidos pelas disposi\u00e7\u00f5es legais vigentes atinentes \u00e0 mat\u00e9ria, em\nespecial a Lei n. 10.406, de 10 de janeiro de 2002.\nCl\u00e1usula 134:- Os s\u00f3cios elegem o foro Central da Comarca da\nCapital, no Estado de S\u00e3o Paulo, para as eventuais quest\u00f5es que\npossam advir.\nB\n\n\n..\n..\nVisto\nConletico\nE, assim, por estarem em tudo justos e contratados, as partes\nassinam o presente instrumento em 03 (tr\u00eas) vias de igual teor e valor\npara um s\u00f3 efeito, tudo, ante duas testemunhas a tudo presentes que\ntamb\u00e9m assinam, devendo em seguida ser encaminhado para registro\ne arquivamento junto a JU ESP - Junta Comercial do Estado de S\u00e3o\nPaulo.\nS\u00e3o Paulo, 12 de Janeiro de 2015.\nGolul tenuina\nBruno Vinicius Ferreira\nDiogo Gabriel Ferreira\nTestemunhas:\nRicardo\nHellon Austina da s Santos\nRicardo Silva Bezerra\nRG n\u00ba 29.074.987-6/SSP-SP\nCPF n\u00ba 213.108.838-82\nHellen Cristina da Silva Santos\nRG n\u00b037965378-3/SSP-SP\nCPF n\u00ba 405.216.528-47\nDO\n15 JAN. 2015\nwww.\nSASA\nSECRETARIA DE DESENVOLVIMENTO\n01/ECON\u00d3MICO, CI\u00caNCIA,\nTECNOLOGIA E INOVA\u00c7\u00c3O\nom\nCERTIFICO O REGISTRO FLAVTA REOTTA eri to\nSOB O NUMERO SECRET\u00c1RIA GERAL EM EXERCICIO\n5.461/15-7 tena MRITH FUIT\n...www.si\n\n\nDocumento B\u00e1sico de Entrada\nPage 1 of 1\n...\nREP\u00daBLICA FERERATIVA DO BRASIL\nCADASTRO NACIONAL.JA PESSOA JUR\u00cdDICA - CNPJ\nPROTOCOLO DE TRANSMISS\u00c3O DA FCP JUOVI30\nA an\u00e1lise e o deferimento deste documento ser\u00e3o efetuados pelo seguinte \u00f3rg\u00e3o:\n\u2022 Junta Comercial do Estado de S\u00e3o Paulo\nC\u00d3DIGO DE ACESSO\nSP.63.05.42.31 - 13.896.623.000.136\n01. IDENTIFICA\u00c7\u00c3O\nNOME EMPRESARIAL (firma ou denomina\u00e7\u00e3o)\nN\u00b0 DE INSCRI\u00c7\u00c3O NO CNPJ\nRF MOTORS COMERCIO DE VEICULOS LTDA.\n13.896.623/0001-36\n02. MOTIVO DO PREENCHIMENTO\nRELA\u00c7\u00c3O DOS EVENTOS SOLICITADOS / DATA DO EVENTO\n203 Exclus\u00e3o do t\u00edtulo do estabelecimento (nome de fantasia) - 12/01/2015\n211 Altera\u00e7\u00e3o de endere\u00e7o dentro do mesmo munic\u00edpio - 12/01/2015\n220 Altera\u00e7\u00e3o do nome empresarial (firma ou denomina\u00e7\u00e3o) - 12/01/2\n03. IDENTIFICA\u00c7\u00c3O DO REPRESENTANTE DA PESSOA JUR\u00cdDICA\nNOME\nBRUNO VINICIUS FERREIRA\nCPF\n340.446.998-44\nILOCAL\nDATA\n12/01/2015\n04. C\u00d3DIGO DE CONTROLE DO CERTIFICADO DIGITAL\nEste documento foi assinado com uso de senha da Sefaz SP\nAprovado pela Instru\u00e7\u00e3o Normativa RFB n\u00ba 1.183, de 19 de agosto de 2011\n12/01/2015\nhttp://www.receita fazenda.gov.br/pessoajuridica/cnpj/fcpj/dbe.asp\n\n\nES\nSP\nGOVERNO DO ESTADO DE S\u00c3O BAULO\nSECRETARIA DE DESENVOLVIMENTO ECONOMICO, CIENCIA E TECNOLOGIA\nJUNTA COMERCIAL DO ESTADO.DE S\u00c3O PAULO: JUCES...\nJUCESP\nAnta Comercial do\nEstado de Sio Pub\nDECLARA\u00c7\u00c3O\n,\nEu, Bruno Vinicius Ferreira, portador da C\u00e9dula de Identidade n\u00ba 42318703-X, inscrito no\nCadastro de Pessoas F\u00edsicas - CPF sob n\u00ba 340.446.998-44, na qualidade de titular, s\u00f3cio ou\nrespons\u00e1vel legal da empresa RF MOTOR'S Com\u00e9rcio de ve\u00edculos Ltda. - ME, DECLARO\nestar ciente que o ESTABELECIMENTO situado no(a) Avenida Regente Feij\u00f3, 277 Vila\nRegente Feij\u00f3, S\u00e3o Paulo, S\u00e3o Paulo, CEP 03342-000, N\u00c3O PODER\u00c1 EXERCER suas\natividades sem que obtenha o parecer municipal sobre a viabilidade de sua instala\u00e7\u00e3o e\nfuncionamento no local indicado, conforme diretrizes estabelecidas na legisla\u00e7\u00e3o de uso e\nocupa\u00e7\u00e3o do solo, posturas municipais e restri\u00e7\u00f5es das \u00e1reas de prote\u00e7\u00e3o ambiental, nos\ntermos do art. 24, $2 do Decreto Estadual n\u00ba 55.660/2010 e sem que tenha um CERTIFICADO\nDE LICENCIAMENTO INTEGRADO V\u00c1LIDO, obtido pelo sistema Via R\u00e1pida Empresa\nM\u00f3dulo de Licenciamento Estadual.\nDeclaro ainda estar ciente que qualquer altera\u00e7\u00e3o no endere\u00e7o do estabelecimento, em sua\natividade ou grupo de atividades, ou em qualquer outra das condi\u00e7\u00f5es determinantes \u00e0\nexpedi\u00e7\u00e3o do Certificado de Licenciamento Integrado, implica na perda de sua validade,\nassumindo, desde o momento da altera\u00e7\u00e3o, a obriga\u00e7\u00e3o de renov\u00e1-lo.\nPor fim, declaro estar ciente que a emiss\u00e3o do Certificado de Licenciamento Integrado poder\u00e1\nser solicitada por representante legal devidamente habilitado, presencialmente e no ato da\nretirada das certid\u00f5es relativas ao registro empresarial na Prefeitura, ou pelo titular, s\u00f3cio, ou\ncontabilista vinculado no Cadastro Nacional da Pessoa Jur\u00eddica (CNPJ) diretamente no site da\nJucesp, atrav\u00e9s do m\u00f3dulo de licenciamento, mediante uso da respectiva certifica\u00e7\u00e3o digital.\nBruno Vinicius Ferreira\nRG: 42318703-X\nRF MOTOR'S Com\u00e9rcio de ve\u00edculos Ltda. - ME" 468 | 469 | context_content = 'position_token' 470 | context_content = 'windows_token' 471 | use_sentence_id = True 472 | window_overlap = 0.5 473 | max_windows = 3 474 | 475 | start_position = 158 476 | max_size = 200 477 | 478 | #tokenizer = AutoTokenizer.from_pretrained('models/', do_lower_case=False) 479 | tokenizer = AutoTokenizer.from_pretrained('unicamp-dl/ptt5-base-portuguese-vocab', do_lower_case=False) 480 | max_tokens = 150 481 | question = 'Qual o tipo, a classe, o órgão emissor, a localização e a abrangência?' 482 | 483 | context, offset = get_context( 484 | document, 485 | context_content=context_content, 486 | max_size=max_size, 487 | start_position=start_position, 488 | proportion_before=0.2, 489 | return_position_offset=True, 490 | use_sentence_id=use_sentence_id, 491 | tokenizer=tokenizer, 492 | max_tokens=max_tokens, 493 | max_windows=max_windows, 494 | question=question, 495 | window_overlap=window_overlap, 496 | verbose=True) 497 | 498 | print('--> testing the offset:') 499 | if isinstance(context, list): 500 | context, offset = context[-1], offset[-1] # last window 501 | print('>>>>>>>>>> using the offset\n' + document['text'][offset:offset + len(context)]) 502 | print('>>>>>>>>>> returned context\n' + context) 503 | 504 | 505 | if __name__ == "__main__": 506 | main() 507 | -------------------------------------------------------------------------------- /information_extraction_t5/features/highlights.py: -------------------------------------------------------------------------------- 1 | from typing import Optional, Tuple, Union, Dict 2 | from collections import OrderedDict 3 | 4 | from fuzzysearch import find_near_matches 5 | from fuzzywuzzy import process 6 | 7 | from information_extraction_t5.features.sentences import ( 8 | check_sent_id_is_valid, 9 | T5_RAW_CONTEXT, 10 | split_context_into_sentences, 11 | ) 12 | 13 | estados = { 14 | 'AC': 'Acre', 15 | 'AL': 'Alagoas', 16 | 'AP': 'Amapá', 17 | 'AM': 'Amazonas', 18 | 'BA': 'Bahia', 19 | 'CE': 'Ceará', 20 | 'DF': 'Distrito Federal', 21 | 'ES': 'Espírito Santo', 22 | 'GO': 'Goiás', 23 | 'MA': 'Maranhão', 24 | 'MT': 'Mato Grosso', 25 | 'MS': 'Mato Grosso do Sul', 26 | 'MG': 'Minas Gerais', 27 | 'PA': 'Pará', 28 | 'PB': 'Paraíba', 29 | 'PR': 'Paraná', 30 | 'PE': 'Pernambuco', 31 | 'PI': 'Piauí', 32 | 'RJ': 'Rio de Janeiro', 33 | 'RN': 'Rio Grande do Norte', 34 | 'RS': 'Rio Grande do Sul', 35 | 'RO': 'Rondônia', 36 | 'RR': 'Roraima', 37 | 'SC': 'Santa Catarina', 38 | 'SP': 'São Paulo', 39 | 'SE': 'Sergipe', 40 | 'TO': 'Tocantins' 41 | } 42 | 43 | area = { 44 | 'metro_quadrado': ['m²', 'm2', 'metros quadrados'], 45 | 'hectare': ['has', 'hectares'], 46 | 'alq_paulista': ['alqueires paulistas', 'alqueires'] 47 | } 48 | 49 | 50 | def include_variations(query): 51 | """Given a canonical format, include possible variations of how the 52 | information can appear in the text. 53 | """ 54 | if query in estados.keys(): 55 | return [estados[query]] 56 | if query in area.keys(): 57 | return area[query] 58 | return [] 59 | 60 | 61 | def find_sentence_of_sent_id(context: T5_RAW_CONTEXT, sent_id: int) -> str: 62 | """Returns the sentence of number `sent_id` in the context. 63 | 64 | This method assumes the sentence ids are defined by linebreaks and start at 65 | 1. 66 | 67 | Args: 68 | context: question raw context 69 | sent_id: Index of the sentence. Must be greater or equal to 0. 70 | """ 71 | assert sent_id >= 0, ( 72 | f'SENT id must be greater or equal to 0. Received: {sent_id}') 73 | 74 | sentences = split_context_into_sentences(context) 75 | 76 | return sentences[sent_id - 1] 77 | 78 | 79 | def find_indexes_of_sentence( 80 | context: T5_RAW_CONTEXT, sent_id: int 81 | ) -> Union[Tuple[int, int], Tuple[None, None]]: 82 | """Returns character indexes for the start and end of a sentence in the 83 | context. 84 | 85 | This method assumes the sentence ids are defined by linebreaks and start at 86 | 1. 87 | """ 88 | sentence = find_sentence_of_sent_id(context, sent_id) 89 | # get the start_char and end_char of the sentence 90 | start_char = context.find(sentence) 91 | end_char = context.find('\n', start_char) 92 | 93 | return start_char, end_char 94 | 95 | 96 | def get_levenshtein_dist( 97 | query_string, 98 | levenshtein_dist_dict: Optional[Dict[int, int]] = None 99 | ) -> int: 100 | """Returns a maximum levenshtein distance based on query string length.""" 101 | if levenshtein_dist_dict is None: 102 | levenshtein_dist_dict=OrderedDict({3: 0, 10: 1, 20: 3, 30: 5}) 103 | for str_size, dist in levenshtein_dist_dict.items(): 104 | if len(query_string) < str_size: 105 | return dist 106 | return list(levenshtein_dist_dict.values())[-1] 107 | 108 | 109 | def fuzzy_extract( 110 | query_string: str, large_string: str, score_cutoff: int = 30, 111 | max_levenshtein_dist: Union[int, Dict[int, int]] = -1, 112 | verbose: bool = False 113 | ) -> Union[Tuple[int, int], Tuple[None, None]]: 114 | """Fuzzy matches query string (and its variations) on a large string. 115 | 116 | Args: 117 | query_string: substring to be searched inside another string. 118 | large_string: the string to be searched on. 119 | score_cutoff: fuzzy matches with a score below this one will be ignored. 120 | max_levenshtein_dist: if a Dict, then changes the maximum levenshtein 121 | distance of matches (value) based on `query_string` length (key). 122 | Otherwise, an int should be supplied for a fixed maximum distance. 123 | verbose: When True, prints debug messages to stdout. 124 | 125 | Returns: 126 | Indexes of the start and end characters of the best match. If nothing 127 | is found, returns (None, None) instead. 128 | """ 129 | if max_levenshtein_dist == -1: 130 | OrderedDict({3: 0, 10: 1, 20: 3, 30: 5}) 131 | query_strings = include_variations(query_string) + [query_string] 132 | matches = [] 133 | starts = [] 134 | ends = [] 135 | scores = [] 136 | large_string = large_string.lower() 137 | 138 | for query_string in query_strings: 139 | query_string = query_string.lower() 140 | if verbose: 141 | print(f'query: {query_string}') 142 | 143 | # set dynamic Levenshtein distance 144 | if isinstance(max_levenshtein_dist, dict): 145 | max_l_dist_query = get_levenshtein_dist(query_string, max_levenshtein_dist) 146 | else: 147 | max_l_dist_query = max_levenshtein_dist 148 | 149 | all_matches = process.extractBests(query_string, (large_string,), 150 | score_cutoff=score_cutoff) 151 | for large, _ in all_matches: 152 | if verbose: 153 | print('word::: {}'.format(large)) 154 | for match in find_near_matches(query_string, large, 155 | max_l_dist=max_l_dist_query): 156 | matched = match.matched 157 | start = match.start 158 | end = match.end 159 | score = match.dist 160 | 161 | if verbose: 162 | print(f"match: {matched}\tindex: {start}\tscore: {score}") 163 | 164 | matches.append(matched) 165 | starts.append(start) 166 | ends.append(end) 167 | scores.append(score) 168 | 169 | if len(matches) == 0: 170 | return None, None 171 | 172 | best_id = scores.index(min(scores)) 173 | 174 | return starts[best_id], ends[best_id] 175 | 176 | 177 | def get_answer_highlight( 178 | answer: str, sent_id: int, context: T5_RAW_CONTEXT, 179 | sentence_expansion: int = 0, verbose: bool = False 180 | ) -> Union[Tuple[int, int, str], Tuple[None, None, None]]: 181 | r"""Given a single answer and its SENT ID, returns highlights of its 182 | location within the context. 183 | 184 | Sometimes the answer has line breaks in the middle of it (ex.: São\nPaulo). 185 | To find the highlight even on these cases, this optionally expands the 186 | highlight window some sentences beyond the original ID. 187 | 188 | Args: 189 | answer: the answer to search on the context. 190 | sent_id: ID of the sentence the answer is in (or starts at). 191 | context: the question raw context. 192 | sentence_expansion: When this is 0, looks for the answer only in the 193 | sentence pointed by SENT ID. If the value is a `N` greater than 0, 194 | then looks for it on the `N` sequences that come after SENT ID, 195 | i.e. the interval `[SENT ID, ..., SENT ID + N]`. 196 | verbose: If True, enables debug prints. 197 | 198 | Examples: 199 | >>> answer = 'Rua Albert Einstein' 200 | >>> sent_id = 3 201 | >>> context = "Campinas\n\nRua 4lbert \nE1nstein 1000" 202 | >>> get_answer_highlight(answer, sent_id, context, sentence_expansion=2) 203 | fuzzy ==> answer: Rua Albert Einstein, sentence: "Rua 4lbert E1nstein 1000" 204 | (10, 30, 'Rua 4lbert \nE1nstein') 205 | """ 206 | sentence = find_sentence_of_sent_id(context, sent_id) 207 | 208 | expanded_sentence = [sentence] 209 | for i in range(1, sentence_expansion + 1): 210 | is_valid = check_sent_id_is_valid(context, sent_id + i) 211 | if not is_valid: 212 | break 213 | 214 | extra_sentence = find_sentence_of_sent_id(context, sent_id + i) 215 | expanded_sentence.append(extra_sentence) 216 | sentence = ' '.join(expanded_sentence) 217 | 218 | if verbose: 219 | print(f'fuzzy ==> answer: {answer}, sentence: "{sentence}"') 220 | 221 | shift, _ = find_indexes_of_sentence(context, sent_id) 222 | start_char, end_char = fuzzy_extract(answer, sentence) 223 | 224 | if start_char is None or end_char is None: 225 | highlight = None 226 | 227 | else: 228 | start_char += shift 229 | end_char += shift 230 | highlight = context[start_char:end_char] 231 | 232 | return start_char, end_char, highlight 233 | -------------------------------------------------------------------------------- /information_extraction_t5/features/postprocess.py: -------------------------------------------------------------------------------- 1 | """Utility methods to post-process model output.""" 2 | from typing import Dict, List, Tuple 3 | 4 | import numpy as np 5 | import pandas as pd 6 | 7 | from information_extraction_t5.features.sentences import ( 8 | T5_SENTENCE, 9 | find_ids_of_sent_tokens, 10 | deconstruct_answer, 11 | get_raw_answer_from_subsentence, 12 | get_subanswer_from_subsentence 13 | ) 14 | 15 | 16 | def group_qas(document_or_example_ids: List[str], group_by_typenames=True) -> Dict[str, List[int]]: 17 | """Groups the sentences according to qa-ids of the examples or documents. 18 | 19 | Args: 20 | sentences: List of qa-ids (strings) 21 | 22 | Returns: 23 | Dict with qa-ids (document-type + type-name) as keys and list of indexes 24 | of grouped sentences as values. 25 | """ 26 | qid_dict = {} 27 | for idx, document_or_example_id in enumerate(document_or_example_ids): 28 | # When grouping by example_ids (pattern document_class.typename), add only the project 29 | # (document class), such as matriculas, certidoes, etc. Ex.: qid_dict['matriculas'] = [0, 1, 2]. 30 | # Must include only for original answers, excluding the ones related to dismembered sub-answers. 31 | if group_by_typenames and '~' not in document_or_example_id: # '~' appears in sub-answers of originally compound answers 32 | proj = document_or_example_id.split('.')[0] 33 | if proj in qid_dict.keys(): 34 | qid_dict[proj].append(idx) 35 | else: 36 | qid_dict[proj] = [idx] 37 | 38 | if document_or_example_id in qid_dict.keys(): 39 | qid_dict[document_or_example_id].append(idx) 40 | else: 41 | qid_dict[document_or_example_id] = [idx] 42 | 43 | # multiple chunks are suffixed with _i. Here we group those cases removing the suffix. 44 | if group_by_typenames: 45 | comp = None 46 | try: 47 | document_or_example_id, comp = document_or_example_id.rsplit('~', 1) 48 | except: 49 | pass 50 | 51 | try: 52 | doc_ex_id, t = document_or_example_id.rsplit('_', 1) 53 | has_asterisk = t.endswith('*') 54 | if comp is None: 55 | if has_asterisk: 56 | t = t[:-1] 57 | t = int(t.strip()) # try to convert suffix to integer 58 | if comp is not None: 59 | doc_ex_id += '~' + comp 60 | elif has_asterisk: 61 | doc_ex_id += '*' 62 | 63 | if doc_ex_id in qid_dict.keys(): 64 | qid_dict[doc_ex_id].append(idx) 65 | else: 66 | qid_dict[doc_ex_id] = [idx] 67 | except: 68 | pass 69 | 70 | return qid_dict 71 | 72 | 73 | def split_compound_labels_and_predictions( 74 | labels: List[T5_SENTENCE], predictions: List[T5_SENTENCE], document_ids: List[str], 75 | example_ids: List[str], probs: List[float], window_ids: List[str], keep_original_compound: bool = True, 76 | keep_disjoint_compound: bool = True 77 | ) -> Tuple[List[T5_SENTENCE], List[T5_SENTENCE], List[str], List[str], 78 | List[float], List[int], List[int], List[str], List[int], Dict]: 79 | """Splits compound answers as individual subsentences (complete sub-anwers) 80 | like \"[SENT1] [Estado]: SP [aparece no texto]: São Paulo\" extending 81 | original label and prediction sets. 82 | 83 | This is useful in inference for getting individual metrics for each 84 | subsentence that composes a compound answer. 85 | 86 | The function keeps for predictions only the first occurrence of the 87 | type-names that compose the labels. If the prediction has some type-name 88 | that is absent in the label, it is ignored. If the prediction has 89 | sentence-id or raw-text, but the label does not have, those terms are 90 | considered as part of prediction, and certainly will result in misprediction. 91 | 92 | If keep_original_compound, the function returns the indices of the original 93 | sentences, ignoring the ones the reference individual subsentences. This is 94 | useful for getting metrics only for the answers as returned by the model. 95 | The metrics will appear as 'ORIG' in the metrics file. 96 | 97 | If keep_disjoint_compound, the function returns, for each document class, 98 | (1) the indices of non-compound answers and (2) the indices of subsentences 99 | of compound answers (ignoring the compound answers). In both cases, the 100 | indices references the senteces with sentence-ids and raw-text complements 101 | already filtered. This is useful for getting metrics for each document class 102 | that can be compared with other experiments that does not use compound qas 103 | and/or does not use sentence-ids and raw-text complements. The metrics will 104 | appear prefixed by 'DISJOINT_' in the metrics file. 105 | 106 | Examples: 107 | >>> labels = ['[SENT1] [Tipo de Logradouro]: Rua [SENT1] [Logradouro]: Abert Einstein'] 108 | >>> predictions = ['[SENT1] [Tipo de Logradouro]: Rua [SENT1] [Logradouro]: 41bert Ein5tein [SENT1] [Bairro]: Cidade Universitária'] 109 | >>> labels, predictions, document_ids, example_ids, probs, window_ids, sent_ids, raw_texts, _, _ = \ 110 | >>> split_compound_labels_and_predictions(labels, predictions, ['doc_1'], ['matriculas.endereco'], [0.98], ['1 1']) 111 | >>> print(labels) 112 | ['[SENT1] [tipo_de_logradouro]: Rua [SENT1] [fp_logradouro]: Abert Einstein', '[SENT1] [tipo_de_logradouro]: Rua', '[tipo_de_logradouro]: Rua', '[SENT1] [fp_logradouro]: Abert Einstein', '[fp_logradouro]: Abert Einstein'] 113 | >>> print(predictions) 114 | ['[SENT1] [tipo_de_logradouro]: Rua [SENT1] [fp_logradouro]: 41bert Ein5tein [SENT1] [fp_bairro]: Cidade Universitária', '[SENT1] [tipo_de_logradouro]: Rua', '[tipo_de_logradouro]: Rua', '[SENT1] [fp_logradouro]: 41bert Ein5tein', '[fp_logradouro]: 41bert Ein5tein'] 115 | >>> print(document_ids) 116 | ['doc_1', 'doc_1', 'doc_1', 'doc_1', 'doc_1'] 117 | >>> print(example_ids) 118 | ['matriculas.endereco', 'matriculas.endereco~tipo_de_logradouro', 'matriculas.endereco~tipo_de_logradouro*', 'matriculas.endereco~fp_logradouro', 'matriculas.endereco~fp_logradouro*'] 119 | >>> print(probs) 120 | [0.98, 0.0, 0.0, 0.0, 0.0] 121 | >>> print(window_ids) 122 | [[1, 1], [1], [1], [1], [1]] 123 | >>> print(sent_ids) 124 | [None, None, [1], None, [1]] 125 | >>> print(raw_texts) 126 | [None, None, None, None, None] 127 | 128 | Returns: 129 | labels, predictions, document_ids, example_ids, probs, window_ids, sent_ids, raw_texts 130 | """ 131 | labels_new, predictions_new = [], [] 132 | document_ids_new, example_ids_new, probs_new, window_ids_new, sent_ids, raw_texts = \ 133 | [], [], [], [], [], [] 134 | original_idx = [] 135 | disjoint_answer_idx_by_doc_class = {} 136 | 137 | for label, prediction, doc_id, ex_id, prob, window_id in zip( 138 | labels, predictions, document_ids, example_ids, probs, window_ids): 139 | window_id = [ int(w) for w in window_id.split(' ') ] 140 | label_subsentences, label_type_names = deconstruct_answer(label) 141 | prediction_subsentences, prediction_type_names = deconstruct_answer(prediction) 142 | 143 | # this is not compound answer, then get the original label/predicion pair 144 | if len(label_type_names) <= 1 or keep_original_compound: 145 | label = ' '.join(label_subsentences) 146 | prediction = ' '.join(prediction_subsentences) 147 | 148 | labels_new.append(label) 149 | predictions_new.append(prediction) 150 | document_ids_new.append(doc_id) 151 | example_ids_new.append(ex_id) 152 | probs_new.append(prob) 153 | window_ids_new.append(window_id) 154 | sent_ids.append(None) 155 | raw_texts.append(None) 156 | 157 | # indexes to compute the f1 and exact ONLY with original (non-splitted) answers 158 | if keep_original_compound: 159 | idx = len(labels_new) - 1 160 | original_idx.append(idx) 161 | 162 | if len(label_type_names) <= 1: 163 | # remove sent-id and raw-text complement, if the label has, 164 | # in order to get metric only for the response per se. 165 | label_sa = get_subanswer_from_subsentence(label) 166 | pred_sa = get_subanswer_from_subsentence(prediction) 167 | 168 | raw_text = get_raw_answer_from_subsentence(prediction_subsentences[0]) 169 | sent_id = find_ids_of_sent_tokens(prediction_subsentences[0]) 170 | 171 | ex_id_ = ex_id + '*' 172 | 173 | labels_new.append(label_sa) 174 | predictions_new.append(pred_sa) 175 | document_ids_new.append(doc_id) 176 | example_ids_new.append(ex_id_) 177 | probs_new.append(prob) 178 | window_ids_new.append(window_id) 179 | sent_ids.append(sent_id) 180 | raw_texts.append(raw_text) 181 | 182 | # keep by-document-class the indices of non-compound answers. 183 | # The sent-id and raw-text complement are already filtered. 184 | if keep_disjoint_compound: 185 | idx = len(labels_new) - 1 186 | doc_class = ex_id.split('.')[0] 187 | if doc_class in disjoint_answer_idx_by_doc_class.keys(): 188 | disjoint_answer_idx_by_doc_class[doc_class].append(idx) 189 | else: 190 | disjoint_answer_idx_by_doc_class[doc_class] = [idx] 191 | 192 | if len(label_type_names) > 1: 193 | window_id = window_id[:1] # for compound qa, the window_id is repeated 194 | for label_ss, label_tn in zip(label_subsentences, label_type_names): 195 | 196 | try: 197 | # the same type-name was predicted, get the first occurrence 198 | pred_idx = prediction_type_names.index(label_tn) 199 | pred_ss = prediction_subsentences[pred_idx] 200 | except: 201 | # the same type-name was not predicted, use empty 202 | pred_ss = '' 203 | 204 | ex_id_ = ex_id + '~' + label_tn 205 | 206 | labels_new.append(label_ss) 207 | predictions_new.append(pred_ss) 208 | document_ids_new.append(doc_id) 209 | example_ids_new.append(ex_id_) 210 | probs_new.append(0.0) 211 | window_ids_new.append(window_id) 212 | sent_ids.append(None) 213 | raw_texts.append(None) 214 | 215 | # remove sent-id and raw-text complement, if the label has, 216 | # in order to get metric only for the response per se 217 | label_sa = get_subanswer_from_subsentence(label_ss) 218 | pred_sa = get_subanswer_from_subsentence(pred_ss) 219 | 220 | raw_text = get_raw_answer_from_subsentence(pred_ss) 221 | sent_id = find_ids_of_sent_tokens(pred_ss) 222 | 223 | ex_id_ = ex_id + '~' + label_tn + '*' 224 | 225 | labels_new.append(label_sa) 226 | predictions_new.append(pred_sa) 227 | document_ids_new.append(doc_id) 228 | example_ids_new.append(ex_id_) 229 | probs_new.append(0.0) 230 | window_ids_new.append(window_id) 231 | sent_ids.append(sent_id) 232 | raw_texts.append(raw_text) 233 | 234 | # keep by-document-class only indices of sub-responses for compound answers, 235 | # not the original compound answer. 236 | # The sent-id and raw-text complement are already filtered. 237 | if keep_disjoint_compound: 238 | idx = len(labels_new) - 1 239 | doc_class = ex_id.split('.')[0] 240 | if doc_class in disjoint_answer_idx_by_doc_class.keys(): 241 | disjoint_answer_idx_by_doc_class[doc_class].append(idx) 242 | else: 243 | disjoint_answer_idx_by_doc_class[doc_class] = [idx] 244 | 245 | return labels_new, predictions_new, document_ids_new, example_ids_new, probs_new, \ 246 | window_ids_new, sent_ids, raw_texts, original_idx, disjoint_answer_idx_by_doc_class 247 | 248 | 249 | def get_highest_probability_window( 250 | labels: List[T5_SENTENCE], predictions: List[T5_SENTENCE], 251 | document_ids: List[str], example_ids: List[str], probs: List[float], 252 | use_fewer_NA: bool = False 253 | ) -> Tuple[List[T5_SENTENCE], List[T5_SENTENCE], List[str], List[str], List[float], List[str]]: 254 | """Get the highest-probability components for each pair document, example. 255 | """ 256 | if use_fewer_NA: 257 | na_cases = [ pred.count('N/A') for pred in predictions ] 258 | arr = np.vstack([np.array(labels), np.array(predictions), 259 | np.array(document_ids), np.array(example_ids), 260 | np.array(na_cases), np.array(probs, dtype=object)]).transpose() 261 | df1 = pd.DataFrame(arr, columns=['labels', 'predictions', 'document_ids', 'example_ids', 'na', 'probs']) 262 | else: 263 | arr = np.vstack([np.array(labels), np.array(predictions), 264 | np.array(document_ids), np.array(example_ids), 265 | np.array(probs, dtype=object)]).transpose() 266 | df1 = pd.DataFrame(arr, columns=['labels', 'predictions', 'document_ids', 'example_ids', 'probs']) 267 | 268 | # include windows-id to access which window got the highest probability. 269 | # In case of compound qa, the windows-id is replicated for each 270 | # prediction subsentence 271 | df1['window_ids'] = df1.groupby(['document_ids', 'example_ids']).cumcount().astype(str) 272 | df1['window_ids'] = df1.apply(lambda x: ' '.join([x['window_ids']] * len(deconstruct_answer(x['predictions'])[0]) ), axis=1) 273 | 274 | if use_fewer_NA: 275 | # get the highest-probability sample among cases with fewer number if N/As 276 | # for each pair document-id / example-id 277 | df1 = df1.sort_values(['na', 'probs'], ascending=[True, False]).groupby(['document_ids', 'example_ids']).head(1) 278 | df1.sort_index(inplace=True) 279 | 280 | labels, predictions, document_ids, example_ids, _, probs, window_ids = df1.T.values.tolist() 281 | else: 282 | # get the highest-probability sample for each pair document-id / example-id 283 | df1 = df1.sort_values('probs', ascending=False).groupby(['document_ids', 'example_ids']).head(1) 284 | df1.sort_index(inplace=True) 285 | 286 | labels, predictions, document_ids, example_ids, probs, window_ids = df1.T.values.tolist() 287 | 288 | return labels, predictions, document_ids, example_ids, probs, window_ids 289 | -------------------------------------------------------------------------------- /information_extraction_t5/features/preprocess.py: -------------------------------------------------------------------------------- 1 | """Utility methods to preprocess model input.""" 2 | from collections import Counter 3 | from typing import Dict, List, Optional, OrderedDict, Tuple, Union 4 | 5 | from information_extraction_t5.features.questions import ( 6 | COMPLEMENT, 7 | QUESTION, 8 | QUESTION_DICT, 9 | QUESTIONS as ALL_QUESTIONS, 10 | SUBQUESTION_DICT, 11 | ) 12 | from information_extraction_t5.features.questions.type_map import COMPLEMENT_TYPE 13 | from information_extraction_t5.features.sentences import SENT_TOKEN 14 | 15 | # Large number to not let the number of sentences be too large for a model. 16 | MAX_SENTENCES = 9999 17 | 18 | 19 | def _replace_brackets_with_parenthesis(text: str) -> str: 20 | text = text.replace('{', '(') 21 | text = text.replace('}', ')') 22 | 23 | return text 24 | 25 | 26 | def _replace_linebreak_with_token_patterns( 27 | text: str, token_pattern: str = SENT_TOKEN 28 | ) -> Tuple[str, int]: 29 | """Returns new string with `\n` replaced with the token pattern and the 30 | number of tokens.""" 31 | num_tokens = text.count('\n') 32 | text = text.replace('\n', token_pattern) 33 | 34 | return text, num_tokens 35 | 36 | 37 | def _replace_linebreaks_with_tokens(text: str) -> str: 38 | r"""Replaces every `\n` in a string with a numbered SENT token. 39 | 40 | If the inputs string has brackets, they will be replaced with parenthesis. 41 | Always adds least one SENT token at the beginning of the new sentence. 42 | Tokens are numerated starting from 1. 43 | 44 | Args: 45 | text: string to have `\n` replaced. It can't be split into more than 46 | MAX_SENTENCES. 47 | 48 | Examples: 49 | >>> sentence = 'Rua PEDRO BIAGI 462 Apartamento nº 103, 1º Andar do RESIDENCIAL IMPERIAL. Sertãozinho\nSP' 50 | >>> new_sentence = _replace_linebreaks_with_tokens(sentence) 51 | >>> print(new_sentence) 52 | ' [SENT1] Rua PEDRO BIAGI 462 Apartamento nº 103, 1º Andar do RESIDENCIAL IMPERIAL. Sertãozinho [SENT2] SP' 53 | 54 | Returns: 55 | New string with token instead of `\n` 56 | """ 57 | # Should have at least one SENT token at start 58 | text = '\n' + text 59 | text = _replace_brackets_with_parenthesis(text) 60 | text, num_tokens = _replace_linebreak_with_token_patterns(text) 61 | 62 | assert num_tokens <= MAX_SENTENCES, 'Maximum number of sentences violated.' 63 | 64 | # token numeration must start from 1 65 | text = text.format(*range(1, num_tokens + 1)) 66 | 67 | return text 68 | 69 | 70 | def _replace_linebreaks_with_spaces(text: str) -> str: 71 | r"""Replaces every `\n` in a string with a space. 72 | 73 | Examples: 74 | >>> sentence = 'Rua PEDRO BIAGI 462 Apartamento nº 103, 1º Andar do RESIDENCIAL IMPERIAL. Sertãozinho\nSP' 75 | >>> new_sentence = _replace_linebreaks_with_spaces(sentence) 76 | >>> print(new_sentence) 77 | 'Rua PEDRO BIAGI 462 Apartamento nº 103, 1º Andar do RESIDENCIAL IMPERIAL. Sertãozinho SP' 78 | """ 79 | text = text.replace('\n', ' ') 80 | 81 | return text 82 | 83 | 84 | def _get_id_based_on_linebreaks(context: str, answer_position: int) -> int: 85 | """Recovers the sentence-id assuming the context is always partiotioned based 86 | on occurrences of linebreaks. 87 | 88 | Args: 89 | context: text context of the question 90 | answer_position: index of last character from answer. 91 | """ 92 | if answer_position == -1: 93 | return 0 94 | 95 | sent_id = Counter(context[:answer_position])['\n'] + 1 96 | 97 | return sent_id 98 | 99 | 100 | def get_questions_for_chunk( 101 | qa_id: str = 'matriculas.imovel.comarca', is_compound: bool = False, 102 | return_dict: bool = False, all_questions: QUESTION_DICT = ALL_QUESTIONS 103 | ) -> Union[List[QUESTION], QUESTION_DICT, SUBQUESTION_DICT]: 104 | """Returns a list of questions for a specific qa_id, or a dict mapping 105 | typenames to question for building compound answers. The function can 106 | return also all the questions. 107 | 108 | Args: 109 | qa_id: the id of a question-answer, generally represented by dot-separated 110 | document class, chunks typenames, and possibly subchunks typenames. 111 | use 'all' to get a dictionary containing all the possible questions. 112 | is_compound: if qa_id represents a compound field. 113 | return_dict: if the function should return a dict that is useful to build 114 | compound answers. 115 | all_questions: Dictionary with all the questions and subquestions. 116 | 117 | Examples: 118 | >>> questions = {'question1': {'subquestion1': ['What?']}} 119 | >>> get_questions_for_chunk('all', all_questions=questions) 120 | {'question1': {'subquestion1': ['What?']}} 121 | >>> get_questions_for_chunk('matriculas.question1', all_questions=questions) 122 | {'subquestion1': ['What?']} 123 | 124 | Returns: 125 | List of all questions for a specific field, or dictionary with all the questions 126 | for a compound field. 127 | """ 128 | if qa_id == 'all': 129 | return all_questions 130 | 131 | typenames = qa_id.split('.') 132 | questions = all_questions 133 | for typename in typenames: 134 | questions = questions[typename] 135 | 136 | if is_compound: 137 | questions = questions['compound'] 138 | 139 | assert isinstance(questions, List) != return_dict, ( 140 | f'Shouldn\'t you set "is_compound=True" for the field {qa_id} to get a ' 141 | 'list of questions for a specific compound typename? Or set ' 142 | '"return_signature=True" to get the ordered dict of typenames to build ' 143 | 'a compound answer?') 144 | 145 | return questions 146 | 147 | 148 | def get_qa_ids_recursively( 149 | dict_or_list, base_qa_id, list_of_use_compound_question, 150 | list_of_compound_chunks_to_ignore, list_of_subchunks_to_skip, 151 | qa_ids_list=[] 152 | ) -> List[str]: 153 | """Auxiliar function to get recursively all the possible qa_ids""" 154 | 155 | if isinstance(dict_or_list, List) and not base_qa_id.endswith('compound'): 156 | qa_ids_list.append(base_qa_id) 157 | 158 | if isinstance(dict_or_list, Dict) or isinstance(dict_or_list, OrderedDict): 159 | if base_qa_id in list_of_use_compound_question: 160 | qa_ids_list.append(base_qa_id) 161 | 162 | elif base_qa_id not in list_of_compound_chunks_to_ignore: 163 | for typename, value in dict_or_list.items(): 164 | if typename not in list_of_subchunks_to_skip: 165 | qa_id = f'{base_qa_id}.{typename}' 166 | _ = get_qa_ids_recursively( 167 | value.copy(), qa_id, 168 | list_of_use_compound_question, 169 | list_of_compound_chunks_to_ignore, 170 | list_of_subchunks_to_skip, qa_ids_list) 171 | 172 | return qa_ids_list 173 | 174 | 175 | def get_all_qa_ids( 176 | document_class: Optional[str] = None, 177 | list_of_type_names: List[str] = [], 178 | list_of_use_compound_question: List[str] = [], 179 | list_of_subchunks_to_list: List[str] = [], 180 | list_subchunks_to_complement_siblings: List[str] = [], 181 | list_of_subchunks_to_skip: List[str] = [], 182 | all_questions: QUESTION_DICT = ALL_QUESTIONS 183 | ) -> List[str]: 184 | """Returns a list of all possible qa_ids that will be used to force 185 | the qa, even if the chunk does not exist. 186 | 187 | Args: 188 | document_class: class of documents to extract qa_ids. Use None to 189 | get for all possible document classes. 190 | list_of_typenames: list of type-names. 191 | list_of_use_compound_question: list of compound qa_ids. 192 | list_of_subchunks_to_list: list of listing qa_ids. 193 | list_subchunks_to_complement_sibling_questions: list of subchunks that 194 | will complement siblings, and does not require a qa. 195 | list_of_subchunks_to_skip: list of subchunks that will be skipped. 196 | 197 | Examples: 198 | >>> typenames = ['matriculas.imovel', 'matriculas.endereco', 'certidoes.resultado'] 199 | >>> use_compound = ['matriculas.endereco'] 200 | >>> get_all_qa_ids('matriculas', typenames, use_compound) 201 | ['matriculas.imovel.no_da_matricula', 'matriculas.imovel.oficio', 'matriculas.imovel.comarca', 202 | 'matriculas.imovel.estado', 'matriculas.endereco'] 203 | 204 | Returns: 205 | List of all possible qa_ids. 206 | """ 207 | all_qa_ids = [] 208 | 209 | # ignore chunks for which one subchunk will complement siblings. 210 | # we cannot force qas, since the question depends on a information 211 | # that is possibly non-annotated. 212 | list_of_compound_chunks_to_ignore = [sc.rsplit('.', 1)[0] 213 | for sc in list_subchunks_to_complement_siblings] 214 | 215 | for doc_class, questions_dict in all_questions.items(): 216 | if document_class is not None and doc_class != document_class: continue 217 | 218 | for typename, list_or_dict in questions_dict.items(): 219 | qa_id = f'{doc_class}.{typename}' 220 | 221 | if qa_id in list_of_type_names: 222 | qa_ids = get_qa_ids_recursively(list_or_dict, qa_id, list_of_use_compound_question, 223 | list_of_compound_chunks_to_ignore, list_of_subchunks_to_skip, []) 224 | for qa_id in qa_ids: 225 | all_qa_ids.append(qa_id) 226 | 227 | # for listing qa_ids, keep only document-class and last subchunk 228 | # with the suffix "_list" 229 | for qa_id in list_of_subchunks_to_list: 230 | typenames = qa_id.split('.') 231 | if document_class is None or document_class == typenames[0]: 232 | qa_id = f'{typenames[0]}.{typenames[-1]}_list' 233 | all_qa_ids.append(qa_id) 234 | 235 | return all_qa_ids 236 | 237 | 238 | def complement_questions_to_require_rawdata( 239 | questions: Union[QUESTION, List[QUESTION]], complement: str = COMPLEMENT 240 | ) -> Union[QUESTION, List[QUESTION]]: 241 | """Add complementary text to a question or questions. 242 | 243 | This indicates to the model it must give a subanswer with part of the 244 | context's raw text. 245 | """ 246 | if isinstance(questions, str): # simple question 247 | questions = questions.replace('?', complement) 248 | if isinstance(questions, list): # list of questions 249 | questions = [q.replace('?', complement) for q in questions] 250 | return questions 251 | 252 | 253 | def generate_t5_input_sentence( 254 | context: str, question: str, use_sentence_id: bool 255 | ) -> str: 256 | """Returns a T5 input sentence based on a question and its context. 257 | 258 | Args: 259 | context: text context of the question 260 | question: the question 261 | use_sentence_id: if True, every newline on the context will be replaced 262 | by a SENT token. Otherwise they are replaced with spaces. 263 | """ 264 | if use_sentence_id: 265 | context = _replace_linebreaks_with_tokens(context) 266 | else: 267 | context = _replace_linebreaks_with_spaces(context) 268 | 269 | t5_sentence = f'question: {question} context: {context}' 270 | return t5_sentence 271 | 272 | 273 | def generate_t5_label_sentence( 274 | answer: str, answer_start: Union[List[int], int], context: str, 275 | use_sentence_id: bool 276 | ) -> str: 277 | """Returns a T5 label sentence for simple or compound answers. 278 | 279 | Args: 280 | answer: answer of the current questions 281 | answer_start: char position of answer starting 282 | context: text context of the question 283 | use_sentence_id: if True, every newline on the context will be replaced 284 | by a SENT token. Otherwise they are replaced with spaces. 285 | """ 286 | if use_sentence_id: 287 | if isinstance(answer_start, list): 288 | # That is a compound_answer, like: "[Valor]: 500,00 [Unidade]: metro_quadrado" 289 | 290 | # Separate the compound answer in sub-answers: --, Valor] 500,00, Unidade] metro_quadrado 291 | # that could be problematic if some sub-answer has brackets, besides COMPLEMENT_TYPE 292 | sub_answers = answer.split('[')[1:] 293 | token_pattern = SENT_TOKEN.strip() 294 | 295 | # Extract sentence-ids for each sub-answer 296 | sent_ids = [] 297 | for sub_answer_start in answer_start: 298 | sent_ids.append(_get_id_based_on_linebreaks(context, 299 | sub_answer_start)) 300 | 301 | # Prepare the final answer with sentence-ids: "[SENTx] [Valor]: 500,00 [SENTy] [Unidade]: metro_quadrado" 302 | answer = '' 303 | for sub_answer in sub_answers: 304 | if sub_answer.startswith(COMPLEMENT_TYPE): 305 | answer = f'{answer}[{sub_answer}' 306 | else: 307 | answer = f'{answer}{token_pattern} [{sub_answer}' 308 | 309 | # Include the sentence-ids 310 | answer = answer.format(*sent_ids) 311 | elif isinstance(answer_start, int): 312 | # That is a simple answer 313 | 314 | sent_id = _get_id_based_on_linebreaks(context, answer_start) 315 | answer = f'[SENT{sent_id}] {answer}' 316 | else: 317 | # That is an occurrence of non-annotated data, as publicacoes (null in squad json) 318 | # [SENTX] is not included 319 | pass 320 | 321 | return answer 322 | -------------------------------------------------------------------------------- /information_extraction_t5/features/questions/__init__.py: -------------------------------------------------------------------------------- 1 | from .questions import * 2 | -------------------------------------------------------------------------------- /information_extraction_t5/features/questions/questions.py: -------------------------------------------------------------------------------- 1 | """Map of all questions, subquestions and its corresponding names. 2 | Compound type-names must be represented as a OrderedDict with 'compound' and 3 | subchunks' type-names as keys (even with empty lists) to keep a signature used 4 | to prepare the compound answers. 5 | 6 | A question name might have associated with it one of two things: 7 | 1. A list of questions. 8 | 2. A dictionary containing names of subquestions, which have their own list of 9 | subquestions. 10 | """ 11 | from collections import OrderedDict 12 | from typing import Dict, List, Union 13 | 14 | SUBQUESTION_NAME = str # Ex.: 'rua' 15 | SUBQUESTION = str # Ex.: 'Qual a rua?' 16 | QUESTION_NAME = str # Ex.: 'endereco' 17 | QUESTION = str # Ex.: 'Qual o endereco?' 18 | SUBQUESTION_DICT = Dict[SUBQUESTION_NAME, List[SUBQUESTION]] 19 | QUESTION_DICT = Dict[QUESTION_NAME, Union[SUBQUESTION_DICT, List[QUESTION]]] 20 | 21 | COMPLEMENT = ' e como aparece no texto?' # or 'and how does it appear in the text?' for EN 22 | 23 | _QUESTIONS_FORM = { 24 | 'etiqueta': [ 25 | 'Qual é o número da etiqueta?', 26 | ], 27 | 'agencia': [ 28 | 'Qual é o número da agência?', 29 | ], 30 | 'conta_corrente': [ 31 | 'Qual é o número da conta corrente?', 32 | ], 33 | 'cpf': [ 34 | 'Qual é o CPF/CNPJ?', 35 | 'Qual é o CPF do titular?', 36 | ], 37 | 'nome_completo': [ 38 | 'Qual é o nome?', 39 | 'Qual é o nome completo?', 40 | ], 41 | 'n_doc_serie': [ 42 | 'Qual é o número do documento ou número da série?', 43 | ], 44 | 'orgao_emissor': [ 45 | 'Qual é o órgão emissor?', 46 | ], 47 | 'doc_id_uf': [ 48 | 'Qual é o estado do documento de identificação?', 49 | 'Qual é a UF do documento de identificação?', 50 | ], 51 | 'data_emissao': [ 52 | 'Qual é a data de emissão?', 53 | ], 54 | 'data_nascimento': [ 55 | 'Qual é a data de nascimento?', 56 | ], 57 | 'nome_mae': [ 58 | 'Qual é o nome da mãe?', 59 | ], 60 | 'nome_pai': [ 61 | 'Qual é o nome do pai?', 62 | ], 63 | 'endereco': OrderedDict({ 64 | 'compound': [ 65 | 'Qual o endereço?', 66 | ], 67 | 'logradouro': [ 68 | 'Qual é o logradouro?', 69 | ], 70 | 'numero': [ 71 | 'Qual é o número?', 72 | ], 73 | 'complemento': [ 74 | 'Qual é o complemento?', 75 | ], 76 | 'bairro': [ 77 | 'Qual é o bairro?', 78 | ], 79 | 'cidade': [ 80 | 'Qual é a cidade?', 81 | ], 82 | 'estado': [ 83 | 'Qual é o estado?', 84 | ], 85 | 'cep': [ 86 | 'Qual é o CEP?', 87 | ] 88 | }), 89 | } 90 | 91 | # Include here other pairs (project, questions dict) for new datasets 92 | QUESTIONS = { 93 | 'form': _QUESTIONS_FORM, 94 | } 95 | -------------------------------------------------------------------------------- /information_extraction_t5/features/questions/type_map.py: -------------------------------------------------------------------------------- 1 | """Dictionaries that map type-names to types and vice-versa. The types are 2 | used as clues in brackets in T5 outputs. The type-names are recovered in 3 | post-processing stage. 4 | 5 | Each document class (project) has it own TYPENAME_TO_TYPE dictionary. We 6 | strongly recommend that the types used in all the projects be consistent, and 7 | as generic as possible. For example, using `CPF/CNPJ` for all CPFs and CNPJs, 8 | regardless of being a consultant, current account holder, business partner, 9 | land owner, etc. 10 | """ 11 | COMPLEMENT_TYPE = 'aparece no texto' # or 'appears in the text' for EN 12 | 13 | # Create a _NEWDATASET_TYPENAME_TO_TYPE for each new dataset, and 14 | # update the TYPENAME_TO_TYPE dict. 15 | 16 | _FORM_TYPENAME_TO_TYPE = { 17 | "etiqueta": "Etiqueta", 18 | "agencia": "Agência", 19 | "conta_corrente": "Conta Corrente", 20 | "cpf": "CPF/CNPJ", 21 | "nome_completo": "Nome", 22 | "n_doc_serie": "No do Documento", 23 | "orgao_emissor": "Órgão Emissor", 24 | "data_emissao": "Data de Emissão", 25 | "data_nascimento": "Data de Nascimento", 26 | "nome_mae": "Nome da Mãe", 27 | "nome_pai": "Nome do Pai", 28 | "endereco": "Endereço", 29 | "logradouro": "Logradouro", 30 | "numero": "Número", 31 | "complemento": "Complemento", 32 | "bairro": "Bairro", 33 | "cidade": "Cidade", 34 | "estado": "Estado", 35 | "cep": "CEP" 36 | } 37 | 38 | TYPENAME_TO_TYPE = { 39 | COMPLEMENT_TYPE: COMPLEMENT_TYPE, 40 | } 41 | TYPENAME_TO_TYPE.update(_FORM_TYPENAME_TO_TYPE) 42 | # TYPENAME_TO_TYPE.update(_NEWDATASET_TYPENAME_TO_TYPE) 43 | 44 | # This dict is used to recover the type-name by using the type. It is not 45 | # critical to recover exactly the original type-name (different typenames 46 | # can be mapped to the same type). Those type-names will be used in post 47 | # processing after splitting sentences. 48 | TYPE_TO_TYPENAME = {v: k for k, v in TYPENAME_TO_TYPE.items()} 49 | -------------------------------------------------------------------------------- /information_extraction_t5/features/sentences.py: -------------------------------------------------------------------------------- 1 | """Auxiliar methods for post-processing T5 input/output sentences.""" 2 | import re 3 | from typing import List, Tuple, Union 4 | 5 | from information_extraction_t5.features.questions.type_map import TYPE_TO_TYPENAME, COMPLEMENT_TYPE 6 | 7 | SENTENCE_ID_PATTERN = r'\[SENT(.*?)\]' 8 | SUBANSWER_PATTERN = r'([^[\]]+)(?:$|\[)' 9 | TYPE_NAME_PATTERN = r'\[([A-Za-záàâãéèêíïóôõöúçñÁÀÂÃÉÈÍÏÓÔÕÖÚÇѺª_ \/]*?)\]' 10 | 11 | SENT_TOKEN = ' [SENT{}] ' 12 | T5_RAW_CONTEXT = str 13 | 14 | # Type of a sentence that may have T5 identification tokens 15 | # Example: '[SENT1] [Comarca] Campinas' 16 | T5_SENTENCE = str 17 | 18 | 19 | def _has_text(string: str) -> bool: 20 | """Returns True if a string has non whitespace text.""" 21 | string_without_whitespace = string.strip() 22 | return len(string_without_whitespace) > 0 23 | 24 | 25 | def _clean_sub_answer(sub_answer: str) -> str: 26 | """Removes undesired characters from a sub answer. 27 | 28 | Removes any `:` and whitespace the subanswer. 29 | """ 30 | sub_answer = sub_answer.replace(':', '') 31 | sub_answer = sub_answer.strip() 32 | 33 | return sub_answer 34 | 35 | 36 | def find_sub_answers(prediction_str: T5_SENTENCE) -> List[str]: 37 | """Returns a list containing the sub answers of a T5 sentence in the order 38 | they appear. 39 | 40 | Examples: 41 | >>> sentence = '[SENT25] [Tipo de Logradouro]: Rua [SENT25] [Logradouro]: PEDRO BIAGI' 42 | >>> sub_answers = _find_sub_answers(sentence) 43 | >>> print(sub_answers) 44 | ['Rua', 'PEDRO BIAGI'] 45 | """ 46 | sub_answer_list = [] 47 | for sub_answer in re.findall(SUBANSWER_PATTERN, prediction_str): 48 | if _has_text(sub_answer): 49 | sub_answer = _clean_sub_answer(sub_answer) 50 | sub_answer_list.append(sub_answer) 51 | 52 | return sub_answer_list 53 | 54 | 55 | def find_ids_of_sent_tokens(sentence: T5_SENTENCE) -> List[int]: 56 | """Returns a list containing the IDs of the SENT tokens if a T5 sentence in 57 | the order they appear. 58 | 59 | The ID is the number that follows a SENT token. 60 | 61 | Examples: 62 | >>> sentence = '[SENT1] Campinas' 63 | >>> ids = _find_ids_of_sent_tokens(sentence) 64 | >>> print(ids) 65 | [1] 66 | """ 67 | ids = [] 68 | for sentid in re.findall(SENTENCE_ID_PATTERN, sentence): 69 | try: 70 | ids.append(int(sentid)) 71 | except: 72 | ids.append(sentid) 73 | 74 | return ids 75 | 76 | 77 | def _convert_name_from_t5_to_type_name(name: str) -> str: 78 | """Converts name outputed by T5 for a type name. 79 | 80 | When the model was trained it learned to output display names from chunks. 81 | This method replaces the display names with their type name version. 82 | """ 83 | if name not in TYPE_TO_TYPENAME: 84 | raise ValueError(f'Unknown type name: {name}') 85 | 86 | return TYPE_TO_TYPENAME[name] 87 | 88 | 89 | def find_type_names(sentence: T5_SENTENCE, map_type: bool = True) -> List[str]: 90 | """Returns a list containing the names of the type tokens of a T5 sentence 91 | in the order they appear. 92 | 93 | The name is the text that appears inside the type token. 94 | 95 | Examples: 96 | >>> sentence = '[Logradouro] Campinas' 97 | >>> type_names = _find_type_names(sentence) 98 | >>> print(type_names) 99 | ['Logradouro'] 100 | """ 101 | type_names = re.findall(TYPE_NAME_PATTERN, sentence) 102 | if map_type: 103 | type_names = [ 104 | _convert_name_from_t5_to_type_name(name) for name in type_names 105 | ] 106 | 107 | return type_names 108 | 109 | 110 | def split_context_into_sentences( 111 | context: T5_RAW_CONTEXT 112 | ) -> List[str]: 113 | """Splits a question context into multiple questions. 114 | 115 | The criteria of splitting is simply every linebreak found. 116 | """ 117 | return context.split('\n') 118 | 119 | 120 | def split_t5_sentence_into_components( 121 | sentence: T5_SENTENCE, 122 | map_type: bool = True 123 | ) -> Tuple[List[int], List[str], List[str]]: 124 | """Splits the string outputed by T5 into its components. 125 | 126 | If no occurrences are found of a component, returns an empty list for it. 127 | 128 | Components: 129 | - sent ids: the ID that follows a SENT token. 130 | - type names: the name inside a answer type token. 131 | - sub answers: each answer fragment found. 132 | Args: 133 | sentence: a T5 output sentence. 134 | 135 | Examples: 136 | >>> sentence = '[SENT25] [Tipo de Logradouro]: Rua [SENT25] [Logradouro]: PEDRO BIAGI [SENT26] [Número]: 462 [SENT25] [Cidade]: Sertãozinho [SENT0] [Estado]: SP' 137 | >>> sent_ids, type_names, sub_answers = \ 138 | >>> split_t5_sentence_into_components(sentence) 139 | >>> print(sent_ids) 140 | [25, 25, 26 25, 0] 141 | >>> print(type_names) 142 | ['tipo_de_logradouro', 'logradouro', 'numero', 'cidade', 'estado'] 143 | >>> print(sub_answers) 144 | ['Rua', 'PEDRO BIAGI', '462', 'Sertãozinho', 'SP']) 145 | 146 | Returns: 147 | Sentence ids, type names, answers/sub-answers 148 | """ 149 | sent_ids = find_ids_of_sent_tokens(sentence) 150 | type_names = find_type_names(sentence, map_type=map_type) 151 | sub_answers = find_sub_answers(sentence) 152 | 153 | return sent_ids, type_names, sub_answers 154 | 155 | 156 | def check_sent_id_is_valid( 157 | context: T5_RAW_CONTEXT, sent_id: int 158 | ) -> bool: 159 | """Returns True if a SENT ID is valid. 160 | 161 | An ID is valid when it corresponds to the ID of a sentence or its ID is 0. 162 | """ 163 | if sent_id < 0: 164 | return False 165 | 166 | sentences = split_context_into_sentences(context) 167 | 168 | if len(sentences) < sent_id: 169 | return False 170 | 171 | return True 172 | 173 | 174 | def deconstruct_answer( 175 | answer_sentence: T5_SENTENCE = '' 176 | ) -> Tuple[List[T5_SENTENCE], List[str]]: 177 | """Gets individual answer subsentences from the compound answer sentence. 178 | 179 | Args: 180 | answer sentence: a T5 output sentence. 181 | 182 | Examples: 183 | >>> sentence = '[SENT25] [Tipo de Logradouro]: Rua [SENT25] [Logradouro]: PEDRO BIAGI [SENT26] [Número]: 462 [SENT25] [Cidade]: Sertãozinho [SENT0] [Estado]: SP [aparece no texto] s paulo' 184 | >>> sub_sentences, type_names = deconstruct_answer(sentence) 185 | >>> print(sub_sentences) 186 | [ 187 | '[SENT25] [tipo_de_logradouro] Rua', 188 | '[SENT25] [logradouro] PEDRO BIAGI', 189 | '[SENT26] [numero] 462', 190 | '[SENT25] [cidade] Sertãozinho', 191 | '[SENT0] [estado] SP [aparece no texto] s paulo' 192 | ] 193 | >>> print(type_names) 194 | ['tipo_de_logradouro', 'logradouro', 'numero', 'cidade', 'estado'] 195 | 196 | Returns: 197 | sub-ansers and type-names 198 | """ 199 | sent_ids, type_names, sub_answers = split_t5_sentence_into_components(answer_sentence) 200 | sub_sentences = [] 201 | all_type_names = [] 202 | 203 | while len(sub_answers) > 0: 204 | sub_sentence = '' 205 | 206 | if len(sent_ids) > 0: 207 | sent_id = sent_ids.pop(0) 208 | sentence_token = SENT_TOKEN.format(sent_id).strip() 209 | sub_sentence += sentence_token + ' ' 210 | 211 | if len(type_names) > 0: 212 | type_name = type_names.pop(0) 213 | sub_sentence += f'[{type_name}]: ' 214 | all_type_names.append(type_name) 215 | 216 | sub_answer = sub_answers.pop(0) 217 | sub_sentence += f'{sub_answer} ' 218 | 219 | if len(type_names) > 0 and len(sub_answers) > 0 and type_names[0] == COMPLEMENT_TYPE: 220 | type_name = type_names.pop(0) 221 | sub_answer = sub_answers.pop(0) 222 | 223 | sub_sentence += f'[{type_name}] {sub_answer} ' 224 | 225 | sub_sentences.append(sub_sentence.strip()) 226 | 227 | return sub_sentences, all_type_names 228 | 229 | 230 | def get_subanswer_from_subsentence(subsentence: T5_SENTENCE) -> T5_SENTENCE: 231 | """Get only the sub-answer from the current subsentence. 232 | 233 | Args: 234 | subsentence: a T5 subsentence. 235 | 236 | Examples: 237 | >>> subsentence = [SENT1] [no_da_matricula] 88975 [aparece no texto] 88.975 238 | >>> subanswer = get_subanswer_from_subsentence(subsentence) 239 | >>> print(subanswer) 240 | [no_da_matricula]: 88975 241 | 242 | Returns: 243 | subanswer that corresponds to subsentence without SENT_TOKEN and COMPLEMENT_TYPE 244 | 245 | """ 246 | _, tn, ans = split_t5_sentence_into_components(subsentence, map_type=False) 247 | 248 | if len(ans) == 0: 249 | return '' 250 | 251 | if len(tn) == 0: 252 | subanswer = ans[0] 253 | else: 254 | subanswer = f'[{tn[0]}]: {ans[0]}' 255 | 256 | return subanswer 257 | 258 | 259 | def get_raw_answer_from_subsentence(subsentence: T5_SENTENCE) -> Union[str, None]: 260 | """Get only the raw-text answer from the current subsentence. 261 | 262 | Args: 263 | subsentence: a T5 subsentence. 264 | 265 | Examples: 266 | >>> subsentence = [SENT1] [no_da_matricula] 88975 [aparece no texto] 88.975 267 | >>> subanswer = get_raw_answer_from_subsentence(subsentence) 268 | >>> print(subanswer) 269 | 88.975 270 | 271 | Returns: 272 | subanswer that corresponds to subsentence without SENT_TOKEN and COMPLEMENT_TYPE 273 | 274 | """ 275 | try: 276 | return subsentence.split(f'[{COMPLEMENT_TYPE}]')[1].strip() 277 | except: 278 | return None 279 | 280 | 281 | def get_clean_answer_from_subanswer(subanswer: T5_SENTENCE) -> List[str]: 282 | """Get the final and pure answer from each sub-answer. 283 | 284 | Args: 285 | subanswer: subanswer extracted with function get_subanswer_from_subsentence. 286 | 287 | Examples: 288 | >>> subanswer = '[no_da_matricula]: 88975' 289 | >>> answer_ = get_clean_answer_from_subanswer(subanswer) 290 | >>> print(answer) 291 | ['88975'] 292 | 293 | Returns: 294 | clean answers without the clues in square brackets 295 | """ 296 | try: 297 | return find_sub_answers(subanswer) 298 | except: 299 | return [''] 300 | -------------------------------------------------------------------------------- /information_extraction_t5/models/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /information_extraction_t5/models/qa_model.py: -------------------------------------------------------------------------------- 1 | """Model definition based on Pytorh-Lightning.""" 2 | import os 3 | import json 4 | import configargparse 5 | import numpy as np 6 | import pandas as pd 7 | 8 | import torch 9 | import pytorch_lightning as pl 10 | 11 | from deepspeed.ops.adam import DeepSpeedCPUAdam 12 | 13 | from transformers import ( 14 | AutoTokenizer, 15 | T5ForConditionalGeneration, 16 | T5Config, 17 | MT5ForConditionalGeneration, 18 | MT5Config 19 | ) 20 | 21 | from information_extraction_t5.features.postprocess import ( 22 | group_qas, 23 | get_highest_probability_window, 24 | split_compound_labels_and_predictions, 25 | ) 26 | from information_extraction_t5.features.sentences import ( 27 | get_clean_answer_from_subanswer 28 | ) 29 | from information_extraction_t5.utils.metrics import ( 30 | normalize_answer, 31 | t5_qa_evaluate, 32 | compute_exact, 33 | compute_f1 34 | ) 35 | from information_extraction_t5.utils.freeze import freeze_embeds 36 | 37 | class QAClassifier(torch.nn.Module): 38 | def __init__(self, hparams): 39 | super().__init__() 40 | self.hparams.update(vars(hparams)) 41 | 42 | if 'mt5' in self.hparams.config_name: 43 | config = MT5Config.from_pretrained( 44 | self.hparams.config_name if self.hparams.config_name else self.hparams.model_name_or_path, 45 | cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None, 46 | ) 47 | self.model = MT5ForConditionalGeneration.from_pretrained( 48 | self.hparams.model_name_or_path, 49 | from_tf=bool(".ckpt" in self.hparams.model_name_or_path), 50 | config=config, 51 | cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None, 52 | ) 53 | else: 54 | config = T5Config.from_pretrained( 55 | self.hparams.config_name if self.hparams.config_name else self.hparams.model_name_or_path, 56 | cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None, 57 | ) 58 | self.model = T5ForConditionalGeneration.from_pretrained( 59 | self.hparams.model_name_or_path, 60 | from_tf=bool(".ckpt" in self.hparams.model_name_or_path), 61 | config=config, 62 | cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None, 63 | ) 64 | self.tokenizer = AutoTokenizer.from_pretrained( 65 | self.hparams.tokenizer_name if self.hparams.tokenizer_name else self.hparams.model_name_or_path, 66 | do_lower_case=self.hparams.do_lower_case, 67 | use_fast=False, 68 | cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None, 69 | ) 70 | 71 | if 'byt5' in self.hparams.model_name_or_path.lower(): 72 | self.input_max_length = self.hparams.max_size # chars 73 | else: 74 | self.input_max_length = self.hparams.max_seq_length # tokens 75 | 76 | # use for faster training/larger batch size 77 | freeze_embeds(self.model) 78 | 79 | # filename for cache predictions 80 | self.cache_fname = os.path.join( 81 | self.hparams.data_dir if self.hparams.data_dir else ".", 82 | "cached_predictions_{}.pkl".format( 83 | list(filter(None, self.hparams.model_name_or_path.split("/"))).pop() 84 | ) 85 | ) 86 | 87 | def forward(self, x): 88 | return self.model(x) 89 | 90 | class LitQA(QAClassifier, pl.LightningModule): 91 | 92 | def configure_optimizers(self): 93 | optimizer = self.get_optimizer() 94 | return optimizer 95 | 96 | def training_step(self, batch, batch_idx): 97 | sentences, labels = batch 98 | 99 | sentences_tokens = self.tokenizer.batch_encode_plus( 100 | sentences, padding=True, truncation=True, 101 | max_length=self.input_max_length, return_tensors='pt' 102 | ) 103 | labels = self.tokenizer.batch_encode_plus( 104 | labels, padding=True, truncation=True, 105 | max_length=self.input_max_length, return_tensors='pt' 106 | ) 107 | 108 | inputs = { 109 | "input_ids": sentences_tokens['input_ids'].to(self.device), 110 | "labels": labels['input_ids'].to(self.device), 111 | "attention_mask": sentences_tokens['attention_mask'].to(self.device), 112 | } 113 | 114 | outputs = self.model(**inputs) 115 | 116 | self.log('train_loss', outputs[0], on_step=True, on_epoch=True, 117 | prog_bar=True, batch_size=len(sentences) 118 | ) 119 | return {'loss': outputs[0]} 120 | 121 | def validation_step(self, batch, batch_idx): 122 | sentences, labels, _, _ = batch 123 | 124 | sentences_tokens = self.tokenizer.batch_encode_plus( 125 | sentences, padding=True, truncation=True, 126 | max_length=self.input_max_length, return_tensors='pt' 127 | ) 128 | 129 | inputs = { 130 | "input_ids": sentences_tokens['input_ids'].to(self.device), 131 | "attention_mask": sentences_tokens['attention_mask'].to(self.device), 132 | "max_length": self.hparams.max_length, 133 | } 134 | 135 | outputs = self.model.generate(**inputs) 136 | predictions = self.tokenizer.batch_decode(outputs, skip_special_tokens=True) 137 | 138 | return {'labels': labels, 'preds': predictions} 139 | 140 | def test_step(self, batch, batch_idx): 141 | sentences, labels, document_ids, typename_ids = batch 142 | 143 | # if we are using cached predictions, is not necessary to run steps again 144 | if self.hparams.use_cached_predictions and os.path.exists(self.cache_fname): 145 | return {'labels': [], 'preds': [], 'doc_ids': [], 'tn_ids': [], 'probs': []} 146 | 147 | sentences_tokens = self.tokenizer.batch_encode_plus( 148 | sentences, padding=True, truncation=True, 149 | max_length=self.input_max_length, return_tensors='pt' 150 | ) 151 | 152 | # This is handled differently then the others because of conflicts of 153 | # the previous approach with quantization. 154 | inputs = { 155 | "input_ids": sentences_tokens['input_ids'].to(self.device).long(), 156 | "attention_mask": sentences_tokens['attention_mask'].to(self.device).long(), 157 | "max_length": self.hparams.max_length, 158 | "num_beams": self.hparams.num_beams, 159 | "early_stopping": True, 160 | } 161 | 162 | outputs = self.model.generate(**inputs) 163 | predictions = self.tokenizer.batch_decode(outputs, skip_special_tokens=True) 164 | 165 | # compute probs 166 | probs = self._compute_probs(sentences, predictions) 167 | 168 | return { 169 | 'labels': labels, 'preds': predictions, 'doc_ids': document_ids, 170 | 'tn_ids': typename_ids, 'probs': probs 171 | } 172 | 173 | def validation_epoch_end(self, outputs): 174 | predictions, labels = [], [] 175 | for output in outputs: 176 | for label, pred in zip(output['labels'], output['preds']): 177 | predictions.append(pred) 178 | labels.append(label) 179 | 180 | results = t5_qa_evaluate(labels, predictions) 181 | exact = torch.tensor(results['exact']) 182 | f1 = torch.tensor(results['f1']) 183 | 184 | log = { 185 | 'val_exact': exact, # for monitoring checkpoint callback 186 | 'val_f1': f1, # for monitoring checkpoint callback 187 | } 188 | self.log_dict(log, logger=True, prog_bar=True, on_epoch=True) 189 | 190 | def test_epoch_end(self, outputs): 191 | predictions, labels, document_ids, typename_ids, probs, window_ids = \ 192 | [], [], [], [], [], [] 193 | 194 | for output in outputs: 195 | for label, pred, doc_id, tn_id, prob in zip( 196 | output['labels'], output['preds'], 197 | output['doc_ids'], output['tn_ids'], output['probs']): 198 | predictions.append(pred) 199 | labels.append(label) 200 | document_ids.append(doc_id) 201 | typename_ids.append(tn_id) 202 | probs.append(prob) 203 | 204 | # cache labels, predictions, document_ids, typename_ids and probs 205 | # so that we can post-process without running again the test_steps 206 | if self.hparams.use_cached_predictions and os.path.exists(self.cache_fname): 207 | print(f'Loading predictions from cached file {self.cache_fname}') 208 | labels, predictions, document_ids, typename_ids, probs = \ 209 | pd.read_pickle(self.cache_fname).T.values.tolist() 210 | else: 211 | self._backup_outputs(labels, predictions, document_ids, typename_ids, probs) 212 | 213 | # pick up the highest-probability prediction for each pair document-typename 214 | if self.hparams.get_highestprob_answer: 215 | ( 216 | labels, 217 | predictions, 218 | document_ids, 219 | typename_ids, 220 | probs, 221 | window_ids, 222 | ) = get_highest_probability_window( 223 | labels, 224 | predictions, 225 | document_ids, 226 | typename_ids, 227 | probs, 228 | use_fewer_NA=True, 229 | ) 230 | 231 | # split compound answers to get metrics to visualize and compute metrics for each subsentence 232 | if self.hparams.split_compound_answers: 233 | ( 234 | labels, 235 | predictions, 236 | document_ids, 237 | typename_ids, 238 | probs, 239 | window_ids, 240 | _, 241 | _, 242 | original_idx, 243 | disjoint_answer_idx_by_doc_class, 244 | ) = split_compound_labels_and_predictions( 245 | labels, 246 | predictions, 247 | document_ids, 248 | typename_ids, 249 | probs, 250 | window_ids, 251 | ) 252 | else: 253 | print('WARNING: We strongly recommend to set --split_compound_answers=True, ' 254 | 'even for datasets without compound qas. This is useful to get metrics ' 255 | 'for clean outputs (without sentence-IDs and raw-text).') 256 | original_idx = list(range(len(labels))) 257 | disjoint_answer_idx_by_doc_class = {} 258 | 259 | # for each typename_id or document_id, extract its indexes to get specific metrics 260 | if self.hparams.group_qas: 261 | qid_dict_by_typenames = group_qas(typename_ids, group_by_typenames=True) 262 | qid_dict_by_documents = group_qas(document_ids, group_by_typenames=False) 263 | qid_dict_by_typenames['ORIG'] = original_idx 264 | qid_dict_by_documents['ORIG'] = original_idx 265 | else: 266 | qid_dict_by_typenames = {'ORIG': original_idx} 267 | qid_dict_by_documents = {'ORIG': original_idx} 268 | 269 | # save labels and predictions 270 | self._save_outputs( 271 | labels, predictions, document_ids, 272 | probs, window_ids, qid_dict_by_typenames, 273 | outputs_fname='outputs_by_typenames.txt', 274 | document_classes=list(disjoint_answer_idx_by_doc_class.keys()) 275 | ) 276 | self._save_outputs( 277 | labels, predictions, typename_ids, 278 | probs, window_ids, qid_dict_by_documents, 279 | outputs_fname='outputs_by_documents.txt', 280 | document_classes=list(disjoint_answer_idx_by_doc_class.keys()) 281 | ) 282 | 283 | # For each document class, include the indexes of individual qas, and 284 | # of subsentences of compound qas. This is useful for fair comparison 285 | # of experiments with compound qas and individual qas. 286 | # Also save the disjoint samples in Excel sheets. 287 | all_idx = [] 288 | writer = pd.ExcelWriter('outputs_sheet_client.xlsx') 289 | for document_class, indices in disjoint_answer_idx_by_doc_class.items(): 290 | qid_dict_by_typenames['DISJOINT_' + document_class] = indices 291 | qid_dict_by_documents['DISJOINT_' + document_class] = indices 292 | all_idx += indices 293 | self._save_sheets( 294 | labels, predictions, document_ids, 295 | typename_ids, probs, document_class, indices, writer 296 | ) 297 | writer.close() 298 | qid_dict_by_typenames['DISJOINT_ALL'] = all_idx 299 | qid_dict_by_documents['DISJOINT_ALL'] = all_idx 300 | self._save_sheets( 301 | labels, predictions, document_ids, 302 | typename_ids, probs, 'all', all_idx 303 | ) 304 | 305 | # compute metrics 306 | results_by_typenames = t5_qa_evaluate( 307 | labels, predictions, qid_dict=qid_dict_by_typenames 308 | ) 309 | results_by_documents = t5_qa_evaluate( 310 | labels, predictions, qid_dict=qid_dict_by_documents 311 | ) 312 | exact = torch.tensor(results_by_typenames['exact']) 313 | f1 = torch.tensor(results_by_typenames['f1']) 314 | 315 | # write metric files 316 | with open('metrics_by_typenames.json', 'w') as f: 317 | json.dump(results_by_typenames, f, indent=4) 318 | with open('metrics_by_documents.json', 'w') as f: 319 | json.dump(results_by_documents, f, indent=4) 320 | 321 | log = { 322 | 'exact': exact, 323 | 'f1': f1 324 | } 325 | self.log_dict(log, logger=True, on_epoch=True) 326 | 327 | @torch.no_grad() 328 | def _compute_probs(self, sentences, predictions): 329 | probs = [] 330 | for sentence, prediction in zip(sentences, predictions): 331 | input_ids = self.tokenizer.encode(sentence, truncation=True, 332 | max_length=self.input_max_length, return_tensors="pt").to(self.device).long() 333 | output_ids = self.tokenizer.encode(prediction, truncation=True, 334 | max_length=self.input_max_length, return_tensors="pt").to(self.device).long() 335 | 336 | outputs = self.model(input_ids=input_ids, labels=output_ids) 337 | 338 | loss = outputs[0] 339 | prob = (loss * -1) / output_ids.shape[1] 340 | prob = np.exp(prob.cpu().numpy()) 341 | probs.append(prob) 342 | return probs 343 | 344 | def _backup_outputs(self, labels, predictions, document_ids, typename_ids, probs): 345 | arr = np.vstack([np.array(labels, dtype="O"), np.array(predictions, dtype="O"), 346 | np.array(document_ids, dtype="O"), np.array(typename_ids, dtype="O"), 347 | np.array(probs, dtype="O")]).transpose() 348 | df = pd.DataFrame(arr, columns=['labels', 'predictions', 'document_ids', 'typename_ids', 'probs']) 349 | df.to_pickle(self.cache_fname) 350 | 351 | def _save_outputs( 352 | self, labels, predictions, doc_or_tn_ids, probs, window_ids, 353 | qid_dict=None, outputs_fname='outputs.txt', document_classes=["form"] 354 | ): 355 | if qid_dict is None: 356 | qid_dict = {} 357 | 358 | f = open(outputs_fname, 'w') 359 | f.write('{0:<50} | {1:50} | {2:30} | {3} | {4}\n'.format( 360 | 'label', 'prediction', 'uuid', 'prob', 'window')) 361 | if qid_dict == {}: 362 | for label, prediction, doc_or_tn_id, prob, w_id in zip( 363 | labels, predictions, doc_or_tn_ids, probs, window_ids): 364 | lab, pred = label, prediction 365 | if self.hparams.normalize_outputs: 366 | lab, pred = normalize_answer(label), normalize_answer(prediction) 367 | if lab != pred or lab == pred and not self.hparams.only_misprediction_outputs: 368 | f.write('{0:<50} | {1:50} | {2:30} | {3} | {4}\n'.format( 369 | label, prediction, doc_or_tn_id, prob, w_id)) 370 | else: 371 | for (kword, list_indices) in qid_dict.items(): 372 | # do not print for ORIG, DISJOINT* and all samples for a specific project/document class 373 | # those groups are important for metrics, not for outputs visualization 374 | if kword == 'ORIG' or kword.startswith('DISJOINT') or kword in document_classes: 375 | continue 376 | f.write(f'===============\n{kword}\n===============\n') 377 | for idx in list_indices: 378 | label, prediction, doc_or_tn_id, prob, w_id = \ 379 | labels[idx], predictions[idx], doc_or_tn_ids[idx], probs[idx], window_ids[idx] 380 | lab, pred = label, prediction 381 | if self.hparams.normalize_outputs: 382 | lab, pred = normalize_answer(label), normalize_answer(prediction) 383 | if lab != pred or lab == pred and not self.hparams.only_misprediction_outputs: 384 | f.write('{0:<50} | {1:50} | {2:30} | {3} | {4}\n'.format( 385 | label, prediction, doc_or_tn_id, prob, w_id)) 386 | f.close() 387 | 388 | def _save_sheets(self, labels, predictions, document_ids, typename_ids, probs, document_class, indices, writer=None): 389 | # Saving disjoint predictions (splitted and clean) in a dataframe 390 | arr = np.vstack([np.array(document_ids, dtype="O")[indices], 391 | np.array(typename_ids, dtype="O")[indices], 392 | np.array(labels, dtype="O")[indices], 393 | np.array(predictions, dtype="O")[indices], 394 | np.array(probs, dtype="O")[indices]]).transpose() 395 | df = pd.DataFrame(arr, 396 | columns=['document_ids', 'typename_ids', 'labels', 'predictions', 'probs'] 397 | ).reset_index(drop=True) 398 | 399 | if document_class == 'all': 400 | df = df.sort_values(['document_ids', 'typename_ids']) # hack to keep listing outputs together for each document-class 401 | df_all_group_doc = df.set_index('document_ids', append=True).swaplevel(0,1) 402 | df_all_group_doc.to_excel('outputs_sheet.xlsx') 403 | else: 404 | # compute metrics for each pair document_id-typename-id 405 | df['exact'] = df.apply(lambda x: compute_exact(x['labels'], x['predictions']), axis=1) 406 | df['f1'] = df.apply(lambda x: compute_f1(x['labels'], x['predictions']), axis=1) 407 | 408 | # remove clue/prefix into brackets 409 | df['labels'] = df.apply( 410 | lambda x: ', '.join(get_clean_answer_from_subanswer(x['labels'])), 411 | axis=1 412 | ) 413 | df['predictions'] = df.apply( 414 | lambda x: ', '.join(get_clean_answer_from_subanswer(x['predictions'])), 415 | axis=1 416 | ) 417 | 418 | # use pivot to get a quadruple of columns (labels, predictions, equal, prob) for each typename 419 | pivoted = df.pivot( 420 | index=['document_ids'], 421 | columns=['typename_ids'], 422 | values=['labels', 'predictions', 'exact', 'f1', 'probs'] 423 | ) 424 | pivoted = pivoted.swaplevel(0, 1, axis=1).sort_index(axis=1) # put column (typename_ids) above the values 425 | 426 | # extract typename_ids in the original order (instead of alphanumeric order) 427 | # get the columns from the document-ids that have more samples 428 | cols = df[df['document_ids']==df.document_ids.mode()[0]].typename_ids.tolist() 429 | if len(cols) == len(pivoted.columns) // 5: 430 | pivoted = pivoted[cols] 431 | else: 432 | print('Keeping typenames in alphanumeric order since none of the documents ' 433 | f'have all the possible qa_ids ({len(cols)} != {len(pivoted.columns) // 5})') 434 | 435 | # save sheet 436 | pivoted.to_excel(writer, sheet_name=document_class) 437 | 438 | def get_optimizer(self,) -> torch.optim.Optimizer: 439 | """Define the optimizer""" 440 | optimizer_name = self.hparams.optimizer 441 | lr = self.hparams.lr 442 | weight_decay=self.hparams.weight_decay 443 | optimizer = getattr(torch.optim, optimizer_name) 444 | 445 | # Prepare optimizer and schedule (linear warmup and decay) 446 | no_decay = ["bias", "LayerNorm.weight"] 447 | optimizer_grouped_parameters = [ 448 | { 449 | "params": [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)], 450 | "weight_decay": weight_decay, 451 | }, 452 | {"params": [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0}, 453 | ] 454 | 455 | if self.hparams.deepspeed: 456 | # DeepSpeedCPUAdam provides 5x to 7x speedup over torch.optim.adam(w) 457 | optimizer = DeepSpeedCPUAdam( 458 | optimizer_grouped_parameters, lr=lr, 459 | weight_decay=weight_decay, eps=1e-4, adamw_mode=True 460 | ) 461 | else: 462 | optimizer = optimizer( 463 | optimizer_grouped_parameters, lr=lr, weight_decay=weight_decay 464 | ) 465 | 466 | print(f'=> Using {optimizer_name} optimizer') 467 | 468 | return optimizer 469 | 470 | @staticmethod 471 | def add_model_specific_args(parent_parser): 472 | 473 | parser = configargparse.ArgumentParser(parents=[parent_parser], add_help=False) 474 | 475 | parser.add_argument( 476 | "--model_name_or_path", 477 | default='t5-small', 478 | type=str, 479 | required=True, 480 | help="Path to pretrained model or model identifier from huggingface.co/models", 481 | ) 482 | parser.add_argument( 483 | "--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name" 484 | ) 485 | parser.add_argument( 486 | "--tokenizer_name", 487 | default="", 488 | type=str, 489 | help="Pretrained tokenizer name or path if not the same as model_name", 490 | ) 491 | parser.add_argument( 492 | "--cache_dir", 493 | default="", 494 | type=str, 495 | help="Where do you want to store the pre-trained models downloaded from s3", 496 | ) 497 | parser.add_argument( 498 | "--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model." 499 | ) 500 | parser.add_argument( 501 | "--max_seq_length", 502 | default=384, 503 | type=int, 504 | help="The maximum total input sequence length after WordPiece tokenization. Sequences " 505 | "longer than this will be truncated, and sequences shorter than this will be padded.", 506 | ) 507 | parser.add_argument( 508 | "--max_size", 509 | default=1024, 510 | type=int, 511 | help="The maximum input length after char-based tokenization. And also the maximum context " 512 | "size for char-based contexts." 513 | ) 514 | parser.add_argument( 515 | "--max_length", 516 | default=120, 517 | type=int, 518 | help="The maximum total output sequence length generated by the model." 519 | ) 520 | parser.add_argument( 521 | "--num_beams", 522 | default=1, 523 | type=int, 524 | help="Number of beams for beam search. 1 means no beam search." 525 | ) 526 | parser.add_argument( 527 | "--get_highestprob_answer", 528 | action="store_true", 529 | help="If true, get the answer from the sliding-window that gives highest probability." 530 | ) 531 | parser.add_argument( 532 | "--split_compound_answers", 533 | action="store_true", 534 | help="If true, split the T5 outputs into individual answers.", 535 | ) 536 | parser.add_argument( 537 | "--group_qas", 538 | action="store_true", 539 | help="If true, use group qas to get individual metrics ans structured output file for each type-name.", 540 | ) 541 | parser.add_argument( 542 | "--only_misprediction_outputs", 543 | action="store_true", 544 | help="If true, return only mispredictions in the output file.", 545 | ) 546 | parser.add_argument( 547 | "--normalize_outputs", 548 | action="store_true", 549 | help="If true, normalize label and prediction to include in the output file. " 550 | "The normalization is the same applied before computing metrics.", 551 | ) 552 | 553 | return parser 554 | -------------------------------------------------------------------------------- /information_extraction_t5/predict.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | """ Predicting the T5 model finetuned for question-answering on SQuAD.""" 3 | 4 | import configargparse 5 | import glob 6 | 7 | import torch 8 | from pytorch_lightning import Trainer 9 | 10 | from information_extraction_t5.models.qa_model import LitQA 11 | from information_extraction_t5.data.qa_data import QADataModule 12 | 13 | 14 | def main(): 15 | """Predict.""" 16 | 17 | parser = configargparse.ArgParser( 18 | 'Training and evaluation script for training T5 model for QA', 19 | config_file_parser_class=configargparse.YAMLConfigFileParser) 20 | parser.add_argument('-c', '--my-config', required=True, is_config_file=True, 21 | help='config file path') 22 | 23 | parser.add_argument("--seed", type=int, default=42, 24 | help="random seed for initialization") 25 | parser.add_argument('--num_workers', default=8, type=int) 26 | parser.add_argument("--use_cached_predictions", action="store_true", 27 | help="If true, reload the cache to post-process the senteces and compute metrics") 28 | 29 | parser = LitQA.add_model_specific_args(parser) 30 | parser = QADataModule.add_model_specific_args(parser) 31 | args, _ = parser.parse_known_args() 32 | 33 | # Load best checkpoint of the current experiment 34 | ckpt_path = glob.glob('lightning_logs/*ckpt')[0] 35 | print(f'Loading weights from {ckpt_path}') 36 | 37 | model = LitQA.load_from_checkpoint( 38 | checkpoint_path=ckpt_path, 39 | hparams_file='lightning_logs/version_0/hparams.yaml', 40 | map_location=None, 41 | hparams=args, 42 | ) 43 | gpus = 1 if torch.cuda.is_available() else 0 44 | if gpus == 0: 45 | model = torch.quantization.quantize_dynamic( 46 | model, {torch.nn.Linear}, dtype=torch.qint8 47 | ) 48 | 49 | dm = QADataModule(args) 50 | dm.setup('test') 51 | 52 | torch.set_num_threads(1) 53 | trainer = Trainer(gpus=gpus) 54 | trainer.test(model, datamodule=dm) 55 | 56 | 57 | if __name__ == "__main__": 58 | main() 59 | -------------------------------------------------------------------------------- /information_extraction_t5/train.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | """ Finetuning the T5 model for question-answering on SQuAD.""" 3 | 4 | import math 5 | import os 6 | import configargparse 7 | 8 | from pytorch_lightning import Trainer, seed_everything 9 | from pytorch_lightning.callbacks import LearningRateMonitor 10 | from pytorch_lightning.callbacks import RichProgressBar 11 | from pytorch_lightning.callbacks import RichModelSummary 12 | from pytorch_lightning.callbacks.model_checkpoint import ModelCheckpoint 13 | from pytorch_lightning.loggers.neptune import NeptuneLogger 14 | from pytorch_lightning.plugins import DeepSpeedPlugin 15 | 16 | from information_extraction_t5.models.qa_model import LitQA 17 | from information_extraction_t5.data.qa_data import QADataModule 18 | 19 | MODEL_DIR = 'lightning_logs' 20 | 21 | def main(): 22 | """Train.""" 23 | 24 | parser = configargparse.ArgParser( 25 | 'Training and evaluation script for training T5 model for QA', 26 | config_file_parser_class=configargparse.YAMLConfigFileParser) 27 | parser.add_argument('-c', '--my-config', required=True, is_config_file=True, 28 | help='config file path') 29 | 30 | # optimizer parameters 31 | parser.add_argument('--optimizer', type=str, default='Adam') 32 | parser.add_argument("--lr", default=5e-5, type=float, 33 | help="The initial learning rate.") 34 | parser.add_argument("--weight_decay", default=0.0, type=float, 35 | help="Weight decay if we apply some.") 36 | 37 | # neptune 38 | parser.add_argument("--neptune", action="store_true", help="If true, use neptune logger.") 39 | parser.add_argument('--neptune_project', type=str, default='ramon.pires/bracis-2021') 40 | parser.add_argument('--experiment_name', type=str, default='experiment01') 41 | parser.add_argument('--tags', action='append') 42 | 43 | parser.add_argument("--deepspeed", action="store_true", help="If true, use deepspeed plugin.") 44 | 45 | parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") 46 | parser.add_argument('--num_workers', default=8, type=int) 47 | 48 | # add all the available trainer options to argparse 49 | parser = Trainer.add_argparse_args(parser) 50 | # add model specific args 51 | parser = LitQA.add_model_specific_args(parser) 52 | # add datamodule specific args 53 | parser = QADataModule.add_model_specific_args(parser) 54 | args, _ = parser.parse_known_args() 55 | 56 | # cached predictions must be used only for predict.py 57 | args.use_cached_predictions = False 58 | 59 | # setting the seed for reproducibility 60 | if args.deterministic: 61 | seed_everything(args.seed) 62 | 63 | # data module 64 | dm = QADataModule(args) 65 | dm.setup('fit') 66 | 67 | # Defining the model 68 | model = LitQA(args) 69 | 70 | # For training larger models, we are not running validation but saving 71 | # checkpoint by train steps. 72 | # To doing this, we set check_val_every_n_epoch > max_epochs 73 | if args.check_val_every_n_epoch > args.max_epochs: 74 | # dataset_size / (batch_size * accum_batches) 75 | every_n_train_steps = math.ceil( 76 | len(dm.train_dataset) / (args.train_batch_size * args.accumulate_grad_batches)) 77 | checkpoint_callback = ModelCheckpoint( 78 | dirpath=MODEL_DIR, filename='{epoch}-{train_loss:.4f}', 79 | monitor='train_loss_step', verbose=False, save_last=False, save_top_k=args.max_epochs, 80 | save_weights_only=True, mode='min', every_n_train_steps=every_n_train_steps 81 | ) 82 | else: 83 | checkpoint_callback = ModelCheckpoint( 84 | dirpath=MODEL_DIR, filename='{epoch}-{val_exact:.2f}-{val_f1:.2f}', 85 | monitor='val_exact', verbose=False, save_last=False, save_top_k=5, 86 | save_weights_only=False, mode='max', every_n_epochs=1 87 | ) 88 | 89 | # Instantiate LearningRateMonitor Callback 90 | lr_logger = LearningRateMonitor(logging_interval='epoch') 91 | 92 | # Set neptune logger 93 | if args.neptune: 94 | neptune_logger = NeptuneLogger( 95 | api_key=os.environ.get('NEPTUNE_API_TOKEN'), 96 | project=args.neptune_project, 97 | name=args.experiment_name, 98 | mode='async', # Possible values "async", "sync", "offline", and "debug", "read-only" 99 | run=None, # Set run's identifier like 'SAN-1' in case of resuming a tracked run 100 | tags=args.tags, 101 | log_model_checkpoints=False, 102 | source_files=["**/*.py", "*.yaml"], 103 | capture_stdout=False, 104 | capture_stderr=False, 105 | capture_hardware_metrics=False, 106 | ) 107 | else: 108 | neptune_logger = None 109 | 110 | if args.deepspeed: 111 | deepspeed_plugin = DeepSpeedPlugin( 112 | stage=2, 113 | offload_optimizer=True, 114 | offload_parameters=True, 115 | allgather_bucket_size=2e8, 116 | reduce_bucket_size=2e8, 117 | allgather_partitions=True, 118 | reduce_scatter=True, 119 | overlap_comm=True, 120 | contiguous_gradients=True, 121 | ## Activation Checkpointing 122 | partition_activations=True, 123 | cpu_checkpointing=True, 124 | contiguous_memory_optimization=True, 125 | ) 126 | else: 127 | deepspeed_plugin = None 128 | 129 | # Defining the Trainer, training... and finally testing 130 | trainer = Trainer.from_argparse_args( 131 | args, 132 | logger=neptune_logger, 133 | plugins=deepspeed_plugin, 134 | callbacks=[ 135 | lr_logger, 136 | checkpoint_callback, 137 | RichProgressBar(), 138 | RichModelSummary(max_depth=2) 139 | ] 140 | ) 141 | trainer.fit(model, datamodule=dm) 142 | 143 | dm.setup('test') 144 | trainer.test(datamodule=dm) 145 | 146 | # Save checkpoints folder 147 | if args.neptune: 148 | # neptune_logger.experiment.log_artifact(MODEL_DIR) 149 | neptune_logger.log_hyperparams(vars(args)) 150 | neptune_logger.run['training/artifacts/metrics_by_typenames.json'].log('metrics_by_typenames.json') 151 | neptune_logger.run['training/artifacts/metrics_by_documents.json'].log('metrics_by_documents.json') 152 | neptune_logger.run['training/artifacts/outputs_sheet_client.xlsx'].log('outputs_sheet_client.xlsx') 153 | 154 | 155 | if __name__ == "__main__": 156 | main() 157 | -------------------------------------------------------------------------------- /information_extraction_t5/utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuralmind-ai/information-extraction-t5/988c589b433d96004139e1f63bfefd6778c0851b/information_extraction_t5/utils/__init__.py -------------------------------------------------------------------------------- /information_extraction_t5/utils/balance_data.py: -------------------------------------------------------------------------------- 1 | """Code to balance the dataset, keeping a negative-positive ratio""" 2 | import numpy as np 3 | import pandas as pd 4 | 5 | from information_extraction_t5.features.sentences import split_t5_sentence_into_components 6 | 7 | 8 | def count_pos_neg(labels, document_ids, example_ids): 9 | """Count the number of positive and negative samples. The counting is also 10 | done for each pair of document_id-example-id returned as a dict.""" 11 | 12 | pos, neg = 0, 0 13 | counter = {} 14 | for label, document_id, example_id in zip(labels, document_ids, example_ids): 15 | if document_id not in counter: 16 | counter[document_id] = {} 17 | 18 | if example_id not in counter[document_id].keys(): 19 | counter[document_id][example_id] = {'pos': 0, 'neg': 0} 20 | 21 | if 'N/A' in label: 22 | counter[document_id][example_id]['neg'] += 1 23 | neg += 1 24 | else: 25 | counter[document_id][example_id]['pos'] += 1 26 | pos += 1 27 | return pos, neg, counter 28 | 29 | 30 | def balance_data(inputs, labels, document_ids, example_ids, negative_ratio): 31 | """Control the number of negative examples (N/A) with respect to the number 32 | of positive examples by negative_ratio. The data balancing is performed for 33 | each pair of document_id-example_id. 34 | 35 | Negative samples are selected with resampling with replacement, as the number 36 | of negative samples can be lower than the number of positives * negative_ratio. 37 | """ 38 | 39 | n_pos, n_neg, _ = count_pos_neg(labels, document_ids, example_ids) 40 | print(f'>> The negative-positive ratio of the original dataset is {n_neg/n_pos:.2f}.') 41 | 42 | is_negative = [] 43 | for label in labels: 44 | _, _, sub_answers = split_t5_sentence_into_components(label) 45 | if 'N/A' in sub_answers: 46 | is_negative.append(True) 47 | else: 48 | is_negative.append(False) 49 | 50 | # create the initial dataframe 51 | arr = np.vstack([ 52 | np.array(inputs), 53 | np.array(labels), 54 | np.array(document_ids), 55 | np.array(example_ids), 56 | np.array(is_negative, dtype=bool)]).transpose() 57 | df1 = pd.DataFrame(arr, columns=['examples', 'labels', 'document_ids', 'example_ids', 'is_negative']) 58 | 59 | # Separate positive and negative samples 60 | df_pos = df1.loc[df1['is_negative'] == 'False'] 61 | df_neg = df1.loc[df1['is_negative'] == 'True'] 62 | 63 | # create temporary dataframe with additional column to count how many 64 | # positive qas we have for each pair document_id-example_id 65 | df_pos_counter = df_pos.groupby(['document_ids', 'example_ids']).size().reset_index(name='counts') 66 | 67 | # merge the positive dataframe with counter and negative dataframe 68 | df_merge = df_pos_counter.merge(df_neg, on=['document_ids', 'example_ids'], how='outer') 69 | # remove pairs document x example that has only negative (no positite qa) 70 | df_merge.dropna(inplace=True) 71 | 72 | # process the merged dataframe to resample negative cases proportional 73 | # to the number of positive cases 74 | df_group = df_merge.groupby(['document_ids', 'example_ids']) 75 | frames = [] 76 | for group in df_group.groups: 77 | df = df_group.get_group(group) 78 | df = df.sample(int(df['counts'].values[0]) * negative_ratio, replace=True, random_state=42) 79 | frames.append(df) 80 | df_merge = pd.concat(frames) 81 | 82 | # remove temporary columns 83 | df_merge = df_merge.drop(['counts', 'is_negative'], axis=1) 84 | df_pos = df_pos.drop(['is_negative'], axis=1) 85 | 86 | # create the final dataframe by concatenating positives and negatives 87 | dfinal = pd.concat([df_pos, df_merge]) 88 | 89 | inputs, labels, document_ids, example_ids = dfinal.T.values.tolist() 90 | 91 | n_pos, n_neg, _ = count_pos_neg(labels, document_ids, example_ids) 92 | if n_neg/n_pos != negative_ratio: 93 | print(f'>> The resultant negative-positive ratio is {n_neg/n_pos:.2f}. ' 94 | 'Hint: set "use_missing_answers=False" to get a precise data balancing.') 95 | else: 96 | print(f'>> The resultant negative-positive ratio is {n_neg/n_pos:.2f}.') 97 | 98 | return inputs, labels, document_ids, example_ids 99 | -------------------------------------------------------------------------------- /information_extraction_t5/utils/freeze.py: -------------------------------------------------------------------------------- 1 | """Utilities for freezing parameters and checking whether they are frozen.""" 2 | from torch import nn 3 | 4 | 5 | def freeze_params(model: nn.Module): 6 | """Set requires_grad=False for each of model.parameters()""" 7 | for par in model.parameters(): 8 | par.requires_grad = False 9 | 10 | 11 | def freeze_embeds(model): 12 | """Freeze token embeddings and positional embeddings for bart, just token embeddings for t5.""" 13 | model_type = model.config.model_type 14 | 15 | if model_type in ["t5", "mt5"]: 16 | freeze_params(model.shared) 17 | for d in [model.encoder, model.decoder]: 18 | freeze_params(d.embed_tokens) 19 | elif model_type == "fsmt": 20 | for d in [model.model.encoder, model.model.decoder]: 21 | freeze_params(d.embed_positions) 22 | freeze_params(d.embed_tokens) 23 | else: 24 | freeze_params(model.model.shared) 25 | for d in [model.model.encoder, model.model.decoder]: 26 | freeze_params(d.embed_positions) 27 | freeze_params(d.embed_tokens) 28 | -------------------------------------------------------------------------------- /information_extraction_t5/utils/metrics.py: -------------------------------------------------------------------------------- 1 | """ Very heavily inspired by the official evaluation script for SQuAD version 2 | 2.0 which was modified by XLNet authors to update `find_best_threshold` 3 | scripts for SQuAD V2.0 In addition to basic functionality, we also compute 4 | additional statistics and plot precision-recall curves if an additional 5 | na_prob.json file is provided. This file is expected to map question ID's to 6 | the model's predicted probability that a question is unanswerable. """ 7 | import collections 8 | import re 9 | import string 10 | from typing import Dict, Optional 11 | import unicodedata 12 | 13 | 14 | def normalize_answer(s): 15 | """Lower text and remove punctuation, articles and extra whitespace.""" 16 | 17 | def remove_articles(text): 18 | regex = re.compile(r"\b(a|an|the)\b", re.UNICODE) 19 | # regex = re.compile(r"\b(o|a|os|as|um|uma|uns|umas)\b", re.UNICODE) # portuguese? 20 | return re.sub(regex, " ", text) 21 | 22 | def white_space_fix(text): 23 | return " ".join(text.split()) 24 | 25 | def remove_punc(text): 26 | exclude = set(string.punctuation) 27 | return "".join(ch for ch in text if ch not in exclude) 28 | 29 | def lower(text): 30 | return text.lower() 31 | 32 | def strip_accents(s): 33 | return ''.join(c for c in unicodedata.normalize('NFD', s) 34 | if unicodedata.category(c) != 'Mn') 35 | 36 | # return white_space_fix(remove_articles(remove_punc(lower(s)))) 37 | return white_space_fix(remove_articles(strip_accents(remove_punc(lower(s))))) 38 | 39 | 40 | def get_tokens(s): 41 | if not s: 42 | return [] 43 | return normalize_answer(s).split() 44 | 45 | 46 | def compute_exact(a_gold, a_pred): 47 | return int(normalize_answer(a_gold) == normalize_answer(a_pred)) 48 | 49 | 50 | def compute_f1(a_gold, a_pred): 51 | gold_toks = get_tokens(a_gold) 52 | pred_toks = get_tokens(a_pred) 53 | common = collections.Counter(gold_toks) & collections.Counter(pred_toks) 54 | num_same = sum(common.values()) 55 | if len(gold_toks) == 0 or len(pred_toks) == 0: 56 | # If either is no-answer, then F1 is 1 if they agree, 0 otherwise 57 | return int(gold_toks == pred_toks) 58 | if num_same == 0: 59 | return 0 60 | precision = 1.0 * num_same / len(pred_toks) 61 | recall = 1.0 * num_same / len(gold_toks) 62 | f1 = (2 * precision * recall) / (precision + recall) 63 | return f1 64 | 65 | 66 | def make_eval_dict(exact_scores, f1_scores, qid_list=None): 67 | if not qid_list: 68 | total = len(exact_scores) 69 | return collections.OrderedDict( 70 | [ 71 | ("exact", 100.0 * sum(exact_scores.values()) / total), 72 | ("f1", 100.0 * sum(f1_scores.values()) / total), 73 | ("total", total), 74 | ] 75 | ) 76 | else: 77 | total = len(qid_list) 78 | return collections.OrderedDict( 79 | [ 80 | ("exact", 81 | 100.0 * sum(exact_scores[k] for k in qid_list) / total), 82 | ("f1", 100.0 * sum(f1_scores[k] for k in qid_list) / total), 83 | ("total", total), 84 | ] 85 | ) 86 | 87 | 88 | def get_raw_scores(answers, preds): 89 | """Computes the exact and f1 scores from the examples and the model 90 | predictions. 91 | 92 | This version gets the answers and prediction in text format, as T5 returns. 93 | """ 94 | exact_scores = {} 95 | f1_scores = {} 96 | 97 | for i, (answer, pred) in enumerate(zip(answers, preds)): 98 | exact_scores[i] = compute_exact(answer, pred) 99 | f1_scores[i] = compute_f1(answer, pred) 100 | 101 | return exact_scores, f1_scores 102 | 103 | 104 | def t5_qa_evaluate(answers, preds, qid_dict: Optional[Dict] = None): 105 | """Evaluates T5 predictions. 106 | 107 | This is a siplification of `square_evaluate` to compute the exact and f1 108 | scores from predictions from T5. 109 | If required, this version returns subdicts with f1 and exact measures for 110 | pre-selected groups of question-answers. 111 | 112 | Examples: 113 | >>> qid_dict = { 114 | >>> 'matriculas': [0, 4], 115 | >>> 'comarca': [1, 4], 116 | >>> 'estado': [2, 6] 117 | >>> 'oficio': [3, 7] 118 | >>> } 119 | >>> t5_qa_evaluate(answers, preds, qid_dict=qid_dict) 120 | >>> {'exact': x, 'f1': y, 'total': 8, 'matriculas': {'exact': z, 'f1': w, 'total': 2}, ... } 121 | """ 122 | if qid_dict is None: 123 | qid_dict = {} 124 | 125 | exact, f1 = get_raw_scores(answers, preds) 126 | evaluation = make_eval_dict(exact, f1) 127 | 128 | for (kword, qid_list) in qid_dict.items(): 129 | evaluation[kword] = make_eval_dict(exact, f1, qid_list) 130 | 131 | return evaluation 132 | -------------------------------------------------------------------------------- /information_extraction_t5/utils/processing.py: -------------------------------------------------------------------------------- 1 | """Utility methods for pre and post processing.""" 2 | from collections import OrderedDict 3 | from typing import Dict, List, Tuple 4 | 5 | import regex as re 6 | 7 | 8 | def get_intersection_set(list_a: List, list_b: List) -> set: 9 | """Returns the intersection set of two lists.""" 10 | set_a = set(list_a) 11 | set_b = set(list_b) 12 | intersection = set_a.intersection(set_b) 13 | 14 | return intersection 15 | 16 | 17 | def concat_or_terms(terms, suffix='{e<=1}'): 18 | """Concats a list of terms in an OR regex. 19 | 20 | Example: 21 | >>> concat_or_terms([r'foo', r'bar'], suffix='{e<=1}') 22 | '(?:foo|bar){e<=1}' 23 | 24 | Args: 25 | terms (list): terms to be considered in a regex search group 26 | suffix (str): fuzzy options to use in the search 27 | 28 | Returns: 29 | (str): regex string for group search 30 | 31 | """ 32 | groups = '|'.join(map(str, terms)) 33 | 34 | return r'(?:{}){}'.format(groups, suffix) 35 | 36 | 37 | def expand_composite_char_pattern(text: str) -> str: 38 | """ Replace composable char in the given text for a regex group with all 39 | its composite versions. 40 | 41 | Args: 42 | text: the string to be expanded 43 | 44 | Returns: 45 | a new string with every composable char replaced by its composites 46 | pattern 47 | """ 48 | 49 | composite_char_groups = [ 50 | 'aáàâã', 51 | 'eéê', 52 | 'ií', 53 | 'oóõ', 54 | 'uúü', 55 | 'cç' 56 | ] 57 | 58 | for group in composite_char_groups: 59 | text = re.sub(fr'[{group}]', f'[{group}]', text) 60 | return text 61 | 62 | 63 | def count_k_v(d): 64 | """Count keys and values in nested dictionary.""" 65 | keys, values = 0, 0 66 | if isinstance(d, Dict) or isinstance(d, OrderedDict): 67 | for item in d.keys(): 68 | if isinstance(d[item], (List, Tuple, Dict)): 69 | keys += 1 70 | k, v = count_k_v(d[item]) 71 | values += v 72 | keys += k 73 | else: 74 | keys += 1 75 | values += 1 76 | 77 | elif isinstance(d, (List, Tuple)): 78 | for item in d: 79 | if isinstance(item, (List, Tuple, Dict)): 80 | k, v = count_k_v(item) 81 | values += v 82 | keys += k 83 | else: 84 | values += 1 85 | 86 | return keys, values 87 | -------------------------------------------------------------------------------- /params.yaml: -------------------------------------------------------------------------------- 1 | model_name_or_path: unicamp-dl/ptt5-base-portuguese-vocab 2 | do_lower_case: false 3 | deepspeed: false 4 | 5 | # neptune 6 | neptune: false 7 | neptune_project: ramon.pires/information-extraction-t5 8 | experiment_name: experiment01 9 | tags: [ptt5, compound] 10 | 11 | # optimizer 12 | optimizer: AdamW 13 | lr: 1e-4 14 | weight_decay: 1e-5 15 | 16 | # preprocess dataset 17 | project: [ 18 | form, 19 | ] 20 | raw_data_file: [ 21 | data/raw/sample_train.json 22 | ] 23 | raw_valid_data_file: [ 24 | null, 25 | ] 26 | raw_test_data_file: [ 27 | data/raw/sample_test.json 28 | ] 29 | train_file: data/processed/train-v0.1.json 30 | valid_file: data/processed/dev-v0.1.json 31 | test_file: data/processed/test-v0.1.json 32 | type_names: [ 33 | form.etiqueta, 34 | form.agencia, 35 | form.conta_corrente, 36 | form.cpf, 37 | form.nome_completo, 38 | form.n_doc_serie, 39 | form.orgao_emissor, 40 | form.data_emissao, 41 | form.data_nascimento, 42 | form.nome_mae, 43 | form.nome_pai, 44 | form.endereco, 45 | ] 46 | use_compound_question: [ 47 | form.endereco, 48 | ] 49 | return_raw_text: [ 50 | null, 51 | ] 52 | 53 | train_force_qa: true 54 | train_choose_question: first 55 | valid_percent: 0.2 56 | context_content: windows_token 57 | window_overlap: 0.2 58 | max_windows: 3 59 | max_size: 2048 60 | max_seq_length: 512 61 | 62 | # dataset 63 | train_batch_size: 8 64 | val_batch_size: 8 65 | shuffle_train: true 66 | use_sentence_id: false 67 | negative_ratio: -1 68 | 69 | seed: 20210519 70 | num_workers: 6 71 | 72 | # inference and post-processing 73 | num_beams: 5 74 | max_length: 200 75 | get_highestprob_answer: true 76 | split_compound_answers: true 77 | group_qas: true 78 | normalize_outputs: true 79 | only_misprediction_outputs: true 80 | use_cached_predictions: true 81 | 82 | # Trainer 83 | accelerator: auto 84 | devices: auto 85 | max_epochs: 26 86 | deterministic: true 87 | accumulate_grad_batches: 2 88 | amp_backend: native 89 | precision: 32 90 | gradient_clip_val: 1.0 91 | val_check_interval: 1.0 92 | check_val_every_n_epoch: 2 93 | limit_val_batches: 0.5 94 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==0.9.0 2 | appdirs==1.4.4 3 | atpublic==1.0 4 | attrs==19.3.0 5 | boto3==1.20.20 6 | botocore==1.23.20 7 | cachetools==4.1.1 8 | certifi==2020.6.20 9 | chardet==3.0.4 10 | click==7.1.2 11 | colorama==0.4.3 12 | ConfigArgParse==1.2.3 13 | configobj==5.0.6 14 | decorator==4.4.2 15 | dill==0.3.2 16 | distro==1.5.0 17 | docutils==0.15.2 18 | dpath==2.0.1 19 | dvc==1.9.1 20 | filelock==3.0.12 21 | flatten-json==0.1.7 22 | flufl.lock==3.2 23 | funcy==1.14 24 | future==0.18.2 25 | fuzzysearch==0.7.3 26 | fuzzywuzzy==0.18.0 27 | gitdb==4.0.5 28 | GitPython==3.1.7 29 | google-auth==1.18.0 30 | google-auth-oauthlib==0.4.1 31 | googleapis-common-protos==1.52.0 32 | grandalf==0.6 33 | grpcio==1.30.0 34 | idna==2.10 35 | jmespath==0.10.0 36 | joblib==0.16.0 37 | jsonpath-ng==1.5.1 38 | Markdown==3.2.2 39 | nanotime==0.5.2 40 | neptune-client==0.13.3 41 | neptune-contrib==0.28.1 42 | networkx==2.3 43 | numpy==1.19.0 44 | oauthlib==3.1.0 45 | openpyxl==3.0.9 46 | packaging>=19.0 47 | pandas<=1.3.5 48 | pathspec==0.8.0 49 | ply==3.11 50 | promise==2.3 51 | protobuf==3.12.2 52 | psutil==5.8.0 53 | pyasn1==0.4.8 54 | pyasn1-modules==0.2.8 55 | pydot==1.4.1 56 | pygtrie==2.3.2 57 | pyparsing==2.4.7 58 | python-dateutil==2.8.1 59 | python-Levenshtein==0.12.0 60 | pytorch-lightning==1.5.5 61 | PyYAML==5.3.1 62 | regex==2020.6.8 63 | requests==2.24.0 64 | requests-oauthlib==1.3.0 65 | rich==10.15.2 66 | rsa==4.5.0 67 | ruamel.yaml==0.16.10 68 | ruamel.yaml.clib==0.2.0 69 | s3transfer==0.5.0 70 | sacremoses==0.0.43 71 | scikit-learn==0.23.1 72 | scipy==1.5.1 73 | sentencepiece==0.1.91 74 | shortuuid==1.0.1 75 | shtab<2.0.0 76 | six==1.15.0 77 | sklearn==0.0 78 | smmap==3.0.4 79 | tabulate==0.8.7 80 | termcolor==1.1.0 81 | threadpoolctl==2.1.0 82 | tokenizers==0.10.3 83 | torch==1.10.0 84 | tqdm==4.47.0 85 | transformers==4.8.2 86 | urllib3==1.25.9 87 | voluptuous==0.11.7 88 | Werkzeug==1.0.1 89 | wrapt==1.12.1 90 | zc.lockfile==2.0 91 | deepspeed==0.4.3 92 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | """Install information_extraction_t5""" 2 | import os 3 | from distutils.core import setup 4 | from setuptools import find_packages 5 | 6 | pkg_dir = os.path.dirname(__name__) 7 | 8 | with open(os.path.join(pkg_dir, 'requirements.txt'), 'r', encoding='utf-8') as fd: 9 | requirements = fd.read().splitlines() 10 | 11 | setup( 12 | name='information_extraction_t5', 13 | version='1.0', 14 | packages=find_packages('.', exclude=['data*', 15 | 'lightning_logs*', 16 | 'models*']), 17 | long_description=open('README.md', 'r', encoding='utf-8').read(), 18 | install_requires=requirements, 19 | ) 20 | --------------------------------------------------------------------------------