├── postagger_elisa.png └── README.md /postagger_elisa.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lisaterumi/nlp-portuguese-postagger/HEAD/postagger_elisa.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # NLP Portuguese POS-tagger 2 | 3 | Treinamos o modelo [BERTimbau](https://github.com/neuralmind-ai/portuguese-bert/) com o corpus [MacMorpho](http://nilc.icmc.usp.br/macmorpho/) para tarefa de *POS-Tagger*, com 10 épocas, atingindo um *F1-Score* geral de `0.9826`. 4 | 5 | Metricas: 6 | 7 | ``` 8 | Precision Recall F1 Suport 9 | accuracy 0.98 33729 10 | macro avg 0.96 0.95 0.95 33729 11 | weighted avg 0.98 0.98 0.98 33729 12 | 13 | F1: 0.9826 Accuracy: 0.9826 14 | ``` 15 | 16 | ## Repositório 17 | 18 | Nosso modelo está no repositório oficial do `Hugging Faces`, você pode acessá-lo pelo endereço: https://huggingface.co/lisaterumi/postagger-portuguese/ 19 | 20 | 21 | 22 | Se você gostou do nosso trabalho, não se esqueça de dar um *like* no modelo no `Hugging Faces` ❤️ 23 | 24 | ## Como usar 25 | 26 | Para usar nosso modelo, basta seguir os passos abaixo: 27 | 28 | ``` 29 | from transformers import AutoTokenizer, AutoModelForTokenClassification 30 | 31 | tokenizer = AutoTokenizer.from_pretrained("lisaterumi/postagger-portuguese") 32 | 33 | model = AutoModelForTokenClassification.from_pretrained("lisaterumi/postagger-portuguese") 34 | 35 | ``` 36 | 37 | Aqui você tem um manual dos tipos gramaticais retornados pelo modelo: 38 | 39 | | Sigla | Significado | 40 | | ------------------- | ------------------- | 41 | | ADJ | Adjetivo | 42 | | ADV | Advérbio | 43 | | ADV-KS | Advérbio conjuntivo subordinado | 44 | | ADV-KS-REL | Advérbio relativo subordinado | 45 | | ART | Artigo | 46 | | CUR | Moeda | 47 | | IN | Interjeição | 48 | | KC | Conjunção coordenativa | 49 | | KS | Conjunção subordinativa | 50 | | N | Substantivo | 51 | | NPROP | Substantivo próprio | 52 | | NUM | Número | 53 | | PCP | Particípio | 54 | | PDEN | Palavra denotativa | 55 | | PREP | Preposição | 56 | | PROADJ | Pronome Adjetivo | 57 | | PRO-KS | Pronome conjuntivo subordinado | 58 | | PRO-KS-REL | Pronome relativo conectivo subordinado | 59 | | PROPESS | Pronome pessoal | 60 | | PROSUB | Pronome nominal | 61 | | V | Verbo | 62 | | VAUX | Verbo auxiliar | 63 | 64 | 65 | Mais informações e exemplos em: http://nilc.icmc.usp.br/macmorpho/macmorpho-manual.pdf 66 | 67 | ## Como citar 68 | 69 | ``` 70 | @article{ 71 | Schneider_postagger_2023, 72 | place={Brasil}, 73 | title={Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese}, 74 | volume={15}, 75 | url={https://jhi.sbis.org.br/index.php/jhi-sbis/article/view/1086}, 76 | DOI={10.59681/2175-4411.v15.iEspecial.2023.1086}, 77 | abstractNote={<p>Electronic Health Records are a valuable source of information to be extracted by means of natural language processing (NLP) tasks, such as morphosyntactic word tagging. Although there have been significant advances in health NLP, such as the Transformer architecture, languages such as Portuguese are still underrepresented. This paper presents taggers developed for Portuguese texts, fine-tuned using BioBERtpt (clinical/biomedical) and BERTimbau (generic) models on a POS-tagged corpus. We achieved an accuracy of 0.9826, state-of-the-art for the corpus used. In addition, we performed a human-based evaluation of the trained models and others in the literature, using authentic clinical narratives. Our clinical model achieved 0.8145 in accuracy compared to 0.7656 for the generic model. It also showed competitive results compared to models trained specifically with clinical texts, evidencing domain impact on the base model in NLP tasks.</p>}, 78 | number={Especial}, journal={Journal of Health Informatics}, 79 | author={Schneider, Elisa Terumi Rubel and Gumiel, Yohan Bonescki and Oliveira, Lucas Ferro Antunes de and Montenegro, Carolina de Oliveira and Barzotto, Laura Rubel and Moro, Claudia and Pagano, Adriana and Paraiso, Emerson Cabrera}, 80 | year={2023}, 81 | month={jul.} } 82 | ``` 83 | --------------------------------------------------------------------------------