├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # poetry
 98 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 99 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
100 | #   commonly ignored for libraries.
101 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102 | #poetry.lock
103 | 
104 | # pdm
105 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106 | #pdm.lock
107 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108 | #   in version control.
109 | #   https://pdm.fming.dev/#use-with-ide
110 | .pdm.toml
111 | 
112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
113 | __pypackages__/
114 | 
115 | # Celery stuff
116 | celerybeat-schedule
117 | celerybeat.pid
118 | 
119 | # SageMath parsed files
120 | *.sage.py
121 | 
122 | # Environments
123 | .env
124 | .venv
125 | env/
126 | venv/
127 | ENV/
128 | env.bak/
129 | venv.bak/
130 | 
131 | # Spyder project settings
132 | .spyderproject
133 | .spyproject
134 | 
135 | # Rope project settings
136 | .ropeproject
137 | 
138 | # mkdocs documentation
139 | /site
140 | 
141 | # mypy
142 | .mypy_cache/
143 | .dmypy.json
144 | dmypy.json
145 | 
146 | # Pyre type checker
147 | .pyre/
148 | 
149 | # pytype static type analyzer
150 | .pytype/
151 | 
152 | # Cython debug symbols
153 | cython_debug/
154 | 
155 | # PyCharm
156 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
157 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
158 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
159 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
160 | #.idea/
161 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contribution Guidelines
 2 | 
 3 | ## Adding to this list
 4 | 
 5 | If you have something awesome to contribute, this is how you do it:
 6 | 
 7 | You'll need a [GitHub account](https://github.com/join). It's free and it's obviously quite awesome.
 8 | 
 9 | 1. Once you're registered, come back to this list: https://github.com/AndyTheFactory/romanian-nlp-datasets
10 | 2. Click on the `README.md` file: ![Step 2 Click on Readme.md](https://cloud.githubusercontent.com/assets/170270/9402920/53a7e3ea-480c-11e5-9d81-aecf64be55eb.png)
11 | 3. Now click on the edit icon. ![Step 3 - Click on Edit](https://cloud.githubusercontent.com/assets/170270/9402927/6506af22-480c-11e5-8c18-7ea823530099.png)
12 | 4. You can start editing the text of the file in the in-browser editor. Make sure you follow guidelines below. You can use [GitHub Flavored Markdown](https://help.github.com/articles/github-flavored-markdown/). ![Step 4 - Edit the file](https://cloud.githubusercontent.com/assets/170270/9402932/7301c3a0-480c-11e5-81f5-7e343b71674f.png)
13 | 5. Say why you're proposing the changes, and then click on "Propose file change". ![Step 5 - Propose Changes](https://cloud.githubusercontent.com/assets/170270/9402937/7dd0652a-480c-11e5-9138-bd14244593d5.png)
14 | 6. Submit the [pull request](https://help.github.com/articles/using-pull-requests/)!
15 | 
16 | If you can, please double-check that your pull request adheres to the following guidelines:
17 | 
18 | - Make an individual pull request for each suggestion.
19 | - Search previous suggestions before making a new one, as yours may be a duplicate.
20 | - Titles should be capitalized.
21 | - Use the following format for links: `[List Name](link)`
22 | - Add a description of the added dataset as a quote, using `>`
23 | - If a paper exists for that dataset, add the link to that paper as a badge (see existing examples)
24 | - Link additions should be added to the bottom of the relevant category.
25 | - New categories or improvements to the existing categorization are welcome.
26 | - Check your spelling and grammar.
27 | - The pull requests and commits should have useful titles.
28 | 
29 | But don't worry too much, we're just happy to have your suggestions!
30 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | CC0 1.0 Universal
 2 | 
 3 | Statement of Purpose
 4 | 
 5 | The laws of most jurisdictions throughout the world automatically confer exclusive Copyright and Related Rights (defined below) upon the creator and subsequent owner(s) (each and all, an "owner") of an original work of authorship and/or a database (each, a "Work").
 6 | 
 7 | Certain owners wish to permanently relinquish those rights to a Work for the purpose of contributing to a commons of creative, cultural and scientific works ("Commons") that the public can reliably and without fear of later claims of infringement build upon, modify, incorporate in other works, reuse and redistribute as freely as possible in any form whatsoever and for any purposes, including without limitation commercial purposes. These owners may contribute to the Commons to promote the ideal of a free culture and the further production of creative, cultural and scientific works, or to gain reputation or greater distribution for their Work in part through the use and efforts of others.
 8 | 
 9 | For these and/or other purposes and motivations, and without any expectation of additional consideration or compensation, the person associating CC0 with a Work (the "Affirmer"), to the extent that he or she is an owner of Copyright and Related Rights in the Work, voluntarily elects to apply CC0 to the Work and publicly distribute the Work under its terms, with knowledge of his or her Copyright and Related Rights in the Work and the meaning and intended legal effect of CC0 on those rights.
10 | 
11 | Copyright and Related Rights. A Work made available under CC0 may be protected by copyright and related or neighboring rights ("Copyright and Related Rights"). Copyright and Related Rights include, but are not limited to, the following:
12 | i. the right to reproduce, adapt, distribute, perform, display, communicate, and translate a Work;
13 | 
14 | ii. moral rights retained by the original author(s) and/or performer(s);
15 | 
16 | iii. publicity and privacy rights pertaining to a person's image or likeness depicted in a Work;
17 | 
18 | iv. rights protecting against unfair competition in regards to a Work, subject to the limitations in paragraph 4(a), below;
19 | 
20 | v. rights protecting the extraction, dissemination, use and reuse of data in a Work;
21 | 
22 | vi. database rights (such as those arising under Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, and under any national implementation thereof, including any amended or successor version of such directive); and
23 | 
24 | vii. other similar, equivalent or corresponding rights throughout the world based on applicable law or treaty, and any national implementations thereof.
25 | 
26 | Waiver. To the greatest extent permitted by, but not in contravention of, applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and unconditionally waives, abandons, and surrenders all of Affirmer's Copyright and Related Rights and associated claims and causes of action, whether now known or unknown (including existing as well as future claims and causes of action), in the Work (i) in all territories worldwide, (ii) for the maximum duration provided by applicable law or treaty (including future time extensions), (iii) in any current or future medium and for any number of copies, and (iv) for any purpose whatsoever, including without limitation commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each member of the public at large and to the detriment of Affirmer's heirs and successors, fully intending that such Waiver shall not be subject to revocation, rescission, cancellation, termination, or any other legal or equitable action to disrupt the quiet enjoyment of the Work by the public as contemplated by Affirmer's express Statement of Purpose.
27 | 
28 | Public License Fallback. Should any part of the Waiver for any reason be judged legally invalid or ineffective under applicable law, then the Waiver shall be preserved to the maximum extent permitted taking into account Affirmer's express Statement of Purpose. In addition, to the extent the Waiver is so judged Affirmer hereby grants to each affected person a royalty-free, non transferable, non sublicensable, non exclusive, irrevocable and unconditional license to exercise Affirmer's Copyright and Related Rights in the Work (i) in all territories worldwide, (ii) for the maximum duration provided by applicable law or treaty (including future time extensions), (iii) in any current or future medium and for any number of copies, and (iv) for any purpose whatsoever, including without limitation commercial, advertising or promotional purposes (the "License"). The License shall be deemed effective as of the date CC0 was applied by Affirmer to the Work. Should any part of the License for any reason be judged legally invalid or ineffective under applicable law, such partial invalidity or ineffectiveness shall not invalidate the remainder of the License, and in such case Affirmer hereby affirms that he or she will not (i) exercise any of his or her remaining Copyright and Related Rights in the Work or (ii) assert any associated claims and causes of action with respect to the Work, in either case contrary to Affirmer's express Statement of Purpose.
29 | 
30 | Limitations and Disclaimers.
31 | 
32 | a. No trademark or patent rights held by Affirmer are waived, abandoned, surrendered, licensed or otherwise affected by this document.
33 | 
34 | b. Affirmer offers the Work as-is and makes no representations or warranties of any kind concerning the Work, express, implied, statutory or otherwise, including without limitation warranties of title, merchantability, fitness for a particular purpose, non infringement, or the absence of latent or other defects, accuracy, or the present or absence of errors, whether or not discoverable, all to the greatest extent permissible under applicable law.
35 | 
36 | c. Affirmer disclaims responsibility for clearing rights of other persons that may apply to the Work or any use thereof, including without limitation any person's Copyright and Related Rights in the Work. Further, Affirmer disclaims responsibility for obtaining any necessary consents, permissions or other rights required for any use of the Work.
37 | 
38 | d. Affirmer understands and acknowledges that Creative Commons is not a party to this document and has no duty or obligation with respect to this CC0 or use of the Work.
39 | 
40 | For more information, please see https://creativecommons.org/publicdomain/zero/1.0
41 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | [![Awesome](https://awesome.re/badge-flat2.svg)](https://awesome.re)
  2 | 
  3 | # A list of Romanian NLP Datasets
  4 | A curated list of open source and open access Romanian Language NLP Datasets.
  5 | For the moment we don't add parallel copora to the list.
  6 | 
  7 | For additions or any other changes please submit a pull request.
  8 | 
  9 | Table of contents
 10 | =================
 11 | 
 12 | <!--ts-->
 13 |    * [Unlabeled text Corpora](#unlabeled-text-corpora)
 14 |    * [Semantic Textual Similarity / Paraphrasing](#semantic-textual-similarity--paraphrasing)
 15 |    * [Natural Language Inference](#natural-language-inference)
 16 |    * [Summarization](#summarization)
 17 |    * [Dialect and regional speech identification](#dialect-and-regional-speech-identification)
 18 |    * [Named Entity Recognition (NER)](#named-entity-recognition-ner)
 19 |    * [Autorship Attribution](#autorship-attribution)
 20 |    * [Sentiment Analysis](#sentiment-analysis)
 21 |    * [Dependency Parsing](#dependency-parsing)
 22 |    * [Diacritics Restoration / Grammar Correction](#diacritics-restoration--grammar-correction)
 23 |    * [Fake News / Clickbait / Satirical News](#fake-news--clickbait--satirical-news)
 24 |    * [Offensive Language](#offensive-language)
 25 |    * [Questions and Answering](#questions-and-answers)
 26 |    * [Spelling, Dictionaries and Gramatical Errors](#spelling-and-gramatical-errors)
 27 | 
 28 | <!--te-->
 29 | 
 30 | ## Unlabeled text Corpora
 31 | 
 32 | 
 33 | * [❄️FuLG dataset ❄️](https://huggingface.co/datasets/faur-ai/fulg)
 34 | >     The FuLG dataset is a comprehensive Romanian language corpus comprising
 35 | >     150 billion tokens, carefully extracted from Common Crawl. 
 36 | [![arXiv](https://img.shields.io/badge/arXiv-2004.06165-f9f107.svg)](https://arxiv.org/abs/2407.13657)
 37 | 
 38 | * [🌐 Oscar Common Crawl dataset 🌐](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201)
 39 | >     Part of a large multilanguage corpus originated from Common Crawl.
 40 | >     It's a raw, unannotated corpus. It has roughly 50 GB of Romanian text
 41 | >     in 4.5 million documnets. For details check its homepage 
 42 | >     and the paper
 43 |  
 44 | [![arXiv](https://img.shields.io/badge/arXiv-2004.06165-f9f107.svg)](https://arxiv.org/abs/2004.06165)
 45 | [![Homepage](https://img.shields.io/badge/oscar%20homepage-6ca1f0)](https://oscar-project.org/)
 46 | 
 47 | * [📚 CC-100 📚](https://data.statmt.org/cc-100/)
 48 | >      Similar to Oscar, part of a multilanguage corpus also based on Common Crawl
 49 | >      from 2018. Romanian text is 16GB large
 50 | 
 51 | [![arXiv](https://img.shields.io/badge/arXiv-1911.00359-f9f107.svg)](https://arxiv.org/abs/1911.00359)
 52 | [![Homepage](https://img.shields.io/badge/cc-100%20homepage-6ca1f0)](https://data.statmt.org/cc-100/)
 53 | 
 54 | * [🌍 Wikipedia Corpus 🌍](https://dumps.wikimedia.org/rowiki/)
 55 | >       Romanian language wikipedia dump. 
 56 |   
 57 | * [📰⚖️ RoTex Collection 📰⚖️](https://github.com/aleris/ReadME-RoTex-Corpus-Builder)
 58 | >       A collection of varoius unannotated corpora collected around 2018-2019.
 59 | >       Includes books, scraped newspapers and juridical documents  
 60 | > 
 61 | * [📖 Romanian Language Repository 📖](https://github.com/lmidriganciochina/romaniancorpus)
 62 | >       A collection of written and spoken text from various
 63 | >       sources: Articles, Fairy tales, Fiction, History, Theatre, News
 64 |  
 65 | * [🏛️ MARCELL Legislative Corpus 🏛️](https://elrc-share.eu/repository/browse/marcell-romanian-legislative-subcorpus-v2/2da548428b9d11eb9c1a00155d026706ce94a6b59ffc4b0e9fb5cd9cebe6889e/)
 66 | >      Romanian national legilation from  1881 to 2021. The corpus
 67 | >      includes mainly: governmental decisions, ministerial orders,
 68 | >      decisions, decrees and laws.
 69 | >      Automatically annotated for Named Entities
 70 | 
 71 | [![ACL](https://img.shields.io/badge/ACL%20Anthology-ed1c24.svg)](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.464.pdf)
 72 | [![Homepage](https://img.shields.io/badge/marcell%20homepage-6ca1f0)](https://marcell-project.eu/)
 73 | 
 74 | * [🦠 COVID-19 Tweets 🐦](https://github.com/UBC-NLP/megacov)
 75 | >    Mega-COV is a billion-scale dataset from Twitter for studying COVID-19. It is
 76 | >    available in over 100+ languages, Romanian being one of them. Tweets need
 77 | >    to be rehydrated
 78 | 
 79 | [![arXiv](https://img.shields.io/badge/arXiv-2005.06012-f9f107.svg)](https://arxiv.org/abs/2005.06012)
 80 | [![Medium](https://img.shields.io/badge/Medium-12100E?style=for-the-badge&logo=medium&logoColor=white)](https://mumageed.medium.com/billion-scale-investigation-of-covid-19-impact-on-human-communication-in-104-languages-874b5a37beac)
 81 | 
 82 | * [COVIDSentiRO](https://github.com/Alegzandra/KES-2023/tree/main/datasets/COVIDSentiRO)
 83 | >    A corpus of Romanian tweets related to COVID and vaccination against COVID, created and
 84 | >    collected between January 2021 and February 2022. It contains 19319 tweets.
 85 | 
 86 | * [📜 Minutes of the Sittings of the Chamber of Deputies of Romania 📜](https://elrc-share.eu/repository/browse/monolingual-corpus-from-minutes-of-the-sittings-of-the-chamber-of-deputies-of-romania-2016-2018-processed/759806e22e1311e9a4d400155d02670657928f90efa64c9dab1b177c2186bf6c/)
 87 | >     Minutes of the Sittings of the Chamber of Deputies of Romania (2016-2018)
 88 | >     Unannotated corpus
 89 |   
 90 | * [🔊 Minutes of the Sittings of the Romanian Parliament 🔊](https://elrc-share.eu/repository/browse/romanian-parliament-transcripts-1996-2018-processed/779b85aee4de11e9913100155d026706ac0a5e38c6824010a537363b37b6bd0f/)
 91 | >     contains 500k+ instances of speech from the parliament podium from
 92 | >     1996 to 2018. Sentence splitting and deduplication onm sentence level
 93 | >     have been applied as processing steps
 94 | >     Unannotated corpus
 95 |   
 96 | * [🗣️ Romanian Presidential Discourses 🗣️](https://github.com/grrrrah/RomanianPresidentialDiscourses)
 97 | >     Romanian presidential discouses (1990-2020) split in 4 files
 98 | >     one for each president. Unannotated corpus
 99 |   
100 |   
101 | * [🎭 Culture Domain Corpus 🎭](https://elrc-share.eu/repository/browse/monolingual-romanian-corpus-in-the-culture-domain-processed/a1d6c98e1d5911e9b7d400155d026706fd71a90bf1df4bacb3f174edccb6e9b9/)
102 | >     Monolingual Romanian corpus, including content from public websites related to culture
103 | 
104 | * [Law Domain Corpus](https://elrc-share.eu/repository/browse/monolingual-romanian-corpus-in-the-law-domain/ee9f6b0289f611e6bfe700155d0205029e7b188412ab4e56bf6fd1d7d9e8b033/)
105 | >    Monolingual (ron) corpus, containing 38063991 tokens and 854096 lexical types in the law domain.
106 | 
107 | * [Public Administration Domain Corpus](https://elrc-share.eu/repository/browse/monolingual-romanian-corpus-in-the-public-administration-domain-processed/32b0c234327311e8b7d400155d026706afa147904c554dd1bad4764bd4a7aaed/)
108 | >    Monolingual Romanian corpus, containing 360833 sentences (9064764 words) in the public administration domain.
109 |   
110 | * [New Civil Procedure Code](https://elrc-share.eu/repository/browse/romanian-new-civil-procedure-code-processed/e4d8e13046ff11e8b7d400155d026706010657f248274e6286aeb6488d8a2ee6/)
111 | >    The New Civil Procedure Code in Romanian (monolingual) comprising 297888 words.
112 | 
113 | * [New Criminal Code](https://elrc-share.eu/repository/browse/noul-cod-penal/100b7e38d25111ea913100155d0267066e48d760d0d24b39a4900b2b09864a02/)
114 | >     The Romanian updated criminal code: text with law content.
115 | 
116 | * [Romanian News Articles Dataset](https://github.com/mhakan20/RomanianNewsArticlesDataset)
117 | >   news articles dataset from romanian newssites
118 | >   title, summary and article
119 | 
120 | * [Old Newspapers](https://www.kaggle.com/datasets/alvations/old-newspapers)
121 | >   multi-language corpus from online available news sources.
122 | >   It contains also 43mil words in Romanian language from Twitter, Blogs and Newspapers
123 | 
124 | [![Homepage](https://img.shields.io/badge/hc%20corpora%20homepage-6ca1f0)](http://corpora.epizy.com/corpora.html?i=1)
125 | 
126 | * [ELTeC-Rom](https://github.com/COST-ELTeC/ELTeC-rom)
127 | >   The Romanian novel collection for ELTeC, the European Literary Text Collection
128 | >   Sources: Biblioteca Metropolitana din Bucuresti, Biblioteca Universitara "Mihai Eminescu" din Iasi,
129 | >   Biblioteca Judeteana din Botosani, personal micro-collections uploaded on Zenodo
130 | >   under the following labels: "Hajduks Library"; "RomanianNovel Library"; "CityMysteries Library"; "BibliotecaDHL_Iasi" 
131 |   
132 | * [RO Business Emails](https://huggingface.co/datasets/readerbench/ro-business-emails)
133 | >   Public dataset of 1447 manually annotated Romanian business-oriented emails.
134 | >   The corpus is annotated with 5 token-related labels, as well as 5 sequence-related classes
135 | 
136 | [![MDPI](https://img.shields.io/badge/MDPI-Information-00813E.svg)](https://www.mdpi.com/2078-2489/14/6/321)
137 | 
138 | 
139 | * [📖RO-Stories📖](https://huggingface.co/datasets/readerbench/ro-stories)
140 | >  The corpus consists of texts written by Romanian authors between 19th century and present, representing stories, short-stories, fairy tales and sketches.
141 | >  The current version contains 19 authors, 1263 full texts and 12516 paragraphs of around 200 words each, preserving paragraphs integrity.
142 | 
143 | * [📕ROST📕](https://www.kaggle.com/datasets/sandamariaavram/rost-romanian-stories-and-other-texts)
144 | >  A dataset containing 400 Romanian texts written by 10 authors
145 | >  The dataset contains stories, short stories, fairy tales, novels, articles, and sketches written by Ion Creangă,
146 | >  Barbu Ştefănescu Delavrancea, Mihai Eminescu, Nicolae Filimon, Emil Gârleanu, Petre Ispirescu, Mihai Oltean, Emilia Plugaru, Liviu Rebreanu, Ioan Slavici.
147 | 
148 | [![MDPI](https://img.shields.io/badge/MDPI-Information-00813E.svg)](https://www.mdpi.com/2227-7390/10/23/4589)
149 | 
150 | * [🍳Romanian Cooking Recipes🍳](https://huggingface.co/datasets/BlackKakapo/recipes-ro)
151 | >  891 Cooking Recipes in Romanian Language
152 | 
153 | ## Semantic Textual Similarity / Paraphrasing
154 | 
155 | * [RO-STS](https://huggingface.co/datasets/ro_sts)
156 | >   Semantic Textual Similarity dataset for the Romanian language
157 | >   RO-STS contains 8,628 sentence pairs with their similarity scores
158 | 
159 | [![NeurIPS](https://img.shields.io/badge/NeurIPS-ed1c24.svg)](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/5f93f983524def3dca464469d2cf9f3e-Abstract-round1.html)
160 | 
161 | * [Romanian Bible Paraphrase Corpus](https://huggingface.co/datasets/andyP/ro-paraphrase-bible)
162 | >   A paraphprase corpus created from 10 different Romanian language Bible versions.
163 | >   The final dataset contains 904,815 similar records and 218,977 non matching records, totaling 1,123,927
164 | 
165 | 
166 | * [Romanian paraphrase dataset](https://huggingface.co/datasets/BlackKakapo/paraphrase-ro)
167 | >  Around ~100k examples of paraphrases. No clear explanation on how the dataset was built
168 | 
169 | * [TaPaCo](https://huggingface.co/datasets/tapaco/viewer/ro/train)
170 | >    A multi-language paraphrase corpus for 73 languages extracted from the Tatoeba database.
171 | >    It has ~ 2000 romanian phrases totaling 941 paraphrase groups. 
172 | 
173 | [![ACL](https://img.shields.io/badge/ACL%20Anthology-ed1c24.svg)](https://aclanthology.org/2020.lrec-1.848/)
174 | [![Homepage](https://img.shields.io/badge/cc-100%20homepage-6ca1f0)](https://zenodo.org/record/3707949)
175 | 
176 | ## Natural Language Inference
177 | 
178 | * [RONLI](https://github.com/Eduard6421/RONLI)
179 | >   We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs, which are obtained via distant supervision, and 6K validation and test sentence pairs, which are manually annotated with the correct labels.
180 | [![ACL](https://img.shields.io/badge/ACL%20Anthology-ed1c24.svg)](https://aclanthology.org/2024.acl-long.15/)
181 | 
182 | 
183 | 
184 | * [~RO-NLI~](https://github.com/dumitrescustefan/RO-NLI)
185 | >  The repository seems to be just an attempt at starting to build the dataset
186 | 
187 | 
188 | ## Summarization
189 | * [RO Text Summarization](https://huggingface.co/datasets/readerbench/ro-text-summarization)
190 | >   Around ~72k Full texts and their summary. Source seems to be news websites.
191 | >   No description or explanation available
192 | 
193 | ## Dialect and regional speech identification
194 | * [RoDia](https://github.com/codrut2/RoDia)
195 | > varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments.
196 | > Around 2800 records labeled with age, gender and type of dialect
197 | 
198 | [![arXiv](https://img.shields.io/badge/arXiv-2309.03378-f9f107.svg)](https://arxiv.org/abs/2309.03378)
199 | 
200 | * [MOROCO](https://github.com/butnaruandrei/MOROCO)
201 | > MOROCO: The Moldavian and Romanian Dialectal Corpus
202 | > The MOROCO data set contains Moldavian and Romanian samples of text collected from the news domain.
203 | > The samples belong to one of the following six topics: culture, finance, politics, science, sports, tech
204 | > totaling over 32.000 labeled records
205 | 
206 | [![arXiv](https://img.shields.io/badge/arXiv-1901.06543-f9f107.svg)](https://arxiv.org/abs/1901.06543)
207 | 
208 |   
209 | ## Named Entity Recognition (NER)
210 | 
211 | * [LegalNERo](https://huggingface.co/datasets/joelito/legalnero)
212 | * [RONEC](https://huggingface.co/datasets/ronec)
213 | * [WikiAnn](https://huggingface.co/datasets/wikiann)
214 | * [SiMoNERo](https://github.com/UniversalDependencies/UD_Romanian-SiMoNERo)
215 | 
216 | ## Autorship Attribution
217 | 
218 | * [ROST](https://www.kaggle.com/datasets/sandamariaavram/rost-romanian-stories-and-other-texts)
219 | 
220 | 
221 | ## Sentiment Analysis
222 | 
223 | * [RO_Sent](https://huggingface.co/datasets/ro_sent)
224 | * [Senti_Lex](https://huggingface.co/datasets/senti_lex)
225 | * [LaROSeDa](https://huggingface.co/datasets/laroseda)
226 | * [SART](https://github.com/Alegzandra/KES-2023/tree/main/datasets/SART)
227 | * [RED](https://github.com/Alegzandra/RED-Romanian-Emotions-Dataset)
228 | * [Romanian Categorized Web Dataset](https://github.com/bogsio/RomanianCategorizedWebDataset)
229 | * [Romanian Sentiment Movie Reviews](https://www.kaggle.com/datasets/gringoandy/romanian-sentiment-movie-reviews)
230 | 
231 | 
232 | ## Dependency Parsing
233 | 
234 | * [CoNLL 2017 & 2018](https://www.conll.org/previous-tasks)
235 | * [Deep Universal Dependencies](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3720)
236 | * [Curlicat Romanian Corpus](https://elrc-share.eu/repository/browse/curlicat-romanian-corpus/8b6c8dca58ea11ed9c1a00155d026706fb03ef8b4c1847cfbe9cea869a82731e/)
237 | * [HamleDT](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1508)
238 | * [RoWordNet](https://github.com/dumitrescustefan/RoWordNet)
239 | * [RoRefTrees](https://github.com/UniversalDependencies/UD_Romanian-RRT)
240 | 
241 | ## Diacritics Restoration / Grammar Correction
242 | 
243 | * [Corpus for training and evaluating diacritics restoration systems](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2607)
244 | * [RONACC](https://nextcloud.readerbench.com/index.php/s/9pwymesT5sycxoM)
245 | 
246 | ## Fake News / Clickbait / Satirical News
247 | 
248 | * [Fakerom](https://www.tagtog.com/fakerom/fakerom)
249 | * [Clickbait dataset on Romanian SciTech News](https://github.com/ralucaginga/ClickbaitSciTechRO)
250 | * [SaRoCo](https://github.com/MihaelaGaman/SaRoCo)
251 | 
252 | ## Offensive Language
253 | 
254 | * [RO-Offense](https://huggingface.co/datasets/readerbench/ro-offense)
255 |   
256 | * [News RO-Offense](https://huggingface.co/datasets/readerbench/news-ro-offense)
257 | > manually annotated 4,052 comments on a Romanian local news website
258 | > into one of the following classes: non-offensive, targeted insults,
259 | > racist, homophobic, and sexist.
260 | 
261 | [![arXiv](https://img.shields.io/badge/UTCluj-RoCHI-ac2820.svg)](http://rochi.utcluj.ro/articole/10/RoCHI2022-Cojocaru-A.pdf)
262 | 
263 | * [FB RO-Offense](https://huggingface.co/datasets/readerbench/ro-fb-offense)
264 | > 4455 organic generated comments from Facebook live broadcasts
265 | > annotated not binary offensive language detection tasks and for fine-grained offensive language detection
266 | 
267 | [![IEEE](https://img.shields.io/badge/IEEE-xplore-14303e.svg)](https://ieeexplore.ieee.org/document/10130824)
268 | 
269 | * [RO-Offense-Sequences](https://huggingface.co/datasets/readerbench/ro-offense-sequences)
270 | > 4800 Romanian comments annotated with offensive text spans
271 | > Offensive span detection
272 | 
273 | [![MDPI](https://img.shields.io/badge/MDPI-Information-00813e.svg)](https://www.mdpi.com/2078-2489/15/1/8)
274 | 
275 | * [Hate Speech RO](https://github.com/andra-pumnea/hate-speech-ro)
276 | > 3860 labeled hate speech records
277 |   
278 | * [ROFF](https://github.com/guzimanis/ROFF)
279 | > Dataset consists of 5000 tweets, from which 924 were labeled as offensive (18.48 %)
280 | > and 4076 tweets as non-offensive.
281 | 
282 | [![ACL](https://img.shields.io/badge/ACL%20Anthology-ed1c24.svg)](https://aclanthology.org/2021.ranlp-1.102.pdf)
283 |   
284 | * [CoRoSeOf](https://github.com/DianaHoefels/CoRoSeOf)
285 | > The corpus contains 39 245 tweets, annotated by multiple annotators, following the sexist label set of a recent study.
286 | 
287 | [![ACL](https://img.shields.io/badge/ACL%20Anthology-ed1c24.svg)](https://aclanthology.org/2022.lrec-1.243.pdf)
288 | 
289 | ## Questions and Answers
290 | 
291 | * [🧮 GSM8K RO 🧮](https://huggingface.co/datasets/BlackKakapo/gsm8k-ro)
292 | >   This dataset is just the translation of the [gsm8k](https://huggingface.co/datasets/gsm8k) dataset.
293 | >   GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems.
294 | >   There is no information on the quality of the translation
295 | 
296 | * [💻 ROCODE 💻](https://huggingface.co/datasets/cosmadrian/rocode)
297 | >   RoCode, a competitive programming dataset, consisting of 2,642 problems written in Romanian,
298 | >  11k solutions in C, C++ and Python and comprehensive testing suites for each problem. The purpose of RoCode
299 | >  is to provide a benchmark for evaluating the code intelligence of language models trained on
300 | >  Romanian / multilingual text as well as a fine-tuning set for pretrained Romanian models.
301 | 
302 | [![arXiv](https://img.shields.io/badge/arXiv-2005.06012-f9f107.svg)](https://arxiv.org/abs/2402.13222)
303 | 
304 | 
305 | ## Spelling, Dictionaries and Gramatical Errors
306 | 
307 | * [Grammar-RO](https://huggingface.co/datasets/BlackKakapo/grammar-ro)
308 | >  Synthetic dataset with ~1.9M records. Altered and correct statement as columns
309 | 
310 | * [RoAcReL](https://huggingface.co/datasets/fmi-unibuc/RoAcReL)
311 | >  Romanian Archaisms Regionalisms Lexicon containing ~ 1940 Word definitions
312 | 
313 | * [RoRuDi](https://huggingface.co/datasets/fmi-unibuc/RoRuDi)
314 | > Romanian Rules for Dialects - 1940 regionalisms, meanings and the region of provenience
315 | 
316 | 
317 | 
318 | 
319 | 


--------------------------------------------------------------------------------