└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # awesome-ukrainian-nlp 2 | Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.) 3 | 4 | ## News 5 | 6 | * 2025-11 -- [UNLP 2026 — The Fifth UNLP Conference](https://unlp.org.ua/call-for-papers/) first call for papers 7 | * 2024/12 -- [UNLP 2025 Shared Task on Detecting Social Media Manipulation](https://unlp.org.ua/shared-task/) has been announced. 8 | * 2024/01 -- [UNLP 2024 Shared Task on Fine-Tuning LLMs for Ukrainian](https://github.com/unlp-workshop/unlp-2024-shared-task) has been announced. 9 | 10 | 11 | ## 1. Datasets / Corpora 12 | 13 | ### Monolingual 14 | 15 | * [Kobza](https://huggingface.co/datasets/Goader/kobza) — around 1.3TB of uncompressed text, 60 billion tokens across 97 million documents, deduplicated compilation of CulturaX, Fineweb 2, HPLT 2.0, Ukrainian News and UberText 2.0. 16 | * [Malyuk](https://huggingface.co/datasets/lang-uk/malyuk) — 113GB of text, compilation of UberText 2.0, OSCAR, Ukrainian News. 17 | * [Brown-UK](https://github.com/brown-uk/corpus) — carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words 18 | * [UberText 2.0](https://lang.org.ua/en/ubertext/) — over 5 GB of news, Wikipedia, social, fiction, and legal texts 19 | * [Wikipedia](https://dumps.wikimedia.org/ukwiki/latest/) 20 | * [OSCAR](https://oscar-corpus.com/) — shuffled sentences extracted from [Common Crawl](https://commoncrawl.org/) and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated. 21 | * [CC-100](http://data.statmt.org/cc-100/) — documents extracted from [Common Crawl](https://commoncrawl.org/), automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text. 22 | * [mC4](https://github.com/allenai/allennlp/discussions/5056) — filtered CommonCrawl again, 196GB of Ukrainian text. 23 | * [Ukrainian Twitter corpus](https://github.com/saganoren/ukr-twi-corpus) - Ukrainian Twitter corpus for toxic text detection. 24 | * [Ukrainian forums](https://github.com/khrystyna-skopyk/ukr_spell_check/blob/master/data/scraped.txt) — 250k sentences scraped from forums. 25 | * [Ukrainain news headlines](https://huggingface.co/datasets/Yehor/news-headlines-ubercorpus) — 3.98M news headlines. 26 | 27 | ### Parallel 28 | 29 | * [OPUS](https://opus.nlpl.eu/) 30 | * [Tatoeba MT Challenge data sets](https://github.com/Helsinki-NLP/Tatoeba-Challenge/) 31 | * [Polish-Ukrainian Parallel Corpus](https://clarin-pl.eu/dspace/handle/11321/535) 32 | * [Back-translated monolingual Wiki data](https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/data/Backtranslations.md) 33 | * [Wiki Edits](https://huggingface.co/datasets/osyvokon/wiki-edits-uk) — 5M sentence edits extracted from the Ukrainian Wikipedia revision history. 34 | 35 | See [Helsinki-NLP/UkrainianLT](https://github.com/Helsinki-NLP/UkrainianLT) for more data and machine translation resources links. 36 | 37 | ### Labeled 38 | 39 | * [ZNO](https://huggingface.co/datasets/osyvokon/zno) — ~4000 text-only questions and answers from Ukrainian External independent testing (ЗНО/ZNO). 40 | * [MMZNO](https://huggingface.co/datasets/lang-uk/MMZNO) — ~4000 multi-modal (text and images) ZNO questions 41 | * [UA-GEC](https://github.com/grammarly/ua-gec) — grammatical error correction (GEC) and fluency corpus. 42 | * [OmniGEC](https://huggingface.co/collections/lang-uk/omnigec-68095391ebef195ed6c0a5f3) — synthetic GEC datasets, along with models. 43 | * [NER-uk](https://github.com/lang-uk/ner-uk) — Brown-UK labeled for named entities. 44 | * [Yakaboo Book Reviews](https://1drv.ms/f/s!AgoiFOsRix8LcYNBl26rru8wGGo?e=geqLkp) — book reviews, ratings and descriptions. 45 | * [Universal Dependencies](https://github.com/UniversalDependencies/UD_Ukrainian-IU/tree/master) — dependency trees corpus. 46 | * [ua-news](https://github.com/fido-ai/ua-datasets/tree/main/ua_datasets/src/text_classification) — 150k news article in 5 categories. 47 | * [UA-SQuAD](https://github.com/fido-ai/ua-datasets/tree/main/ua_datasets/src/question_answering) — Ukrainian version of Stanford Question Answering Dataset. 48 | * [Ukrainian Winograd schema challenge (WSC) Dataset](https://github.com/pkuchmiichuk/ua-coref#ukrainian-wsc-dataset) — manually translated. 49 | * [Ukrainian OntoNotes Dataset](https://github.com/pkuchmiichuk/ua-coref#ukrainian-ontonotes-dataset) — scripts to build large silver dataset for coreference resolution. 50 | 51 | ### Dictionaries 52 | 53 | * [ВЕСУМ](https://github.com/brown-uk/dict_uk) — POS tag dictionary. Can generate a list of all word forms valid for spelling. 54 | * [Tonal dictionary](https://github.com/lang-uk/tone-dict-uk) 55 | * [Multilingualsentiment, includes Ukrainian](https://sites.google.com/site/datascienceslab/projects/multilingualsentiment) - a list of positive/negative words 56 | * [obscene-ukr](https://github.com/saganoren/obscene-ukr) — profanity dictionary 57 | * [Word stress dictionary](https://github.com/lang-uk/ukrainian-word-stress-dictionary) — word stress for 2.7M word forms. See [ukrainian-word-stress](https://github.com/lang-uk/ukrainian-word-stress) 58 | * [Heteronyms](https://github.com/lang-uk/ukrainian-heteronyms-dictionary) — words that share the same spelling but have different meaning/pronunciation. 59 | * [Abbreviations](https://github.com/lang-uk/ukrainian-abbreviations-dictionary) — map abbreviation to expansion 60 | 61 | ### Prompts 62 | 63 | * [Aya](https://huggingface.co/datasets/CohereForAI/aya_dataset) — crowd-sourced prompts and reference outputs. Ukrainian part is ~500 prompts. 64 | 65 | 66 | ## 2. Tools 67 | 68 | * [tree_stem](https://github.com/amakukha/stemmers_ukrainian) — stemmer 69 | * [pymorphy2](https://github.com/kmike/pymorphy2) + [pymorphy2-dicts-uk](https://pypi.org/project/pymorphy2-dicts-uk/) — POS tagger and lemmatizer 70 | * [LanguageTool](https://languagetool.org/uk/) — grammar, style and spell checker 71 | * [Stanza](https://stanfordnlp.github.io/stanza/) — Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER 72 | * [nlp-uk](https://github.com/brown-uk/nlp_uk) — Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation 73 | * [NLP-Cube](https://github.com/adobe/NLP-Cube) - Python package for tokenization, sentence splitting, multi-word-tokenization, lemmatization, part-of-speech tagging and dependency parsing. 74 | 75 | 76 | 77 | ## 3. Pretrained models 78 | 79 | ### Language models 80 | 81 | *Autoregressive:* 82 | * [Lapa](https://huggingface.co/collections/lapa-llm/lapa-v012-release) — Gemma-3-12B-based Ukrainian LLM, along with training datasets 83 | * [MamayLM v0.1](https://huggingface.co/collections/INSAIT-Institute/mamaylm-gemma-2-68080b895a949a52b474d5de) - Ukrainian-focused Gemma 2 based 9B model, pre-trained and fine-tuned on large Ukrainian/English corpora (blog in [Ukrainian](https://huggingface.co/blog/INSAIT-Institute/mamaylm-ukr) and [English](https://huggingface.co/blog/INSAIT-Institute/mamaylm)) 84 | * [MamayLM v1.0](https://huggingface.co/collections/INSAIT-Institute/mamaylm-v10-gemma-3-68d3fd732b78eaba4886db9d) - Ukrainian-focused Gemma 3 based 4B and 12B multimodal models, pre-trained and fine-tuned on large Ukrainian/English corpora ([blog](https://blog.mamaylm.insait.ai/)) 85 | 86 | 87 | * [aya-101](https://huggingface.co/CohereForAI/aya-101) — massively multilingual LM, 13B parameters 88 | * [pythia-uk](https://huggingface.co/theodotus/pythia-uk) — mT5 finetuned on wiki and oasst1 for chats in Ukrainian. 89 | * [UAlpaca](https://github.com/robinhad/kruk) — Llama fine-tuned for instruction following on the machine-translated Alpaca dataset. 90 | * [XGLM](https://github.com/pytorch/fairseq/blob/main/examples/xglm/README.md) — multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian. 91 | * [Tereveni-AI/GPT-2](https://huggingface.co/Tereveni-AI/gpt2-124M-uk-fiction) 92 | * [uk4b](https://github.com/proger/uk4b) and [haloop inference toolkit](https://github.com/proger/haloop/tree/main#pretrained-models) - GPT-2 small, medium and large-style models trained on UberText 2.0 wikipedia, news and books. 93 | 94 | *Masked:* 95 | * [xlm-roberta-base-uk](https://huggingface.co/ukr-models/xlm-roberta-base-uk) — truncated version of XLM-RoBERTa with only Ukrainian and English embeddings left. 96 | * [youscan/ukr-roberta-base](https://huggingface.co/youscan/ukr-roberta-base) 97 | * [Goader/modern-liberta-large](https://huggingface.co/Goader/modern-liberta-large) — ModernBERT Large with Ukrainian tokenizer and 8192 context window, continually pretrained on 160B tokens. 98 | 99 | *Mixed*: 100 | * [Electra](https://huggingface.co/lang-uk) 101 | 102 | ### Machine translation 103 | 104 | * [Helsinki-NLP / OPUS-MT models](https://github.com/Helsinki-NLP/UkrainianLT) — Ukrainian to/from 25 langaguages. 105 | - [OPUS-MT models at HuggingFace](https://huggingface.co/models?language=uk&pipeline_tag=translation&sort=modified) 106 | - [OPUS-MT models evaluated on flores101](https://github.com/Helsinki-NLP/UkrainianLT/blob/main/opus-mt-ukr-flores-devtest.md) 107 | * [M2M-100](https://github.com/pytorch/fairseq/tree/master/examples/m2m_100) — Ukrainian to/from 100 languages. 108 | * [Uk-En folktale corpus](https://github.com/Ukrainian-To-English-Corpora/Folktale_corpus) — small sentence-aligned corpus of fairy tales. 109 | 110 | See [Helsinki-NLP/ UkrainianLT](https://github.com/Helsinki-NLP/UkrainianLT) for more. 111 | 112 | ### Sequence-to-sequence models 113 | 114 | * [mBART50](https://github.com/pytorch/fairseq/tree/master/examples/multilingual#mbart50-models) 115 | * [mT5](https://github.com/google-research/multilingual-t5) 116 | 117 | ### Named-entity recognition (NER) 118 | 119 | * [MITIE NER Model](https://lang.org.ua/en/models/#anchor1) 120 | * [ukr-models/uk-ner](https://huggingface.co/ukr-models/uk-ner) 121 | * [lang-uk/flair-uk-ner](https://huggingface.co/lang-uk/flair-uk-ner) 122 | * [dchaplinsky/uk_ner_web_trf_large](https://huggingface.co/dchaplinsky/uk_ner_web_trf_large) 123 | 124 | ### Part-of-speech tagging (POS) 125 | 126 | * [lang-uk/flair-uk-pos](https://huggingface.co/lang-uk/flair-uk-pos) 127 | 128 | ### Word embeddings 129 | 130 | * fastText 131 | - [Official fastText trained on CommonCrawl and Wiki](https://fasttext.cc/docs/en/crawl-vectors.html) — 157 languages, including Ukrainian. 132 | - [Older official fastText trained on Wiki](https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md) — 294 languages, including Ukrainian. 133 | - [fastText_multilingual](https://github.com/babylonhealth/fastText_multilingual) — 78 languages, aligned to the same vector space. 134 | - [fasttext_uk (2023)](https://huggingface.co/dchaplinsky/fasttext_uk) and [cbow](https://huggingface.co/dchaplinsky/fasttext_uk_cbow) — trained on UberText 2.0 135 | * [Word2Vec](https://lang.org.ua/en/models/#anchor4) 136 | * [GloVe](https://lang.org.ua/en/models/#anchor4) 137 | * [LexVec](https://lang.org.ua/en/models/#anchor4) 138 | * [BPEmb: Subword Embeddings, includes Ukrainian](https://nlp.h-its.org/bpemb/) - easy to use with [Flair](https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/BYTE_PAIR_EMBEDDINGS.md) 139 | * [Flair](https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/FLAIR_EMBEDDINGS.md) — [Ukrainian](https://huggingface.co/lang-uk/flair-uk-forward) added in 2022. 140 | 141 | ### Other 142 | 143 | * [uk-punctcase](https://huggingface.co/ukr-models/uk-punctcase) — punctuation and case restoration model based on XLM-RoBERTa-Uk. 144 | * [punctuation_uk_bert](https://huggingface.co/dchaplinsky/punctuation_uk_bert) — another punctuation and case restoration model based on bert-base-multilingual-cased. 145 | * [ukrainian-word-stress](https://github.com/lang-uk/ukrainian-word-stress) — adds word stress. 146 | 147 | ## 4. Paid 148 | 149 | * [LORELEI Ukrainian Representative Language Pack](https://catalog.ldc.upenn.edu/LDC2020T24) - Ukrainian monolingual text, Ukrainian-English parallel text, partially annotated for named entities 150 | 151 | 152 | ## 5. Other resources and links 153 | 154 | * [Helsinki-NLP/ UkrainianLT](https://github.com/Helsinki-NLP/UkrainianLT) — another collection of links to Ukrainian language tools. 155 | * [egorsmkv / speech-recognition-uk](https://github.com/egorsmkv/speech-recognition-uk) — speech recognition and text-to-speech models and datasets 156 | 157 | ## 6. Workshops and conferences 158 | 159 | * [Ukrainian Natural Language Processing Workshop](https://unlp.org.ua/) 160 | * UNLP 2023 Shared Task — shared task (competition) in grammatical error correction for Ukrainian 161 | - [Training data and evaluation scripts](https://github.com/osyvokon/unlp-2023-shared-task) 162 | - [Public leaderboard](https://codalab.lisn.upsaclay.fr/competitions/10740) 163 | * [UNLP 2024 Shared Task](https://github.com/unlp-workshop/unlp-2024-shared-task) — shared task (competition) on fine-tuning large language models (LLMs) for Ukrainian 164 | * [UNLP 2025 Shared Task on Detecting Social Media Manipulation](https://unlp.org.ua/shared-task/) 165 | --------------------------------------------------------------------------------