└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # awesome-ukrainian-nlp
  2 | Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)
  3 | 
  4 | ## News
  5 | 
  6 | * 2025-11 -- [UNLP 2026 — The Fifth UNLP Conference](https://unlp.org.ua/call-for-papers/) first call for papers
  7 | * 2024/12 -- [UNLP 2025 Shared Task on Detecting Social Media Manipulation](https://unlp.org.ua/shared-task/) has been announced.
  8 | * 2024/01 -- [UNLP 2024 Shared Task on Fine-Tuning LLMs for Ukrainian](https://github.com/unlp-workshop/unlp-2024-shared-task) has been announced.
  9 | 
 10 | 
 11 | ## 1. Datasets / Corpora
 12 | 
 13 | ### Monolingual
 14 | 
 15 | * [Kobza](https://huggingface.co/datasets/Goader/kobza) — around 1.3TB of uncompressed text, 60 billion tokens across 97 million documents, deduplicated compilation of CulturaX, Fineweb 2, HPLT 2.0, Ukrainian News and UberText 2.0.
 16 | * [Malyuk](https://huggingface.co/datasets/lang-uk/malyuk) — 113GB of text, compilation of UberText 2.0, OSCAR, Ukrainian News. 
 17 | * [Brown-UK](https://github.com/brown-uk/corpus) — carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words
 18 | * [UberText 2.0](https://lang.org.ua/en/ubertext/) — over 5 GB of news, Wikipedia, social, fiction, and legal texts
 19 | * [Wikipedia](https://dumps.wikimedia.org/ukwiki/latest/)
 20 | * [OSCAR](https://oscar-corpus.com/) — shuffled sentences extracted from [Common Crawl](https://commoncrawl.org/) and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated.
 21 | * [CC-100](http://data.statmt.org/cc-100/) — documents extracted from [Common Crawl](https://commoncrawl.org/), automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text.
 22 | * [mC4](https://github.com/allenai/allennlp/discussions/5056) — filtered CommonCrawl again, 196GB of Ukrainian text. 
 23 | * [Ukrainian Twitter corpus](https://github.com/saganoren/ukr-twi-corpus) - Ukrainian Twitter corpus for toxic text detection.
 24 | * [Ukrainian forums](https://github.com/khrystyna-skopyk/ukr_spell_check/blob/master/data/scraped.txt) — 250k sentences scraped from forums.
 25 | * [Ukrainain news headlines](https://huggingface.co/datasets/Yehor/news-headlines-ubercorpus) — 3.98M news headlines.
 26 | 
 27 | ### Parallel
 28 | 
 29 | * [OPUS](https://opus.nlpl.eu/)
 30 | * [Tatoeba MT Challenge data sets](https://github.com/Helsinki-NLP/Tatoeba-Challenge/)
 31 | * [Polish-Ukrainian Parallel Corpus](https://clarin-pl.eu/dspace/handle/11321/535) 
 32 | * [Back-translated monolingual Wiki data](https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/data/Backtranslations.md)
 33 | * [Wiki Edits](https://huggingface.co/datasets/osyvokon/wiki-edits-uk) — 5M sentence edits extracted from the Ukrainian Wikipedia revision history.
 34 | 
 35 | See [Helsinki-NLP/UkrainianLT](https://github.com/Helsinki-NLP/UkrainianLT) for more data and machine translation resources links.
 36 | 
 37 | ### Labeled
 38 | 
 39 | * [ZNO](https://huggingface.co/datasets/osyvokon/zno) — ~4000 text-only questions and answers from Ukrainian External independent testing (ЗНО/ZNO).
 40 | * [MMZNO](https://huggingface.co/datasets/lang-uk/MMZNO) — ~4000 multi-modal (text and images) ZNO questions
 41 | * [UA-GEC](https://github.com/grammarly/ua-gec) — grammatical error correction (GEC) and fluency corpus.
 42 | * [OmniGEC](https://huggingface.co/collections/lang-uk/omnigec-68095391ebef195ed6c0a5f3) — synthetic GEC datasets, along with models.
 43 | * [NER-uk](https://github.com/lang-uk/ner-uk) — Brown-UK labeled for named entities.
 44 | * [Yakaboo Book Reviews](https://1drv.ms/f/s!AgoiFOsRix8LcYNBl26rru8wGGo?e=geqLkp) — book reviews, ratings and descriptions.
 45 | * [Universal Dependencies](https://github.com/UniversalDependencies/UD_Ukrainian-IU/tree/master) — dependency trees corpus.
 46 | * [ua-news](https://github.com/fido-ai/ua-datasets/tree/main/ua_datasets/src/text_classification) — 150k news article in 5 categories.
 47 | * [UA-SQuAD](https://github.com/fido-ai/ua-datasets/tree/main/ua_datasets/src/question_answering) — Ukrainian version of Stanford Question Answering Dataset.
 48 | * [Ukrainian Winograd schema challenge (WSC) Dataset](https://github.com/pkuchmiichuk/ua-coref#ukrainian-wsc-dataset) — manually translated.
 49 | * [Ukrainian OntoNotes Dataset](https://github.com/pkuchmiichuk/ua-coref#ukrainian-ontonotes-dataset) — scripts to build large silver dataset for coreference resolution.
 50 |  
 51 | ### Dictionaries
 52 | 
 53 | * [ВЕСУМ](https://github.com/brown-uk/dict_uk) — POS tag dictionary. Can generate a list of all word forms valid for spelling.
 54 | * [Tonal dictionary](https://github.com/lang-uk/tone-dict-uk)
 55 | * [Multilingualsentiment, includes Ukrainian](https://sites.google.com/site/datascienceslab/projects/multilingualsentiment) - a list of positive/negative words
 56 | * [obscene-ukr](https://github.com/saganoren/obscene-ukr) — profanity dictionary
 57 | * [Word stress dictionary](https://github.com/lang-uk/ukrainian-word-stress-dictionary) — word stress for 2.7M word forms. See [ukrainian-word-stress](https://github.com/lang-uk/ukrainian-word-stress) 
 58 | * [Heteronyms](https://github.com/lang-uk/ukrainian-heteronyms-dictionary) — words that share the same spelling but have different meaning/pronunciation.
 59 | * [Abbreviations](https://github.com/lang-uk/ukrainian-abbreviations-dictionary) — map abbreviation to expansion
 60 | 
 61 | ### Prompts
 62 | 
 63 | * [Aya](https://huggingface.co/datasets/CohereForAI/aya_dataset) — crowd-sourced prompts and reference outputs. Ukrainian part is ~500 prompts.
 64 | 
 65 | 
 66 | ## 2. Tools
 67 | 
 68 | * [tree_stem](https://github.com/amakukha/stemmers_ukrainian) — stemmer
 69 | * [pymorphy2](https://github.com/kmike/pymorphy2) + [pymorphy2-dicts-uk](https://pypi.org/project/pymorphy2-dicts-uk/) — POS tagger and lemmatizer
 70 | * [LanguageTool](https://languagetool.org/uk/) — grammar, style and spell checker
 71 | * [Stanza](https://stanfordnlp.github.io/stanza/) — Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER
 72 | * [nlp-uk](https://github.com/brown-uk/nlp_uk) — Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation
 73 | * [NLP-Cube](https://github.com/adobe/NLP-Cube) - Python package for tokenization, sentence splitting, multi-word-tokenization, lemmatization, part-of-speech tagging and dependency parsing.
 74 | 
 75 |  
 76 | 
 77 | ## 3. Pretrained models
 78 | 
 79 | ### Language models
 80 | 
 81 | *Autoregressive:*
 82 | * [Lapa](https://huggingface.co/collections/lapa-llm/lapa-v012-release) — Gemma-3-12B-based Ukrainian LLM, along with training datasets
 83 | * [MamayLM v0.1](https://huggingface.co/collections/INSAIT-Institute/mamaylm-gemma-2-68080b895a949a52b474d5de) - Ukrainian-focused Gemma 2 based 9B model, pre-trained and fine-tuned on large Ukrainian/English corpora (blog in [Ukrainian](https://huggingface.co/blog/INSAIT-Institute/mamaylm-ukr) and [English](https://huggingface.co/blog/INSAIT-Institute/mamaylm))
 84 | * [MamayLM v1.0](https://huggingface.co/collections/INSAIT-Institute/mamaylm-v10-gemma-3-68d3fd732b78eaba4886db9d) - Ukrainian-focused Gemma 3 based 4B and 12B multimodal models, pre-trained and fine-tuned on large Ukrainian/English corpora ([blog](https://blog.mamaylm.insait.ai/))
 85 | 
 86 | 
 87 | * [aya-101](https://huggingface.co/CohereForAI/aya-101) — massively multilingual LM, 13B parameters
 88 | * [pythia-uk](https://huggingface.co/theodotus/pythia-uk) — mT5 finetuned on wiki and oasst1 for chats in Ukrainian.
 89 | * [UAlpaca](https://github.com/robinhad/kruk) — Llama fine-tuned for instruction following on the machine-translated Alpaca dataset.
 90 | * [XGLM](https://github.com/pytorch/fairseq/blob/main/examples/xglm/README.md) — multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian.
 91 | * [Tereveni-AI/GPT-2](https://huggingface.co/Tereveni-AI/gpt2-124M-uk-fiction)
 92 | * [uk4b](https://github.com/proger/uk4b) and [haloop inference toolkit](https://github.com/proger/haloop/tree/main#pretrained-models) - GPT-2 small, medium and large-style models trained on UberText 2.0 wikipedia, news and books.
 93 | 
 94 | *Masked:*
 95 | * [xlm-roberta-base-uk](https://huggingface.co/ukr-models/xlm-roberta-base-uk) — truncated version of XLM-RoBERTa with only Ukrainian and English embeddings left.
 96 | * [youscan/ukr-roberta-base](https://huggingface.co/youscan/ukr-roberta-base)
 97 | * [Goader/modern-liberta-large](https://huggingface.co/Goader/modern-liberta-large) — ModernBERT Large with Ukrainian tokenizer and 8192 context window, continually pretrained on 160B tokens.
 98 | 
 99 | *Mixed*:
100 | * [Electra](https://huggingface.co/lang-uk)
101 | 
102 | ### Machine translation
103 | 
104 | * [Helsinki-NLP / OPUS-MT models](https://github.com/Helsinki-NLP/UkrainianLT) — Ukrainian to/from 25 langaguages.
105 |   - [OPUS-MT models at HuggingFace](https://huggingface.co/models?language=uk&pipeline_tag=translation&sort=modified)
106 |   - [OPUS-MT models evaluated on flores101](https://github.com/Helsinki-NLP/UkrainianLT/blob/main/opus-mt-ukr-flores-devtest.md)
107 | * [M2M-100](https://github.com/pytorch/fairseq/tree/master/examples/m2m_100) — Ukrainian to/from 100 languages.
108 | * [Uk-En folktale corpus](https://github.com/Ukrainian-To-English-Corpora/Folktale_corpus) — small sentence-aligned corpus of fairy tales.
109 | 
110 | See [Helsinki-NLP/ UkrainianLT](https://github.com/Helsinki-NLP/UkrainianLT) for more.
111 | 
112 | ### Sequence-to-sequence models
113 | 
114 | * [mBART50](https://github.com/pytorch/fairseq/tree/master/examples/multilingual#mbart50-models)
115 | * [mT5](https://github.com/google-research/multilingual-t5)
116 | 
117 | ### Named-entity recognition (NER)
118 | 
119 | * [MITIE NER Model](https://lang.org.ua/en/models/#anchor1)
120 | * [ukr-models/uk-ner](https://huggingface.co/ukr-models/uk-ner)
121 | * [lang-uk/flair-uk-ner](https://huggingface.co/lang-uk/flair-uk-ner)
122 | * [dchaplinsky/uk_ner_web_trf_large](https://huggingface.co/dchaplinsky/uk_ner_web_trf_large)
123 | 
124 | ### Part-of-speech tagging (POS)
125 | 
126 | * [lang-uk/flair-uk-pos](https://huggingface.co/lang-uk/flair-uk-pos)
127 | 
128 | ### Word embeddings
129 | 
130 | * fastText
131 |   - [Official fastText trained on CommonCrawl and Wiki](https://fasttext.cc/docs/en/crawl-vectors.html) — 157 languages, including Ukrainian.
132 |   - [Older official fastText trained on Wiki](https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md) — 294 languages, including Ukrainian.
133 |   - [fastText_multilingual](https://github.com/babylonhealth/fastText_multilingual) — 78 languages, aligned to the same vector space.
134 |   - [fasttext_uk (2023)](https://huggingface.co/dchaplinsky/fasttext_uk) and [cbow](https://huggingface.co/dchaplinsky/fasttext_uk_cbow) — trained on UberText 2.0
135 | * [Word2Vec](https://lang.org.ua/en/models/#anchor4)
136 | * [GloVe](https://lang.org.ua/en/models/#anchor4)
137 | * [LexVec](https://lang.org.ua/en/models/#anchor4)
138 | * [BPEmb: Subword Embeddings, includes Ukrainian](https://nlp.h-its.org/bpemb/) - easy to use with [Flair](https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/BYTE_PAIR_EMBEDDINGS.md)
139 | * [Flair](https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/FLAIR_EMBEDDINGS.md) — [Ukrainian](https://huggingface.co/lang-uk/flair-uk-forward) added in 2022. 
140 | 
141 | ### Other
142 | 
143 | * [uk-punctcase](https://huggingface.co/ukr-models/uk-punctcase) — punctuation and case restoration model based on XLM-RoBERTa-Uk.
144 | * [punctuation_uk_bert](https://huggingface.co/dchaplinsky/punctuation_uk_bert) — another punctuation and case restoration model based on bert-base-multilingual-cased.
145 | * [ukrainian-word-stress](https://github.com/lang-uk/ukrainian-word-stress) — adds word stress.
146 | 
147 | ## 4. Paid
148 | 
149 | * [LORELEI Ukrainian Representative Language Pack](https://catalog.ldc.upenn.edu/LDC2020T24) - Ukrainian monolingual text, Ukrainian-English parallel text, partially annotated for named entities
150 | 
151 | 
152 | ## 5. Other resources and links
153 | 
154 | * [Helsinki-NLP/ UkrainianLT](https://github.com/Helsinki-NLP/UkrainianLT) — another collection of links to Ukrainian language tools.
155 | * [egorsmkv / speech-recognition-uk](https://github.com/egorsmkv/speech-recognition-uk) — speech recognition and text-to-speech models and datasets
156 | 
157 | ## 6. Workshops and conferences
158 | 
159 | * [Ukrainian Natural Language Processing Workshop](https://unlp.org.ua/)
160 | * UNLP 2023 Shared Task — shared task (competition) in grammatical error correction for Ukrainian 
161 |   - [Training data and evaluation scripts](https://github.com/osyvokon/unlp-2023-shared-task) 
162 |   - [Public leaderboard](https://codalab.lisn.upsaclay.fr/competitions/10740)
163 | * [UNLP 2024 Shared Task](https://github.com/unlp-workshop/unlp-2024-shared-task) — shared task (competition) on fine-tuning large language models (LLMs) for Ukrainian
164 | * [UNLP 2025 Shared Task on Detecting Social Media Manipulation](https://unlp.org.ua/shared-task/)
165 | 


--------------------------------------------------------------------------------