├── CITATION.cff
├── LICENSE
└── README.md


/CITATION.cff:
--------------------------------------------------------------------------------
 1 | cff-version: 1.2.0
 2 | message: "If you use resources in this repository, please cite it as below."
 3 | authors:
 4 |   - family-names: Dadas
 5 |     given-names: Sławomir
 6 |     orcid: https://orcid.org/0000-0002-9177-6685
 7 | title: "Polish NLP resources"
 8 | version: 1.0
 9 | date-released: 2019-08-01
10 | url: "https://github.com/sdadas/polish-nlp-resources"
11 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                    GNU LESSER GENERAL PUBLIC LICENSE
  2 |                        Version 3, 29 June 2007
  3 | 
  4 |  Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
  5 |  Everyone is permitted to copy and distribute verbatim copies
  6 |  of this license document, but changing it is not allowed.
  7 | 
  8 | 
  9 |   This version of the GNU Lesser General Public License incorporates
 10 | the terms and conditions of version 3 of the GNU General Public
 11 | License, supplemented by the additional permissions listed below.
 12 | 
 13 |   0. Additional Definitions.
 14 | 
 15 |   As used herein, "this License" refers to version 3 of the GNU Lesser
 16 | General Public License, and the "GNU GPL" refers to version 3 of the GNU
 17 | General Public License.
 18 | 
 19 |   "The Library" refers to a covered work governed by this License,
 20 | other than an Application or a Combined Work as defined below.
 21 | 
 22 |   An "Application" is any work that makes use of an interface provided
 23 | by the Library, but which is not otherwise based on the Library.
 24 | Defining a subclass of a class defined by the Library is deemed a mode
 25 | of using an interface provided by the Library.
 26 | 
 27 |   A "Combined Work" is a work produced by combining or linking an
 28 | Application with the Library.  The particular version of the Library
 29 | with which the Combined Work was made is also called the "Linked
 30 | Version".
 31 | 
 32 |   The "Minimal Corresponding Source" for a Combined Work means the
 33 | Corresponding Source for the Combined Work, excluding any source code
 34 | for portions of the Combined Work that, considered in isolation, are
 35 | based on the Application, and not on the Linked Version.
 36 | 
 37 |   The "Corresponding Application Code" for a Combined Work means the
 38 | object code and/or source code for the Application, including any data
 39 | and utility programs needed for reproducing the Combined Work from the
 40 | Application, but excluding the System Libraries of the Combined Work.
 41 | 
 42 |   1. Exception to Section 3 of the GNU GPL.
 43 | 
 44 |   You may convey a covered work under sections 3 and 4 of this License
 45 | without being bound by section 3 of the GNU GPL.
 46 | 
 47 |   2. Conveying Modified Versions.
 48 | 
 49 |   If you modify a copy of the Library, and, in your modifications, a
 50 | facility refers to a function or data to be supplied by an Application
 51 | that uses the facility (other than as an argument passed when the
 52 | facility is invoked), then you may convey a copy of the modified
 53 | version:
 54 | 
 55 |    a) under this License, provided that you make a good faith effort to
 56 |    ensure that, in the event an Application does not supply the
 57 |    function or data, the facility still operates, and performs
 58 |    whatever part of its purpose remains meaningful, or
 59 | 
 60 |    b) under the GNU GPL, with none of the additional permissions of
 61 |    this License applicable to that copy.
 62 | 
 63 |   3. Object Code Incorporating Material from Library Header Files.
 64 | 
 65 |   The object code form of an Application may incorporate material from
 66 | a header file that is part of the Library.  You may convey such object
 67 | code under terms of your choice, provided that, if the incorporated
 68 | material is not limited to numerical parameters, data structure
 69 | layouts and accessors, or small macros, inline functions and templates
 70 | (ten or fewer lines in length), you do both of the following:
 71 | 
 72 |    a) Give prominent notice with each copy of the object code that the
 73 |    Library is used in it and that the Library and its use are
 74 |    covered by this License.
 75 | 
 76 |    b) Accompany the object code with a copy of the GNU GPL and this license
 77 |    document.
 78 | 
 79 |   4. Combined Works.
 80 | 
 81 |   You may convey a Combined Work under terms of your choice that,
 82 | taken together, effectively do not restrict modification of the
 83 | portions of the Library contained in the Combined Work and reverse
 84 | engineering for debugging such modifications, if you also do each of
 85 | the following:
 86 | 
 87 |    a) Give prominent notice with each copy of the Combined Work that
 88 |    the Library is used in it and that the Library and its use are
 89 |    covered by this License.
 90 | 
 91 |    b) Accompany the Combined Work with a copy of the GNU GPL and this license
 92 |    document.
 93 | 
 94 |    c) For a Combined Work that displays copyright notices during
 95 |    execution, include the copyright notice for the Library among
 96 |    these notices, as well as a reference directing the user to the
 97 |    copies of the GNU GPL and this license document.
 98 | 
 99 |    d) Do one of the following:
100 | 
101 |        0) Convey the Minimal Corresponding Source under the terms of this
102 |        License, and the Corresponding Application Code in a form
103 |        suitable for, and under terms that permit, the user to
104 |        recombine or relink the Application with a modified version of
105 |        the Linked Version to produce a modified Combined Work, in the
106 |        manner specified by section 6 of the GNU GPL for conveying
107 |        Corresponding Source.
108 | 
109 |        1) Use a suitable shared library mechanism for linking with the
110 |        Library.  A suitable mechanism is one that (a) uses at run time
111 |        a copy of the Library already present on the user's computer
112 |        system, and (b) will operate properly with a modified version
113 |        of the Library that is interface-compatible with the Linked
114 |        Version.
115 | 
116 |    e) Provide Installation Information, but only if you would otherwise
117 |    be required to provide such information under section 6 of the
118 |    GNU GPL, and only to the extent that such information is
119 |    necessary to install and execute a modified version of the
120 |    Combined Work produced by recombining or relinking the
121 |    Application with a modified version of the Linked Version. (If
122 |    you use option 4d0, the Installation Information must accompany
123 |    the Minimal Corresponding Source and Corresponding Application
124 |    Code. If you use option 4d1, you must provide the Installation
125 |    Information in the manner specified by section 6 of the GNU GPL
126 |    for conveying Corresponding Source.)
127 | 
128 |   5. Combined Libraries.
129 | 
130 |   You may place library facilities that are a work based on the
131 | Library side by side in a single library together with other library
132 | facilities that are not Applications and are not covered by this
133 | License, and convey such a combined library under terms of your
134 | choice, if you do both of the following:
135 | 
136 |    a) Accompany the combined library with a copy of the same work based
137 |    on the Library, uncombined with any other library facilities,
138 |    conveyed under the terms of this License.
139 | 
140 |    b) Give prominent notice with the combined library that part of it
141 |    is a work based on the Library, and explaining where to find the
142 |    accompanying uncombined form of the same work.
143 | 
144 |   6. Revised Versions of the GNU Lesser General Public License.
145 | 
146 |   The Free Software Foundation may publish revised and/or new versions
147 | of the GNU Lesser General Public License from time to time. Such new
148 | versions will be similar in spirit to the present version, but may
149 | differ in detail to address new problems or concerns.
150 | 
151 |   Each version is given a distinguishing version number. If the
152 | Library as you received it specifies that a certain numbered version
153 | of the GNU Lesser General Public License "or any later version"
154 | applies to it, you have the option of following the terms and
155 | conditions either of that published version or of any later version
156 | published by the Free Software Foundation. If the Library as you
157 | received it does not specify a version number of the GNU Lesser
158 | General Public License, you may choose any version of the GNU Lesser
159 | General Public License ever published by the Free Software Foundation.
160 | 
161 |   If the Library as you received it specifies that a proxy can decide
162 | whether future versions of the GNU Lesser General Public License shall
163 | apply, that proxy's public statement of acceptance of any version is
164 | permanent authorization for you to choose that version for the
165 | Library.
166 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Polish NLP resources
  2 | 
  3 | This repository contains pre-trained models and language resources for Natural Language Processing in Polish created during my research. Some of the models are also available on **[Huggingface Hub](https://huggingface.co/sdadas)**.
  4 | 
  5 | If you'd like to use any of those resources in your research please cite:
  6 | 
  7 | ```bibtex
  8 | @Misc{polish-nlp-resources,
  9 |   author =       {S{\l}awomir Dadas},
 10 |   title =        {A repository of Polish {NLP} resources},
 11 |   howpublished = {Github},
 12 |   year =         {2019},
 13 |   url =          {https://github.com/sdadas/polish-nlp-resources/}
 14 | }
 15 | ```
 16 | 
 17 | ## Contents
 18 | 
 19 | - [Word embeddings](#word-embeddings)
 20 |   - [Word2Vec](#word2vec)
 21 |   - [FastText](#fasttext)
 22 |   - [GloVe](#glove)
 23 |   - [High dimensional word vectors](#high-dimensional-word-vectors)
 24 |   - [Compressed Word2Vec](#compressed-word2vec)
 25 |   - [Wikipedia2Vec](#wikipedia2vec)
 26 | - [Language models](#language-models)
 27 |   - [ELMo](#elmo)
 28 |   - [RoBERTa](#roberta)
 29 |   - [BART](#bart)
 30 |   - [GPT-2](#gpt-2)
 31 |   - [Longformer](#longformer)
 32 | - [Text encoders](#text-encoders)
 33 | - [Machine translation models](#machine-translation-models)
 34 |   - [Convolutional models for Fairseq](#convolutional-models-for-fairseq)
 35 |   - [T5-based models](#t5-based-models)
 36 | - [Fine-tuned models](#fine-tuned-models)
 37 | - [Dictionaries and lexicons](#dictionaries-and-lexicons)
 38 | - [Links to external resources](#links-to-external-resources)
 39 |   - [Repositories of linguistic tools and resources](#repositories-of-linguistic-tools-and-resources)
 40 |   - [Publicly available large Polish text corpora (> 1GB)](#publicly-available-large-polish-text-corpora--1gb)
 41 |   - [Models supporting Polish language](#models-supporting-polish-language)
 42 | 
 43 | 
 44 | ## Word embeddings
 45 | 
 46 | The following section includes pre-trained word embeddings for Polish. Each model was trained on a corpus consisting of Polish Wikipedia dump, Polish books and articles, 1.5 billion tokens at total. 
 47 | 
 48 | ### Word2Vec
 49 | 
 50 | Word2Vec trained with [Gensim](https://radimrehurek.com/gensim/). 100 dimensions, negative sampling, contains lemmatized words with 3 or more ocurrences in the corpus and additionally a set of pre-defined punctuation symbols, all numbers from 0 to 10'000, Polish forenames and lastnames. The archive contains embedding in gensim binary format. Example of usage:
 51 | 
 52 | ```python
 53 | from gensim.models import KeyedVectors
 54 | 
 55 | if __name__ == '__main__':
 56 |     word2vec = KeyedVectors.load("word2vec_100_3_polish.bin")
 57 |     print(word2vec.similar_by_word("bierut"))
 58 |     
 59 | # [('cyrankiewicz', 0.818274736404419), ('gomułka', 0.7967918515205383), ('raczkiewicz', 0.7757788896560669), ('jaruzelski', 0.7737460732460022), ('pużak', 0.7667238712310791)]
 60 | ```
 61 | 
 62 | [Download (GitHub)](https://github.com/sdadas/polish-nlp-resources/releases/download/v1.0/word2vec.zip)
 63 | 
 64 | ### FastText
 65 | 
 66 | FastText trained with [Gensim](https://radimrehurek.com/gensim/). Vocabulary and dimensionality is identical to Word2Vec model. The archive contains embedding in gensim binary format. Example of usage:
 67 | 
 68 | ```python
 69 | from gensim.models import KeyedVectors
 70 | 
 71 | if __name__ == '__main__':
 72 |     word2vec = KeyedVectors.load("fasttext_100_3_polish.bin")
 73 |     print(word2vec.similar_by_word("bierut"))
 74 |     
 75 | # [('bieruty', 0.9290274381637573), ('gierut', 0.8921363353729248), ('bieruta', 0.8906412124633789), ('bierutow', 0.8795544505119324), ('bierutowsko', 0.839280366897583)]
 76 | ```
 77 | 
 78 | [Download (OneDrive)](https://witedupl-my.sharepoint.com/:u:/g/personal/dadass_wit_edu_pl/EeoDV_cq0KtAupMa0E9iIlEBMTVvw4OzABbPuAxUMFD8EA?e=5naF5z)
 79 | 
 80 | ### GloVe
 81 | 
 82 | Global Vectors for Word Representation (GloVe) trained using the reference implementation from Stanford NLP. 100 dimensions, contains lemmatized words with 3 or more ocurrences in the corpus. Example of usage:
 83 | 
 84 | ```python
 85 | from gensim.models import KeyedVectors
 86 | 
 87 | if __name__ == '__main__':
 88 |     word2vec = KeyedVectors.load_word2vec_format("glove_100_3_polish.txt")
 89 |     print(word2vec.similar_by_word("bierut"))
 90 |     
 91 | # [('cyrankiewicz', 0.8335597515106201), ('gomułka', 0.7793121337890625), ('bieruta', 0.7118682861328125), ('jaruzelski', 0.6743760108947754), ('minc', 0.6692837476730347)]
 92 | ```
 93 | 
 94 | [Download (GitHub)](https://github.com/sdadas/polish-nlp-resources/releases/download/v1.0/glove.zip)
 95 | 
 96 | ### High dimensional word vectors
 97 | Pre-trained vectors using the same vocabulary as above but with higher dimensionality. These vectors are more suitable for representing larger chunks of text such as sentences or documents using simple word aggregation methods (averaging, max pooling etc.) as more semantic information is preserved this way.
 98 | 
 99 | **GloVe** - **300d:** [Part 1 (GitHub)](https://github.com/sdadas/polish-nlp-resources/releases/download/glove-hd/glove_300_3_polish.zip.001), **500d:** [Part 1 (GitHub)](https://github.com/sdadas/polish-nlp-resources/releases/download/glove-hd/glove_500_3_polish.zip.001) [Part 2 (GitHub)](https://github.com/sdadas/polish-nlp-resources/releases/download/glove-hd/glove_500_3_polish.zip.002), **800d:** [Part 1 (GitHub)](https://github.com/sdadas/polish-nlp-resources/releases/download/glove-hd/glove_800_3_polish.zip.001) [Part 2 (GitHub)](https://github.com/sdadas/polish-nlp-resources/releases/download/glove-hd/glove_800_3_polish.zip.002) [Part 3 (GitHub)](https://github.com/sdadas/polish-nlp-resources/releases/download/glove-hd/glove_800_3_polish.zip.003) 
100 | 
101 | **Word2Vec** - [300d (OneDrive)](https://witedupl-my.sharepoint.com/:u:/g/personal/dadass_wit_edu_pl/EQ7QA6PkPupBtZYyP8kaafMB0z9FdHfME7kxm_tcRWh9hA?e=RGekMu), 
102 | [500d (OneDrive)](https://witedupl-my.sharepoint.com/:u:/g/personal/dadass_wit_edu_pl/EfBT7WvY7eVHuQZIuUpnJzsBXTN2L896ldVvhRBiCmUH_A?e=F0LGVc), [800d (OneDrive)](https://witedupl-my.sharepoint.com/:u:/g/personal/dadass_wit_edu_pl/Eda0vUkicNpNk4oMf2eoLZkBTJbMTmymKqqZ_yoEXw98TA?e=rKu4pP)
103 | 
104 | **FastText** - [300d (OneDrive)](https://witedupl-my.sharepoint.com/:u:/g/personal/dadass_wit_edu_pl/ESj0xTXmTK5Jhiocp5Oxt7IBUUmaEjczFWvQn17c2QNgcg?e=9aory9), 
105 | [500d (OneDrive)](https://witedupl-my.sharepoint.com/:u:/g/personal/dadass_wit_edu_pl/EViVRrF38fJMv1ihX2ARDNEBFFOE-MLSDHCcMG49IqEcCQ?e=g36NJ7), [800d (OneDrive)](https://witedupl-my.sharepoint.com/:u:/g/personal/dadass_wit_edu_pl/ESHkEJ7jLGlHoAIKiYdL0NkB_Z8VJyFcEHx3TpE7L1kNFg?e=FkoBgA)
106 | 
107 | ### Compressed Word2Vec
108 | 
109 | This is a compressed version of the Word2Vec embedding model described above. For compression, we used the method described in [Compressing Word Embeddings via Deep Compositional Code Learning](https://arxiv.org/abs/1711.01068) by Shu and Nakayama. Compressed embeddings are suited for deployment on storage-poor devices such as mobile phones. The model weights 38MB, only 4.4% size of the original Word2Vec embeddings. Although the authors of the article claimed that compressing with their method doesn't hurt model performance, we noticed a slight but acceptable drop of accuracy when using compressed version of embeddings. Sample decoder class with usage:
110 | 
111 | ```python
112 | import gzip
113 | from typing import Dict, Callable
114 | import numpy as np
115 | 
116 | class CompressedEmbedding(object):
117 | 
118 |     def __init__(self, vocab_path: str, embedding_path: str, to_lowercase: bool=True):
119 |         self.vocab_path: str = vocab_path
120 |         self.embedding_path: str = embedding_path
121 |         self.to_lower: bool = to_lowercase
122 |         self.vocab: Dict[str, int] = self.__load_vocab(vocab_path)
123 |         embedding = np.load(embedding_path)
124 |         self.codes: np.ndarray = embedding[embedding.files[0]]
125 |         self.codebook: np.ndarray = embedding[embedding.files[1]]
126 |         self.m = self.codes.shape[1]
127 |         self.k = int(self.codebook.shape[0] / self.m)
128 |         self.dim: int = self.codebook.shape[1]
129 | 
130 |     def __load_vocab(self, vocab_path: str) -> Dict[str, int]:
131 |         open_func: Callable = gzip.open if vocab_path.endswith(".gz") else open
132 |         with open_func(vocab_path, "rt", encoding="utf-8") as input_file:
133 |             return {line.strip():idx for idx, line in enumerate(input_file)}
134 | 
135 |     def vocab_vector(self, word: str):
136 |         if word == "<pad>": return np.zeros(self.dim)
137 |         val: str = word.lower() if self.to_lower else word
138 |         index: int = self.vocab.get(val, self.vocab["<unk>"])
139 |         codes = self.codes[index]
140 |         code_indices = np.array([idx * self.k + offset for idx, offset in enumerate(np.nditer(codes))])
141 |         return np.sum(self.codebook[code_indices], axis=0)
142 | 
143 | if __name__ == '__main__':
144 |     word2vec = CompressedEmbedding("word2vec_100_3.vocab.gz", "word2vec_100_3.compressed.npz")
145 |     print(word2vec.vocab_vector("bierut"))
146 | ```
147 | 
148 | [Download (GitHub)](https://github.com/sdadas/polish-nlp-resources/releases/download/v1.0/compressed.zip)
149 | 
150 | ### Wikipedia2Vec
151 | [Wikipedia2Vec](https://wikipedia2vec.github.io/) is a toolkit for learning joint representations of words and Wikipedia entities. We share Polish embeddings learned using a modified version of the library in which we added lemmatization and fixed some issues regarding parsing wiki dumps for languages other than English. Embedding models are available in sizes from 100 to 800 dimensions. A simple example: 
152 | 
153 | ```python
154 | from wikipedia2vec import Wikipedia2Vec
155 | 
156 | wiki2vec = Wikipedia2Vec.load("wiki2vec-plwiki-100.bin")
157 | print(wiki2vec.most_similar(wiki2vec.get_entity("Bolesław Bierut")))
158 | # (<Entity Bolesław Bierut>, 1.0), (<Word bierut>, 0.75790733), (<Word gomułka>, 0.7276504),
159 | # (<Entity Krajowa Rada Narodowa>, 0.7081445), (<Entity Władysław Gomułka>, 0.7043667) [...]
160 | ```
161 | 
162 | Download embeddings: [100d](https://witedupl-my.sharepoint.com/:u:/g/personal/dadass_wit_edu_pl/Ee_DFnilujxCiHmfUjRsqzUBBPst44eyCtmAnpB-Tq-ykw?e=ZzWIuf), [300d](https://witedupl-my.sharepoint.com/:u:/g/personal/dadass_wit_edu_pl/EWBzb1a89YJJku3vzFPObTUB5wTNaqIsznKT2AaKSP6xDQ?e=hhxSf0), [500d](https://witedupl-my.sharepoint.com/:u:/g/personal/dadass_wit_edu_pl/ERYsJUEo_DlKpUBBV_A86-0BrHDB88TJGr--WtzbKxhfJg?e=BPjH80), [800d](https://witedupl-my.sharepoint.com/:u:/g/personal/dadass_wit_edu_pl/EQJT8QyrMLFEqtC_1ZdOI54BzOQXIlvoQIbhra9EuIoV7w?e=SLfQrI). 
163 | 
164 | ## Language models
165 | 
166 | ### ELMo
167 | 
168 | Embeddings from Language Models (ELMo) is a contextual embedding presented in [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) by Peters et al. Sample usage with PyTorch below, for a more detailed instructions for integrating ELMo with your model please refer to the official repositories [github.com/allenai/bilm-tf](https://github.com/allenai/bilm-tf) (Tensorflow) and [github.com/allenai/allennlp](https://github.com/allenai/allennlp) (PyTorch).
169 | 
170 | ```python
171 | from allennlp.commands.elmo import ElmoEmbedder
172 | 
173 | elmo = ElmoEmbedder("options.json", "weights.hdf5")
174 | print(elmo.embed_sentence(["Zażółcić", "gęślą", "jaźń"]))
175 | ```
176 | 
177 | [Download (GitHub)](https://github.com/sdadas/polish-nlp-resources/releases/download/v1.0/elmo.zip)
178 | 
179 | ### RoBERTa
180 | 
181 | Language model for Polish based on popular transformer architecture. We provide weights for improved BERT language model introduced in [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf). We provide two RoBERTa models for Polish - base and large model. A summary of pre-training parameters for each model is shown in the table below. We release two version of the each model: one in the [Fairseq](https://github.com/pytorch/fairseq) format and the other in the [HuggingFace Transformers](https://github.com/huggingface/transformers) format. More information about the models can be found in a [separate repository](https://github.com/sdadas/polish-roberta).
182 | 
183 | <table>
184 | <thead>
185 | <th>Model</th>
186 | <th>L / H / A*</th>
187 | <th>Batch size</th>
188 | <th>Update steps</th>
189 | <th>Corpus size</th>
190 | <th>Fairseq</th>
191 | <th>Transformers</th>
192 | </thead>
193 | <tr>
194 |   <td>RoBERTa&nbsp;(base)</td>
195 |   <td>12&nbsp;/&nbsp;768&nbsp;/&nbsp;12</td>
196 |   <td>8k</td>
197 |   <td>125k</td>
198 |   <td>~20GB</td>
199 |   <td>
200 |   <a href="https://github.com/sdadas/polish-roberta/releases/download/models/roberta_base_fairseq.zip">v0.9.0</a>
201 |   </td>
202 |   <td>
203 |   <a href="https://github.com/sdadas/polish-roberta/releases/download/models-transformers-v3.4.0/roberta_base_transformers.zip">v3.4</a>
204 |   </td>
205 | </tr>
206 | <tr>
207 |   <td>RoBERTa&#8209;v2&nbsp;(base)</td>
208 |   <td>12&nbsp;/&nbsp;768&nbsp;/&nbsp;12</td>
209 |   <td>8k</td>
210 |   <td>400k</td>
211 |   <td>~20GB</td>
212 |   <td>
213 |   <a href="https://github.com/sdadas/polish-roberta/releases/download/models-v2/roberta_base_fairseq.zip">v0.10.1</a>
214 |   </td>
215 |   <td>
216 |   <a href="https://github.com/sdadas/polish-roberta/releases/download/models-v2/roberta_base_transformers.zip">v4.4</a>
217 |   </td>
218 | </tr>
219 | <tr>
220 |   <td>RoBERTa&nbsp;(large)</td>
221 |   <td>24&nbsp;/&nbsp;1024&nbsp;/&nbsp;16</td>
222 |   <td>30k</td>
223 |   <td>50k</td>
224 |   <td>~135GB</td>
225 |   <td>
226 |   <a href="https://github.com/sdadas/polish-roberta/releases/download/models/roberta_large_fairseq.zip">v0.9.0</a>
227 |   </td>
228 |   <td>
229 |   <a href="https://github.com/sdadas/polish-roberta/releases/download/models-transformers-v3.4.0/roberta_large_transformers.zip">v3.4</a>
230 |   </td>
231 | </tr>
232 | <tr>
233 |   <td>RoBERTa&#8209;v2&nbsp;(large)</td>
234 |   <td>24&nbsp;/&nbsp;1024&nbsp;/&nbsp;16</td>
235 |   <td>2k</td>
236 |   <td>400k</td>
237 |   <td>~200GB</td>
238 |   <td>
239 |   <a href="https://github.com/sdadas/polish-roberta/releases/download/models-v2/roberta_large_fairseq.zip">v0.10.2</a>
240 |   </td>
241 |   <td>
242 |   <a href="https://github.com/sdadas/polish-roberta/releases/download/models-v2/roberta_large_transformers.zip">v4.14</a>
243 |   </td>
244 | </tr>
245 |   </tr>
246 |   <tr>
247 |   <td>DistilRoBERTa</td>
248 |   <td>6&nbsp;/&nbsp;768&nbsp;/&nbsp;12</td>
249 |   <td>1k</td>
250 |   <td>10ep.</td>
251 |   <td>~20GB</td>
252 |   <td>
253 |   n/a
254 |   </td>
255 |   <td>
256 |   <a href="https://github.com/sdadas/polish-roberta/releases/download/models-v2/distilroberta_transformers.zip">v4.13</a>
257 |   </td>
258 | </tr>
259 | </table>
260 | 
261 | \* L - the number of encoder blocks, H - hidden size, A - the number of attention heads <br/>
262 | 
263 | Example in Fairseq:
264 | 
265 | ```python
266 | import os
267 | from fairseq.models.roberta import RobertaModel, RobertaHubInterface
268 | from fairseq import hub_utils
269 | 
270 | model_path = "roberta_large_fairseq"
271 | loaded = hub_utils.from_pretrained(
272 |     model_name_or_path=model_path,
273 |     data_name_or_path=model_path,
274 |     bpe="sentencepiece",
275 |     sentencepiece_vocab=os.path.join(model_path, "sentencepiece.bpe.model"),
276 |     load_checkpoint_heads=True,
277 |     archive_map=RobertaModel.hub_models(),
278 |     cpu=True
279 | )
280 | roberta = RobertaHubInterface(loaded['args'], loaded['task'], loaded['models'][0])
281 | roberta.eval()
282 | roberta.fill_mask('Druga wojna światowa zakończyła się w <mask> roku.', topk=1)
283 | roberta.fill_mask('Ludzie najbardziej boją się <mask>.', topk=1)
284 | #[('Druga wojna światowa zakończyła się w 1945 roku.', 0.9345270991325378, ' 1945')]
285 | #[('Ludzie najbardziej boją się śmierci.', 0.14140743017196655, ' śmierci')]
286 | ```
287 | 
288 | It is recommended to use the above models, but it is still possible to download [our old model](https://github.com/sdadas/polish-nlp-resources/releases/download/roberta/roberta.zip), trained on smaller batch size (2K) and smaller corpus (15GB).
289 | 
290 | ### BART
291 | 
292 | BART is a transformer-based sequence to sequence model trained with a denoising objective. Can be used for fine-tuning on prediction tasks, just like regular BERT, as well as various text generation tasks such as machine translation, summarization, paraphrasing etc. We provide a Polish version of BART base model, trained on a large corpus of texts extracted from Common Crawl (200+ GB). More information on the BART architecture can be found in [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461). Example in HugginFace Transformers:
293 | 
294 | ```python
295 | import os
296 | from transformers import BartForConditionalGeneration, PreTrainedTokenizerFast
297 | 
298 | model_dir = "bart_base_transformers"
299 | tok = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
300 | model = BartForConditionalGeneration.from_pretrained(model_dir)
301 | sent = "Druga<mask>światowa zakończyła się w<mask>roku kapitulacją hitlerowskich<mask>"
302 | batch = tok(sent, return_tensors='pt')
303 | generated_ids = model.generate(batch['input_ids'])
304 | print(tok.batch_decode(generated_ids, skip_special_tokens=True))
305 | # ['Druga wojna światowa zakończyła się w 1945 roku kapitulacją hitlerowskich Niemiec.']
306 | ```
307 | 
308 | Download for [Fairseq v0.10](https://github.com/sdadas/polish-nlp-resources/releases/download/bart-base/bart_base_fairseq.zip) or [HuggingFace Transformers v4.0](https://github.com/sdadas/polish-nlp-resources/releases/download/bart-base/bart_base_transformers.zip).
309 | 
310 | ### GPT-2
311 | 
312 | GPT-2 is a unidirectional transformer-based language model trained with an auto-regressive objective, originally introduced in the [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) paper. The original English GPT-2 was released in four sizes differing by the number of parameters: Small (112M), Medium (345M), Large (774M), XL (1.5B). 
313 | 
314 | #### Models for Huggingface Transformers
315 | 
316 | We provide Polish GPT-2 models for Huggingface Transformers. The models have been trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) library and then converted to the Huggingface format. The released checkpoints support longer contexts than the original GPT-2 by OpenAI. Small and medium models support up to 2048 tokens, twice as many as GPT-2 models and the same as GPT-3. Large and XL models support up to 1536 tokens. Example in Transformers:
317 | 
318 | ```python
319 | from transformers import pipeline
320 | 
321 | generator = pipeline("text-generation",  model="sdadas/polish-gpt2-medium")
322 | results = generator("Policja skontrolowała trzeźwość kierowców",
323 |   max_new_tokens=1024,  do_sample=True, repetition_penalty = 1.2,
324 |   num_return_sequences=1, num_beams=1,  temperature=0.95,top_k=50, top_p=0.95
325 | )
326 | print(results[0].get("generated_text"))
327 | # Policja skontrolowała trzeźwość kierowców. Teraz policjanci przypominają kierowcom o zachowaniu 
328 | # bezpiecznej odległości i środkach ostrożności związanych z pandemią. - Kierujący po spożyciu 
329 | # alkoholu są bardziej wyczuleni na innych uczestników ruchu drogowego oraz mają większą skłonność 
330 | # do brawury i ryzykownego zachowania zwłaszcza wobec pieszych. Dodatkowo nie zawsze pamiętają oni 
331 | # zasady obowiązujących u nas przepisów prawa regulujących kwestie dotyczące odpowiedzialności [...]
332 | ```
333 | [Small](https://huggingface.co/sdadas/polish-gpt2-small), [Medium](https://huggingface.co/sdadas/polish-gpt2-medium), [Large](https://huggingface.co/sdadas/polish-gpt2-large), and [XL](https://huggingface.co/sdadas/polish-gpt2-xl) models are available on the Huggingface Hub
334 | 
335 | #### Models for Fairseq
336 | 
337 | We provide Polish versions of the medium and large GPT-2 models trained using Fairseq library. Example in Fairseq:
338 | 
339 | ```python
340 | import os
341 | from fairseq import hub_utils
342 | from fairseq.models.transformer_lm import TransformerLanguageModel
343 | 
344 | model_dir = "gpt2_medium_fairseq"
345 | loaded = hub_utils.from_pretrained(
346 |     model_name_or_path=model_dir,
347 |     checkpoint_file="model.pt",
348 |     data_name_or_path=model_dir,
349 |     bpe="hf_byte_bpe",
350 |     bpe_merges=os.path.join(model_dir, "merges.txt"),
351 |     bpe_vocab=os.path.join(model_dir, "vocab.json"),
352 |     load_checkpoint_heads=True,
353 |     archive_map=TransformerLanguageModel.hub_models()
354 | )
355 | model = hub_utils.GeneratorHubInterface(loaded["args"], loaded["task"], loaded["models"])
356 | model.eval()
357 | result = model.sample(
358 |     ["Policja skontrolowała trzeźwość kierowców"],
359 |     beam=5, sampling=True, sampling_topk=50, sampling_topp=0.95,
360 |     temperature=0.95, max_len_a=1, max_len_b=100, no_repeat_ngram_size=3
361 | )
362 | print(result[0])
363 | # Policja skontrolowała trzeźwość kierowców pojazdów. Wszystko działo się na drodze gminnej, między Radwanowem 
364 | # a Boguchowem. - Około godziny 12.30 do naszego komisariatu zgłosił się kierowca, którego zaniepokoiło 
365 | # zachowanie kierującego w chwili wjazdu na tą drogę. Prawdopodobnie nie miał zapiętych pasów - informuje st. asp. 
366 | # Anna Węgrzyniak z policji w Brzezinach. Okazało się, że kierujący był pod wpływem alkoholu. [...]
367 | ```
368 | 
369 | Download [medium](https://github.com/sdadas/polish-nlp-resources/releases/download/gpt-2/gpt2_medium_fairseq.7z) or [large](https://github.com/sdadas/polish-nlp-resources/releases/download/gpt-2/gpt2_large_fairseq.7z) model for Fairseq v0.10.
370 | 
371 | ### Longformer
372 | 
373 | One of the main constraints of standard Transformer architectures is the limitation on the number of input tokens. There are several known models that allow processing of long documents, one of the popular ones being Longformer, introduced in the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150). We provide base and large versions of Polish Longformer model. The models were initialized with Polish RoBERTa (v2) weights and then fine-tuned on a corpus of long documents, ranging from 1024 to 4096 tokens. Example in Huggingface Transformers:
374 | 
375 | ```python
376 | from transformers import pipeline
377 | fill_mask = pipeline('fill-mask', model='sdadas/polish-longformer-base-4096')
378 | fill_mask('Stolica oraz największe miasto Francji to <mask>.')
379 | ```
380 | [Base](https://huggingface.co/sdadas/polish-longformer-base-4096) and [large](https://huggingface.co/sdadas/polish-longformer-large-4096) models are available on the Huggingface Hub
381 | 
382 | ## Text encoders
383 | The purpose of text encoders is to produce a fixed-length vector representation for chunks of text, such as sentences or paragraphs. These models are used in semantic search, question answering, document clustering, dataset augmentation, plagiarism detection, and other tasks which involve measuring semantic similarity or relatedness between text passages.
384 | 
385 | ### Paraphrase mining and semantic textual similarity
386 | We share two models based on the [Sentence-Transformers](https://www.sbert.net/) library, trained using distillation method described in the paper [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813). A corpus of 100 million parallel Polish-English sentence pairs from the [OPUS](https://opus.nlpl.eu/) project was used to train the models. You can download them from the Hugginface Hub using the links below.
387 | 
388 | <table>
389 | <thead>
390 | <th>Student model</th>
391 | <th>Teacher model</th>
392 | <th>Download</th>
393 | </thead>
394 | <tr>
395 |   <td>polish-roberta-base-v2</td>
396 |   <td>paraphrase-distilroberta-base-v2</td>
397 |   <td><a href="https://huggingface.co/sdadas/st-polish-paraphrase-from-distilroberta">st-polish-paraphrase-from-distilroberta</a></td>
398 | </tr>
399 | <tr>
400 |   <td>polish-roberta-base-v2</td>
401 |   <td>paraphrase-mpnet-base-v2</td>
402 |   <td><a href="https://huggingface.co/sdadas/st-polish-paraphrase-from-mpnet">st-polish-paraphrase-from-mpnet</a></td>
403 | </tr>
404 | </table>
405 | 
406 | A simple example in Sentence-Transformers library:
407 | 
408 | ```python
409 | from sentence_transformers import SentenceTransformer
410 | from sentence_transformers.util import cos_sim
411 | 
412 | sentences = ["Bardzo lubię jeść słodycze.", "Uwielbiam zajadać się słodkościami."]
413 | model = SentenceTransformer("sdadas/st-polish-paraphrase-from-mpnet")
414 | results = model.encode(sentences, convert_to_tensor=True, show_progress_bar=False)
415 | print(cos_sim(results[0], results[1]))
416 | # tensor([[0.9794]], device='cuda:0')
417 | ```
418 | ### MMLW 
419 | MMLW (muszę mieć lepszą wiadomość) is a set of text encoders trained using [multilingual knowledge distillation method](https://arxiv.org/abs/2004.09813) on a diverse corpus of 60 million Polish-English text pairs, which included both sentence and paragraph aligned translations. The encoders are available in [Sentence-Transformers](https://www.sbert.net/) format. We used a two-step process to train the models. In the first step, the encoders were initialized with Polish RoBERTa and multilingual E5 checkpoints, and then distilled utilising English BGE as a teacher model. The resulting models from the distillation step can be used as general-purpose embeddings with applications in various tasks such as text similarity, document clustering, or fuzzy deduplication. The second step involved fine-tuning the obtained models on [Polish MS MARCO](https://huggingface.co/datasets/clarin-knext/msmarco-pl) dataset with contrastrive loss. The second stage models are adapted specifically for information retrieval tasks.
420 | 
421 | We provide a total of ten text encoders, five distilled and five fine-tuned for information retrieval. In the table below, we present the details of the released models.
422 | 
423 | <table>
424 | <thead>
425 | <tr>
426 |   <th colspan="2">Base models</th>
427 |   <th colspan="2">Stage 1: Distilled models</th>
428 |   <th colspan="2">Stage 2: Retrieval models</th>
429 | </tr>
430 | <tr>
431 |   <th>Student model</th>
432 |   <th>Teacher model</th>
433 |   <th><a href="https://huggingface.co/spaces/mteb/leaderboard">PL-MTEB</a><br/>Score</th>
434 |   <th>Download</th>
435 |   <th><a href="https://huggingface.co/spaces/sdadas/pirb">PIRB</a><br/>NDCG@10</th>
436 |   <th>Download</th>
437 | </tr>
438 | </thead>
439 | <tr>
440 |   <td colspan="6"><strong>Encoders based on Polish RoBERTa</strong></td>
441 | </tr>
442 | <tr>
443 |   <td><a href="https://huggingface.co/sdadas/polish-roberta-base-v2">polish-roberta-base-v2</a></td>
444 |   <td><a href="https://huggingface.co/BAAI/bge-base-en">bge-base-en</a></td>
445 |   <td>61.05</td>
446 |   <td><a href="https://huggingface.co/sdadas/mmlw-roberta-base">mmlw-roberta-base</a></td>
447 |   <td>56.38</td>
448 |   <td><a href="https://huggingface.co/sdadas/mmlw-retrieval-roberta-base">mmlw-retrieval-roberta-base</a></td>
449 | </tr>
450 | <tr>
451 |   <td><a href="https://huggingface.co/sdadas/polish-roberta-large-v2">polish-roberta-large-v2</a></td>
452 |   <td><a href="https://huggingface.co/BAAI/bge-large-en">bge-large-en</a></td>
453 |   <td>63.23</td>
454 |   <td><a href="https://huggingface.co/sdadas/mmlw-roberta-large">mmlw-roberta-large</a></td>
455 |   <td>58.46</td>
456 |   <td><a href="https://huggingface.co/sdadas/mmlw-retrieval-roberta-large">mmlw-retrieval-roberta-large</a></td>
457 | </tr>
458 | <tr>
459 |   <td colspan="6"><strong>Encoders based on Multilingual E5</strong></td>
460 | </tr>
461 | <tr>
462 |   <td><a href="https://huggingface.co/intfloat/multilingual-e5-small">multilingual-e5-small</a></td>
463 |   <td><a href="https://huggingface.co/BAAI/bge-small-en">bge-small-en</a></td>
464 |   <td>55.84</td>
465 |   <td><a href="https://huggingface.co/sdadas/mmlw-e5-small">mmlw-e5-small</a></td>
466 |   <td>52.34</td>
467 |   <td><a href="https://huggingface.co/sdadas/mmlw-retrieval-e5-small">mmlw-retrieval-e5-small</a></td>
468 | </tr>
469 | <tr>
470 |   <td><a href="https://huggingface.co/intfloat/multilingual-e5-base">multilingual-e5-base</a></td>
471 |   <td><a href="https://huggingface.co/BAAI/bge-base-en">bge-base-en</a></td>
472 |   <td>59.71</td>
473 |   <td><a href="https://huggingface.co/sdadas/mmlw-e5-base">mmlw-e5-base</a></td>
474 |   <td>56.09</td>
475 |   <td><a href="https://huggingface.co/sdadas/mmlw-retrieval-e5-base">mmlw-retrieval-e5-base</a></td>
476 | </tr>
477 | <tr>
478 |   <td><a href="https://huggingface.co/intfloat/multilingual-e5-large">multilingual-e5-large</a></td>
479 |   <td><a href="https://huggingface.co/BAAI/bge-large-en">bge-large-en</a></td>
480 |   <td>61.17</td>
481 |   <td><a href="https://huggingface.co/sdadas/mmlw-e5-large">mmlw-e5-large</a></td>
482 |   <td>58.30</td>
483 |   <td><a href="https://huggingface.co/sdadas/mmlw-retrieval-e5-large">mmlw-retrieval-e5-large</a></td>
484 | </tr>
485 | </table>
486 | 
487 | Please note that the developed models require the use of specific prefixes and suffixes when encoding texts. For RoBERTa-based encoders, each query should be preceded by the prefix "zapytanie: ", and no prefix is needed for passages. For E5-based models, queries should be prefixed with "query: " and passages with "passage: ". An example of how to use the models: 
488 | 
489 | ```python
490 | from sentence_transformers import SentenceTransformer
491 | from sentence_transformers.util import cos_sim
492 | 
493 | query_prefix = "zapytanie: "      # "zapytanie: " for roberta, "query: " for e5
494 | answer_prefix = ""                # empty for roberta, "passage: " for e5
495 | queries = [query_prefix + "Jak dożyć 100 lat?"]
496 | answers = [
497 |     answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
498 |     answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
499 |     answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
500 | ]
501 | model = SentenceTransformer("sdadas/mmlw-retrieval-roberta-base")
502 | queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
503 | answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
504 | 
505 | best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
506 | print(answers[best_answer])
507 | # Trzeba zdrowo się odżywiać i uprawiać sport.
508 | ```
509 | 
510 | ## Machine translation models
511 | 
512 | This section includes pre-trained machine translation models.
513 | 
514 | ### Convolutional models for Fairseq
515 | 
516 | We provide Polish-English and English-Polish convolutional neural machine translation models trained using [Fairseq](https://github.com/pytorch/fairseq) sequence modeling toolkit. Both models were trained on a parallel corpus of more than 40 million sentence pairs taken from [Opus](http://opus.nlpl.eu/) collection. Example of usage (`fairseq`, `sacremoses` and `subword-nmt` python packages are required to run this example):
517 | 
518 | ```python
519 | from fairseq.models import BaseFairseqModel
520 | 
521 | model_path = "/polish-english/"
522 | model = BaseFairseqModel.from_pretrained(
523 |     model_name_or_path=model_path,
524 |     checkpoint_file="checkpoint_best.pt",
525 |     data_name_or_path=model_path,
526 |     tokenizer="moses",
527 |     bpe="subword_nmt",
528 |     bpe_codes="code",
529 |     cpu=True
530 | )
531 | print(model.translate(sentence="Zespół astronomów odkrył w konstelacji Panny niezwykłą planetę.", beam=5))
532 | # A team of astronomers discovered an extraordinary planet in the constellation of Virgo.
533 | ```
534 | 
535 | **Polish-English convolutional model:** [Download (GitHub)](https://github.com/sdadas/polish-nlp-resources/releases/download/nmt-models-conv/polish-english-conv.zip) \
536 | **English-Polish convolutional model:** [Download (GitHub)](https://github.com/sdadas/polish-nlp-resources/releases/download/nmt-models-conv/english-polish-conv.zip)
537 | 
538 | ### T5-based models
539 | 
540 | We share MT5 and Flan-T5 models fine-tuned for Polish-English and English-Polish translation. The models were trained on 70 million sentence pairs from [OPUS](http://opus.nlpl.eu/). You can download them from the Hugginface Hub using the links below. An example of how to use the models:
541 | 
542 | ```python
543 | from transformers import pipeline
544 | generator = pipeline("translation", model="sdadas/flan-t5-base-translator-en-pl")
545 | sentence = "A team of astronomers discovered an extraordinary planet in the constellation of Virgo."
546 | print(generator(sentence, max_length=512))
547 | # [{'translation_text': 'Zespół astronomów odkrył niezwykłą planetę w gwiazdozbiorze Panny.'}]
548 | ```
549 | 
550 | The following models are available on the Huggingface Hub: [mt5-base-translator-en-pl](https://huggingface.co/sdadas/mt5-base-translator-en-pl), [mt5-base-translator-pl-en](https://huggingface.co/sdadas/mt5-base-translator-pl-en), [flan-t5-base-translator-en-pl](https://huggingface.co/sdadas/flan-t5-base-translator-en-pl)
551 | 
552 | ## Fine-tuned models
553 | 
554 | ### ByT5-text-correction
555 | 
556 | A small multilingual utility model intended for simple text correction. It is designed to improve the quality of texts from the web, often lacking punctuation or proper word capitalization. The model was trained to perform three types of corrections: restoring punctuation in sentences, restoring word capitalization, and restoring diacritical marks for languages that include them.
557 | 
558 | The following languages are supported: Belarusian (be), Danish (da), German (de), Greek (el), English (en), Spanish (es), French (fr), Italian (it), Dutch (nl), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Slovak (sk), Swedish (sv), Ukrainian (uk). The model takes as input a sentence preceded by a language code prefix. For example:
559 | 
560 | ```python
561 | from transformers import pipeline
562 | generator = pipeline("text2text-generation", model="sdadas/byt5-text-correction")
563 | sentences = [
564 |     "<pl> ciekaw jestem na co licza onuce stawiajace na sykulskiego w nadziei na zwrot ku rosji",
565 |     "<de> die frage die sich die europäer stellen müssen lautet ist es in unserem interesse die krise auf taiwan zu beschleunigen",
566 |     "<ru> при своём рождении 26 августа 1910 года тереза получила имя агнес бояджиу"
567 | ]
568 | generator(sentences, max_length=512)
569 | # Ciekaw jestem na co liczą onuce stawiające na Sykulskiego w nadziei na zwrot ku Rosji.
570 | # Die Frage, die sich die Europäer stellen müssen, lautet: Ist es in unserem Interesse, die Krise auf Taiwan zu beschleunigen?
571 | # При своём рождении 26 августа 1910 года Тереза получила имя Агнес Бояджиу.
572 | ```
573 | 
574 | The model is available on the Huggingface Hub: [byt5-text-correction](https://huggingface.co/sdadas/byt5-text-correction)
575 | 
576 | ### Text ranking models
577 | 
578 | We provide a set of text ranking models than can be used in the reranking phase of retrieval augmented generation (RAG) pipelines. Our goal was to build efficient models that combine high accuracy with relatively low computational complexity. We employed Polish RoBERTa language models, fine-tuning them for text ranking task on a large dataset consisting of 1.4 million queries and 10 million documents. The models were trained using two knowledge distillation methods: a standard technique based on mean squared loss (MSE) and the RankNet algorithm that enforces sorting lists of documents according to their relevance to the query. The RankNet metod has proven to be more effective. Below is a summary of the released models:
579 | 
580 | <table>
581 | <thead>
582 | <th>Model</th>
583 | <th>Parameters</th>
584 | <th>Training method</th>
585 | <th><a href="https://huggingface.co/spaces/sdadas/pirb">PIRB</a><br/>NDCG@10</th>
586 | </thead>
587 | <tr>
588 |   <td><a href="https://huggingface.co/sdadas/polish-reranker-base-ranknet">polish-reranker-base-ranknet</a></td>
589 |   <td>124M</td>
590 |   <td>RankNet</td>
591 |   <td>60.32</td>
592 | </tr>
593 | <tr>
594 |   <td><a href="https://huggingface.co/sdadas/polish-reranker-large-ranknet">polish-reranker-large-ranknet</a></td>
595 |   <td>435M</td>
596 |   <td>RankNet</td>
597 |   <td>62.65</td>
598 | </tr>
599 | <tr>
600 |   <td><a href="https://huggingface.co/sdadas/polish-reranker-base-mse">polish-reranker-base-mse</a></td>
601 |   <td>124M</td>
602 |   <td>MSE</td>
603 |   <td>57.50</td>
604 | </tr>
605 | <tr>
606 |   <td><a href="https://huggingface.co/sdadas/polish-reranker-large-mse">polish-reranker-large-mse</a></td>
607 |   <td>435M</td>
608 |   <td>MSE</td>
609 |   <td>60.27</td>
610 | </tr>
611 | </table>
612 | 
613 | The models can be used with sentence-transformers library:
614 | 
615 | ```python
616 | from sentence_transformers import CrossEncoder
617 | import torch.nn
618 | 
619 | query = "Jak dożyć 100 lat?"
620 | answers = [
621 |     "Trzeba zdrowo się odżywiać i uprawiać sport.",
622 |     "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
623 |     "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
624 | ]
625 | 
626 | model = CrossEncoder(
627 |     "sdadas/polish-reranker-large-ranknet",
628 |     default_activation_function=torch.nn.Identity(),
629 |     max_length=512,
630 |     device="cuda" if torch.cuda.is_available() else "cpu"
631 | )
632 | pairs = [[query, answer] for answer in answers]
633 | results = model.predict(pairs)
634 | print(results.tolist())
635 | ```
636 | 
637 | 
638 | ## Dictionaries and lexicons
639 | 
640 | ### Polish, English and foreign person names
641 | 
642 | This lexicon contains 346 thousand forenames and lastnames labeled as Polish, English or Foreign (other) crawled from multiple Internet sources.
643 | Possible labels are: `P-N` (Polish forename), `P-L` (Polish lastname), `E-N` (English forename), `E-L` (English lastname), `F` (foreign / other). 
644 | For each word, there is an additional flag indicating whether this name is also used as a common word in Polish (`C` for common, `U` for uncommon).
645 | 
646 | [Download (GitHub)](lexicons/names)
647 | 
648 | ### Named entities extracted from SJP.PL
649 | 
650 | This dictionary consists mostly of the names of settlements, geographical regions, countries, continents and words derived from them (relational adjectives and inhabitant names). 
651 | Besides that, it also contains names of popular brands, companies and common abbreviations of institutions' names.
652 | This resource was created in a semi-automatic way, by extracting the words and their forms from SJP.PL using a set of heuristic rules and then manually filtering out words that weren't named entities.
653 | 
654 | [Download (GitHub)](lexicons/named-sjp)
655 | 
656 | ## Links to external resources
657 | 
658 | ### Repositories of linguistic tools and resources
659 | 
660 | - [Computational Linguistics in Poland - IPI PAN](http://clip.ipipan.waw.pl/LRT)
661 | - [G4.19 Research Group, Wroclaw University of Technology](http://nlp.pwr.wroc.pl/narzedzia-i-zasoby)
662 | - [CLARIN - repository of linguistic resources](https://clarin-pl.eu/dspace/)
663 | - [Gonito.net - evaluation platform with some challenges for Polish](https://gonito.net)
664 | - [Awesome NLP Polish (ksopyla)](https://github.com/ksopyla/awesome-nlp-polish)
665 | - [A catalog of Polish speech corpora (AMU-CAI)](https://github.com/goodmike31/pl-asr-speech-data-survey)
666 | 
667 | ### Publicly available large Polish text corpora (> 1GB)
668 | 
669 | - [OSCAR Corpus (Common Crawl extract)](https://oscar-corpus.com/)
670 | - [CC-100 Web Crawl Data (Common Crawl extract)](http://data.statmt.org/cc-100/)
671 | - [The Polish Parliamentary Corpus](http://clip.ipipan.waw.pl/PPC)
672 | - [Redistributable subcorpora of the National Corpus of Polish](http://zil.ipipan.waw.pl/DistrNKJP)
673 | - [Polish Wikipedia Dumps](https://dumps.wikimedia.org/plwiki/)
674 | - [OPUS Parallel Corpora](https://opus.nlpl.eu/)
675 | - [Corpus from PolEval 2018 Language Modeling Task](http://2018.poleval.pl/index.php/tasks/)
676 | - [C4 and mC4 corpora (contains ~180GB of compressed Polish text)](https://huggingface.co/datasets/allenai/c4)
677 | - [NLLB parallel corpus (1613 language pairs of which 43 include Polish)](https://huggingface.co/datasets/allenai/nllb)
678 | - [CulturaX - a combination of mC4 and OSCAR corpora, cleaned and deduplicated](https://huggingface.co/datasets/uonlp/CulturaX)
679 | - [MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level)](https://huggingface.co/datasets/allenai/MADLAD-400)
680 | - [SpeakLeash - An initiative to build a large corpus for training Polish LLMs](https://github.com/speakleash/speakleash)
681 | 
682 | ### Models supporting Polish language
683 | 
684 | #### Sentence analysis (tokenization, lemmatization, POS tagging etc.)
685 | 
686 | - [SpaCy](https://spacy.io/models/pl) - A popular library for NLP in Python which includes Polish models for sentence analysis.
687 | - [Stanza](https://stanfordnlp.github.io/stanza/) - A collection of neural NLP models for many languages from StndordNLP.
688 | - [Trankit](https://github.com/nlp-uoregon/trankit) - A light-weight transformer-based python toolkit for multilingual natural language processing by the University of Oregon.
689 | - [KRNNT](https://github.com/kwrobel-nlp/krnnt) and [KFTT](https://github.com/kwrobel-nlp/kftt) - Neural morphosyntactic taggers for Polish.
690 | - [Morfeusz](http://morfeusz.sgjp.pl/) - A classic Polish morphosyntactic tagger.
691 | - [Language Tool](https://github.com/languagetool-org/languagetool) - Java-based open source proofreading software for many languages with sentence analysis tools included.
692 | - [Stempel](https://github.com/dzieciou/pystempel) - Algorythmic stemmer for Polish.
693 | - [PoLemma](https://huggingface.co/amu-cai/polemma-large) - plT5-based lemmatizer of named entities and multi-word expressions for Polish, available in [small](https://huggingface.co/amu-cai/polemma-small), [base](https://huggingface.co/amu-cai/polemma-base) and [large](https://huggingface.co/amu-cai/polemma-large) sizes.
694 | 
695 | #### Machine translation
696 | - [Marian-NMT](https://marian-nmt.github.io/) - An efficient C++ based implementation of neural translation models. Many pre-trained models are available, including those supporting Polish: [pl-de](https://huggingface.co/Helsinki-NLP/opus-mt-pl-de), [pl-en](https://huggingface.co/Helsinki-NLP/opus-mt-pl-en), [pl-es](https://huggingface.co/Helsinki-NLP/opus-mt-pl-es), [pl-fr](https://huggingface.co/Helsinki-NLP/opus-mt-pl-fr), [pl-sv](https://huggingface.co/Helsinki-NLP/opus-mt-pl-sv), [de-pl](https://huggingface.co/Helsinki-NLP/opus-mt-de-pl), [es-pl](https://huggingface.co/Helsinki-NLP/opus-mt-es-pl), [fr-pl](https://huggingface.co/Helsinki-NLP/opus-mt-fr-pl).
697 | - [M2M](https://github.com/pytorch/fairseq/tree/master/examples/m2m_100) (2021) - A single massive machine translation architecture supporting direct translation for any pair from the list of 100 languages. Details in the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/pdf/2010.11125.pdf).
698 | - [mBART-50](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) (2021) - A multilingual BART model fine-tuned for machine translation in 50 languages. Three machine translation models were published: [many-to-many](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt), [English-to-many](https://huggingface.co/facebook/mbart-large-50-one-to-many-mmt), and [many-to-English](https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt). For more information see [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401).
699 | - [NLLB](https://github.com/facebookresearch/fairseq/tree/nllb) (2022) - NLLB (No Language Left Behind) is a project by Meta AI aiming to provide machine translation models for over 200 languages. A set of multilingual neural models ranging from 600M to 54.5B parameters is available for download. For more details see [No Language Left Behind: Scaling Human-Centered Machine Translation](https://research.facebook.com/publications/no-language-left-behind/).
700 | - [MADLAD](https://huggingface.co/google/madlad400-10b-mt) (2023) - MADLAD is a series of multilingual machine translation models released by Google. The models are based on the T5 architecture and were trained on the MADLAD-400 corpus covering over 450 languages. For more details see [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662).
701 | 
702 | #### Language models
703 | - [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) (2018) - BERT (Bidirectional Encoder Representations from Transformers) is a model for generating contextual word representations. Multilingual cased model provided by Google supports 104 languages including Polish.
704 | - [XLM-RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/xlmr) (2019) - Cross lingual language model trained on 2.5 terabytes of data from CommonCrawl and Wikipedia. Supports 100 languages including Polish. See [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/pdf/1911.02116.pdf) for details.
705 | - [Slavic BERT](https://github.com/deepmipt/Slavic-BERT-NER#slavic-bert) (2019) - Multilingual BERT model supporting Bulgarian (bg), Czech (cs), Polish (pl) and Russian (ru) languages.
706 | - [mT5](https://github.com/google-research/multilingual-t5) (2020) - Google's text-to-text transformer for 101 languages based on the T5 architecture. Details in the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934).
707 | - [HerBERT](https://huggingface.co/allegro) (2020) - Polish BERT-based language model trained by Allegro for HuggingFace Transformers in [base](https://huggingface.co/allegro/herbert-base-cased) and [large](https://huggingface.co/allegro/herbert-large-cased) variant.
708 | - [plT5](https://huggingface.co/allegro/plt5-large) (2021) - Polish version of the T5 model available in [small](https://huggingface.co/allegro/plt5-small), [base](https://huggingface.co/allegro/plt5-base) and [large](https://huggingface.co/allegro/plt5-large) sizes.
709 | - [ByT5](https://huggingface.co/docs/transformers/model_doc/byt5) (2021) - A multilignual sequence to sequence model similar to T5, but using raw byte sequences as inputs instead of subword tokens. Introduced in the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626).
710 | - [XLM-RoBERTa-XL and XXL](https://github.com/pytorch/fairseq/blob/main/examples/xlmr/README.md) (2021) - Large-scale versions of XLM-RoBERTa models with 3.5 and 10.7 billion parameters respectively. For more information see [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/pdf/2105.00572.pdf).
711 | - [mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke) (2021) - A multilingual version of LUKE, Transformer-based language model enriched with entity metadata. The model supports 24 languages including Polish. For more information see [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/pdf/2110.08151.pdf).
712 | - [XGLM](https://huggingface.co/facebook/xglm-4.5B) (2021) - A GPT style autoregressive Transformer language model trained on a large-scale multilingual corpus. The model was published in several sizes, but only the 4.5B variant includes Polish language. For more information see [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668).
713 | - [PapuGaPT2](https://huggingface.co/flax-community/papuGaPT2) (2021) - Polish GPT-like autoregressive models available in [base](https://huggingface.co/flax-community/papuGaPT2) and [large](https://huggingface.co/flax-community/papuGaPT2-large) sizes.
714 | - [mGPT](https://huggingface.co/sberbank-ai/mGPT) (2022) - Another multilingual GPT style model with 1.3B parameters, covering 60 languages. The model has been trained by Sberbank AI. For more information see [mGPT: Few-Shot Learners Go Multilingual](https://arxiv.org/abs/2204.07580). 
715 | - [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) (2022) - An improved version of T5 model, fine-tuned on a broad set of downstream tasks in multiple languages. Flan-T5 models can be used in zero-shot and few-shot scenarios, they can also be further fine-tuned for specific task. For more information see [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf). 
716 | - [XLM-V](https://huggingface.co/facebook/xlm-v-base) (2023) - A multilingual transformer-based language model utilising large vocabulary of 1 million tokens, which brings significant improvements on downstream tasks for some languages. Apart from a larger vocabulary, the model's architecture is similar to previously published XLM-R models. For more information see [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472). 
717 | - [umT5](https://console.cloud.google.com/storage/browser/t5-data/pretrained_models/t5x/umt5_xxl) (2023) - An improved mt5 model trained using a more uniform language distribution. For more information see [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://arxiv.org/pdf/2304.09151.pdf).
718 | - [mLongT5](https://console.cloud.google.com/storage/browser/t5-data/pretrained_models/t5x/mlongt5) (2023) - A multilingual version of LongT5 which is an extension of the T5 model that handles long inputs of up to 16k tokens. Supports 101 languages including Polish. For more information see [mLongT5: A Multilingual and Efficient Text-To-Text Transformer for
719 | Longer Sequences](https://arxiv.org/pdf/2305.11129.pdf).
720 | 
721 | #### Sentence encoders
722 | - [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/1) (2019) - USE (Universal Sentence Encoder) generates sentence level langauge representations. Pre-trained multilingual model supports 16 langauges (Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian).
723 | - [LASER Language-Agnostic SEntence Representations](https://github.com/facebookresearch/LASER) (2019) - A multilingual sentence encoder by Facebook Research, supporting 93 languages.
724 | - [LaBSE](https://tfhub.dev/google/LaBSE/1) (2020) - Language-agnostic BERT sentence embedding model supporting 109 languages. See [Language-agnostic BERT Sentence Embedding](https://arxiv.org/abs/2007.01852) for details.
725 | - [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (2020) - Sentence-level models based on the transformer architecture. The library includes multilingual models supporting Polish. More information on multilingual knowledge distillation method used by the authors can be found in [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813).
726 | - [LASER2 and LASER3](https://github.com/facebookresearch/LASER/blob/main/nllb/README.md) (2022) - New versions of the LASER sentence encoder by Meta AI, developed as a part of the NLLB (No Language Left Behind) project. LASER2 supports the same set of languages as the first version of the encoder, which includes Polish. LASER3 adds support to less common languages, mostly low-resource African languages. See [Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages](https://arxiv.org/pdf/2205.12654.pdf) for more details.
727 | - [E5](https://huggingface.co/intfloat/multilingual-e5-large) (2022) - A general purpose text encoder which can be applied to a variety of tasks such as information retrieval, semantic textual similarity, text reranking, or clustering. The models were trained using a large dataset of text pairs extracted from CommonCrawl. Recently, multilingual E5 models supporting Polish have been published in [small](https://huggingface.co/intfloat/multilingual-e5-small), [base](https://huggingface.co/intfloat/multilingual-e5-base), and [large](https://huggingface.co/intfloat/multilingual-e5-large) versions. See [Text Embeddings by Weakly-Supervised Contrastive Pre-training](https://arxiv.org/abs/2212.03533) for more details.
728 | 
729 | #### Optical character recognition (OCR)
730 | - [Easy OCR](https://github.com/JaidedAI/EasyOCR) - Optical character recognition toolkit with pre-trained models for over 40 languages, including Polish.
731 | - [Tesseract](https://github.com/tesseract-ocr/tesseract) - Popular OCR software developed since 1980s, supporting over 100 languages. For integration with Python, wrappers such as [PyTesseract](https://github.com/madmaze/pytesseract) or [OCRMyPDF](https://github.com/ocrmypdf/OCRmyPDF) can be used. 
732 | 
733 | #### Speech processing (speech recognition, text-to-speech, voice cloning etc.)
734 | - [Quartznet - Nvidia NeMo](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_pl_quartznet15x5) (2021) - Nvidia NeMo is a toolkit for building conversational AI models. Apart from the framework itself, Nvidia also published many models trained using their code, which includes a speech recognition model for Polish based on Quartznet architecture.
735 | - [XLS-R](https://huggingface.co/facebook/wav2vec2-xls-r-300m) (2021) - XLS-R is a multilingual version of Wav2Vec 2.0 model by Meta AI, which is a large-scale pre-trained model for speech processing. The model is trained in a self-supervised way, so it needs to be fine-tuned for solving specific tasks such as ASR. Several fine-tuned checkpoints for Polish speech recognition exist on the HuggingFace Hub e.g. [wav2vec2-large-xlsr-53-polish](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-polish)
736 | - [M-CTC-T](https://huggingface.co/speechbrain/m-ctc-t-large) (2021) - A speech recognition model from Meta AI, supporting 60 languages including Polish. For more information see [Pseudo-Labeling for Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161).
737 | - [Whisper](https://github.com/openai/whisper/) (2022) - Whisper is a model released by OpenAI for ASR and other speech-related tasks, supporting 82 languages. The model is available in five sizes: tiny (39M params), base (74M), small (244M), medium (769M), and large (1.5B). More information can be found in the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf).
738 | - [MMS](https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md) (2023) - Massively Multilingual Speech (MMS) is a large-scale multilingual speech foundation model published by Meta AI. Along with the pre-trained models, they also released checkpoints fine-tuned for specific tasks such as [speech recognition](https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md#asr), [text to speech](https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md#tts), and [language identification](https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md#lid). For more information see [Scaling Speech Technology to 1,000+ Languages](https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/).
739 | - [SeamlessM4T](https://github.com/facebookresearch/seamless_communication) (2023) - Multilingual and multitask model trained on text and speech data. It covers almost 100 languages including Polish and can perform automatic speech recognition (ASR) as well as multimodal translation tasks across languages: speech-to-text, text-to-speech, speech-to-speech, text-to-text. See [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) for more details.
740 | - [SONAR](https://github.com/facebookresearch/SONAR#supported-languages-and-download-links) (2023) - Multilingual embeddings for speech and text with a set of additional models fine-tuned for specific tasks such as text translation, speech-to-text translation, or cross-lingual semantic similarity. See [SONAR: Sentence-Level Multimodal and Language-Agnostic Representations](https://ai.meta.com/research/publications/sonar-sentence-level-multimodal-and-language-agnostic-representations/) for more details.
741 | - [XTTS](https://huggingface.co/coqui/XTTS-v1) (2023) - A text-to-speech model that allows voice cloning by using just a 3-second audio sample of the target voice. Supports 13 languages including Polish.
742 | 
743 | #### Multimodal models
744 | - [Multilingual CLIP (SBert)](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1) (2021) - CLIP (Contrastive Language-Image Pre-Training) is a neural network introducted by [OpenAI](https://github.com/openai/CLIP) which enables joint vector representations for images and text. It can be used for building image search engines. This is a multilingual version of CLIP trained by the authors of the [Sentence-Transformers](https://www.sbert.net/) library.
745 | - [Multilingual CLIP (M-CLIP)](https://huggingface.co/M-CLIP/M-BERT-Base-ViT-B) (2021) - This is yet another multilingual version of CLIP supporting Polish language, trained by the Swedish Institute of Computer Science (SICS).
746 | - [LayoutXLM](https://huggingface.co/microsoft/layoutxlm-base) (2021) - A multilingual version of [LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2) model, pre-trained on 30 million  documents in 53 languages. The model combines visual, spatial, and textual modalities to solve prediction problems on visually-rich documents, such as PDFs or DOCs. See [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) and [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) for details.
747 | 


--------------------------------------------------------------------------------