├── README.md └── figures └── gc4electra_64k_loss.png /README.md: -------------------------------------------------------------------------------- 1 | # GC4LM: A Colossal (Biased) language model for German 2 | 3 | This repository presents a colossal (and biased) language model for German trained on the recently released 4 | ["German colossal, clean Common Crawl corpus"](https://german-nlp-group.github.io/projects/gc4-corpus.html) (GC4), 5 | with a total dataset size of ~844GB. 6 | 7 | --- 8 | 9 | **Disclaimer**: the presented and trained language models in this repository are for **research only** purposes. 10 | The GC4 corpus - that was used for training - contains crawled texts from the internet. Thus, the language models can 11 | be considered as highly biased, resulting in a model that encodes stereotypical associations along gender, race, 12 | ethnicity and disability status. Before using and working with the released checkpoints, it is highly recommended 13 | to read: 14 | 15 | [On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?](https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf) 16 | 17 | from Emily M. Bender, Timnit Gebru, Angelina McMillan-Major and Shmargaret Shmitchell. 18 | 19 | The aim of the released checkpoints is to boost research on large pre-trained language models for German, especially 20 | for identifying biases and how to prevent them, as most research is currently done for English only. 21 | 22 | --- 23 | 24 | Please use the new GitHub Discussions feature in order to discuss or present further research questions. 25 | Feel free to use `#gc4lm` on Twitter 🐦. 26 | 27 | # Changelog 28 | 29 | * 02.05.2021: Initial version 30 | 31 | # Preprocessing 32 | 33 | After downloading the complete `HEAD` and `MIDDLE` parts of the GC4, we extract the downloaded archives and extract the 34 | raw content (incl. language score filtering) with the provided 35 | [Gist](https://gist.github.com/Phil1108/e1821fec6eb746edc8e04ef5f76d23f1) from the GC4 team. 36 | 37 | In another pre-processing script we perform sentence-splitting of the whole pre-training corpus. One of the fastest solutions is to 38 | use NLTK (with the German model) instead of using e.g. Spacy. 39 | 40 | After extraction, language score filtering and sentence splitting, the resulting dataset size is **844GB**. 41 | 42 | After sentence-splitting the next step is to create an ELECTRA-compatible vocab, that is described in the next section. 43 | 44 | # Vocab generation 45 | 46 | The vocab generation workflow is mainly inspired by a blog post from Judit Ács about ["Exploring BERT's Vocabulary"](https://juditacs.github.io/2019/02/19/bert-tokenization-stats.html) 47 | and a recently released paper ["How Good is Your Tokenizer?"](https://arxiv.org/abs/2012.15613) 48 | from Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder and Iryna Gurevych. 49 | 50 | We mainly focus on calculating the subword fertility on the training and development data for popular downstream 51 | tasks such as named entity recognition (NER), PoS tagging and text classification. For that purpose we use the 52 | tokenized training and development data from: 53 | 54 | * [GermEval 2014](https://sites.google.com/site/germeval2014ner/data) 55 | * [GermEval 2018](https://projects.fzai.h-da.de/iggsa/germeval-2018/) (Spacy is used for tokenization) 56 | * [Universal Dependencies - German HDT](https://github.com/UniversalDependencies/UD_German-HDT) 57 | 58 | and calculate the subword fertility and portion of unknown (sub)words for various released German language models: 59 | 60 | | Model name | Subword fertility | `UNK` portion 61 | | ------------------------------ | ----------------- | ------------- 62 | | `bert-base-german-cased` | 1.4433 | 0.0083% 63 | | `bert-base-german-dbmdz-cased` | 1.4070 | 0.0050% 64 | | This work (32k) | 1.3955 | 0.0011% 65 | | This work (64k) | 1.3050 | 0.0011% 66 | 67 | We then decided to create a new vocabulary based on the `HEAD` and `MIDDLE` parts from GC4. We select the following archives to generate a new vocab on: 68 | 69 | * `0000_2015-48` (from `HEAD`, 2.5GB) 70 | * `0004_2016-44` (from `HEAD`, 2.1GB) and `0006_2016-44` (from `MIDDLE`, 861MB) 71 | * `0003_2017-30` (from `HEAD`, 2.4GB) and `0007_2017-51` (from `MIDDLE`, 1.1GB) 72 | * `0007_2018-30` (from `HEAD`, 409MB) and `0007_2018-51` (from `MIDDLE`, 4.9GB) 73 | * `0006_2019-09` (from `HEAD`, 1.8GB) and `0008_2019-30` (from `MIDDLE`, 2.2GB) 74 | * `0003_2020-10` (from `HEAD`, 4.5GB) and `0007_2020-10` (from `MIDDLE`, 4.0GB) 75 | 76 | This results in a corpus with a size of 27GB that is used for vocab generation. 77 | 78 | We decided to generate both a 32k and 64k sized vocabularies, using the awesome Hugging Face [Tokenizers](https://github.com/huggingface/tokenizers) library. 79 | 80 | # GC4ELECTRA 81 | 82 | The first large pre-trained language model on the GC4 corpus is an ELECTRA-based model: *GC4ELECTRA*. It was trained 83 | with the same parameters as the Turkish ELECTRA model on a v3-32 TPU. It uses the **64k** vocabulary (32k model is currently training). 84 | 85 | **Notice**: we do not release **one** model. Instead, we release all model checkpoints (with a 100k step-width), for more research possibilities. 86 | 87 | The following checkpoints are available from the Hugging Face Model Hub. Thanks Hugging Face for providing this amazing infrastructure!! 88 | 89 | We also include the original TensorFlow checkpoint in each model on the hub. 90 | 91 | ## Discriminator & generator checkpoints 92 | 93 | | Model Hub Name | Checkpoint (Step) 94 | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- 95 | | [`electra-base-gc4-64k-0-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-0-cased-discriminator) - [`electra-base-gc4-64k-0-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-0-cased-generator) | 0 (Initial) 96 | | [`electra-base-gc4-64k-100000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-100000-cased-discriminator) - [`electra-base-gc4-64k-100000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-100000-cased-generator) | 100,000 steps 97 | | [`electra-base-gc4-64k-200000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-200000-cased-discriminator) - [`electra-base-gc4-64k-200000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-200000-cased-generator) | 200,000 steps 98 | | [`electra-base-gc4-64k-300000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-300000-cased-discriminator) - [`electra-base-gc4-64k-300000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-300000-cased-generator) | 300,000 steps 99 | | [`electra-base-gc4-64k-400000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-400000-cased-discriminator) - [`electra-base-gc4-64k-400000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-400000-cased-generator) | 400,000 steps 100 | | [`electra-base-gc4-64k-500000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-500000-cased-discriminator) - [`electra-base-gc4-64k-500000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-500000-cased-generator) | 500,000 steps 101 | | [`electra-base-gc4-64k-600000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-600000-cased-discriminator) - [`electra-base-gc4-64k-600000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-600000-cased-generator) | 600,000 steps 102 | | [`electra-base-gc4-64k-700000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-700000-cased-discriminator) - [`electra-base-gc4-64k-700000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-700000-cased-generator) | 700,000 steps 103 | | [`electra-base-gc4-64k-800000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-800000-cased-discriminator) - [`electra-base-gc4-64k-800000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-800000-cased-generator) | 800,000 steps 104 | | [`electra-base-gc4-64k-900000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-900000-cased-discriminator) - [`electra-base-gc4-64k-900000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-900000-cased-generator) | 900,000 steps 105 | | [`electra-base-gc4-64k-1000000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-1000000-cased-discriminator) - [`electra-base-gc4-64k-1000000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-1000000-cased-generator) | 1M steps 106 | 107 | **Notice**: You should use the generator models for MLM tasks like masked token prediction. The discriminator models should be used for fine-tuning 108 | on downstream tasks like NER, PoS tagging, text classication and many more. 109 | 110 | ## Training Loss 111 | 112 | The following plot shows the loss curve over 1M steps: 113 | 114 | ![GC4ELECTRA - training loss curve](figures/gc4electra_64k_loss.png) 115 | 116 | # License 117 | 118 | All models are licensed under [MIT](LICENSE). 119 | 120 | # Contact (Bugs, Feedback, Contribution and more) 121 | 122 | Please use the new [GitHub Discussions](https://github.com/stefan-it/gc4-lms/discussions) for feedback or just fill a PR for suggestions/corrections. 123 | 124 | # Acknowledgments 125 | 126 | Thanks to [Philip May](https://github.com/PhilipMay), [Philipp Reißel](https://github.com/Phil1108) and to [iisys](the Institute of Information Systems Hof University) 127 | for releasing and hosting the "German colossal, cleaned Common Crawl corpus" (GC4). 128 | 129 | Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). 130 | Thanks for providing access to the TFRC ❤️ 131 | 132 | Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team, 133 | it is possible to store and download all checkpoints from their Model Hub 🤗 134 | -------------------------------------------------------------------------------- /figures/gc4electra_64k_loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stefan-it/gc4lm/2c2e6a46d806115f0103f81209d076302c9f4e1e/figures/gc4electra_64k_loss.png --------------------------------------------------------------------------------