├── README.md
└── figures
    └── gc4electra_64k_loss.png


/README.md:
--------------------------------------------------------------------------------
  1 | # GC4LM: A Colossal (Biased) language model for German
  2 | 
  3 | This repository presents a colossal (and biased) language model for German trained on the recently released
  4 | ["German colossal, clean Common Crawl corpus"](https://german-nlp-group.github.io/projects/gc4-corpus.html) (GC4),
  5 | with a total dataset size of ~844GB.
  6 | 
  7 | ---
  8 | 
  9 | **Disclaimer**: the presented and trained language models in this repository are for **research only** purposes.
 10 | The GC4 corpus - that was used for training - contains crawled texts from the internet. Thus, the language models can
 11 | be considered as highly biased, resulting in a model that encodes stereotypical associations along gender, race,
 12 | ethnicity and disability status. Before using and working with the released checkpoints, it is highly recommended
 13 | to read:
 14 | 
 15 | [On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?](https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf)
 16 | 
 17 | from Emily M. Bender, Timnit Gebru, Angelina McMillan-Major and Shmargaret Shmitchell.
 18 | 
 19 | The aim of the released checkpoints is to boost research on large pre-trained language models for German, especially
 20 | for identifying biases and how to prevent them, as most research is currently done for English only.
 21 | 
 22 | ---
 23 | 
 24 | Please use the new GitHub Discussions feature in order to discuss or present further research questions.
 25 | Feel free to use `#gc4lm` on Twitter 🐦.
 26 | 
 27 | # Changelog
 28 | 
 29 | * 02.05.2021: Initial version
 30 | 
 31 | # Preprocessing
 32 | 
 33 | After downloading the complete `HEAD` and `MIDDLE` parts of the GC4, we extract the downloaded archives and extract the
 34 | raw content (incl. language score filtering) with the provided
 35 | [Gist](https://gist.github.com/Phil1108/e1821fec6eb746edc8e04ef5f76d23f1) from the GC4 team.
 36 | 
 37 | In another pre-processing script we perform sentence-splitting of the whole pre-training corpus. One of the fastest solutions is to
 38 | use NLTK (with the German model) instead of using e.g. Spacy.
 39 | 
 40 | After extraction, language score filtering and sentence splitting, the resulting dataset size is **844GB**.
 41 | 
 42 | After sentence-splitting the next step is to create an ELECTRA-compatible vocab, that is described in the next section.
 43 | 
 44 | # Vocab generation
 45 | 
 46 | The vocab generation workflow is mainly inspired by a blog post from Judit Ács about ["Exploring BERT's Vocabulary"](https://juditacs.github.io/2019/02/19/bert-tokenization-stats.html)
 47 | and a recently released paper ["How Good is Your Tokenizer?"](https://arxiv.org/abs/2012.15613)
 48 | from Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder and Iryna Gurevych.
 49 | 
 50 | We mainly focus on calculating the subword fertility on the training and development data for popular downstream
 51 | tasks such as named entity recognition (NER), PoS tagging and text classification. For that purpose we use the
 52 | tokenized training and development data from:
 53 | 
 54 | * [GermEval 2014](https://sites.google.com/site/germeval2014ner/data)
 55 | * [GermEval 2018](https://projects.fzai.h-da.de/iggsa/germeval-2018/) (Spacy is used for tokenization)
 56 | * [Universal Dependencies - German HDT](https://github.com/UniversalDependencies/UD_German-HDT)
 57 | 
 58 | and calculate the subword fertility and portion of unknown (sub)words for various released German language models:
 59 | 
 60 | | Model name                     | Subword fertility | `UNK` portion
 61 | | ------------------------------ | ----------------- | -------------
 62 | | `bert-base-german-cased`       | 1.4433            | 0.0083%
 63 | | `bert-base-german-dbmdz-cased` | 1.4070            | 0.0050%
 64 | | This work (32k)                | 1.3955            | 0.0011%
 65 | | This work (64k)                | 1.3050            | 0.0011%
 66 | 
 67 | We then decided to create a new vocabulary based on the `HEAD` and `MIDDLE` parts from GC4. We select the following archives to generate a new vocab on:
 68 | 
 69 | * `0000_2015-48` (from `HEAD`, 2.5GB)
 70 | * `0004_2016-44` (from `HEAD`, 2.1GB) and `0006_2016-44` (from `MIDDLE`, 861MB)
 71 | * `0003_2017-30` (from `HEAD`, 2.4GB) and `0007_2017-51` (from `MIDDLE`, 1.1GB)
 72 | * `0007_2018-30` (from `HEAD`, 409MB) and `0007_2018-51` (from `MIDDLE`, 4.9GB)
 73 | * `0006_2019-09` (from `HEAD`, 1.8GB) and `0008_2019-30` (from `MIDDLE`, 2.2GB)
 74 | * `0003_2020-10` (from `HEAD`, 4.5GB) and `0007_2020-10` (from `MIDDLE`, 4.0GB)
 75 | 
 76 | This results in a corpus with a size of 27GB that is used for vocab generation.
 77 | 
 78 | We decided to generate both a 32k and 64k sized vocabularies, using the awesome Hugging Face [Tokenizers](https://github.com/huggingface/tokenizers) library.
 79 | 
 80 | # GC4ELECTRA
 81 | 
 82 | The first large pre-trained language model on the GC4 corpus is an ELECTRA-based model: *GC4ELECTRA*. It was trained
 83 | with the same parameters as the Turkish ELECTRA model on a v3-32 TPU. It uses the **64k** vocabulary (32k model is currently training).
 84 | 
 85 | **Notice**: we do not release **one** model. Instead, we release all model checkpoints (with a 100k step-width), for more research possibilities.
 86 | 
 87 | The following checkpoints are available from the Hugging Face Model Hub. Thanks Hugging Face for providing this amazing infrastructure!!
 88 | 
 89 | We also include the original TensorFlow checkpoint in each model on the hub.
 90 | 
 91 | ## Discriminator & generator checkpoints
 92 | 
 93 | | Model Hub Name                                                                                                                                                                                                                                                            | Checkpoint (Step)
 94 | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -----------------
 95 | | [`electra-base-gc4-64k-0-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-0-cased-discriminator)             - [`electra-base-gc4-64k-0-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-0-cased-generator)             | 0 (Initial)
 96 | | [`electra-base-gc4-64k-100000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-100000-cased-discriminator)   - [`electra-base-gc4-64k-100000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-100000-cased-generator)   | 100,000 steps
 97 | | [`electra-base-gc4-64k-200000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-200000-cased-discriminator)   - [`electra-base-gc4-64k-200000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-200000-cased-generator)   | 200,000 steps
 98 | | [`electra-base-gc4-64k-300000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-300000-cased-discriminator)   - [`electra-base-gc4-64k-300000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-300000-cased-generator)   | 300,000 steps
 99 | | [`electra-base-gc4-64k-400000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-400000-cased-discriminator)   - [`electra-base-gc4-64k-400000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-400000-cased-generator)   | 400,000 steps
100 | | [`electra-base-gc4-64k-500000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-500000-cased-discriminator)   - [`electra-base-gc4-64k-500000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-500000-cased-generator)   | 500,000 steps
101 | | [`electra-base-gc4-64k-600000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-600000-cased-discriminator)   - [`electra-base-gc4-64k-600000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-600000-cased-generator)   | 600,000 steps
102 | | [`electra-base-gc4-64k-700000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-700000-cased-discriminator)   - [`electra-base-gc4-64k-700000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-700000-cased-generator)   | 700,000 steps
103 | | [`electra-base-gc4-64k-800000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-800000-cased-discriminator)   - [`electra-base-gc4-64k-800000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-800000-cased-generator)   | 800,000 steps
104 | | [`electra-base-gc4-64k-900000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-900000-cased-discriminator)   - [`electra-base-gc4-64k-900000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-900000-cased-generator)   | 900,000 steps
105 | | [`electra-base-gc4-64k-1000000-cased-discriminator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-1000000-cased-discriminator) - [`electra-base-gc4-64k-1000000-cased-generator`](https://huggingface.co/stefan-it/electra-base-gc4-64k-1000000-cased-generator) | 1M steps
106 | 
107 | **Notice**: You should use the generator models for MLM tasks like masked token prediction. The discriminator models should be used for fine-tuning
108 | on downstream tasks like NER, PoS tagging, text classication and many more.
109 | 
110 | ## Training Loss
111 | 
112 | The following plot shows the loss curve over 1M steps:
113 | 
114 | ![GC4ELECTRA - training loss curve](figures/gc4electra_64k_loss.png)
115 | 
116 | # License
117 | 
118 | All models are licensed under [MIT](LICENSE).
119 | 
120 | # Contact (Bugs, Feedback, Contribution and more)
121 | 
122 | Please use the new [GitHub Discussions](https://github.com/stefan-it/gc4-lms/discussions) for feedback or just fill a PR for suggestions/corrections.
123 | 
124 | # Acknowledgments
125 | 
126 | Thanks to [Philip May](https://github.com/PhilipMay), [Philipp Reißel](https://github.com/Phil1108) and to [iisys](the Institute of Information Systems Hof University)
127 | for releasing and hosting the "German colossal, cleaned Common Crawl corpus" (GC4).
128 | 
129 | Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
130 | Thanks for providing access to the TFRC ❤️
131 | 
132 | Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
133 | it is possible to store and download all checkpoints from their Model Hub 🤗
134 | 


--------------------------------------------------------------------------------
/figures/gc4electra_64k_loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stefan-it/gc4lm/2c2e6a46d806115f0103f81209d076302c9f4e1e/figures/gc4electra_64k_loss.png


--------------------------------------------------------------------------------