├── danish_itos.pkl ├── finnish_itos.pkl ├── norwegian_itos.pkl ├── .gitattributes ├── danish_enc.h5 ├── danish_enc.pth ├── finnish_enc.h5 ├── finnish_enc.pth ├── norwegian_enc.h5 ├── norwegian_enc.pth └── README.md /danish_itos.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mollerhoj/Scandinavian-ULMFiT/HEAD/danish_itos.pkl -------------------------------------------------------------------------------- /finnish_itos.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mollerhoj/Scandinavian-ULMFiT/HEAD/finnish_itos.pkl -------------------------------------------------------------------------------- /norwegian_itos.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mollerhoj/Scandinavian-ULMFiT/HEAD/norwegian_itos.pkl -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | *.h5 filter=lfs diff=lfs merge=lfs -text 2 | *.pth filter=lfs diff=lfs merge=lfs -text 3 | -------------------------------------------------------------------------------- /danish_enc.h5: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:f2a7820b7ed76d2e0346eb6ed0e5fe04897e610a8f3b3bed7d3a25d0087249ec 3 | size 128851518 4 | -------------------------------------------------------------------------------- /danish_enc.pth: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:181ba8e616d7e0326a879b6a9cd00a25e4b4a68cd78f7592928a0c0d652aaf7f 3 | size 128851663 4 | -------------------------------------------------------------------------------- /finnish_enc.h5: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:abdf858f780a3f34ee244a21432fd085146786fdfd4738141407166be3a06e76 3 | size 128852062 4 | -------------------------------------------------------------------------------- /finnish_enc.pth: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:d68a6383c1646b0e6a29aa9df966366f3acce7c2a1f0ed390f921ef917db5a26 3 | size 128851761 4 | -------------------------------------------------------------------------------- /norwegian_enc.h5: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:5fb3ed4f2fc56d0c987af9c6a6d21073afaa7dd79ba6ec3ebb73bd8c4aaddd6f 3 | size 128852067 4 | -------------------------------------------------------------------------------- /norwegian_enc.pth: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:21d838d8d8a6492a7429203010b99cd7262171baee279fb3be01935e307ed942 3 | size 128851761 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | We're in the process of releasing BERT models as well. Get the first one here: https://github.com/mollerhoj/danish_bert 2 | 3 | # Scandinavian ULMFiT 4 | 5 | Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. 6 | 7 | This repository contains the weights for the embedding layer of a UMLFiT language model that can be used as the first step in fine-tuning any Natural Language Processing task. 8 | 9 | The weights were trained on 90% of all text in the corresponding language wikipedia as per 3. July 2018. The remaining 10% was used for validation. 10 | 11 | # Supported Languages: 12 | 13 | - Danish 14 | 15 | Trained on 78,373,122 tokens, and validated on 7,837,310 tokens. We achieve a perplexity of 30.9. 16 | Download files: [Link](https://www.dropbox.com/s/mipfzhj71ecptbd/danish.zip?dl=0) 17 | 18 | - Norwegian 19 | 20 | Trained on 80,284,231 tokens, and validated on 8,920,387 tokens. We achieve a perplexity of 26.31. 21 | Download files: [Link](https://www.dropbox.com/s/lwr5kvbxri1gvv9/norwegian.zip?dl=0) 22 | 23 | - Finnish 24 | 25 | Trained on 68,775,370 tokens, and validated on 7,641,571 tokens. We achieve a perplexity of 27.66 26 | 27 | Training even higher performance models is possible, but require more (costly) training time. If you need a model with higher performance, feel free to contact us. 28 | Download files: [Link](https://www.dropbox.com/s/3wl620c603ewvgo/finnish.zip?dl=0) 29 | 30 | Our servers crashed when training the Swedish model, but if you're in need of it, contact us and we can train it for you. 31 | 32 | ### Paper 33 | 34 | See Universal Language Model Fine-tuning for Text Classification, Jeremy Howard, Sebastian Ruder, https://arxiv.org/abs/1801.06146 35 | 36 | ### File descriptions 37 | 38 | - enc.h5 Contains the weights in 'Hierarchical Data Format' 39 | 40 | - enc.pth Contains the weights in 'Pytorch model format' 41 | 42 | - itos.pkl (Integers to Strings) contains the vocabulary mapping from ids (0 - 30000) to strings 43 | 44 | ### Sponsor 45 | 46 | This work was sponsored by Danish chatbot company BotXO 47 | http://www.botxo.co/ 48 | 49 | ### Thanks 50 | 51 | Thanks to Tobias Lindberg from Damvad Analytics for converting the vectors to pth-format. 52 | --------------------------------------------------------------------------------