├── .gitignore └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | # Created by https://www.gitignore.io/api/osx 3 | 4 | ### OSX ### 5 | *.DS_Store 6 | .AppleDouble 7 | .LSOverride 8 | 9 | # Icon must end with two \r 10 | Icon 11 | 12 | # Thumbnails 13 | ._* 14 | 15 | # Files that might appear in the root of a volume 16 | .DocumentRevisions-V100 17 | .fseventsd 18 | .Spotlight-V100 19 | .TemporaryItems 20 | .Trashes 21 | .VolumeIcon.icns 22 | .com.apple.timemachine.donotpresent 23 | 24 | # Directories potentially created on remote AFP share 25 | .AppleDB 26 | .AppleDesktop 27 | Network Trash Folder 28 | Temporary Items 29 | .apdisk 30 | 31 | 32 | # End of https://www.gitignore.io/api/osx 33 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Indonesian NLP resources 2 | 3 | ## Language modeling 4 | 5 | 1. [Kompas online collection](http://ilps.science.uva.nl/ilps/wp-content/uploads/sites/6/files/bahasaindonesia/kompas.zip). 6 | This corpus contains [Kompas online](http://www.kompas.com/) news articles from 2001-2002. See 7 | [here](http://ilps.science.uva.nl/resources/bahasa/) for more info and citations. 8 | 1. [Tempo online collection](http://ilps.science.uva.nl/ilps/wp-content/uploads/sites/6/files/bahasaindonesia/tempo.zip). 9 | This corpus contains [Tempo online](https://www.tempo.co/) news articles from 2000-2002. See 10 | [here](http://ilps.science.uva.nl/resources/bahasa/) for more info and citations. 11 | 1. [OSCAR](https://traces1.inria.fr/oscar/#corpus). This large corpus contains articles from many sources crawled by 12 | [CommonCrawl](https://commoncrawl.org/) and extracted by [ALMAnaCH](https://team.inria.fr/almanach/). In total there are 13 | 4B words tokens and 2B word types. (NOTE: Contains strong language, mostly coming from gambling sites.) 14 | 1. [Leipzig corpora collection](https://corpora.uni-leipzig.de/en?corpusId=ind_mixed_2013). Indonesian mixed corpus 15 | based on material from 2013. Sentences: 74,329,815 - Types: 7,964,109 - Tokens: 1,206,281,985. From news materials, randomly chosen websites, and Wikipedia dumps. 16 | 1. [CC-100](http://data.statmt.org/cc-100/). This large corpus contains articles from many sources crawled by [CommonCrawl](https://commoncrawl.org/) and extracted by [FAIR](https://github.com/facebookresearch). For Bahasa Indonesia, in total there are around 4.8B sentences and 6B sentence piece tokens. See [here](https://www.aclweb.org/anthology/2020.lrec-1.494.pdf) for more info and citations. 17 | 1. [IndoNLU Benchmark](https://www.indobenchmark.com/) A collective effort made by researchers and practitioners from Gojek, Institut Teknologi Bandung, HKUST, Universitas Multimedia Nusantara, Prosa.ai, and Universitas Indonesia. 18 | They provide pre-trained BERT/ALBERT [language models](https://huggingface.co/indobenchmark) 19 | that were trained on a large [corpus](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/dataset/preprocessed/dataset_all_uncased_blankline.txt.xz) of 4B words (250M sentences). They also create single-sentence and sentence-pair [datasets](https://github.com/indobenchmark/indonlu) for evaluating classification and sequence-tagging tasks. 20 | 1. [Indonesian News Corpus](https://data.mendeley.com/datasets/2zpbjs22k3/1). 21 | This corpus contains 150,466 news articles crawled from various Indonesian news portals from the second half of 2015. 22 | 23 | ## POS tagging 24 | 25 | 1. [IDN tagged corpus](https://github.com/famrashel/idn-tagged-corpus). This corpus contains 26 | 10K sentences and 250K word tokens. The POS tags are annotated manually. 27 | 28 | ## Sentiment analysis 29 | 30 | 1. [Aspect and Opinion Terms Extraction for Hotel Reviews](https://github.com/jordhy97/final_project). 31 | The corpus consists of 5000 hotel reviews from [Airy](https://www.airyrooms.com/) (78K tokens) with 5 labels. The paper is available on [arXiv](https://arxiv.org/abs/1908.04899). 32 | 1. [Aspect-Based Sentiment Analysis](https://github.com/annisanurulazhar/absa-playground). 33 | A text classification resource for multi-label aspect categorization. 34 | 35 | ## Syntactic parsing 36 | 37 | 1. [Indonesian Treebank](https://github.com/famrashel/idn-treebank). This corpus contains 1K parsed 38 | sentences. (constituency parsing) 39 | 1. [UD Indonesian](https://github.com/UniversalDependencies/UD_Indonesian-GSD). This corpus is 40 | provided by [Universal Dependencies](http://universaldependencies.org/). Training, development, 41 | and testing split are already provided. (dependency parsing) 42 | 43 | ## Machine translation 44 | 45 | 1. [OPUS (Open Parallel Corpus)](http://opus.nlpl.eu/). This site contains parallel corpora of Indonesian and other languages 46 | based on openly available resources (e.g., OpenSubtitles). 47 | 1. [IDENTICv1.0](https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0005-BF85-F?show=full) [[paper](http://www.lrec-conf.org/proceedings/lrec2012/pdf/644_Paper.pdf)]. 48 | Indonesian (ID)-English (EN). 45k sentences/~1M tokens (ID). Domain: science, sport, international, economy, news article, movie subtitle. It may overlap with PANL10N corpus. The dataset has versions with raw and tokenized sentences, and in CoNLL format. 49 | 1. [IWSLT2017](https://wit3.fbk.eu/mt.php?release=2017-01-more) [[paper](https://wit3.fbk.eu/papers/WIT3-EAMT2012.pdf)]. 50 | ID-EN. ~100K sentences. TEDtalk subtitles (spoken language). 51 | NOTE: the test set tst2017-plus provided contains a small part of the train data (as mentioned [here](https://www.aclweb.org/anthology/P19-2043.pdf)). 52 | 1. [Asian Language Treebank](http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/) [[paper](http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/ALT-Parallel-Corpus-20171201/ALT-O-COCOSDA.pdf)]. 53 | ID, EN, and some Asian languages (mostly South East Asian). 20K sentences. Domain: News. 54 | 55 | ## Word normalization 56 | 57 | 1. [Colloquial Indonesian Lexicon](https://github.com/nasalsabila/kamus-alay). 58 | This lexicon consists of 3592 unique colloquial tokens that are mapped onto 1742 unique lemmas. The full description of this lexicon can be seen in the [paper](https://ieeexplore.ieee.org/abstract/document/8629151). 59 | 60 | ## Text summarization 61 | 62 | 1. [IndoSum](https://github.com/kata-ai/indosum). 63 | A collection of 20K online news article-summary pairs belonging to 6 categories and 10 sources. 64 | It has both abstractive summaries and extractive labels. 65 | 66 | ## Text classification 67 | 68 | 1. [SMS Spam](https://drive.google.com/file/d/1-stKadfTgJLtYsHWqXhGO3nTjKVFxm_Q/view). 69 | This corpus contains 1143 sentences that have been labeled with normal message, fraud, promotion. It is provided by Yudi Wibisono 70 | 1. [Hate Speech Detection](https://github.com/ialfina/id-hatespeech-detection). 71 | This dataset consists of 713 tweets in the Indonesian language with 453 non hate speech and 260 hate speech tweets. 72 | 1. [Abusive Language Detection](https://github.com/okkyibrohim/id-abusive-language-detection). 73 | A collection of tweets for abusive language detection in Indonesian social media. It consists of two types of labeling, abusive/not abusive and not abusive/abusive but not offensive/offensive. It also has its own colloquial Indonesian lexicon. 74 | 75 | ## Speech recognition 76 | 77 | 1. [TITML-IDN speech corpus](http://research.nii.ac.jp/src/en/TITML-IDN.html). 78 | The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances. 79 | The utterances are phonetically balanced. 80 | The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution. 81 | The procedure is listed [here](http://research.nii.ac.jp/src/en/register.html). 82 | 1. [Indonesian Speech Recognition](https://github.com/frankydotid/Indonesian-Speech-Recognition). 83 | A small corpus of 50 utterances by a single male speaker. Disclaimer: This is a school project, do not use it for any important tasks. The author is not responsible for the undesired results of using the data provided here. 84 | 1. [CMU Wilderness Multilingual Speech Dataset](https://github.com/festvox/datasets-CMU_Wilderness). 85 | A dataset of over 700 different languages providing audio, aligned texts, and word pronunciations. 86 | One of the languages is Indonesian. The utterances are from the bible, which is recorded by [bible.is](bible.is). 87 | 88 | ## Paraphrase identification 89 | 90 | 1. [Translated PAWS](https://github.com/Wikidepia/indonesia_dataset/tree/master/paraphrase/PAWS). 91 | This dataset is a translation of [PAWS](https://github.com/google-research-datasets/paws). The dataset is translated using Google Translate 92 | and contains 100K human-labeled data that feature the importance of modeling structure, context, and word order information for the problem 93 | of paraphrase identification. 94 | --------------------------------------------------------------------------------