├── .gitignore
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | 
 2 | # Created by https://www.gitignore.io/api/osx
 3 | 
 4 | ### OSX ###
 5 | *.DS_Store
 6 | .AppleDouble
 7 | .LSOverride
 8 | 
 9 | # Icon must end with two \r
10 | Icon
11 | 
12 | # Thumbnails
13 | ._*
14 | 
15 | # Files that might appear in the root of a volume
16 | .DocumentRevisions-V100
17 | .fseventsd
18 | .Spotlight-V100
19 | .TemporaryItems
20 | .Trashes
21 | .VolumeIcon.icns
22 | .com.apple.timemachine.donotpresent
23 | 
24 | # Directories potentially created on remote AFP share
25 | .AppleDB
26 | .AppleDesktop
27 | Network Trash Folder
28 | Temporary Items
29 | .apdisk
30 | 
31 | 
32 | # End of https://www.gitignore.io/api/osx
33 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Indonesian NLP resources
 2 | 
 3 | ## Language modeling
 4 | 
 5 | 1. [Kompas online collection](http://ilps.science.uva.nl/ilps/wp-content/uploads/sites/6/files/bahasaindonesia/kompas.zip).
 6 |    This corpus contains [Kompas online](http://www.kompas.com/) news articles from 2001-2002. See
 7 |    [here](http://ilps.science.uva.nl/resources/bahasa/) for more info and citations.
 8 | 1. [Tempo online collection](http://ilps.science.uva.nl/ilps/wp-content/uploads/sites/6/files/bahasaindonesia/tempo.zip).
 9 |    This corpus contains [Tempo online](https://www.tempo.co/) news articles from 2000-2002. See
10 |    [here](http://ilps.science.uva.nl/resources/bahasa/) for more info and citations.
11 | 1. [OSCAR](https://traces1.inria.fr/oscar/#corpus). This large corpus contains articles from many sources crawled by
12 |    [CommonCrawl](https://commoncrawl.org/) and extracted by [ALMAnaCH](https://team.inria.fr/almanach/). In total there are
13 |    4B words tokens and 2B word types. (NOTE: Contains strong language, mostly coming from gambling sites.)
14 | 1. [Leipzig corpora collection](https://corpora.uni-leipzig.de/en?corpusId=ind_mixed_2013). Indonesian mixed corpus
15 |    based on material from 2013. Sentences: 74,329,815 - Types: 7,964,109 - Tokens: 1,206,281,985. From news materials, randomly chosen websites, and Wikipedia dumps.
16 | 1. [CC-100](http://data.statmt.org/cc-100/). This large corpus contains articles from many sources crawled by [CommonCrawl](https://commoncrawl.org/) and extracted by [FAIR](https://github.com/facebookresearch). For Bahasa Indonesia, in total there are around 4.8B sentences and 6B sentence piece tokens. See [here](https://www.aclweb.org/anthology/2020.lrec-1.494.pdf) for more info and citations.
17 | 1. [IndoNLU Benchmark](https://www.indobenchmark.com/) A collective effort made by researchers and practitioners from Gojek, Institut Teknologi Bandung, HKUST, Universitas Multimedia Nusantara, Prosa.ai, and Universitas Indonesia.
18 | They provide pre-trained BERT/ALBERT [language models](https://huggingface.co/indobenchmark)
19 | that were trained on a large [corpus](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/dataset/preprocessed/dataset_all_uncased_blankline.txt.xz) of 4B words (250M sentences). They also create single-sentence and sentence-pair [datasets](https://github.com/indobenchmark/indonlu) for evaluating classification and sequence-tagging tasks.
20 | 1. [Indonesian News Corpus](https://data.mendeley.com/datasets/2zpbjs22k3/1).
21 |    This corpus contains 150,466 news articles crawled from various Indonesian news portals from the second half of 2015.
22 | 
23 | ## POS tagging
24 | 
25 | 1. [IDN tagged corpus](https://github.com/famrashel/idn-tagged-corpus). This corpus contains
26 |    10K sentences and 250K word tokens. The POS tags are annotated manually.
27 | 
28 | ## Sentiment analysis
29 | 
30 | 1. [Aspect and Opinion Terms Extraction for Hotel Reviews](https://github.com/jordhy97/final_project).
31 |     The corpus consists of 5000 hotel reviews from [Airy](https://www.airyrooms.com/) (78K tokens) with 5 labels. The paper is available on [arXiv](https://arxiv.org/abs/1908.04899).
32 | 1. [Aspect-Based Sentiment Analysis](https://github.com/annisanurulazhar/absa-playground).
33 |     A text classification resource for multi-label aspect categorization.
34 | 
35 | ## Syntactic parsing
36 | 
37 | 1. [Indonesian Treebank](https://github.com/famrashel/idn-treebank). This corpus contains 1K parsed
38 |    sentences. (constituency parsing)
39 | 1. [UD Indonesian](https://github.com/UniversalDependencies/UD_Indonesian-GSD). This corpus is
40 |    provided by [Universal Dependencies](http://universaldependencies.org/). Training, development,
41 |    and testing split are already provided. (dependency parsing)
42 | 
43 | ## Machine translation
44 | 
45 | 1. [OPUS (Open Parallel Corpus)](http://opus.nlpl.eu/). This site contains parallel corpora of Indonesian and other languages
46 |    based on openly available resources (e.g., OpenSubtitles).
47 | 1. [IDENTICv1.0](https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0005-BF85-F?show=full) [[paper](http://www.lrec-conf.org/proceedings/lrec2012/pdf/644_Paper.pdf)].
48 |     Indonesian (ID)-English (EN). 45k sentences/~1M tokens (ID). Domain: science, sport, international, economy, news article, movie subtitle. It may overlap with PANL10N corpus. The dataset has versions with raw and tokenized sentences, and in CoNLL format.
49 | 1. [IWSLT2017](https://wit3.fbk.eu/mt.php?release=2017-01-more)         [[paper](https://wit3.fbk.eu/papers/WIT3-EAMT2012.pdf)].
50 |     ID-EN. ~100K sentences. TEDtalk subtitles (spoken language).
51 |     NOTE: the test set tst2017-plus provided contains a small part of the train data (as mentioned [here](https://www.aclweb.org/anthology/P19-2043.pdf)).
52 | 1. [Asian Language Treebank](http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/) [[paper](http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/ALT-Parallel-Corpus-20171201/ALT-O-COCOSDA.pdf)].
53 |     ID, EN, and some Asian languages (mostly South East Asian). 20K sentences. Domain: News.
54 | 
55 | ## Word normalization
56 | 
57 | 1. [Colloquial Indonesian Lexicon](https://github.com/nasalsabila/kamus-alay).
58 |     This lexicon consists of 3592 unique colloquial tokens that are mapped onto 1742 unique lemmas. The full description of this lexicon can be seen in the [paper](https://ieeexplore.ieee.org/abstract/document/8629151).
59 | 
60 | ## Text summarization
61 | 
62 | 1. [IndoSum](https://github.com/kata-ai/indosum).
63 |     A collection of 20K online news article-summary pairs belonging to 6 categories and 10 sources.
64 |     It has both abstractive summaries and extractive labels.
65 | 
66 | ## Text classification
67 | 
68 | 1. [SMS Spam](https://drive.google.com/file/d/1-stKadfTgJLtYsHWqXhGO3nTjKVFxm_Q/view).
69 |    This corpus contains 1143 sentences that have been labeled with normal message, fraud, promotion. It is provided by Yudi Wibisono
70 | 1. [Hate Speech Detection](https://github.com/ialfina/id-hatespeech-detection).
71 |     This dataset consists of 713 tweets in the Indonesian language with 453 non hate speech and 260 hate speech tweets.
72 | 1. [Abusive Language Detection](https://github.com/okkyibrohim/id-abusive-language-detection).
73 |     A collection of tweets for abusive language detection in Indonesian social media. It consists of two types of labeling, abusive/not abusive and not abusive/abusive but not offensive/offensive. It also has its own colloquial Indonesian lexicon.
74 | 
75 | ## Speech recognition
76 | 
77 | 1. [TITML-IDN speech corpus](http://research.nii.ac.jp/src/en/TITML-IDN.html).
78 |    The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances.
79 |    The utterances are phonetically balanced.
80 |    The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution.
81 |    The procedure is listed [here](http://research.nii.ac.jp/src/en/register.html).
82 | 1. [Indonesian Speech Recognition](https://github.com/frankydotid/Indonesian-Speech-Recognition).
83 |    A small corpus of 50 utterances by a single male speaker. Disclaimer: This is a school project, do not use it for any important tasks. The author is not responsible for the undesired results of using the data provided here.
84 | 1. [CMU Wilderness Multilingual Speech Dataset](https://github.com/festvox/datasets-CMU_Wilderness).
85 |    A dataset of over 700 different languages providing audio, aligned texts, and word pronunciations.
86 |    One of the languages is Indonesian. The utterances are from the bible, which is recorded by [bible.is](bible.is).
87 | 
88 | ## Paraphrase identification
89 | 
90 | 1. [Translated PAWS](https://github.com/Wikidepia/indonesia_dataset/tree/master/paraphrase/PAWS).
91 |    This dataset is a translation of [PAWS](https://github.com/google-research-datasets/paws). The dataset is translated using Google Translate
92 |    and contains 100K human-labeled data that feature the importance of modeling structure, context, and word order information for the problem
93 |    of paraphrase identification.
94 | 


--------------------------------------------------------------------------------