├── .gitignore ├── LICENSE ├── README.md ├── collected.txt ├── img └── logo.png ├── requirements.txt ├── src ├── __init__.py └── merge.py └── stopwords_twitter.csv /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | .vscode -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Braincore.id 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 | 3 |

IndoTWEEST

4 |

5 | Indonesian Tweet Stopwords 6 |

7 |
8 | 9 |
10 | 11 | Proyek ini di-inisialisasi oleh Braincore.id sebagai kontribusi dalam pengembangan dataset stopwords sosial media Twitter untuk memudahkan penelitian yang menggunakan dataset dari platform tersebut demi kemajuan NLP Indonesia. 12 | 13 | # Kontribusi 14 | Tata cara kontribusi dapat dibaca pada dokumen berikut Ini 15 | 16 | 1. Clone terlebih dahulu git ini menggunakan command `git clone https://github.com/Braincore-id/IndoTWEES.git` 17 | 18 | 2. **Bagi yang memang sudah memiliki kumpulan stopwords yang ingin ditambahkan dapat melewati tahap 2**. Jalankan [Colab](https://colab.research.google.com/drive/1bSHsso2kJfvbxH5Gs70RtrLQQdz13J6T?usp=sharing) berikut untuk dijadikan acuan *stopwords* apa saja yang ingin dimasukkan 19 | 20 | 3. Setelah mendapatkan kumpulan *stopwords*, masukkan kumpulan *stopwords* tersebut kedalam file **.txt** dengan format sebagai berikut 21 | ``` 22 | 23 | 24 | 25 | ... 26 | ... 27 | 28 | ``` 29 | 4. Jalankan perintah `python src/add_csv.py --new_stopwords `. Untuk lebih jelas mengenai *argparse* apa saja yang dapat digunakan bisa menggunakan perintah `python src/add_csv.py --help` 30 | 31 | 5. Lakukan *pull request* sehingga hasil stopwords akan ditambahkan kedalam final stopwords 32 | 33 | 34 | # Task List 35 | - [ ] *Support* csv dan format lain sebagai format file untuk menambahkan stopwords 36 | - [ ] Penghitung Stopwords otomatis di README.md 37 | - [ ] Pembaruan otomatis pada *Terakhir diperbarui* pada README.md 38 | 39 | # Contributors 40 | 41 | 42 | contributors 43 | 44 | -------------------------------------------------------------------------------- /collected.txt: -------------------------------------------------------------------------------- 1 | copi 2 | copii 3 | copee 4 | gopei 5 | gipewey 6 | gopay 7 | dana 8 | ovo 9 | nderr 10 | see 11 | i 12 | gasih 13 | bosquee 14 | info 15 | gasi 16 | udah 17 | bokap 18 | nyokap 19 | monyet 20 | nyet 21 | monyeet 22 | bsk 23 | besok 24 | ayo -------------------------------------------------------------------------------- /img/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Braincore-id/IndoTWEEST/69b28614464964ed31370983c20f18e313be61eb/img/logo.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Braincore-id/IndoTWEEST/69b28614464964ed31370983c20f18e313be61eb/requirements.txt -------------------------------------------------------------------------------- /src/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Braincore-id/IndoTWEEST/69b28614464964ed31370983c20f18e313be61eb/src/__init__.py -------------------------------------------------------------------------------- /src/merge.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import pandas as pd 3 | 4 | ap = argparse.ArgumentParser() 5 | ap.add_argument("--old_csv", type=str, help="File CSV yang menyimpan stopwords lama", default="./stopwords_twitter.csv") 6 | ap.add_argument("--collected_stopwords", type=str, required=True, help="Stopwords baru yang sudah dikumpulkan", default="./collected.txt") 7 | args = vars(ap.parse_args()) 8 | 9 | def merge(old, new): 10 | old_stopwords = pd.read_csv(old, names=["stopword"]) 11 | new_stopwords = pd.read_csv(new, names=["stopword"]) 12 | 13 | stopwords_diff = new_stopwords[~new_stopwords.isin(old_stopwords)] 14 | 15 | stopwords_full = pd.concat([old_stopwords, stopwords_diff]) 16 | 17 | stopwords_full = stopwords_full.sort_values(by="stopword", ascending=True) 18 | stopwords_full.reset_index(drop=True, inplace=True) 19 | stopwords_full.drop_duplicates(inplace=True, ignore_index=True) 20 | 21 | stopwords_full.to_csv("stopwords_twitter.csv", index=False) 22 | 23 | if __name__=="__main__": 24 | merge(args["old_csv"], args["collected_stopwords"]) -------------------------------------------------------------------------------- /stopwords_twitter.csv: -------------------------------------------------------------------------------- 1 | 33books 2 | a thread 3 | affh iyyh 4 | aja 5 | ajah 6 | ajg 7 | alter 8 | an 9 | anjay 10 | anjir 11 | anjrit 12 | arenabet88 13 | at 14 | au 15 | avail 16 | awokawokawok 17 | axistogel 18 | ayo 19 | bacot 20 | banget 21 | bangsat 22 | bapack 23 | bct 24 | begonoh 25 | berujung 26 | besok 27 | bestie 28 | bg 29 | bgt 30 | bikin 31 | Bio 32 | bokap 33 | bosku 34 | bosque 35 | bosquee 36 | bray 37 | bsk 38 | btw 39 | buruan 40 | casino 41 | cek 42 | chinajuli 43 | cina 44 | cm 45 | cmiiw 46 | cmn 47 | cok 48 | com 49 | copee 50 | copi 51 | copii 52 | cot 53 | cuman 54 | cuy 55 | dah 56 | dana 57 | dari 58 | dech 59 | deh 60 | dek 61 | depan 62 | deposit 63 | dg 64 | dgn 65 | di 66 | dia 67 | dih 68 | dm 69 | doang 70 | dom 71 | dom 72 | dpn 73 | dri 74 | drop 75 | eh 76 | elus 77 | emang 78 | emberan 79 | emng 80 | engga 81 | eror 82 | fafifu 83 | GA 84 | ga 85 | gais 86 | gaje 87 | gak 88 | gaksi 89 | gaksih 90 | gamau 91 | gan 92 | gapapa 93 | gasi 94 | gasih 95 | gasin 96 | gass 97 | gassin 98 | gausah 99 | ges 100 | gimana 101 | gini 102 | gipewey 103 | gitulho 104 | gituloh 105 | give away 106 | giveaway 107 | gk 108 | gopay 109 | gopei 110 | gpp 111 | gt 112 | gua 113 | gue 114 | guys 115 | gw 116 | hadeh 117 | haha 118 | hehe 119 | hiyahiya 120 | hm 121 | hmm 122 | hmmm 123 | hoki 124 | i 125 | info 126 | ingfo 127 | jaehyun 128 | jajan 129 | jbjb 130 | jejek 131 | jg 132 | juga 133 | jugaa 134 | kaa 135 | kagak 136 | kak 137 | kalo 138 | kamu nanyea 139 | kek 140 | kk 141 | kl 142 | knp 143 | kntl 144 | krn 145 | ktl 146 | kuy 147 | kyaa 148 | lagih 149 | lg 150 | link 151 | lo 152 | loe 153 | lu 154 | ma 155 | maaf 156 | maseh 157 | mazeeh 158 | Mention 159 | min 160 | mjb 161 | monyeet 162 | monyet 163 | moots 164 | mulu 165 | murah 166 | nahlo 167 | nahloh 168 | nak 169 | nder 170 | nderr 171 | netflix 172 | ngab 173 | ngabs 174 | ngak 175 | ngga 176 | nggak 177 | nggk 178 | ni 179 | nih 180 | ninuninu 181 | nitip 182 | no 183 | noh 184 | noob 185 | nu 186 | ny 187 | nya 188 | nyet 189 | nyimak 190 | nyokap 191 | oh 192 | ohh 193 | ok 194 | oke 195 | okee 196 | open po 197 | order 198 | oren 199 | outfit 200 | ovo 201 | oyen 202 | paansi 203 | pack 204 | padahal 205 | pansos 206 | paragames 207 | pdhal 208 | pdhl 209 | penjilat 210 | pinjol 211 | poker 212 | pp 213 | preloved 214 | premium 215 | qrt 216 | ready 217 | rekomendasi 218 | rezim 219 | rt 220 | sampe 221 | sealed 222 | see 223 | seh 224 | semalem 225 | sender 226 | shopee 227 | sih 228 | sis 229 | skrg 230 | skuy 231 | slebew 232 | slot 233 | smpe 234 | sohotogel 235 | spotify 236 | ss 237 | stopword 238 | stopword 239 | tags 240 | tai 241 | tapikan 242 | tau 243 | tbl 244 | tdk 245 | tele 246 | temen 247 | teros 248 | teross 249 | thank you 250 | thanks 251 | thx 252 | TIA 253 | tida 254 | tidaa 255 | tk 256 | tl 257 | TL 258 | togel 259 | togelchina 260 | tokped 261 | tolong 262 | tp 263 | tpi 264 | tu 265 | tuh 266 | tukan 267 | udah 268 | udh 269 | via 270 | vpn 271 | wa 272 | wasweswos 273 | wes 274 | wkwkwk 275 | wleowleo 276 | wts 277 | y 278 | ya 279 | yaelah 280 | yah 281 | yak 282 | yakali 283 | yakan 284 | yg 285 | ygy 286 | yoi 287 | yuk 288 | --------------------------------------------------------------------------------