├── .gitignore
├── LICENSE
├── README.md
├── collected.txt
├── img
└── logo.png
├── requirements.txt
├── src
├── __init__.py
└── merge.py
└── stopwords_twitter.csv
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | pip-wheel-metadata/
24 | share/python-wheels/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | MANIFEST
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .nox/
44 | .coverage
45 | .coverage.*
46 | .cache
47 | nosetests.xml
48 | coverage.xml
49 | *.cover
50 | *.py,cover
51 | .hypothesis/
52 | .pytest_cache/
53 |
54 | # Translations
55 | *.mo
56 | *.pot
57 |
58 | # Django stuff:
59 | *.log
60 | local_settings.py
61 | db.sqlite3
62 | db.sqlite3-journal
63 |
64 | # Flask stuff:
65 | instance/
66 | .webassets-cache
67 |
68 | # Scrapy stuff:
69 | .scrapy
70 |
71 | # Sphinx documentation
72 | docs/_build/
73 |
74 | # PyBuilder
75 | target/
76 |
77 | # Jupyter Notebook
78 | .ipynb_checkpoints
79 |
80 | # IPython
81 | profile_default/
82 | ipython_config.py
83 |
84 | # pyenv
85 | .python-version
86 |
87 | # pipenv
88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
91 | # install all needed dependencies.
92 | #Pipfile.lock
93 |
94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
95 | __pypackages__/
96 |
97 | # Celery stuff
98 | celerybeat-schedule
99 | celerybeat.pid
100 |
101 | # SageMath parsed files
102 | *.sage.py
103 |
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 |
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 |
117 | # Rope project settings
118 | .ropeproject
119 |
120 | # mkdocs documentation
121 | /site
122 |
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 |
128 | # Pyre type checker
129 | .pyre/
130 | .vscode
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 Braincore.id
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |

3 |
IndoTWEEST
4 |
5 | Indonesian Tweet Stopwords
6 |
7 |
8 |
9 |
10 |
11 | Proyek ini di-inisialisasi oleh Braincore.id sebagai kontribusi dalam pengembangan dataset stopwords sosial media Twitter untuk memudahkan penelitian yang menggunakan dataset dari platform tersebut demi kemajuan NLP Indonesia.
12 |
13 | # Kontribusi
14 | Tata cara kontribusi dapat dibaca pada dokumen berikut Ini
15 |
16 | 1. Clone terlebih dahulu git ini menggunakan command `git clone https://github.com/Braincore-id/IndoTWEES.git`
17 |
18 | 2. **Bagi yang memang sudah memiliki kumpulan stopwords yang ingin ditambahkan dapat melewati tahap 2**. Jalankan [Colab](https://colab.research.google.com/drive/1bSHsso2kJfvbxH5Gs70RtrLQQdz13J6T?usp=sharing) berikut untuk dijadikan acuan *stopwords* apa saja yang ingin dimasukkan
19 |
20 | 3. Setelah mendapatkan kumpulan *stopwords*, masukkan kumpulan *stopwords* tersebut kedalam file **.txt** dengan format sebagai berikut
21 | ```
22 |
23 |
24 |
25 | ...
26 | ...
27 |
28 | ```
29 | 4. Jalankan perintah `python src/add_csv.py --new_stopwords `. Untuk lebih jelas mengenai *argparse* apa saja yang dapat digunakan bisa menggunakan perintah `python src/add_csv.py --help`
30 |
31 | 5. Lakukan *pull request* sehingga hasil stopwords akan ditambahkan kedalam final stopwords
32 |
33 |
34 | # Task List
35 | - [ ] *Support* csv dan format lain sebagai format file untuk menambahkan stopwords
36 | - [ ] Penghitung Stopwords otomatis di README.md
37 | - [ ] Pembaruan otomatis pada *Terakhir diperbarui* pada README.md
38 |
39 | # Contributors
40 |
41 |
42 |
43 |
44 |
--------------------------------------------------------------------------------
/collected.txt:
--------------------------------------------------------------------------------
1 | copi
2 | copii
3 | copee
4 | gopei
5 | gipewey
6 | gopay
7 | dana
8 | ovo
9 | nderr
10 | see
11 | i
12 | gasih
13 | bosquee
14 | info
15 | gasi
16 | udah
17 | bokap
18 | nyokap
19 | monyet
20 | nyet
21 | monyeet
22 | bsk
23 | besok
24 | ayo
--------------------------------------------------------------------------------
/img/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Braincore-id/IndoTWEEST/69b28614464964ed31370983c20f18e313be61eb/img/logo.png
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Braincore-id/IndoTWEEST/69b28614464964ed31370983c20f18e313be61eb/requirements.txt
--------------------------------------------------------------------------------
/src/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Braincore-id/IndoTWEEST/69b28614464964ed31370983c20f18e313be61eb/src/__init__.py
--------------------------------------------------------------------------------
/src/merge.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import pandas as pd
3 |
4 | ap = argparse.ArgumentParser()
5 | ap.add_argument("--old_csv", type=str, help="File CSV yang menyimpan stopwords lama", default="./stopwords_twitter.csv")
6 | ap.add_argument("--collected_stopwords", type=str, required=True, help="Stopwords baru yang sudah dikumpulkan", default="./collected.txt")
7 | args = vars(ap.parse_args())
8 |
9 | def merge(old, new):
10 | old_stopwords = pd.read_csv(old, names=["stopword"])
11 | new_stopwords = pd.read_csv(new, names=["stopword"])
12 |
13 | stopwords_diff = new_stopwords[~new_stopwords.isin(old_stopwords)]
14 |
15 | stopwords_full = pd.concat([old_stopwords, stopwords_diff])
16 |
17 | stopwords_full = stopwords_full.sort_values(by="stopword", ascending=True)
18 | stopwords_full.reset_index(drop=True, inplace=True)
19 | stopwords_full.drop_duplicates(inplace=True, ignore_index=True)
20 |
21 | stopwords_full.to_csv("stopwords_twitter.csv", index=False)
22 |
23 | if __name__=="__main__":
24 | merge(args["old_csv"], args["collected_stopwords"])
--------------------------------------------------------------------------------
/stopwords_twitter.csv:
--------------------------------------------------------------------------------
1 | 33books
2 | a thread
3 | affh iyyh
4 | aja
5 | ajah
6 | ajg
7 | alter
8 | an
9 | anjay
10 | anjir
11 | anjrit
12 | arenabet88
13 | at
14 | au
15 | avail
16 | awokawokawok
17 | axistogel
18 | ayo
19 | bacot
20 | banget
21 | bangsat
22 | bapack
23 | bct
24 | begonoh
25 | berujung
26 | besok
27 | bestie
28 | bg
29 | bgt
30 | bikin
31 | Bio
32 | bokap
33 | bosku
34 | bosque
35 | bosquee
36 | bray
37 | bsk
38 | btw
39 | buruan
40 | casino
41 | cek
42 | chinajuli
43 | cina
44 | cm
45 | cmiiw
46 | cmn
47 | cok
48 | com
49 | copee
50 | copi
51 | copii
52 | cot
53 | cuman
54 | cuy
55 | dah
56 | dana
57 | dari
58 | dech
59 | deh
60 | dek
61 | depan
62 | deposit
63 | dg
64 | dgn
65 | di
66 | dia
67 | dih
68 | dm
69 | doang
70 | dom
71 | dom
72 | dpn
73 | dri
74 | drop
75 | eh
76 | elus
77 | emang
78 | emberan
79 | emng
80 | engga
81 | eror
82 | fafifu
83 | GA
84 | ga
85 | gais
86 | gaje
87 | gak
88 | gaksi
89 | gaksih
90 | gamau
91 | gan
92 | gapapa
93 | gasi
94 | gasih
95 | gasin
96 | gass
97 | gassin
98 | gausah
99 | ges
100 | gimana
101 | gini
102 | gipewey
103 | gitulho
104 | gituloh
105 | give away
106 | giveaway
107 | gk
108 | gopay
109 | gopei
110 | gpp
111 | gt
112 | gua
113 | gue
114 | guys
115 | gw
116 | hadeh
117 | haha
118 | hehe
119 | hiyahiya
120 | hm
121 | hmm
122 | hmmm
123 | hoki
124 | i
125 | info
126 | ingfo
127 | jaehyun
128 | jajan
129 | jbjb
130 | jejek
131 | jg
132 | juga
133 | jugaa
134 | kaa
135 | kagak
136 | kak
137 | kalo
138 | kamu nanyea
139 | kek
140 | kk
141 | kl
142 | knp
143 | kntl
144 | krn
145 | ktl
146 | kuy
147 | kyaa
148 | lagih
149 | lg
150 | link
151 | lo
152 | loe
153 | lu
154 | ma
155 | maaf
156 | maseh
157 | mazeeh
158 | Mention
159 | min
160 | mjb
161 | monyeet
162 | monyet
163 | moots
164 | mulu
165 | murah
166 | nahlo
167 | nahloh
168 | nak
169 | nder
170 | nderr
171 | netflix
172 | ngab
173 | ngabs
174 | ngak
175 | ngga
176 | nggak
177 | nggk
178 | ni
179 | nih
180 | ninuninu
181 | nitip
182 | no
183 | noh
184 | noob
185 | nu
186 | ny
187 | nya
188 | nyet
189 | nyimak
190 | nyokap
191 | oh
192 | ohh
193 | ok
194 | oke
195 | okee
196 | open po
197 | order
198 | oren
199 | outfit
200 | ovo
201 | oyen
202 | paansi
203 | pack
204 | padahal
205 | pansos
206 | paragames
207 | pdhal
208 | pdhl
209 | penjilat
210 | pinjol
211 | poker
212 | pp
213 | preloved
214 | premium
215 | qrt
216 | ready
217 | rekomendasi
218 | rezim
219 | rt
220 | sampe
221 | sealed
222 | see
223 | seh
224 | semalem
225 | sender
226 | shopee
227 | sih
228 | sis
229 | skrg
230 | skuy
231 | slebew
232 | slot
233 | smpe
234 | sohotogel
235 | spotify
236 | ss
237 | stopword
238 | stopword
239 | tags
240 | tai
241 | tapikan
242 | tau
243 | tbl
244 | tdk
245 | tele
246 | temen
247 | teros
248 | teross
249 | thank you
250 | thanks
251 | thx
252 | TIA
253 | tida
254 | tidaa
255 | tk
256 | tl
257 | TL
258 | togel
259 | togelchina
260 | tokped
261 | tolong
262 | tp
263 | tpi
264 | tu
265 | tuh
266 | tukan
267 | udah
268 | udh
269 | via
270 | vpn
271 | wa
272 | wasweswos
273 | wes
274 | wkwkwk
275 | wleowleo
276 | wts
277 | y
278 | ya
279 | yaelah
280 | yah
281 | yak
282 | yakali
283 | yakan
284 | yg
285 | ygy
286 | yoi
287 | yuk
288 |
--------------------------------------------------------------------------------