├── .gitattributes
├── .gitignore
├── .idea
├── RNN-LM.iml
├── inspectionProfiles
│ └── profiles_settings.xml
├── misc.xml
├── modules.xml
└── vcs.xml
├── LICENSE
├── README.md
├── data
├── .ipynb_checkpoints
│ └── Data-Analysis-checkpoint.ipynb
├── Data-Analysis.ipynb
└── train_labels.csv
├── main.py
├── model.py
├── package
├── __pycache__
│ ├── config.cpython-37.pyc
│ ├── data_loader.cpython-37.pyc
│ ├── definition.cpython-37.pyc
│ ├── evaluator.cpython-37.pyc
│ ├── loss.cpython-37.pyc
│ ├── trainer.cpython-37.pyc
│ └── utils.cpython-37.pyc
├── config.py
├── data_loader.py
├── definition.py
├── evaluator.py
├── loss.py
├── trainer.py
└── utils.py
└── preprocess
├── .ipynb_checkpoints
└── Preprocess-checkpoint.ipynb
├── KoNPron.ipynb
├── Preprocess.ipynb
├── konpron.py
├── special.csv
└── train_labels.csv
/.gitattributes:
--------------------------------------------------------------------------------
1 | docs/* linguist-vendored
2 | sphinx/* linguist-vendored
3 | preprocess/* linguist-vendored
4 | data/* linguist-vendored
5 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *.bin
2 | *.zip
3 | *.idea
4 | __pycache__/
5 | *.pyc
6 | .idea
--------------------------------------------------------------------------------
/.idea/RNN-LM.iml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
--------------------------------------------------------------------------------
/.idea/inspectionProfiles/profiles_settings.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/.idea/misc.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
--------------------------------------------------------------------------------
/.idea/modules.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
--------------------------------------------------------------------------------
/.idea/vcs.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2020 SooHwan Kim
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Character RNN Language Model
2 |
3 | ### Recurrent Neural Netowrk Language Model in Pytorch
4 | [ ](https://pytorch.org/)
5 |
6 | ## Intro
7 |
8 | This is a project for Character-level Recurrent Neural Network Language Model (Rnnlm) implemented in [Pytorch](pytorch.org).
9 | This language model can be combined with tasks such as speech recognition, machine translation, image captioning etc..
10 | We appreciate any kind of feedback or contribution.
11 |
12 |
13 |
14 | ## Installation
15 | This project recommends Python 3.7 or higher.
16 | I recommend creating a new virtual environment for this project (using virtualenv or conda).
17 |
18 | ### Prerequisites
19 |
20 | * Numpy: `pip install numpy` (Refer [here](https://github.com/numpy/numpy) for problem installing Numpy).
21 | * Pandas: `pip install pandas` (Refer [here](https://github.com/pandas-dev/pandas) for problem installing Pandas)
22 | * PyTorch: Refer to [PyTorch website](http://pytorch.org/) to install the version w.r.t. your environment.
23 |
24 |
25 | ## Troubleshoots and Contributing
26 |
27 | If you have any questions, bug reports, and feature requests, please [open an issue](https://github.com/sooftware/RNN-LM/issues) on Github.
28 | or Contacts sh951011@gmail.com please.
29 |
30 | I appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.
31 |
32 | ### Code Style
33 |
34 | We follow PEP-8 for code style. Especially the style of docstrings is important to generate documentation.
35 |
36 | ## Reference
37 |
38 | [[1] IBM pytorch-seq2seq](https://github.com/IBM/pytorch-seq2seq)
39 | [[2] Character-unit based End-to-End Korean Speech Recognition](https://github.com/sooftware/End-to-End-Korean-Speech-Recognition)
40 | [[3] 「An analysis of incorporating an external language model into a sequence-to-sequence model」 Paper](https://arxiv.org/abs/1712.01996)
41 |
42 | ## Author
43 |
44 | * Soohwan Kim [@sooftware](https://github.com/sooftware)
45 | * Contacts: sh951011@gmail.com
46 |
--------------------------------------------------------------------------------
/data/.ipynb_checkpoints/Data-Analysis-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import pickle\n",
10 | "import pandas as pd\n",
11 | "import matplotlib.pyplot as plt\n",
12 | "import seaborn as sns"
13 | ]
14 | },
15 | {
16 | "cell_type": "code",
17 | "execution_count": 2,
18 | "metadata": {},
19 | "outputs": [
20 | {
21 | "data": {
22 | "text/html": [
23 | "
\n",
24 | "\n",
37 | "
\n",
38 | " \n",
39 | " \n",
40 | " \n",
41 | " ko \n",
42 | " id \n",
43 | " \n",
44 | " \n",
45 | " \n",
46 | " \n",
47 | " 0 \n",
48 | " 어디 보자... \n",
49 | " 8 190 0 42 45 1 1 1 \n",
50 | " \n",
51 | " \n",
52 | " 1 \n",
53 | " 칠대 왕국 종족 중에 오크는 처음 듣는다 \n",
54 | " 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n",
55 | " \n",
56 | " \n",
57 | " 2 \n",
58 | " 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n",
59 | " 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n",
60 | " \n",
61 | " \n",
62 | " 3 \n",
63 | " 라니스터 가의 \n",
64 | " 32 20 79 162 0 6 130 \n",
65 | " \n",
66 | " \n",
67 | " 4 \n",
68 | " 별 희한한 생각이 다 떠오르곤 하죠 \n",
69 | " 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... \n",
70 | " \n",
71 | " \n",
72 | " ... \n",
73 | " ... \n",
74 | " ... \n",
75 | " \n",
76 | " \n",
77 | " 2992115 \n",
78 | " 네 오늘 아침 범죄현장에 있었죠 \n",
79 | " 96 0 57 275 0 5 373 0 560 700 350 109 17 0 26 ... \n",
80 | " \n",
81 | " \n",
82 | " 2992116 \n",
83 | " 머시 \n",
84 | " 235 47 \n",
85 | " \n",
86 | " \n",
87 | " 2992117 \n",
88 | " 눈은 풀려 있었고 입에선 연신 침이 흘러 나왔다. \n",
89 | " 351 23 0 449 108 0 26 62 7 0 219 17 194 0 147 ... \n",
90 | " \n",
91 | " \n",
92 | " 2992118 \n",
93 | " 나는 좋은 선생님이야. \n",
94 | " 13 4 0 94 23 0 194 71 216 3 25 1 \n",
95 | " \n",
96 | " \n",
97 | " 2992119 \n",
98 | " 다만 취업 준비를 지원한다는 제도 성격을 고려해 유흥.도박.성인 용품 등 용도나 고... \n",
99 | " 15 46 0 331 207 0 245 122 55 0 10 82 27 15 4 0... \n",
100 | " \n",
101 | " \n",
102 | "
\n",
103 | "
2992120 rows × 2 columns
\n",
104 | "
"
105 | ],
106 | "text/plain": [
107 | " ko \\\n",
108 | "0 어디 보자... \n",
109 | "1 칠대 왕국 종족 중에 오크는 처음 듣는다 \n",
110 | "2 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n",
111 | "3 라니스터 가의 \n",
112 | "4 별 희한한 생각이 다 떠오르곤 하죠 \n",
113 | "... ... \n",
114 | "2992115 네 오늘 아침 범죄현장에 있었죠 \n",
115 | "2992116 머시 \n",
116 | "2992117 눈은 풀려 있었고 입에선 연신 침이 흘러 나왔다. \n",
117 | "2992118 나는 좋은 선생님이야. \n",
118 | "2992119 다만 취업 준비를 지원한다는 제도 성격을 고려해 유흥.도박.성인 용품 등 용도나 고... \n",
119 | "\n",
120 | " id \n",
121 | "0 8 190 0 42 45 1 1 1 \n",
122 | "1 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n",
123 | "2 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n",
124 | "3 32 20 79 162 0 6 130 \n",
125 | "4 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... \n",
126 | "... ... \n",
127 | "2992115 96 0 57 275 0 5 373 0 560 700 350 109 17 0 26 ... \n",
128 | "2992116 235 47 \n",
129 | "2992117 351 23 0 449 108 0 26 62 7 0 219 17 194 0 147 ... \n",
130 | "2992118 13 4 0 94 23 0 194 71 216 3 25 1 \n",
131 | "2992119 15 46 0 331 207 0 245 122 55 0 10 82 27 15 4 0... \n",
132 | "\n",
133 | "[2992120 rows x 2 columns]"
134 | ]
135 | },
136 | "execution_count": 2,
137 | "metadata": {},
138 | "output_type": "execute_result"
139 | }
140 | ],
141 | "source": [
142 | "corpus_df = None\n",
143 | "\n",
144 | "with open('corpus_df.bin', 'rb') as f:\n",
145 | " corpus_df = pickle.load(f)\n",
146 | " \n",
147 | "corpus_df"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": 4,
153 | "metadata": {},
154 | "outputs": [],
155 | "source": [
156 | "targets = corpus_df['id']\n",
157 | "target_lengths = list()\n",
158 | "\n",
159 | "for target in targets:\n",
160 | " tokens = target.split()\n",
161 | " target_lengths.append(len(tokens))"
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 5,
167 | "metadata": {},
168 | "outputs": [
169 | {
170 | "data": {
171 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAEaCAYAAAAv2I3rAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAcb0lEQVR4nO3de5QV1YHv8e8P2hdDIgpCDD5ag47iGI3gI6MxSjBB9GYymSQ3xoQmGjVzIxLHuVGjS2CUic6V+MBM4ttGHR3nJiZeIT6I4E0yiQ5E8QXGNsEHENEGVAIxt3HfP/Y+bVGc7j4N3b274fdZ66w+VbXPrl37nPPrql3V1QohYGZmPa9f7gaYmW2rHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2DokKUj6cifK16fXHNOd7eqtJE2U1JK7HUWSbpM0N3c7OtLZz1oN9c2XdFNX1dfVelUA95UPSW8g6bj0YX1H0pDSsu0kvdbVH+auJqlO0iRJj0t6W9Kbkp6QdJGkXXK3ry+SdEx63+u7uN5RkjZI+k1X1tsDPgv8Q+5GtKVXBbBtlj8AE0rz/hZYn6EtNZO0HTAbmA7cA4wBDgEuAo4CGrag7u27oo22kbOA7wP1kkbnbkytQgirQghv5W5HW/pUAEv6kqTH0p7SG5JmS9q/sLxy6PsFSf9H0jpJv5P0lVI9+0h6SNKfJL0s6RvlQxVJSyVdXHrdTZLmF6ZPSK9bldr0qKQjNmNddZKmSvp9KvespLNq7JabgTNK885M88v9t7ukuyWtkbQ+tWN0qczxkp5K7XhK0vFV6hmWjlZeT3uuv5R0bI3trTgHOAH4VAjhyhDCf4UQloYQ5oQQ/hvQWFhfg6Tn0t7+q5Iuk1RXWD5f0s2SLpW0AliW5i+VND29b2+lz8wVkvoVXlvL+3xM2sa302ORpE91ZmPTHuRDktamfvuRpL0Ly6dKapL0N5KWSPqjpHmSPlSq5xRJL6b35z8lnZw+88ekvd6fp6K/T/Pnl15/pqSXUn/8RNJuNbT9fcApwA3A3cTPV7lMkPQ/JN2e+ugVSd8qlWn3+1ulzkZJD1WZP0/Sben5HpJ+mOpbr/h9/5+FsuXv2ha/l12pTwUwsANwKXAY8cu7AZitTfd4LgduBz5M3Lu6VdJ+AJIE3AvsDBwLfBo4CfjIZrRnIPA94h7bXwMvAA9IGtzJdd1EPFQ6CzgQ+CfgCkmn19CGu4HdlcZb0xf248AtxUKpLT8GDgBOBo4AXgMeVhrCkPRB4H5gIbGPzwOuKdWzEzAPeB9wYtqWOameA2tob8VXgEdCCL+qtjCEsDqt76S0LbcDB6c2fQOYUnrJF4DdgE8Q96YrJgHLgcOBc4GzgW/W2khJ/YH7gMeIfXIYMBVY14k6RgKPAr8CRqf2bSD22Y6ForsDfw+cSvw8DaLwPkoaBdwJ3EU8WvgX4OrC618B/iY9PyLV99nC8sOB44mfwXHAocCVNWzCqcALIYSngNuAUyQNrFJuCvB/U73/i/gZLv4Cr/X7W/EDYKykfSozCp/vG9OsfyV+v8YSvzunA69Wq6wr3ssuF0LoNQ/imzu3E+V3BQJwdJquT9P/UChTB6wFzkrTJ6QyI0r1rANuKsxbClxcWt9NwPx22tMPWA2cWuu6gH2Ad4EDSnVdAjzZzrqOS3XvQfwQNqb5lwP3pecB+HJ6/ok0PbJQxw7ACuCSNH0Z8BJQVyhzcqmeicQPeF2pPY8AV5feh2Paaf864Noa3uOfA/eU5k0mDrFsn6bnA78F+pXKLQV+Xpr3z8Crtb7PwC5pW47rxOdyItBS+lzfXSqzQ+qDz6TpqUALsFuhzBfTZ2PHNH1nle35erGvgWPSdH2V79brwA6FeRcAK2rYnt8AkwvTzwJnlsqE8vsJLAG+U+v3t/yZTdNPAZcVpr8DPFuYXgRMbWcd83nvu9bp97K7H31qD1jSoZLuVTxUfxt4OS3au1T0ycqTEEILcU9vWJo1EngjhNBUKLMKeH4z2rNPOuRqkvQW8Bbxt3GlPbWsazQgYEE6PF0raS3wbWC/GptyPfD5dDg5kff2DooOAppDCM8V2vIOcW/goEJ7H099VvGLUj2HAx8A1pTa+7FOtBfiNtdyJ6iDiHtVRY8COwLFw/OFIYR3q7y+vIf9S2C4pPfX0sgQ98RvAh6U9FNJF0j6y1peW3A48Lel/mpO21Dss+UhhNcL08uI/TQ0TY8Efl2qu+oRRBsWp/e8WP+wtgoDKA6pHQz8W2F2I1WGISh876rV34nvb9H1wFcl9VccdprIxp/vq4Fvp6GNK9TOUFgXvZddqq7jIr2DpAHAQ8RAOI148gnib+PyIcyfS9OBjYdbavniv0v88BdtV5q+H3iDeEj8SlrvL0rt6WhdlXb9NZseCtV0q7oQwiJJzxAPTVuIQwJVi1aZVwzCaqFYnu4HLCae6CvrzKHc87wX/B0pt0FV5v+xxrrK72mH73MI4QxJ1wCfJB7VXCrp7BDC9TWusx9xCOXyKsuaC8+rfW4rry/P2xzV6i9ve9mZxJxYEUexIL2mn6TDQgjFqyLa/N518vtbdDtwBXHYpB9xL3ZW6wpCuFXSA8QhleOBn0q6N4RQ9eqfLngvu1Rf2gM+kDjGd1EIYV4IYTHxzejoA1T2HLCbpBGVGYqXPJVPBqwEPlia1zp2m8Z5RwKXhxAeTHuWf+K9vZVa17Uw/dwrhNBUerzYie26njjMcEsIYUOV5c8CQ9J4ZKUtOxDHCp8tlDkyjZVVlK/lXQDsC7xVpb3LO9HeO4Axkj5abaHeuwztWeKYX9GxxCGI39WwnqNK0x8l7mlWzoy3+z5XhBCeCSF8N4RwIvEEZ7U9wLYsIJ6PeLFKn63uRD3PpfYXlbevEoL92ULpKOGLxB2MQwuPQ4jnATrTB5v1/U3v093EE81nAD9MR5HFMitCCLeGECYQx4BPbe8IZwvfyy7VGwN4YDpUKT4OII5NvgNMkvQhSZ8gniDq7B7BXOK40SxJh0s6hPhbtqVU11zgv0v6pKS/lHQVGx8qrSaOqZ0haf8UJHex8eVfHa4rDU/cAtwo6SuSRkg6RNJpks7vxHbdRvyAX9rG8keAx4F/k3S0pL8i7knsSLy8iPRzN+AGSQemPp5equdO4PfEkyefVLzy5EhJF0r6TCfaew3wM+Lh4D9KGi1pb0njJP2Y9y6t+w7wd+lwcX9JXyCOl84IIZT3uKo5VPEKg/0lfYk4fnxVYXm773N6P65QPHu+d3qfP0YMw1r9MzGA7pB0RBq6Ol7SNZL27UQ93wWOlvRPaXs+TTwpCe99dl8i7tWPlzRU0s6dqL/sy6neW1NotT6Iv0C/JOkvaqxrS76/1xNP+H6KeCVGK0nXSRqf6jyIeNLxFeDtciVd9F52rdyD0MUHMURClceStPxzxCsN/gQ8QdwzagEmpuX1VDn5AzRRGKgnnvh6ONXzCvE3/OPAzEKZ9xHDcjVxL2kqpZNwaf2LUj3PA3+3mevqD3yLeNLiz8RhjUeBz7fTV8elbd2jnTLlExq7E/cm1hB/UTwKjC695hPA08QvyzPEM/blegYTw3pZau8y4tUeH2nvfajSvjpiIC4gDiG8ld7XbwODCuUaiMMelXVNZ+MThfMpnEAtzF+ayt6a6l5FPDvfv9b3OfXZj4gnHt8hXlFxI7BzO9s1kcJJuDTvYOAnaT3r0+fkBmDXtHwq0FR6zSYn1IiXg72Y2vIr4tUfARhVKPOt1E8bCttxG6UT3KSAbWc7ngTuamPZLun9+Fq1z1qaNxe4rTDd7ve3rXrS/CeA56vM/x7xBOx64nDObOCgap+NzXkvu/uh1LBtmuJ1jq8Sz4bP3FrWta2TtJT45bssd1u6i6QJxF8wg0MIa3K3pzukk28vAd8NIczI3Z6u1GdOwnWldOjWQtyrGkq8fjEQrxnus+uyrZ+kfySOv64iXl1xBfAfW2P4Kv7BzFDi9fEDiUcmW5VtMoCBAcTrbOuJh74LiYfLr/XxddnW78PEcd9diUNad7DpH6VsLfYinm9YAXw1hPBm5vZ0OQ9BmJll0huvgjAz2yZ0aghiyJAhob6+vpuaYma2dVq4cOEbIYRNbnzUqQCur69nwYIFXdcqM7NtgKSXqs33EISZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmXTqf8J1p5kzZ9LU1NRhuWXLlgEwfPjwbm3PiBEjmDRpUreuw8y2bb0mgJuamnjymcVsGLBru+X6r3sTgD+8031N779uVbfVbWZW0WsCGGDDgF1Zf8D4dsvstGQOQIfltkRlHWZm3cljwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpn0SADPnDmTmTNn9sSqrMR9b9Z71fXESpqamnpiNVaF+96s9/IQhJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy6QudwOs+y1atIjjjjsudzM2MXToUFauXLnRvP79+zN48GBWrlzJoEGDWLNmDXV1dQwbNozly5dz5ZVXMmrUKACam5s599xzefnllwGYMWMGo0aNorm5mWnTpnHOOecwY8YMJHHaaadxySWXMHz4cC6//HIApk2bxpQpUxg8ePBGbWhubub8889n+fLlXHrppTQ2Nm5SrriOa6+9ttPL21J+XflnrfX0BpVt6UttrqY7t8N7wJZNOXwBNmzY0Dp/zZo1ALS0tLBs2TJCCEyZMqW1bGNjY2v4Aq3LGhsbefrpp7nssstYvHgxzz33HFOnTmXdunW88MILzJo1q7XMrFmzNmlDY2MjTU1NrFu3jilTplQtV1zH5ixvS/l15Z+11tMbtNfHfUl3bocDeCu3aNGi3E3oUmvXrmXhwoU0Nzcze/bsTZbNmzePBx54gBACS5cu3WhZxezZs1vLPPDAAzQ3N7cua25uZs6cORu9rlyuubl5o3V0dnlbqr2u/LOWenqD4rb0lTZX093b0SNDEMuWLWP9+vVMnjy5zTJNTU30+3PoieZ0qN+f3qKp6e1222v5TJkyhTFjxrBhw4ZNlk2fPr3D17e0tCAJiHvcs2bN4txzzwXi3k5LS8smrymWa2xs5N13393s5W2p9rr22tGbFbelr7S5mu7ejg73gCWdKWmBpAWvv/56l63YbHOtXbuWuXPnVl3W0tJSNUDLQgit5R9++OHW+e3VWyk3d+7cTdbRmeVtqfa69trRmxW3pa+0uZru3o4O94BDCDcANwCMHj16s3ZRhw8fDsA111zTZpnJkyez8HevbU71Xe7dHd/PiH2HtdvevqI3nnzbUgMHDmTMmDHcd999myyrq4sf6Y6CTBIhBOrq6jjhhBNa548dO7bNeivlxo4dy5w5czZaR2eWt6Xa69prR29W3Ja+0uZquns7PAZsfc60adNoaGigf//+myy76KKL6Nev/Y91XV0d2223HRCvupgwYULrsoaGhtYQLyqWa2ho2GQdnVnelmqva68dvVlxW/pKm6vp7u1wAG/lDjnkkNxN6FIDBw5k1KhRDB48mJNOOmmTZccffzzjxo1DEvX19RstqzjppJNay4wbN26jS4sGDx7M+PHjN3pdudzgwYM3Wkdnl7el2uvKP2uppzcobktfaXM13b0dDmDLZujQoZvM69+/f+v8QYMGAXGPdfjw4Uhi2rRprWUbGhrYa6+9WqcryxoaGjj44IO5+OKLOfDAAxk5ciRTp05lwIAB7LfffkyYMKG1TLU9moaGBkaMGMGAAQOYNm1a1XLFdWzO8raUX1f+2Zf2JNvr476kO7dDlZMRtRg9enRYsGBBp1dSuZqgljHg9QeMb7MMwE5L4iVCHZXbEjstmcOorWQMuJa+N7PuJWlhCGF0eb73gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSZ1PbGSESNG9MRqrAr3vVnv1SMBPGnSpJ5YjVXhvjfrvTwEYWaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8ukLncDivqvW8VOS+Z0UKYZoMNyW9oOGNZt9ZuZQS8K4BEjRtRUbtmyFgCGD+/OgBxWc3vMzDZXrwngSZMm5W6CmVmP8hiwmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMlEIofbC0uvAS5u5riHAG5v52m2B+6dt7pv2uX/a1lv6Zu8Qwm7lmZ0K4C0haUEIYXSPrKwPcv+0zX3TPvdP23p733gIwswsEwewmVkmPRnAN/Tguvoi90/b3Dftc/+0rVf3TY+NAZuZ2cY8BGFmlokD2Mwsk24PYEnjJD0vqUnSBd29vt5I0i2SVkp6pjBvV0kPS3oh/dwlzZeka1N/PSXpsHwt736S9pQ0T9JiSc9Kmpzmu38ASTtKelzSotQ/09L8fSQ9lvrn3yVtn+bvkKab0vL6nO3vCZL6S3pC0v1pus/0TbcGsKT+wPeAE4GRwCmSRnbnOnup24BxpXkXAD8LIewH/CxNQ+yr/dLjTOD7PdTGXFqA80IIBwJHAd9InxH3T/QOMCaEcAhwKDBO0lHAFcBVqX9WA6en8qcDq0MII4CrUrmt3WRgcWG67/RNCKHbHsBHgQcL0xcCF3bnOnvrA6gHnilMPw/snp7vDjyfnl8PnFKt3LbwAH4CnOD+qdo3A4DfAEcS/7qrLs1v/Z4BDwIfTc/rUjnlbns39skexF/QY4D7AfWlvunuIYjhwCuF6VfTPINhIYQVAOnn0DR/m+2zdEj4EeAx3D+t0iH2k8BK4GHgRWBNCKElFSn2QWv/pOVvAoN7tsU96mrgW8C7aXowfahvujuAVWWer3tr3zbZZ5IGAj8EvhlCeKu9olXmbdX9E0LYEEI4lLi3dwRwYLVi6ec20z+STgZWhhAWFmdXKdpr+6a7A/hVYM/C9B7A8m5eZ1/xmqTdAdLPlWn+NtdnkrYjhu+dIYQfpdnun5IQwhpgPnGsfJCkurSo2Aet/ZOW7wys6tmW9pijgU9LWgrcTRyGuJo+1DfdHcD/BeyXzkpuD3wRuK+b19lX3Ac0pOcNxLHPyvwJ6Wz/UcCblUPxrZEkATcDi0MI3y0scv8AknaTNCg93wkYSzzhNA/4XCpW7p9Kv30OeCSkQc+tTQjhwhDCHiGEemK2PBJCOJW+1Dc9MEg+HvgtcdzqotyD9jkewF3ACuD/EX8Ln04ce/oZ8EL6uWsqK+KVIy8CTwOjc7e/m/vmGOJh4FPAk+kx3v3T2j8fBp5I/fMMcEmavy/wONAE/AewQ5q/Y5puSsv3zb0NPdRPxwH397W+8Z8im5ll4r+EMzPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmGUkKkkbkbofl4QDehkk6RtJ/SnpT0ipJv5R0eO529YQcwSdpvqSv9eQ6rXer67iIbY0kvZ9496i/B+4Btgc+Rrz9oZn1AO8Bb7v2Bwgh3BXizV7WhxAeCiE8VSkg6bR0o/TVkh6UtHdh2QmSlqS95+skPVrZu5M0VdIdhbL1aY+zLk3vLOlmSSskLZN0Wbp3NJImSvqFpCvTen8v6cRCXbtKulXS8rT8x4VlJ0t6UtKatGf/4c3pmA62O0j6errZ92pJ30t/Tl25a9kMSW+kdp9d2W5J04m/4K6TtFbSdYVVjq1Wn239HMDbrt8CGyQ1SjpR6T9OVEj6DPBt4LPAbsDPiX9SjaQhxJvnXAwMIf5Z8NGdWHcj8UbsI4i3n/wkUDw0P5J4n98hwL8ANxdC6XbifXEPIt6i8qrUpsOAW4CziH/GfD1wn6QdOtGudre74GTgcOAQ4AvAp9L8M4g3jD8UOAz4TOUFIYSLUl1nhxAGhhDOrqE+29rl/ltoP/I9iLc1vI14f4oW4s1KhqVlPwVOL5TtB6wD9gYmAL8uLFOq42tpeipwR2F5PfF+D3XAMOIwx06F5acA89LziUBTYdmA9NoPEG/M/i6wS5Vt+T5waWne88DH29j2AIyoMr/N7S687pjC8nuAC9LzR4CzCsvGVrY7Tc+v9FGpHVXr82Prf3gPeBsWQlgcQpgYQtgD+Cvgg8Tb+UEM2mvS4fwa4m37RLyp9Qcp3BQ9xOR4hdrsDWwHrCjUfT3v3XAd4A+FutelpwOJtxJcFUJY3Ua951XqTPXumdraGe1t9ybtI4bzwPR8o36h9j5pqz7byvkknAEQQlgi6TbiITzE8JgeQrizXFbSfhTuyZuGB4r36P0jcc+14gOF568Q94CHhPf+a0GtXgF2lTQoxHvjlpdNDyFM72Sd1dZRdbtrsIJ4/9mKPUvLfecr24j3gLdRkg6QdJ6kPdL0nsShgF+nIj8ALpR0UFq+s6TPp2WzgYMkfTadWDuHjUP2SeBYSXtJ2pn4vwCB1n8v9BAwQ9L7JfWT9CFJH++ozem1PwX+VdIukraTdGxafCPwdUlHKvoLSSdJel87VW6v+F+HK4/+HWx3R+4BJksarngP3/NLy18j3irRDHAAb8veJp7sekzSH4nB+wxwHkAI4V7if429W9JbadmJadkbwOeBy4Fm4n8o/mWl4hDCw8C/E+9hu5B4uVvRBOJlb88R/2vt/yaO79biK8T7Ki8h/peMb6Z1LiCeBLsu1dlEHE9uz7PA+sLjq+1tdw1uJP5yeYp4D985xLH1DWn5NcDn0tUO19ZYp23FfD9g6xKS5hNPvN2Uuy29Rbp87gchhL07LGzbJO8Bm3URSTtJGp+u+x0OTAHuzd0u670cwGZdR8A04hDIE8T/3XZJ1hZZr+YhCDOzTLwHbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZ/H++yD/HUsXeOAAAAABJRU5ErkJggg==\n",
172 | "text/plain": [
173 | ""
174 | ]
175 | },
176 | "metadata": {
177 | "needs_background": "light"
178 | },
179 | "output_type": "display_data"
180 | }
181 | ],
182 | "source": [
183 | "sns.boxplot(target_lengths)\n",
184 | "plt.title('Language Model Corpus length Analysis', fontsize='x-large')\n",
185 | "plt.xlabel('Sequence Length', fontsize='large')\n",
186 | "plt.show()"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": 6,
192 | "metadata": {},
193 | "outputs": [],
194 | "source": [
195 | "sentences = corpus_df['ko']\n",
196 | "\n",
197 | "new_sentences = list()\n",
198 | "new_targets = list()\n",
199 | "\n",
200 | "for (sentence, target) in zip(sentences, targets):\n",
201 | " if len(target.split()) < 151:\n",
202 | " new_sentences.append(sentence)\n",
203 | " new_targets.append(target)"
204 | ]
205 | },
206 | {
207 | "cell_type": "code",
208 | "execution_count": 8,
209 | "metadata": {},
210 | "outputs": [
211 | {
212 | "name": "stdout",
213 | "output_type": "stream",
214 | "text": [
215 | "2991682\n",
216 | "2991682\n"
217 | ]
218 | }
219 | ],
220 | "source": [
221 | "print(len(new_sentences))\n",
222 | "print(len(new_targets))"
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": 11,
228 | "metadata": {},
229 | "outputs": [
230 | {
231 | "data": {
232 | "text/html": [
233 | "\n",
234 | "\n",
247 | "
\n",
248 | " \n",
249 | " \n",
250 | " \n",
251 | " ko \n",
252 | " id \n",
253 | " \n",
254 | " \n",
255 | " \n",
256 | " \n",
257 | " 0 \n",
258 | " 어디 보자... \n",
259 | " 8 190 0 42 45 1 1 1 \n",
260 | " \n",
261 | " \n",
262 | " 1 \n",
263 | " 칠대 왕국 종족 중에 오크는 처음 듣는다 \n",
264 | " 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n",
265 | " \n",
266 | " \n",
267 | " 2 \n",
268 | " 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n",
269 | " 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n",
270 | " \n",
271 | " \n",
272 | " 3 \n",
273 | " 라니스터 가의 \n",
274 | " 32 20 79 162 0 6 130 \n",
275 | " \n",
276 | " \n",
277 | " 4 \n",
278 | " 별 희한한 생각이 다 떠오르곤 하죠 \n",
279 | " 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... \n",
280 | " \n",
281 | " \n",
282 | "
\n",
283 | "
"
284 | ],
285 | "text/plain": [
286 | " ko \\\n",
287 | "0 어디 보자... \n",
288 | "1 칠대 왕국 종족 중에 오크는 처음 듣는다 \n",
289 | "2 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n",
290 | "3 라니스터 가의 \n",
291 | "4 별 희한한 생각이 다 떠오르곤 하죠 \n",
292 | "\n",
293 | " id \n",
294 | "0 8 190 0 42 45 1 1 1 \n",
295 | "1 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n",
296 | "2 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n",
297 | "3 32 20 79 162 0 6 130 \n",
298 | "4 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... "
299 | ]
300 | },
301 | "execution_count": 11,
302 | "metadata": {},
303 | "output_type": "execute_result"
304 | }
305 | ],
306 | "source": [
307 | "corpus_dict = {'ko' : new_sentences,\n",
308 | " 'id' : new_targets}\n",
309 | "corpus_df = pd.DataFrame(corpus_dict)\n",
310 | "corpus_df.head()"
311 | ]
312 | },
313 | {
314 | "cell_type": "code",
315 | "execution_count": 12,
316 | "metadata": {},
317 | "outputs": [],
318 | "source": [
319 | "with open('corpus_df.bin', 'wb') as f:\n",
320 | " pickle.dump(corpus_df, f)"
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": null,
326 | "metadata": {},
327 | "outputs": [],
328 | "source": []
329 | }
330 | ],
331 | "metadata": {
332 | "kernelspec": {
333 | "display_name": "Python 3",
334 | "language": "python",
335 | "name": "python3"
336 | },
337 | "language_info": {
338 | "codemirror_mode": {
339 | "name": "ipython",
340 | "version": 3
341 | },
342 | "file_extension": ".py",
343 | "mimetype": "text/x-python",
344 | "name": "python",
345 | "nbconvert_exporter": "python",
346 | "pygments_lexer": "ipython3",
347 | "version": "3.7.6"
348 | }
349 | },
350 | "nbformat": 4,
351 | "nbformat_minor": 4
352 | }
353 |
--------------------------------------------------------------------------------
/data/Data-Analysis.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import pickle\n",
10 | "import pandas as pd\n",
11 | "import matplotlib.pyplot as plt\n",
12 | "import seaborn as sns"
13 | ]
14 | },
15 | {
16 | "cell_type": "code",
17 | "execution_count": 2,
18 | "metadata": {},
19 | "outputs": [
20 | {
21 | "data": {
22 | "text/html": [
23 | "\n",
24 | "\n",
37 | "
\n",
38 | " \n",
39 | " \n",
40 | " \n",
41 | " ko \n",
42 | " id \n",
43 | " \n",
44 | " \n",
45 | " \n",
46 | " \n",
47 | " 0 \n",
48 | " 어디 보자... \n",
49 | " 8 190 0 42 45 1 1 1 \n",
50 | " \n",
51 | " \n",
52 | " 1 \n",
53 | " 칠대 왕국 종족 중에 오크는 처음 듣는다 \n",
54 | " 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n",
55 | " \n",
56 | " \n",
57 | " 2 \n",
58 | " 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n",
59 | " 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n",
60 | " \n",
61 | " \n",
62 | " 3 \n",
63 | " 라니스터 가의 \n",
64 | " 32 20 79 162 0 6 130 \n",
65 | " \n",
66 | " \n",
67 | " 4 \n",
68 | " 별 희한한 생각이 다 떠오르곤 하죠 \n",
69 | " 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... \n",
70 | " \n",
71 | " \n",
72 | " ... \n",
73 | " ... \n",
74 | " ... \n",
75 | " \n",
76 | " \n",
77 | " 2992115 \n",
78 | " 네 오늘 아침 범죄현장에 있었죠 \n",
79 | " 96 0 57 275 0 5 373 0 560 700 350 109 17 0 26 ... \n",
80 | " \n",
81 | " \n",
82 | " 2992116 \n",
83 | " 머시 \n",
84 | " 235 47 \n",
85 | " \n",
86 | " \n",
87 | " 2992117 \n",
88 | " 눈은 풀려 있었고 입에선 연신 침이 흘러 나왔다. \n",
89 | " 351 23 0 449 108 0 26 62 7 0 219 17 194 0 147 ... \n",
90 | " \n",
91 | " \n",
92 | " 2992118 \n",
93 | " 나는 좋은 선생님이야. \n",
94 | " 13 4 0 94 23 0 194 71 216 3 25 1 \n",
95 | " \n",
96 | " \n",
97 | " 2992119 \n",
98 | " 다만 취업 준비를 지원한다는 제도 성격을 고려해 유흥.도박.성인 용품 등 용도나 고... \n",
99 | " 15 46 0 331 207 0 245 122 55 0 10 82 27 15 4 0... \n",
100 | " \n",
101 | " \n",
102 | "
\n",
103 | "
2992120 rows × 2 columns
\n",
104 | "
"
105 | ],
106 | "text/plain": [
107 | " ko \\\n",
108 | "0 어디 보자... \n",
109 | "1 칠대 왕국 종족 중에 오크는 처음 듣는다 \n",
110 | "2 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n",
111 | "3 라니스터 가의 \n",
112 | "4 별 희한한 생각이 다 떠오르곤 하죠 \n",
113 | "... ... \n",
114 | "2992115 네 오늘 아침 범죄현장에 있었죠 \n",
115 | "2992116 머시 \n",
116 | "2992117 눈은 풀려 있었고 입에선 연신 침이 흘러 나왔다. \n",
117 | "2992118 나는 좋은 선생님이야. \n",
118 | "2992119 다만 취업 준비를 지원한다는 제도 성격을 고려해 유흥.도박.성인 용품 등 용도나 고... \n",
119 | "\n",
120 | " id \n",
121 | "0 8 190 0 42 45 1 1 1 \n",
122 | "1 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n",
123 | "2 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n",
124 | "3 32 20 79 162 0 6 130 \n",
125 | "4 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... \n",
126 | "... ... \n",
127 | "2992115 96 0 57 275 0 5 373 0 560 700 350 109 17 0 26 ... \n",
128 | "2992116 235 47 \n",
129 | "2992117 351 23 0 449 108 0 26 62 7 0 219 17 194 0 147 ... \n",
130 | "2992118 13 4 0 94 23 0 194 71 216 3 25 1 \n",
131 | "2992119 15 46 0 331 207 0 245 122 55 0 10 82 27 15 4 0... \n",
132 | "\n",
133 | "[2992120 rows x 2 columns]"
134 | ]
135 | },
136 | "execution_count": 2,
137 | "metadata": {},
138 | "output_type": "execute_result"
139 | }
140 | ],
141 | "source": [
142 | "corpus_df = None\n",
143 | "\n",
144 | "with open('corpus_df.bin', 'rb') as f:\n",
145 | " corpus_df = pickle.load(f)\n",
146 | " \n",
147 | "corpus_df"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": 4,
153 | "metadata": {},
154 | "outputs": [],
155 | "source": [
156 | "targets = corpus_df['id']\n",
157 | "target_lengths = list()\n",
158 | "\n",
159 | "for target in targets:\n",
160 | " tokens = target.split()\n",
161 | " target_lengths.append(len(tokens))"
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 5,
167 | "metadata": {},
168 | "outputs": [
169 | {
170 | "data": {
171 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAEaCAYAAAAv2I3rAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAcb0lEQVR4nO3de5QV1YHv8e8P2hdDIgpCDD5ag47iGI3gI6MxSjBB9GYymSQ3xoQmGjVzIxLHuVGjS2CUic6V+MBM4ttGHR3nJiZeIT6I4E0yiQ5E8QXGNsEHENEGVAIxt3HfP/Y+bVGc7j4N3b274fdZ66w+VbXPrl37nPPrql3V1QohYGZmPa9f7gaYmW2rHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2DokKUj6cifK16fXHNOd7eqtJE2U1JK7HUWSbpM0N3c7OtLZz1oN9c2XdFNX1dfVelUA95UPSW8g6bj0YX1H0pDSsu0kvdbVH+auJqlO0iRJj0t6W9Kbkp6QdJGkXXK3ry+SdEx63+u7uN5RkjZI+k1X1tsDPgv8Q+5GtKVXBbBtlj8AE0rz/hZYn6EtNZO0HTAbmA7cA4wBDgEuAo4CGrag7u27oo22kbOA7wP1kkbnbkytQgirQghv5W5HW/pUAEv6kqTH0p7SG5JmS9q/sLxy6PsFSf9H0jpJv5P0lVI9+0h6SNKfJL0s6RvlQxVJSyVdXHrdTZLmF6ZPSK9bldr0qKQjNmNddZKmSvp9KvespLNq7JabgTNK885M88v9t7ukuyWtkbQ+tWN0qczxkp5K7XhK0vFV6hmWjlZeT3uuv5R0bI3trTgHOAH4VAjhyhDCf4UQloYQ5oQQ/hvQWFhfg6Tn0t7+q5Iuk1RXWD5f0s2SLpW0AliW5i+VND29b2+lz8wVkvoVXlvL+3xM2sa302ORpE91ZmPTHuRDktamfvuRpL0Ly6dKapL0N5KWSPqjpHmSPlSq5xRJL6b35z8lnZw+88ekvd6fp6K/T/Pnl15/pqSXUn/8RNJuNbT9fcApwA3A3cTPV7lMkPQ/JN2e+ugVSd8qlWn3+1ulzkZJD1WZP0/Sben5HpJ+mOpbr/h9/5+FsuXv2ha/l12pTwUwsANwKXAY8cu7AZitTfd4LgduBz5M3Lu6VdJ+AJIE3AvsDBwLfBo4CfjIZrRnIPA94h7bXwMvAA9IGtzJdd1EPFQ6CzgQ+CfgCkmn19CGu4HdlcZb0xf248AtxUKpLT8GDgBOBo4AXgMeVhrCkPRB4H5gIbGPzwOuKdWzEzAPeB9wYtqWOameA2tob8VXgEdCCL+qtjCEsDqt76S0LbcDB6c2fQOYUnrJF4DdgE8Q96YrJgHLgcOBc4GzgW/W2khJ/YH7gMeIfXIYMBVY14k6RgKPAr8CRqf2bSD22Y6ForsDfw+cSvw8DaLwPkoaBdwJ3EU8WvgX4OrC618B/iY9PyLV99nC8sOB44mfwXHAocCVNWzCqcALIYSngNuAUyQNrFJuCvB/U73/i/gZLv4Cr/X7W/EDYKykfSozCp/vG9OsfyV+v8YSvzunA69Wq6wr3ssuF0LoNQ/imzu3E+V3BQJwdJquT9P/UChTB6wFzkrTJ6QyI0r1rANuKsxbClxcWt9NwPx22tMPWA2cWuu6gH2Ad4EDSnVdAjzZzrqOS3XvQfwQNqb5lwP3pecB+HJ6/ok0PbJQxw7ACuCSNH0Z8BJQVyhzcqmeicQPeF2pPY8AV5feh2Paaf864Noa3uOfA/eU5k0mDrFsn6bnA78F+pXKLQV+Xpr3z8Crtb7PwC5pW47rxOdyItBS+lzfXSqzQ+qDz6TpqUALsFuhzBfTZ2PHNH1nle35erGvgWPSdH2V79brwA6FeRcAK2rYnt8AkwvTzwJnlsqE8vsJLAG+U+v3t/yZTdNPAZcVpr8DPFuYXgRMbWcd83nvu9bp97K7H31qD1jSoZLuVTxUfxt4OS3au1T0ycqTEEILcU9vWJo1EngjhNBUKLMKeH4z2rNPOuRqkvQW8Bbxt3GlPbWsazQgYEE6PF0raS3wbWC/GptyPfD5dDg5kff2DooOAppDCM8V2vIOcW/goEJ7H099VvGLUj2HAx8A1pTa+7FOtBfiNtdyJ6iDiHtVRY8COwLFw/OFIYR3q7y+vIf9S2C4pPfX0sgQ98RvAh6U9FNJF0j6y1peW3A48Lel/mpO21Dss+UhhNcL08uI/TQ0TY8Efl2qu+oRRBsWp/e8WP+wtgoDKA6pHQz8W2F2I1WGISh876rV34nvb9H1wFcl9VccdprIxp/vq4Fvp6GNK9TOUFgXvZddqq7jIr2DpAHAQ8RAOI148gnib+PyIcyfS9OBjYdbavniv0v88BdtV5q+H3iDeEj8SlrvL0rt6WhdlXb9NZseCtV0q7oQwiJJzxAPTVuIQwJVi1aZVwzCaqFYnu4HLCae6CvrzKHc87wX/B0pt0FV5v+xxrrK72mH73MI4QxJ1wCfJB7VXCrp7BDC9TWusx9xCOXyKsuaC8+rfW4rry/P2xzV6i9ve9mZxJxYEUexIL2mn6TDQgjFqyLa/N518vtbdDtwBXHYpB9xL3ZW6wpCuFXSA8QhleOBn0q6N4RQ9eqfLngvu1Rf2gM+kDjGd1EIYV4IYTHxzejoA1T2HLCbpBGVGYqXPJVPBqwEPlia1zp2m8Z5RwKXhxAeTHuWf+K9vZVa17Uw/dwrhNBUerzYie26njjMcEsIYUOV5c8CQ9J4ZKUtOxDHCp8tlDkyjZVVlK/lXQDsC7xVpb3LO9HeO4Axkj5abaHeuwztWeKYX9GxxCGI39WwnqNK0x8l7mlWzoy3+z5XhBCeCSF8N4RwIvEEZ7U9wLYsIJ6PeLFKn63uRD3PpfYXlbevEoL92ULpKOGLxB2MQwuPQ4jnATrTB5v1/U3v093EE81nAD9MR5HFMitCCLeGECYQx4BPbe8IZwvfyy7VGwN4YDpUKT4OII5NvgNMkvQhSZ8gniDq7B7BXOK40SxJh0s6hPhbtqVU11zgv0v6pKS/lHQVGx8qrSaOqZ0haf8UJHex8eVfHa4rDU/cAtwo6SuSRkg6RNJpks7vxHbdRvyAX9rG8keAx4F/k3S0pL8i7knsSLy8iPRzN+AGSQemPp5equdO4PfEkyefVLzy5EhJF0r6TCfaew3wM+Lh4D9KGi1pb0njJP2Y9y6t+w7wd+lwcX9JXyCOl84IIZT3uKo5VPEKg/0lfYk4fnxVYXm773N6P65QPHu+d3qfP0YMw1r9MzGA7pB0RBq6Ol7SNZL27UQ93wWOlvRPaXs+TTwpCe99dl8i7tWPlzRU0s6dqL/sy6neW1NotT6Iv0C/JOkvaqxrS76/1xNP+H6KeCVGK0nXSRqf6jyIeNLxFeDtciVd9F52rdyD0MUHMURClceStPxzxCsN/gQ8QdwzagEmpuX1VDn5AzRRGKgnnvh6ONXzCvE3/OPAzEKZ9xHDcjVxL2kqpZNwaf2LUj3PA3+3mevqD3yLeNLiz8RhjUeBz7fTV8elbd2jnTLlExq7E/cm1hB/UTwKjC695hPA08QvyzPEM/blegYTw3pZau8y4tUeH2nvfajSvjpiIC4gDiG8ld7XbwODCuUaiMMelXVNZ+MThfMpnEAtzF+ayt6a6l5FPDvfv9b3OfXZj4gnHt8hXlFxI7BzO9s1kcJJuDTvYOAnaT3r0+fkBmDXtHwq0FR6zSYn1IiXg72Y2vIr4tUfARhVKPOt1E8bCttxG6UT3KSAbWc7ngTuamPZLun9+Fq1z1qaNxe4rTDd7ve3rXrS/CeA56vM/x7xBOx64nDObOCgap+NzXkvu/uh1LBtmuJ1jq8Sz4bP3FrWta2TtJT45bssd1u6i6QJxF8wg0MIa3K3pzukk28vAd8NIczI3Z6u1GdOwnWldOjWQtyrGkq8fjEQrxnus+uyrZ+kfySOv64iXl1xBfAfW2P4Kv7BzFDi9fEDiUcmW5VtMoCBAcTrbOuJh74LiYfLr/XxddnW78PEcd9diUNad7DpH6VsLfYinm9YAXw1hPBm5vZ0OQ9BmJll0huvgjAz2yZ0aghiyJAhob6+vpuaYma2dVq4cOEbIYRNbnzUqQCur69nwYIFXdcqM7NtgKSXqs33EISZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmXTqf8J1p5kzZ9LU1NRhuWXLlgEwfPjwbm3PiBEjmDRpUreuw8y2bb0mgJuamnjymcVsGLBru+X6r3sTgD+8031N779uVbfVbWZW0WsCGGDDgF1Zf8D4dsvstGQOQIfltkRlHWZm3cljwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpn0SADPnDmTmTNn9sSqrMR9b9Z71fXESpqamnpiNVaF+96s9/IQhJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy6QudwOs+y1atIjjjjsudzM2MXToUFauXLnRvP79+zN48GBWrlzJoEGDWLNmDXV1dQwbNozly5dz5ZVXMmrUKACam5s599xzefnllwGYMWMGo0aNorm5mWnTpnHOOecwY8YMJHHaaadxySWXMHz4cC6//HIApk2bxpQpUxg8ePBGbWhubub8889n+fLlXHrppTQ2Nm5SrriOa6+9ttPL21J+XflnrfX0BpVt6UttrqY7t8N7wJZNOXwBNmzY0Dp/zZo1ALS0tLBs2TJCCEyZMqW1bGNjY2v4Aq3LGhsbefrpp7nssstYvHgxzz33HFOnTmXdunW88MILzJo1q7XMrFmzNmlDY2MjTU1NrFu3jilTplQtV1zH5ixvS/l15Z+11tMbtNfHfUl3bocDeCu3aNGi3E3oUmvXrmXhwoU0Nzcze/bsTZbNmzePBx54gBACS5cu3WhZxezZs1vLPPDAAzQ3N7cua25uZs6cORu9rlyuubl5o3V0dnlbqr2u/LOWenqD4rb0lTZX093b0SNDEMuWLWP9+vVMnjy5zTJNTU30+3PoieZ0qN+f3qKp6e1222v5TJkyhTFjxrBhw4ZNlk2fPr3D17e0tCAJiHvcs2bN4txzzwXi3k5LS8smrymWa2xs5N13393s5W2p9rr22tGbFbelr7S5mu7ejg73gCWdKWmBpAWvv/56l63YbHOtXbuWuXPnVl3W0tJSNUDLQgit5R9++OHW+e3VWyk3d+7cTdbRmeVtqfa69trRmxW3pa+0uZru3o4O94BDCDcANwCMHj16s3ZRhw8fDsA111zTZpnJkyez8HevbU71Xe7dHd/PiH2HtdvevqI3nnzbUgMHDmTMmDHcd999myyrq4sf6Y6CTBIhBOrq6jjhhBNa548dO7bNeivlxo4dy5w5czZaR2eWt6Xa69prR29W3Ja+0uZquns7PAZsfc60adNoaGigf//+myy76KKL6Nev/Y91XV0d2223HRCvupgwYULrsoaGhtYQLyqWa2ho2GQdnVnelmqva68dvVlxW/pKm6vp7u1wAG/lDjnkkNxN6FIDBw5k1KhRDB48mJNOOmmTZccffzzjxo1DEvX19RstqzjppJNay4wbN26jS4sGDx7M+PHjN3pdudzgwYM3Wkdnl7el2uvKP2uppzcobktfaXM13b0dDmDLZujQoZvM69+/f+v8QYMGAXGPdfjw4Uhi2rRprWUbGhrYa6+9WqcryxoaGjj44IO5+OKLOfDAAxk5ciRTp05lwIAB7LfffkyYMKG1TLU9moaGBkaMGMGAAQOYNm1a1XLFdWzO8raUX1f+2Zf2JNvr476kO7dDlZMRtRg9enRYsGBBp1dSuZqgljHg9QeMb7MMwE5L4iVCHZXbEjstmcOorWQMuJa+N7PuJWlhCGF0eb73gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSZ1PbGSESNG9MRqrAr3vVnv1SMBPGnSpJ5YjVXhvjfrvTwEYWaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8ukLncDivqvW8VOS+Z0UKYZoMNyW9oOGNZt9ZuZQS8K4BEjRtRUbtmyFgCGD+/OgBxWc3vMzDZXrwngSZMm5W6CmVmP8hiwmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMlEIofbC0uvAS5u5riHAG5v52m2B+6dt7pv2uX/a1lv6Zu8Qwm7lmZ0K4C0haUEIYXSPrKwPcv+0zX3TPvdP23p733gIwswsEwewmVkmPRnAN/Tguvoi90/b3Dftc/+0rVf3TY+NAZuZ2cY8BGFmlokD2Mwsk24PYEnjJD0vqUnSBd29vt5I0i2SVkp6pjBvV0kPS3oh/dwlzZeka1N/PSXpsHwt736S9pQ0T9JiSc9Kmpzmu38ASTtKelzSotQ/09L8fSQ9lvrn3yVtn+bvkKab0vL6nO3vCZL6S3pC0v1pus/0TbcGsKT+wPeAE4GRwCmSRnbnOnup24BxpXkXAD8LIewH/CxNQ+yr/dLjTOD7PdTGXFqA80IIBwJHAd9InxH3T/QOMCaEcAhwKDBO0lHAFcBVqX9WA6en8qcDq0MII4CrUrmt3WRgcWG67/RNCKHbHsBHgQcL0xcCF3bnOnvrA6gHnilMPw/snp7vDjyfnl8PnFKt3LbwAH4CnOD+qdo3A4DfAEcS/7qrLs1v/Z4BDwIfTc/rUjnlbns39skexF/QY4D7AfWlvunuIYjhwCuF6VfTPINhIYQVAOnn0DR/m+2zdEj4EeAx3D+t0iH2k8BK4GHgRWBNCKElFSn2QWv/pOVvAoN7tsU96mrgW8C7aXowfahvujuAVWWer3tr3zbZZ5IGAj8EvhlCeKu9olXmbdX9E0LYEEI4lLi3dwRwYLVi6ec20z+STgZWhhAWFmdXKdpr+6a7A/hVYM/C9B7A8m5eZ1/xmqTdAdLPlWn+NtdnkrYjhu+dIYQfpdnun5IQwhpgPnGsfJCkurSo2Aet/ZOW7wys6tmW9pijgU9LWgrcTRyGuJo+1DfdHcD/BeyXzkpuD3wRuK+b19lX3Ac0pOcNxLHPyvwJ6Wz/UcCblUPxrZEkATcDi0MI3y0scv8AknaTNCg93wkYSzzhNA/4XCpW7p9Kv30OeCSkQc+tTQjhwhDCHiGEemK2PBJCOJW+1Dc9MEg+HvgtcdzqotyD9jkewF3ACuD/EX8Ln04ce/oZ8EL6uWsqK+KVIy8CTwOjc7e/m/vmGOJh4FPAk+kx3v3T2j8fBp5I/fMMcEmavy/wONAE/AewQ5q/Y5puSsv3zb0NPdRPxwH397W+8Z8im5ll4r+EMzPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmGUkKkkbkbofl4QDehkk6RtJ/SnpT0ipJv5R0eO529YQcwSdpvqSv9eQ6rXer67iIbY0kvZ9496i/B+4Btgc+Rrz9oZn1AO8Bb7v2Bwgh3BXizV7WhxAeCiE8VSkg6bR0o/TVkh6UtHdh2QmSlqS95+skPVrZu5M0VdIdhbL1aY+zLk3vLOlmSSskLZN0Wbp3NJImSvqFpCvTen8v6cRCXbtKulXS8rT8x4VlJ0t6UtKatGf/4c3pmA62O0j6errZ92pJ30t/Tl25a9kMSW+kdp9d2W5J04m/4K6TtFbSdYVVjq1Wn239HMDbrt8CGyQ1SjpR6T9OVEj6DPBt4LPAbsDPiX9SjaQhxJvnXAwMIf5Z8NGdWHcj8UbsI4i3n/wkUDw0P5J4n98hwL8ANxdC6XbifXEPIt6i8qrUpsOAW4CziH/GfD1wn6QdOtGudre74GTgcOAQ4AvAp9L8M4g3jD8UOAz4TOUFIYSLUl1nhxAGhhDOrqE+29rl/ltoP/I9iLc1vI14f4oW4s1KhqVlPwVOL5TtB6wD9gYmAL8uLFOq42tpeipwR2F5PfF+D3XAMOIwx06F5acA89LziUBTYdmA9NoPEG/M/i6wS5Vt+T5waWne88DH29j2AIyoMr/N7S687pjC8nuAC9LzR4CzCsvGVrY7Tc+v9FGpHVXr82Prf3gPeBsWQlgcQpgYQtgD+Cvgg8Tb+UEM2mvS4fwa4m37RLyp9Qcp3BQ9xOR4hdrsDWwHrCjUfT3v3XAd4A+FutelpwOJtxJcFUJY3Ua951XqTPXumdraGe1t9ybtI4bzwPR8o36h9j5pqz7byvkknAEQQlgi6TbiITzE8JgeQrizXFbSfhTuyZuGB4r36P0jcc+14gOF568Q94CHhPf+a0GtXgF2lTQoxHvjlpdNDyFM72Sd1dZRdbtrsIJ4/9mKPUvLfecr24j3gLdRkg6QdJ6kPdL0nsShgF+nIj8ALpR0UFq+s6TPp2WzgYMkfTadWDuHjUP2SeBYSXtJ2pn4vwCB1n8v9BAwQ9L7JfWT9CFJH++ozem1PwX+VdIukraTdGxafCPwdUlHKvoLSSdJel87VW6v+F+HK4/+HWx3R+4BJksarngP3/NLy18j3irRDHAAb8veJp7sekzSH4nB+wxwHkAI4V7if429W9JbadmJadkbwOeBy4Fm4n8o/mWl4hDCw8C/E+9hu5B4uVvRBOJlb88R/2vt/yaO79biK8T7Ki8h/peMb6Z1LiCeBLsu1dlEHE9uz7PA+sLjq+1tdw1uJP5yeYp4D985xLH1DWn5NcDn0tUO19ZYp23FfD9g6xKS5hNPvN2Uuy29Rbp87gchhL07LGzbJO8Bm3URSTtJGp+u+x0OTAHuzd0u670cwGZdR8A04hDIE8T/3XZJ1hZZr+YhCDOzTLwHbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZ/H++yD/HUsXeOAAAAABJRU5ErkJggg==\n",
172 | "text/plain": [
173 | ""
174 | ]
175 | },
176 | "metadata": {
177 | "needs_background": "light"
178 | },
179 | "output_type": "display_data"
180 | }
181 | ],
182 | "source": [
183 | "sns.boxplot(target_lengths)\n",
184 | "plt.title('Language Model Corpus length Analysis', fontsize='x-large')\n",
185 | "plt.xlabel('Sequence Length', fontsize='large')\n",
186 | "plt.show()"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": 6,
192 | "metadata": {},
193 | "outputs": [],
194 | "source": [
195 | "sentences = corpus_df['ko']\n",
196 | "\n",
197 | "new_sentences = list()\n",
198 | "new_targets = list()\n",
199 | "\n",
200 | "for (sentence, target) in zip(sentences, targets):\n",
201 | " if len(target.split()) < 151:\n",
202 | " new_sentences.append(sentence)\n",
203 | " new_targets.append(target)"
204 | ]
205 | },
206 | {
207 | "cell_type": "code",
208 | "execution_count": 8,
209 | "metadata": {},
210 | "outputs": [
211 | {
212 | "name": "stdout",
213 | "output_type": "stream",
214 | "text": [
215 | "2991682\n",
216 | "2991682\n"
217 | ]
218 | }
219 | ],
220 | "source": [
221 | "print(len(new_sentences))\n",
222 | "print(len(new_targets))"
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": 11,
228 | "metadata": {},
229 | "outputs": [
230 | {
231 | "data": {
232 | "text/html": [
233 | "\n",
234 | "\n",
247 | "
\n",
248 | " \n",
249 | " \n",
250 | " \n",
251 | " ko \n",
252 | " id \n",
253 | " \n",
254 | " \n",
255 | " \n",
256 | " \n",
257 | " 0 \n",
258 | " 어디 보자... \n",
259 | " 8 190 0 42 45 1 1 1 \n",
260 | " \n",
261 | " \n",
262 | " 1 \n",
263 | " 칠대 왕국 종족 중에 오크는 처음 듣는다 \n",
264 | " 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n",
265 | " \n",
266 | " \n",
267 | " 2 \n",
268 | " 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n",
269 | " 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n",
270 | " \n",
271 | " \n",
272 | " 3 \n",
273 | " 라니스터 가의 \n",
274 | " 32 20 79 162 0 6 130 \n",
275 | " \n",
276 | " \n",
277 | " 4 \n",
278 | " 별 희한한 생각이 다 떠오르곤 하죠 \n",
279 | " 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... \n",
280 | " \n",
281 | " \n",
282 | "
\n",
283 | "
"
284 | ],
285 | "text/plain": [
286 | " ko \\\n",
287 | "0 어디 보자... \n",
288 | "1 칠대 왕국 종족 중에 오크는 처음 듣는다 \n",
289 | "2 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n",
290 | "3 라니스터 가의 \n",
291 | "4 별 희한한 생각이 다 떠오르곤 하죠 \n",
292 | "\n",
293 | " id \n",
294 | "0 8 190 0 42 45 1 1 1 \n",
295 | "1 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n",
296 | "2 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n",
297 | "3 32 20 79 162 0 6 130 \n",
298 | "4 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... "
299 | ]
300 | },
301 | "execution_count": 11,
302 | "metadata": {},
303 | "output_type": "execute_result"
304 | }
305 | ],
306 | "source": [
307 | "corpus_dict = {'ko' : new_sentences,\n",
308 | " 'id' : new_targets}\n",
309 | "corpus_df = pd.DataFrame(corpus_dict)\n",
310 | "corpus_df.head()"
311 | ]
312 | },
313 | {
314 | "cell_type": "code",
315 | "execution_count": 12,
316 | "metadata": {},
317 | "outputs": [],
318 | "source": [
319 | "with open('corpus_df.bin', 'wb') as f:\n",
320 | " pickle.dump(corpus_df, f)"
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": null,
326 | "metadata": {},
327 | "outputs": [],
328 | "source": []
329 | }
330 | ],
331 | "metadata": {
332 | "kernelspec": {
333 | "display_name": "Python 3",
334 | "language": "python",
335 | "name": "python3"
336 | },
337 | "language_info": {
338 | "codemirror_mode": {
339 | "name": "ipython",
340 | "version": 3
341 | },
342 | "file_extension": ".py",
343 | "mimetype": "text/x-python",
344 | "name": "python",
345 | "nbconvert_exporter": "python",
346 | "pygments_lexer": "ipython3",
347 | "version": "3.7.6"
348 | }
349 | },
350 | "nbformat": 4,
351 | "nbformat_minor": 4
352 | }
353 |
--------------------------------------------------------------------------------
/data/train_labels.csv:
--------------------------------------------------------------------------------
1 | id,char,freq
2 | 0, ,5774462
3 | 1,.,640924
4 | 2,그,556373
5 | 3,이,509291
6 | 4,는,374559
7 | 5,아,370444
8 | 6,가,369698
9 | 7,고,356378
10 | 8,어,333842
11 | 9,거,306987
12 | 10,지,276453
13 | 11,데,249269
14 | 12,?,235024
15 | 13,나,229646
16 | 14,하,226136
17 | 15,다,221216
18 | 16,서,211193
19 | 17,에,204330
20 | 18,도,190561
21 | 19,게,177140
22 | 20,니,173284
23 | 21,",",152938
24 | 22,기,149467
25 | 23,은,144674
26 | 24,면,142025
27 | 25,야,137553
28 | 26,있,133155
29 | 27,한,121564
30 | 28,을,121048
31 | 29,까,119483
32 | 30,해,115148
33 | 31,리,111855
34 | 32,라,111479
35 | 33,래,105784
36 | 34,사,100533
37 | 35,근,99781
38 | 36,들,99447
39 | 37,안,97043
40 | 38,로,91847
41 | 39,일,88319
42 | 40,뭐,87328
43 | 41,내,85968
44 | 42,보,82911
45 | 43,제,80874
46 | 44,같,79626
47 | 45,자,76298
48 | 46,만,76093
49 | 47,시,72836
50 | 48,런,70919
51 | 49,너,69192
52 | 50,대,68756
53 | 51,때,67179
54 | 52,되,66237
55 | 53,으,66106
56 | 54,진,62831
57 | 55,를,61802
58 | 56,잖,61455
59 | 57,오,60782
60 | 58,러,60629
61 | 59,인,60234
62 | 60,막,59994
63 | 61,무,58705
64 | 62,었,58385
65 | 63,구,57294
66 | 64,했,57209
67 | 65,수,56787
68 | 66,간,55275
69 | 67,애,54476
70 | 68,우,53539
71 | 69,요,53234
72 | 70,마,53125
73 | 71,생,52815
74 | 72,렇,50798
75 | 73,냥,49989
76 | 74,짜,49581
77 | 75,주,48969
78 | 76,없,48392
79 | 77,말,47929
80 | 78,학,46285
81 | 79,스,46225
82 | 80,더,44487
83 | 81,많,43607
84 | 82,원,41379
85 | 83,음,41348
86 | 84,정,39775
87 | 85,겠,39691
88 | 86,여,39203
89 | 87,먹,39194
90 | 88,금,38720
91 | 89,든,38476
92 | 90,부,38398
93 | 91,할,38262
94 | 92,전,36575
95 | 93,번,36375
96 | 94,좋,36363
97 | 95,랑,36081
98 | 96,네,35514
99 | 97,람,33799
100 | 98,약,33412
101 | 99,건,33371
102 | 100,각,32167
103 | 101,좀,31738
104 | 102,알,30893
105 | 103,잘,30132
106 | 104,걸,29634
107 | 105,모,29629
108 | 106,것,28482
109 | 107,상,28247
110 | 108,려,28218
111 | 109,장,27856
112 | 110,히,27705
113 | 111,않,27305
114 | 112,맞,27202
115 | 113,던,27082
116 | 114,르,26286
117 | 115,교,26116
118 | 116,바,25994
119 | 117,냐,25742
120 | 118,드,25702
121 | 119,십,25654
122 | 120,날,25556
123 | 121,치,25287
124 | 122,비,25278
125 | 123,단,25129
126 | 124,동,25047
127 | 125,또,24720
128 | 126,못,24528
129 | 127,저,24074
130 | 128,얘,23990
131 | 129,중,23851
132 | 130,의,23607
133 | 131,난,23318
134 | 132,엄,23057
135 | 133,봤,22930
136 | 134,걔,22732
137 | 135,화,22593
138 | 136,응,22254
139 | 137,싶,21756
140 | 138,갔,21628
141 | 139,았,21052
142 | 140,집,20850
143 | 141,왜,20801
144 | 142,계,20757
145 | 143,공,20620
146 | 144,긴,20547
147 | 145,신,20371
148 | 146,적,20244
149 | 147,연,20225
150 | 148,직,20061
151 | 149,실,19467
152 | 150,영,19454
153 | 151,미,19444
154 | 152,봐,18931
155 | 153,분,18893
156 | 154,테,18829
157 | 155,년,18669
158 | 156,트,18654
159 | 157,문,18230
160 | 158,와,18114
161 | 159,돼,18028
162 | 160,물,17889
163 | 161,예,17864
164 | 162,터,17722
165 | 163,세,17719
166 | 164,럼,17521
167 | 165,청,17479
168 | 166,차,17455
169 | 167,친,17355
170 | 168,개,17355
171 | 169,삼,17242
172 | 170,국,17224
173 | 171,두,17129
174 | 172,소,17125
175 | 173,살,16893
176 | 174,재,16635
177 | 175,운,15949
178 | 176,쫌,15780
179 | 177,유,15516
180 | 178,속,15255
181 | 179,명,15190
182 | 180,랬,15155
183 | 181,본,15148
184 | 182,갈,15084
185 | 183,방,15069
186 | 184,돈,14998
187 | 185,타,14919
188 | 186,처,14908
189 | 187,빠,14851
190 | 188,과,14843
191 | 189,식,14743
192 | 190,디,14633
193 | 191,배,14403
194 | 192,피,14093
195 | 193,뭔,14093
196 | 194,선,13929
197 | 195,남,13909
198 | 196,경,13868
199 | 197,달,13787
200 | 198,언,13579
201 | 199,받,13519
202 | 200,심,13409
203 | 201,월,13367
204 | 202,버,13339
205 | 203,왔,13265
206 | 204,느,13223
207 | 205,점,12960
208 | 206,올,12923
209 | 207,업,12861
210 | 208,른,12801
211 | 209,성,12717
212 | 210,회,12591
213 | 211,조,12570
214 | 212,워,12424
215 | 213,따,12410
216 | 214,행,12350
217 | 215,반,12158
218 | 216,님,11998
219 | 217,딱,11908
220 | 218,관,11828
221 | 219,입,11275
222 | 220,카,11235
223 | 221,당,11068
224 | 222,였,10977
225 | 223,케,10576
226 | 224,쪽,10430
227 | 225,천,10384
228 | 226,작,10381
229 | 227,누,10336
230 | 228,열,10252
231 | 229,얼,10250
232 | 230,울,10246
233 | 231,찮,10231
234 | 232,된,10191
235 | 233,별,10159
236 | 234,떻,10108
237 | 235,머,9876
238 | 236,쓰,9853
239 | 237,위,9841
240 | 238,크,9838
241 | 239,노,9799
242 | 240,괜,9735
243 | 241,강,9698
244 | 242,편,9668
245 | 243,몰,9623
246 | 244,맛,9382
247 | 245,준,9342
248 | 246,줄,9294
249 | 247,파,9282
250 | 248,백,9252
251 | 249,매,9181
252 | 250,산,9160
253 | 251,술,9142
254 | 252,힘,9056
255 | 253,프,9019
256 | 254,즘,8997
257 | 255,임,8969
258 | 256,체,8888
259 | 257,형,8790
260 | 258,몇,8742
261 | 259,맨,8712
262 | 260,새,8623
263 | 261,용,8571
264 | 262,키,8547
265 | 263,통,8410
266 | 264,양,8363
267 | 265,끝,8361
268 | 266,싸,8328
269 | 267,볼,8188
270 | 268,혼,8179
271 | 269,온,8132
272 | 270,등,8123
273 | 271,길,8067
274 | 272,될,8033
275 | 273,밌,7998
276 | 274,육,7924
277 | 275,늘,7909
278 | 276,슨,7835
279 | 277,됐,7738
280 | 278,놀,7707
281 | 279,외,7608
282 | 280,팔,7601
283 | 281,져,7551
284 | 282,레,7485
285 | 283,억,7461
286 | 284,발,7450
287 | 285,결,7412
288 | 286,초,7290
289 | 287,감,7180
290 | 288,군,7174
291 | 289,호,7173
292 | 290,름,7146
293 | 291,솔,7079
294 | 292,닌,7051
295 | 293,밖,7013
296 | 294,불,7007
297 | 295,밥,6784
298 | 296,포,6676
299 | 297,싫,6631
300 | 298,완,6582
301 | 299,갖,6511
302 | 300,겨,6468
303 | 301,질,6453
304 | 302,토,6448
305 | 303,험,6417
306 | 304,색,6371
307 | 305,떤,6352
308 | 306,역,6340
309 | 307,티,6319
310 | 308,갑,6316
311 | 309,목,6262
312 | 310,린,6256
313 | 311,추,6204
314 | 312,격,6174
315 | 313,후,6119
316 | 314,확,6095
317 | 315,루,6079
318 | 316,민,6024
319 | 317,끼,6023
320 | 318,칠,6019
321 | 319,돌,5997
322 | 320,찍,5956
323 | 321,쪼,5946
324 | 322,깐,5788
325 | 323,필,5786
326 | 324,빨,5693
327 | 325,났,5657
328 | 326,락,5561
329 | 327,박,5537
330 | 328,끔,5483
331 | 329,낌,5403
332 | 330,럴,5397
333 | 331,취,5344
334 | 332,복,5315
335 | 333,둘,5264
336 | 334,페,5217
337 | 335,렸,5198
338 | 336,써,5197
339 | 337,줘,5173
340 | 338,급,5067
341 | 339,력,5065
342 | 340,잡,5030
343 | 341,씩,5006
344 | 342,찾,4990
345 | 343,놓,4987
346 | 344,최,4894
347 | 345,코,4891
348 | 346,넘,4870
349 | 347,졌,4803
350 | 348,섯,4799
351 | 349,브,4793
352 | 350,현,4764
353 | 351,눈,4760
354 | 352,항,4751
355 | 353,귀,4708
356 | 354,설,4688
357 | 355,벌,4666
358 | 356,담,4647
359 | 357,앞,4640
360 | 358,책,4630
361 | 359,절,4629
362 | 360,플,4523
363 | 361,폰,4513
364 | 362,태,4496
365 | 363,종,4487
366 | 364,옛,4450
367 | 365,증,4413
368 | 366,튼,4411
369 | 367,글,4408
370 | 368,습,4383
371 | 369,병,4377
372 | 370,론,4373
373 | 371,출,4364
374 | 372,능,4354
375 | 373,침,4345
376 | 374,순,4339
377 | 375,줬,4308
378 | 376,평,4303
379 | 377,메,4287
380 | 378,똑,4281
381 | 379,커,4261
382 | 380,엔,4248
383 | 381,꾸,4230
384 | 382,란,4194
385 | 383,듣,4083
386 | 384,씨,4009
387 | 385,큰,4002
388 | 386,표,3995
389 | 387,잠,3942
390 | 388,먼,3942
391 | 389,쁘,3840
392 | 390,활,3820
393 | 391,합,3787
394 | 392,접,3732
395 | 393,럽,3722
396 | 394,옷,3705
397 | 395,쳐,3690
398 | 396,손,3689
399 | 397,붙,3645
400 | 398,망,3640
401 | 399,죽,3609
402 | 400,투,3606
403 | 401,족,3603
404 | 402,셨,3589
405 | 403,참,3572
406 | 404,떨,3567
407 | 405,웃,3533
408 | 406,졸,3516
409 | 407,쉬,3492
410 | 408,뭘,3447
411 | 409,변,3406
412 | 410,릴,3374
413 | 411,웠,3293
414 | 412,홍,3267
415 | 413,즈,3265
416 | 414,랐,3245
417 | 415,독,3243
418 | 416,충,3239
419 | 417,짝,3217
420 | 418,떡,3197
421 | 419,뒤,3195
422 | 420,휴,3161
423 | 421,셔,3142
424 | 422,넣,3135
425 | 423,쨌,3075
426 | 424,악,3073
427 | 425,패,3049
428 | 426,빼,3041
429 | 427,슬,2983
430 | 428,특,2975
431 | 429,꺼,2970
432 | 430,숙,2951
433 | 431,쯤,2934
434 | 432,텐,2905
435 | 433,창,2901
436 | 434,겼,2888
437 | 435,굴,2869
438 | 436,판,2863
439 | 437,죠,2851
440 | 438,답,2820
441 | 439,희,2816
442 | 440,허,2815
443 | 441,옆,2798
444 | 442,료,2791
445 | 443,닐,2790
446 | 444,택,2769
447 | 445,림,2760
448 | 446,읽,2742
449 | 447,핸,2733
450 | 448,축,2730
451 | 449,풀,2716
452 | 450,틀,2712
453 | 451,몸,2694
454 | 452,골,2690
455 | 453,황,2635
456 | 454,켜,2635
457 | 455,익,2625
458 | 456,베,2613
459 | 457,북,2600
460 | 458,법,2578
461 | 459,늦,2578
462 | 460,함,2568
463 | 461,랜,2555
464 | 462,꼬,2555
465 | 463,향,2547
466 | 464,석,2541
467 | 465,환,2533
468 | 466,슷,2529
469 | 467,품,2518
470 | 468,혀,2513
471 | 469,블,2512
472 | 470,쓸,2503
473 | 471,채,2472
474 | 472,며,2470
475 | 473,욕,2463
476 | 474,권,2450
477 | 475,검,2445
478 | 476,굳,2428
479 | 477,록,2425
480 | 478,톡,2408
481 | 479,김,2408
482 | 480,넌,2383
483 | 481,깨,2375
484 | 482,션,2374
485 | 483,캐,2369
486 | 484,송,2339
487 | 485,녀,2336
488 | 486,탈,2327
489 | 487,광,2321
490 | 488,혹,2313
491 | 489,퍼,2300
492 | 490,뽑,2278
493 | 491,철,2265
494 | 492,째,2249
495 | 493,움,2232
496 | 494,밤,2231
497 | 495,꼭,2226
498 | 496,샀,2224
499 | 497,끊,2212
500 | 498,땐,2203
501 | 499,깔,2179
502 | 500,멀,2145
503 | 501,높,2141
504 | 502,께,2140
505 | 503,큼,2121
506 | 504,녁,2104
507 | 505,곳,2082
508 | 506,잔,2070
509 | 507,쉽,2070
510 | 508,짐,2067
511 | 509,암,2063
512 | 510,극,2061
513 | 511,련,2056
514 | 512,떠,2056
515 | 513,벽,2049
516 | 514,헤,2047
517 | 515,C,2040
518 | 516,끄,2024
519 | 517,곱,2015
520 | 518,승,2011
521 | 519,봉,2009
522 | 520,착,2006
523 | 521,촌,1990
524 | 522,껴,1986
525 | 523,딩,1983
526 | 524,류,1978
527 | 525,뜨,1970
528 | 526,넷,1941
529 | 527,놨,1922
530 | 528,궁,1894
531 | 529,논,1882
532 | 530,곤,1875
533 | 531,클,1869
534 | 532,싼,1859
535 | 533,앉,1854
536 | 534,컴,1849
537 | 535,맥,1841
538 | 536,팀,1830
539 | 537,썼,1818
540 | 538,낫,1801
541 | 539,튜,1788
542 | 540,걱,1786
543 | 541,쁜,1770
544 | 542,킨,1760
545 | 543,빌,1752
546 | 544,쿠,1748
547 | 545,찌,1738
548 | 546,쌤,1719
549 | 547,T,1715
550 | 548,밀,1711
551 | 549,빵,1702
552 | 550,냈,1702
553 | 551,센,1691
554 | 552,딴,1688
555 | 553,쩌,1678
556 | 554,딸,1678
557 | 555,걍,1596
558 | 556,획,1588
559 | 557,씬,1582
560 | 558,챙,1541
561 | 559,첫,1536
562 | 560,범,1530
563 | 561,핑,1519
564 | 562,굉,1519
565 | 563,쩔,1514
566 | 564,팅,1507
567 | 565,긍,1486
568 | 566,탄,1471
569 | 567,덟,1470
570 | 568,퇴,1469
571 | 569,뛰,1469
572 | 570,층,1467
573 | 571,춰,1454
574 | 572,훨,1447
575 | 573,찬,1439
576 | 574,듯,1424
577 | 575,S,1396
578 | 576,왕,1392
579 | 577,텔,1385
580 | 578,뉴,1382
581 | 579,렌,1377
582 | 580,탕,1374
583 | 581,짓,1371
584 | 582,밑,1365
585 | 583,헬,1358
586 | 584,존,1339
587 | 585,립,1323
588 | 586,녔,1318
589 | 587,꼈,1305
590 | 588,빡,1304
591 | 589,낮,1287
592 | 590,견,1282
593 | 591,링,1281
594 | 592,볶,1271
595 | 593,낙,1271
596 | 594,릭,1267
597 | 595,젠,1263
598 | 596,퓨,1262
599 | 597,츠,1256
600 | 598,맘,1252
601 | 599,놔,1247
602 | 600,렵,1241
603 | 601,땜,1232
604 | 602,쇼,1224
605 | 603,값,1215
606 | 604,닭,1203
607 | 605,깝,1200
608 | 606,픈,1194
609 | 607,탁,1183
610 | 608,쓴,1179
611 | 609,농,1172
612 | 610,량,1166
613 | 611,염,1156
614 | 612,홉,1144
615 | 613,척,1130
616 | 614,겁,1129
617 | 615,콘,1127
618 | 616,섭,1125
619 | 617,냄,1125
620 | 618,P,1125
621 | 619,효,1124
622 | 620,규,1124
623 | 621,꿈,1121
624 | 622,곡,1093
625 | 623,액,1090
626 | 624,쎄,1077
627 | 625,덜,1075
628 | 626,턴,1065
629 | 627,킹,1061
630 | 628,훈,1057
631 | 629,쳤,1054
632 | 630,널,1047
633 | 631,멋,1037
634 | 632,꿀,1034
635 | 633,깜,1019
636 | 634,짧,1014
637 | 635,롤,1013
638 | 636,낼,1012
639 | 637,꽤,1003
640 | 638,총,984
641 | 639,램,984
642 | 640,덕,980
643 | 641,믄,974
644 | 642,믿,972
645 | 643,흥,970
646 | 644,롱,967
647 | 645,뜻,962
648 | 646,짤,958
649 | 647,쌍,957
650 | 648,컨,953
651 | 649,셋,952
652 | 650,잤,950
653 | 651,닥,950
654 | 652,웬,946
655 | 653,엽,944
656 | 654,혜,939
657 | 655,찰,935
658 | 656,뻐,935
659 | 657,뿌,934
660 | 658,빈,934
661 | 659,꿔,934
662 | 660,낸,932
663 | 661,뻔,928
664 | 662,쌓,926
665 | 663,즐,919
666 | 664,튀,914
667 | 665,겹,909
668 | 666,득,899
669 | 667,끌,896
670 | 668,M,880
671 | 669,V,877
672 | 670,녹,876
673 | 671,푸,870
674 | 672,쭉,863
675 | 673,싱,858
676 | 674,팬,857
677 | 675,A,854
678 | 676,!,841
679 | 677,념,836
680 | 678,맡,825
681 | 679,쟁,814
682 | 680,엑,810
683 | 681,켓,809
684 | 682,뀌,808
685 | 683,털,803
686 | 684,풍,802
687 | 685,웨,799
688 | 686,땡,792
689 | 687,롯,791
690 | 688,롭,788
691 | 689,젊,781
692 | 690,넓,778
693 | 691,멘,777
694 | 692,냉,772
695 | 693,칼,771
696 | 694,잉,768
697 | 695,빙,768
698 | 696,뿐,767
699 | 697,옮,761
700 | 698,젤,760
701 | 699,B,757
702 | 700,죄,756
703 | 701,탔,752
704 | 702,샤,746
705 | 703,홀,745
706 | 704,떼,743
707 | 705,줌,738
708 | 706,징,734
709 | 707,폭,727
710 | 708,G,721
711 | 709,킬,713
712 | 710,흔,712
713 | 711,딜,711
714 | 712,슈,703
715 | 713,율,700
716 | 714,즌,697
717 | 715,씀,694
718 | 716,앙,689
719 | 717,눠,688
720 | 718,콩,686
721 | 719,얻,684
722 | 720,숨,682
723 | 721,닝,673
724 | 722,꽃,668
725 | 723,쌀,667
726 | 724,컬,666
727 | 725,춤,666
728 | 726,c,664
729 | 727,뚫,663
730 | 728,엠,661
731 | 729,몬,659
732 | 730,D,658
733 | 731,흐,656
734 | 732,앤,646
735 | 733,똥,645
736 | 734,콜,644
737 | 735,델,635
738 | 736,렀,628
739 | 737,폐,627
740 | 738,엘,624
741 | 739,쁠,623
742 | 740,랄,622
743 | 741,걘,621
744 | 742,벤,619
745 | 743,봄,611
746 | 744,왠,609
747 | 745,씻,609
748 | 746,률,608
749 | 747,켰,600
750 | 748,짱,600
751 | 749,웹,599
752 | 750,압,599
753 | 751,럭,596
754 | 752,땅,596
755 | 753,멍,595
756 | 754,랩,595
757 | 755,댓,595
758 | 756,깊,595
759 | 757,뮤,592
760 | 758,령,590
761 | 759,릿,589
762 | 760,낀,589
763 | 761,윤,586
764 | 762,옥,584
765 | 763,룸,582
766 | 764,딘,579
767 | 765,객,578
768 | 766,댄,576
769 | 767,컵,574
770 | 768,폴,573
771 | 769,쟤,570
772 | 770,뷰,569
773 | 771,템,568
774 | 772,덴,567
775 | 773,눌,559
776 | 774,캠,558
777 | 775,홈,557
778 | 776,삶,557
779 | 777,삭,555
780 | 778,벨,555
781 | 779,엉,552
782 | 780,헐,549
783 | 781,벅,545
784 | 782,벗,544
785 | 783,혈,543
786 | 784,밍,539
787 | 785,셀,536
788 | 786,낭,534
789 | 787,춥,533
790 | 788,릉,533
791 | 789,t,533
792 | 790,잃,532
793 | 791,I,529
794 | 792,놈,526
795 | 793,춘,524
796 | 794,찜,520
797 | 795,R,519
798 | 796,걷,518
799 | 797,삐,515
800 | 798,헌,510
801 | 799,딨,510
802 | 800,빛,504
803 | 801,흘,503
804 | 802,닫,502
805 | 803,균,502
806 | 804,p,495
807 | 805,L,494
808 | 806,좌,492
809 | 807,껄,491
810 | 808,펜,489
811 | 809,N,487
812 | 810,싹,486
813 | 811,탑,485
814 | 812,쏘,483
815 | 813,O,482
816 | 814,픽,480
817 | 815,덩,477
818 | 816,햄,476
819 | 817,큐,473
820 | 818,힐,472
821 | 819,곧,471
822 | 820,낳,470
823 | 821,힌,468
824 | 822,팩,468
825 | 823,뒷,468
826 | 824,툰,467
827 | 825,섬,465
828 | 826,꽂,463
829 | 827,례,462
830 | 828,핫,460
831 | 829,섞,460
832 | 830,촬,458
833 | 831,흰,457
834 | 832,둥,455
835 | 833,K,450
836 | 834,괴,449
837 | 835,s,448
838 | 836,핀,446
839 | 837,꿨,444
840 | 838,틱,441
841 | 839,밝,441
842 | 840,랙,440
843 | 841,땠,440
844 | 842,둔,440
845 | 843,슴,439
846 | 844,첨,438
847 | 845,밴,432
848 | 846,렁,431
849 | 847,칭,429
850 | 848,묻,428
851 | 849,뜬,425
852 | 850,깎,424
853 | 851,엇,423
854 | 852,컸,421
855 | 853,퀴,420
856 | 854,납,418
857 | 855,협,417
858 | 856,몽,416
859 | 857,꼐,415
860 | 858,떴,414
861 | 859,썰,410
862 | 860,찐,407
863 | 861,꼴,407
864 | 862,갠,406
865 | 863,턱,405
866 | 864,틴,398
867 | 865,낄,398
868 | 866,뒀,397
869 | 867,끗,396
870 | 868,꼼,395
871 | 869,F,395
872 | 870,샵,394
873 | 871,휘,392
874 | 872,뼈,390
875 | 873,뚜,389
876 | 874,쩍,388
877 | 875,팡,386
878 | 876,멜,386
879 | 877,톤,385
880 | 878,앨,385
881 | 879,탐,384
882 | 880,칸,384
883 | 881,끓,383
884 | 882,뚱,381
885 | 883,닮,378
886 | 884,깃,375
887 | 885,짬,374
888 | 886,빤,371
889 | 887,측,370
890 | 888,혔,369
891 | 889,꽁,369
892 | 890,펴,368
893 | 891,앴,368
894 | 892,겸,368
895 | 893,쿨,367
896 | 894,릇,363
897 | 895,얀,362
898 | 896,쿄,358
899 | 897,컷,358
900 | 898,팠,356
901 | 899,끈,355
902 | 900,렴,354
903 | 901,잊,352
904 | 902,덤,350
905 | 903,갤,342
906 | 904,븐,340
907 | 905,흡,337
908 | 906,덮,337
909 | 907,씹,335
910 | 908,뽀,335
911 | 909,뚝,335
912 | 910,갚,335
913 | 911,찔,334
914 | 912,댔,333
915 | 913,혁,332
916 | 914,띠,328
917 | 915,벼,327
918 | 916,얇,324
919 | 917,뺐,324
920 | 918,팝,323
921 | 919,잇,322
922 | 920,왼,322
923 | 921,낚,321
924 | 922,칙,316
925 | 923,겉,316
926 | 924,뜯,313
927 | 925,닦,312
928 | 926,짠,311
929 | 927,썹,310
930 | 928,뷔,310
931 | 929,묶,310
932 | 930,꾼,306
933 | 931,빅,305
934 | 932,땄,305
935 | 933,캡,304
936 | 934,묘,304
937 | 935,샘,303
938 | 936,묵,303
939 | 937,a,302
940 | 938,쭈,300
941 | 939,b,300
942 | 940,겪,299
943 | 941,둬,298
944 | 942,J,298
945 | 943,쫄,296
946 | 944,랫,296
947 | 945,뀐,296
948 | 946,흑,295
949 | 947,댕,295
950 | 948,꽉,295
951 | 949,곰,294
952 | 950,붕,293
953 | 951,땀,292
954 | 952,릎,290
955 | 953,뽕,289
956 | 954,쥐,288
957 | 955,렉,287
958 | 956,숭,283
959 | 957,샐,283
960 | 958,v,282
961 | 959,렛,281
962 | 960,녕,281
963 | 961,힙,280
964 | 962,쫙,279
965 | 963,촉,278
966 | 964,쩜,277
967 | 965,긋,277
968 | 966,샌,276
969 | 967,o,275
970 | 968,쫓,273
971 | 969,쩐,273
972 | 970,헷,272
973 | 971,X,268
974 | 972,웅,267
975 | 973,뺏,267
976 | 974,쵸,266
977 | 975,쪘,266
978 | 976,랍,266
979 | 977,E,266
980 | 978,좁,265
981 | 979,앱,265
982 | 980,썸,264
983 | 981,냅,264
984 | 982,펙,263
985 | 983,늙,263
986 | 984,껌,261
987 | 985,n,261
988 | 986,e,261
989 | 987,랭,260
990 | 988,귤,260
991 | 989,찢,259
992 | 990,닿,259
993 | 991,띄,258
994 | 992,긁,255
995 | 993,귄,253
996 | 994,굽,253
997 | 995,갓,253
998 | 996,캔,252
999 | 997,멈,252
1000 | 998,욱,250
1001 | 999,뺄,250
1002 | 1000,뇌,250
1003 | 1001,팟,249
1004 | 1002,쌌,248
1005 | 1003,룹,248
1006 | 1004,덥,248
1007 | 1005,폼,246
1008 | 1006,톱,244
1009 | 1007,듬,244
1010 | 1008,껍,244
1011 | 1009,흠,243
1012 | 1010,팍,243
1013 | 1011,맹,243
1014 | 1012,쉴,242
1015 | 1013,썩,240
1016 | 1014,밟,240
1017 | 1015,맵,237
1018 | 1016,돋,236
1019 | 1017,콤,235
1020 | 1018,맙,234
1021 | 1019,뱅,233
1022 | 1020,쫍,231
1023 | 1021,윗,229
1024 | 1022,뜩,229
1025 | 1023,찝,228
1026 | 1024,뺀,227
1027 | 1025,닷,226
1028 | 1026,넨,226
1029 | 1027,쌈,225
1030 | 1028,쩨,224
1031 | 1029,붓,224
1032 | 1030,쩡,223
1033 | 1031,믹,223
1034 | 1032,잼,221
1035 | 1033,r,221
1036 | 1034,쭐,220
1037 | 1035,엊,219
1038 | 1036,g,219
1039 | 1037,췄,217
1040 | 1038,룩,217
1041 | 1039,텀,215
1042 | 1040,쇠,213
1043 | 1041,숫,212
1044 | 1042,풋,210
1045 | 1043,쌩,208
1046 | 1044,쾌,207
1047 | 1045,볍,207
1048 | 1046,뤄,207
1049 | 1047,겐,207
1050 | 1048,m,207
1051 | 1049,펌,206
1052 | 1050,쪄,206
1053 | 1051,뻥,206
1054 | 1052,i,206
1055 | 1053,뻤,205
1056 | 1054,k,204
1057 | 1055,핵,203
1058 | 1056,셉,200
1059 | 1057,듀,198
1060 | 1058,닉,198
1061 | 1059,략,197
1062 | 1060,넉,197
1063 | 1061,딪,196
1064 | 1062,낯,195
1065 | 1063,텍,194
1066 | 1064,뱃,194
1067 | 1065,멤,194
1068 | 1066,윈,192
1069 | 1067,엎,192
1070 | 1068,뭉,192
1071 | 1069,젝,191
1072 | 1070,셜,190
1073 | 1071,빴,190
1074 | 1072,룰,190
1075 | 1073,앗,189
1076 | 1074,궈,189
1077 | 1075,윙,188
1078 | 1076,엥,188
1079 | 1077,d,185
1080 | 1078,꼽,183
1081 | 1079,챔,182
1082 | 1080,쉐,182
1083 | 1081,봇,182
1084 | 1082,푼,180
1085 | 1083,댁,179
1086 | 1084,칵,178
1087 | 1085,뿔,177
1088 | 1086,뺑,176
1089 | 1087,탱,175
1090 | 1088,쿼,174
1091 | 1089,갱,174
1092 | 1090,퉁,173
1093 | 1091,빔,173
1094 | 1092,썬,172
1095 | 1093,빽,172
1096 | 1094,둑,172
1097 | 1095,헛,169
1098 | 1096,빗,168
1099 | 1097,탓,167
1100 | 1098,륵,165
1101 | 1099,꼰,165
1102 | 1100,쎈,163
1103 | 1101,쥬,162
1104 | 1102,깡,162
1105 | 1103,퀄,161
1106 | 1104,빚,161
1107 | 1105,즉,160
1108 | 1106,삿,160
1109 | 1107,밭,160
1110 | 1108,u,160
1111 | 1109,혐,159
1112 | 1110,햇,159
1113 | 1111,툭,159
1114 | 1112,탠,158
1115 | 1113,샷,158
1116 | 1114,맣,157
1117 | 1115,껏,157
1118 | 1116,핏,156
1119 | 1117,앵,155
1120 | 1118,뜰,155
1121 | 1119,굿,155
1122 | 1120,U,155
1123 | 1121,섹,154
1124 | 1122,펑,153
1125 | 1123,맻,153
1126 | 1124,뀔,153
1127 | 1125,깥,152
1128 | 1126,뱀,150
1129 | 1127,뢰,150
1130 | 1128,껀,150
1131 | 1129,뉘,149
1132 | 1130,흉,146
1133 | 1131,틈,146
1134 | 1132,쏟,146
1135 | 1133,훔,143
1136 | 1134,쇄,142
1137 | 1135,뎅,142
1138 | 1136,칩,141
1139 | 1137,띵,140
1140 | 1138,푹,139
1141 | 1139,넥,139
1142 | 1140,퀘,137
1143 | 1141,훅,136
1144 | 1142,융,136
1145 | 1143,멸,136
1146 | 1144,냠,136
1147 | 1145,횟,134
1148 | 1146,찼,134
1149 | 1147,Y,134
1150 | 1148,룬,133
1151 | 1149,귈,133
1152 | 1150,H,133
1153 | 1151,젓,132
1154 | 1152,쏠,132
1155 | 1153,숲,132
1156 | 1154,냬,132
1157 | 1155,l,132
1158 | 1156,짰,130
1159 | 1157,멕,130
1160 | 1158,뇨,130
1161 | 1159,팽,128
1162 | 1160,깼,127
1163 | 1161,숏,126
1164 | 1162,굔,125
1165 | 1163,슐,123
1166 | 1164,쉰,123
1167 | 1165,얄,122
1168 | 1166,뱉,122
1169 | 1167,렬,122
1170 | 1168,굶,122
1171 | 1169,팁,121
1172 | 1170,츄,121
1173 | 1171,뭣,121
1174 | 1172,륙,121
1175 | 1173,횡,120
1176 | 1174,옹,119
1177 | 1175,뻘,117
1178 | 1176,옵,116
1179 | 1177,옴,116
1180 | 1178,얹,116
1181 | 1179,쑥,116
1182 | 1180,깄,116
1183 | 1181,므,115
1184 | 1182,찡,112
1185 | 1183,젖,112
1186 | 1184,꽈,112
1187 | 1185,틸,111
1188 | 1186,콕,111
1189 | 1187,첩,111
1190 | 1188,똘,111
1191 | 1189,쿵,110
1192 | 1190,왤,110
1193 | 1191,괌,110
1194 | 1192,밸,109
1195 | 1193,녜,109
1196 | 1194,갸,109
1197 | 1195,펀,108
1198 | 1196,칫,108
1199 | 1197,맺,108
1200 | 1198,탭,107
1201 | 1199,쁨,106
1202 | 1200,폈,105
1203 | 1201,펼,105
1204 | 1202,첼,105
1205 | 1203,숱,105
1206 | 1204,섰,105
1207 | 1205,킥,104
1208 | 1206,맑,103
1209 | 1207,랗,103
1210 | 1208,펐,102
1211 | 1209,넛,102
1212 | 1210,솜,101
1213 | 1211,벙,101
1214 | 1212,껑,101
1215 | 1213,f,101
1216 | 1214,룡,100
1217 | 1215,훌,99
1218 | 1216,x,98
1219 | 1217,쓱,97
1220 | 1218,늬,97
1221 | 1219,곽,97
1222 | 1220,y,97
1223 | 1221,욘,96
1224 | 1222,돔,96
1225 | 1223,겄,96
1226 | 1224,텝,95
1227 | 1225,훠,94
1228 | 1226,텅,94
1229 | 1227,씌,94
1230 | 1228,꺾,94
1231 | 1229,벚,93
1232 | 1230,렷,93
1233 | 1231,귓,93
1234 | 1232,찹,92
1235 | 1233,툴,91
1236 | 1234,깅,91
1237 | 1235,쭤,90
1238 | 1236,욜,90
1239 | 1237,얌,90
1240 | 1238,짖,89
1241 | 1239,옳,89
1242 | 1240,벳,88
1243 | 1241,뛸,88
1244 | 1242,깠,88
1245 | 1243,퍽,87
1246 | 1244,퀸,87
1247 | 1245,엮,87
1248 | 1246,삽,87
1249 | 1247,겟,87
1250 | 1248,왓,86
1251 | 1249,댈,86
1252 | 1250,샴,85
1253 | 1251,뻗,85
1254 | 1252,됨,84
1255 | 1253,얜,83
1256 | 1254,굵,83
1257 | 1255,눕,82
1258 | 1256,갇,82
1259 | 1257,셰,81
1260 | 1258,늫,81
1261 | 1259,텨,80
1262 | 1260,숍,80
1263 | 1261,뻑,80
1264 | 1262,됩,80
1265 | 1263,잎,79
1266 | 1264,뭇,79
1267 | 1265,퐁,78
1268 | 1266,팸,78
1269 | 1267,쯔,78
1270 | 1268,넜,78
1271 | 1269,깍,78
1272 | 1270,쌔,77
1273 | 1271,셈,77
1274 | 1272,읍,76
1275 | 1273,픔,75
1276 | 1274,펫,75
1277 | 1275,콧,75
1278 | 1276,얗,75
1279 | 1277,눅,75
1280 | 1278,j,74
1281 | 1279,쬐,73
1282 | 1280,렙,73
1283 | 1281,닙,73
1284 | 1282,슥,72
1285 | 1283,흙,71
1286 | 1284,쭝,71
1287 | 1285,짭,71
1288 | 1286,샹,71
1289 | 1287,릏,71
1290 | 1288,럿,71
1291 | 1289,덧,71
1292 | 1290,즙,70
1293 | 1291,늑,70
1294 | 1292,괄,70
1295 | 1293,킷,68
1296 | 1294,쿡,68
1297 | 1295,캉,68
1298 | 1296,둡,68
1299 | 1297,톨,67
1300 | 1298,엣,67
1301 | 1299,숟,67
1302 | 1300,낑,67
1303 | 1301,펭,66
1304 | 1302,왁,66
1305 | 1303,쏴,66
1306 | 1304,쏙,66
1307 | 1305,봅,66
1308 | 1306,멧,66
1309 | 1307,줏,65
1310 | 1308,뵈,65
1311 | 1309,쫑,64
1312 | 1310,륨,64
1313 | 1311,h,64
1314 | 1312,펄,63
1315 | 1313,짼,63
1316 | 1314,짚,63
1317 | 1315,껐,63
1318 | 1316,겜,63
1319 | 1317,싯,62
1320 | 1318,붐,62
1321 | 1319,렐,62
1322 | 1320,돗,62
1323 | 1321,팥,61
1324 | 1322,웰,61
1325 | 1323,륜,61
1326 | 1324,잣,60
1327 | 1325,슝,60
1328 | 1326,붉,60
1329 | 1327,윽,59
1330 | 1328,삘,59
1331 | 1329,딲,59
1332 | 1330,갯,58
1333 | 1331,횐,57
1334 | 1332,헨,57
1335 | 1333,캘,57
1336 | 1334,쩰,57
1337 | 1335,뤘,57
1338 | 1336,랴,57
1339 | 1337,껜,57
1340 | 1338,펠,56
1341 | 1339,킵,56
1342 | 1340,컹,56
1343 | 1341,렘,56
1344 | 1342,뛴,56
1345 | 1343,헝,55
1346 | 1344,씽,55
1347 | 1345,뮬,55
1348 | 1346,젯,54
1349 | 1347,샜,54
1350 | 1348,뿜,54
1351 | 1349,뒹,54
1352 | 1350,뎌,54
1353 | 1351,깬,54
1354 | 1352,챠,53
1355 | 1353,왈,53
1356 | 1354,뾰,53
1357 | 1355,뚤,53
1358 | 1356,꾹,53
1359 | 1357,갛,53
1360 | 1358,잌,52
1361 | 1359,엿,52
1362 | 1360,솥,52
1363 | 1361,벡,52
1364 | 1362,룻,52
1365 | 1363,꿍,52
1366 | 1364,곈,52
1367 | 1365,팜,51
1368 | 1366,튕,51
1369 | 1367,컥,51
1370 | 1368,첸,51
1371 | 1369,줍,51
1372 | 1370,섀,51
1373 | 1371,몫,51
1374 | 1372,뜸,51
1375 | 1373,깁,51
1376 | 1374,핬,50
1377 | 1375,쭘,50
1378 | 1376,쌰,50
1379 | 1377,넬,50
1380 | 1378,큘,49
1381 | 1379,쾅,49
1382 | 1380,캄,48
1383 | 1381,괘,48
1384 | 1382,쟀,47
1385 | 1383,윌,47
1386 | 1384,엌,47
1387 | 1385,앓,47
1388 | 1386,씁,47
1389 | 1387,륭,47
1390 | 1388,W,47
1391 | 1389,쑤,46
1392 | 1390,삥,46
1393 | 1391,돕,46
1394 | 1392,깰,46
1395 | 1393,핍,45
1396 | 1394,텃,45
1397 | 1395,슛,45
1398 | 1396,맸,45
1399 | 1397,롬,45
1400 | 1398,갭,45
1401 | 1399,얽,44
1402 | 1400,쏭,44
1403 | 1401,랠,44
1404 | 1402,겔,44
1405 | 1403,Q,44
1406 | 1404,핥,43
1407 | 1405,킴,43
1408 | 1406,읏,43
1409 | 1407,앚,43
1410 | 1408,숯,43
1411 | 1409,밋,43
1412 | 1410,뽐,42
1413 | 1411,뻣,42
1414 | 1412,눴,42
1415 | 1413,잭,41
1416 | 1414,뽈,41
1417 | 1415,뗐,41
1418 | 1416,꽝,41
1419 | 1417,훑,40
1420 | 1418,캥,40
1421 | 1419,쫀,40
1422 | 1420,뵙,40
1423 | 1421,홋,39
1424 | 1422,펍,39
1425 | 1423,뺨,39
1426 | 1424,됬,39
1427 | 1425,끽,39
1428 | 1426,빕,38
1429 | 1427,밉,38
1430 | 1428,꿋,38
1431 | 1429,헉,37
1432 | 1430,캣,37
1433 | 1431,촘,37
1434 | 1432,셌,37
1435 | 1433,삔,37
1436 | 1434,삑,37
1437 | 1435,뽂,37
1438 | 1436,뮌,37
1439 | 1437,뗄,37
1440 | 1438,텼,36
1441 | 1439,탬,36
1442 | 1440,쨍,36
1443 | 1441,웜,36
1444 | 1442,앰,36
1445 | 1443,맴,36
1446 | 1444,띡,36
1447 | 1445,꿇,36
1448 | 1446,걀,36
1449 | 1447,흩,35
1450 | 1448,쥴,35
1451 | 1449,씸,35
1452 | 1450,낡,35
1453 | 1451,곁,35
1454 | 1452,휙,34
1455 | 1453,쿤,34
1456 | 1454,켄,34
1457 | 1455,츤,34
1458 | 1456,얕,34
1459 | 1457,썪,34
1460 | 1458,둠,34
1461 | 1459,촛,33
1462 | 1460,챌,33
1463 | 1461,죙,33
1464 | 1462,쟈,33
1465 | 1463,잰,33
1466 | 1464,뵀,33
1467 | 1465,w,33
1468 | 1466,푠,32
1469 | 1467,폿,32
1470 | 1468,쨈,32
1471 | 1469,밲,32
1472 | 1470,랖,32
1473 | 1471,떵,32
1474 | 1472,찻,31
1475 | 1473,옇,31
1476 | 1474,뽁,31
1477 | 1475,둣,31
1478 | 1476,닳,31
1479 | 1477,긱,31
1480 | 1478,곶,31
1481 | 1479,휠,30
1482 | 1480,춧,30
1483 | 1481,쐈,30
1484 | 1482,썽,30
1485 | 1483,뼛,29
1486 | 1484,떳,29
1487 | 1485,굘,29
1488 | 1486,훗,28
1489 | 1487,퀵,28
1490 | 1488,봬,28
1491 | 1489,릅,28
1492 | 1490,꺠,28
1493 | 1491,슉,27
1494 | 1492,눔,27
1495 | 1493,끙,27
1496 | 1494,궐,27
1497 | 1495,2,27
1498 | 1496,촥,26
1499 | 1497,젬,26
1500 | 1498,솟,26
1501 | 1499,맷,26
1502 | 1500,룐,26
1503 | 1501,뎃,26
1504 | 1502,깽,26
1505 | 1503,툼,25
1506 | 1504,쎌,25
1507 | 1505,쉼,25
1508 | 1506,쉘,25
1509 | 1507,숑,25
1510 | 1508,뎀,25
1511 | 1509,냔,25
1512 | 1510,쫘,24
1513 | 1511,쎘,24
1514 | 1512,싣,24
1515 | 1513,섣,24
1516 | 1514,샾,24
1517 | 1515,맬,24
1518 | 1516,뗀,24
1519 | 1517,꾀,24
1520 | 1518,헹,23
1521 | 1519,햐,23
1522 | 1520,톰,23
1523 | 1521,췌,23
1524 | 1522,챈,23
1525 | 1523,봔,23
1526 | 1524,밧,23
1527 | 1525,맏,23
1528 | 1526,딥,23
1529 | 1527,늠,23
1530 | 1528,낵,23
1531 | 1529,낱,23
1532 | 1530,꺄,23
1533 | 1531,갬,23
1534 | 1532,훼,22
1535 | 1533,핼,22
1536 | 1534,튠,22
1537 | 1535,웩,22
1538 | 1536,쏜,22
1539 | 1537,뿅,22
1540 | 1538,빰,22
1541 | 1539,딤,22
1542 | 1540,꿉,22
1543 | 1541,걜,22
1544 | 1542,1,22
1545 | 1543,짙,21
1546 | 1544,얍,21
1547 | 1545,샛,21
1548 | 1546,뗘,21
1549 | 1547,듭,21
1550 | 1548,챘,20
1551 | 1549,쯧,20
1552 | 1550,짹,20
1553 | 1551,잦,20
1554 | 1552,옐,20
1555 | 1553,빳,20
1556 | 1554,몹,20
1557 | 1555,몄,20
1558 | 1556,똔,20
1559 | 1557,딧,20
1560 | 1558,놉,20
1561 | 1559,궜,20
1562 | 1560,굼,20
1563 | 1561,헥,19
1564 | 1562,캬,19
1565 | 1563,챕,19
1566 | 1564,쟨,19
1567 | 1565,멓,19
1568 | 1566,똠,19
1569 | 1567,댐,19
1570 | 1568,텁,18
1571 | 1569,켈,18
1572 | 1570,첵,18
1573 | 1571,숄,18
1574 | 1572,띨,18
1575 | 1573,듦,18
1576 | 1574,궤,18
1577 | 1575,곗,18
1578 | 1576,튈,17
1579 | 1577,좆,17
1580 | 1578,윷,17
1581 | 1579,옅,17
1582 | 1580,얏,17
1583 | 1581,믈,17
1584 | 1582,룽,17
1585 | 1583,띃,17
1586 | 1584,딕,17
1587 | 1585,뎁,17
1588 | 1586,닛,17
1589 | 1587,냑,17
1590 | 1588,겅,17
1591 | 1589,휩,16
1592 | 1590,팎,16
1593 | 1591,틋,16
1594 | 1592,콸,16
1595 | 1593,콥,16
1596 | 1594,잴,16
1597 | 1595,웁,16
1598 | 1596,슘,16
1599 | 1597,멱,16
1600 | 1598,랏,16
1601 | 1599,떄,16
1602 | 1600,뒨,16
1603 | 1601,꿰,16
1604 | 1602,깻,16
1605 | 1603,긌,16
1606 | 1604,젼,15
1607 | 1605,윰,15
1608 | 1606,웍,15
1609 | 1607,앳,15
1610 | 1608,샬,15
1611 | 1609,샥,15
1612 | 1610,볕,15
1613 | 1611,멩,15
1614 | 1612,넹,15
1615 | 1613,넙,15
1616 | 1614,끕,15
1617 | 1615,휜,14
1618 | 1616,텄,14
1619 | 1617,쫒,14
1620 | 1618,쩝,14
1621 | 1619,쨋,14
1622 | 1620,윳,14
1623 | 1621,쉭,14
1624 | 1622,쇽,14
1625 | 1623,셧,14
1626 | 1624,뵐,14
1627 | 1625,땔,14
1628 | 1626,덱,14
1629 | 1627,댑,14
1630 | 1628,꺽,14
1631 | 1629,곪,14
1632 | 1630,켔,13
1633 | 1631,츰,13
1634 | 1632,읜,13
1635 | 1633,쑈,13
1636 | 1634,볐,13
1637 | 1635,및,13
1638 | 1636,롷,13
1639 | 1637,딛,13
1640 | 1638,냇,13
1641 | 1639,3,13
1642 | 1640,혓,12
1643 | 1641,팻,12
1644 | 1642,팰,12
1645 | 1643,킁,12
1646 | 1644,촤,12
1647 | 1645,쨰,12
1648 | 1646,잿,12
1649 | 1647,옌,12
1650 | 1648,쐬,12
1651 | 1649,쌋,12
1652 | 1650,슁,12
1653 | 1651,쉑,12
1654 | 1652,빢,12
1655 | 1653,뵌,12
1656 | 1654,뭄,12
1657 | 1655,묽,12
1658 | 1656,뎠,12
1659 | 1657,늪,12
1660 | 1658,뇽,12
1661 | 1659,훤,11
1662 | 1660,횔,11
1663 | 1661,홧,11
1664 | 1662,햅,11
1665 | 1663,푯,11
1666 | 1664,팼,11
1667 | 1665,탯,11
1668 | 1666,탤,11
1669 | 1667,큭,11
1670 | 1668,짢,11
1671 | 1669,읊,11
1672 | 1670,롸,11
1673 | 1671,띤,11
1674 | 1672,놋,11
1675 | 1673,넴,11
1676 | 1674,귿,11
1677 | 1675,q,11
1678 | 1676,휑,10
1679 | 1677,퓸,10
1680 | 1678,튿,10
1681 | 1679,튄,10
1682 | 1680,촐,10
1683 | 1681,쭌,10
1684 | 1682,짊,10
1685 | 1683,숴,10
1686 | 1684,숀,10
1687 | 1685,뿡,10
1688 | 1686,뻬,10
1689 | 1687,렜,10
1690 | 1688,뗬,10
1691 | 1689,늄,10
1692 | 1690,끍,10
1693 | 1691,곌,10
1694 | 1692,갰,10
1695 | 1693,폄,9
1696 | 1694,콰,9
1697 | 1695,캇,9
1698 | 1696,캅,9
1699 | 1697,늉,9
1700 | 1698,뉜,9
1701 | 1699,큔,8
1702 | 1700,콱,8
1703 | 1701,켠,8
1704 | 1702,쳔,8
1705 | 1703,쳇,8
1706 | 1704,읔,8
1707 | 1705,읎,8
1708 | 1706,욤,8
1709 | 1707,뼘,8
1710 | 1708,롹,8
1711 | 1709,렝,8
1712 | 1710,뚠,8
1713 | 1711,땋,8
1714 | 1712,덨,8
1715 | 1713,넵,8
1716 | 1714,넝,8
1717 | 1715,넋,8
1718 | 1716,꿩,8
1719 | 1717,꼿,8
1720 | 1718,깟,8
1721 | 1719,곯,8
1722 | 1720,4,8
1723 | 1721,힝,7
1724 | 1722,헀,7
1725 | 1723,푤,7
1726 | 1724,쿱,7
1727 | 1725,캤,7
1728 | 1726,챗,7
1729 | 1727,쯩,7
1730 | 1728,쮸,7
1731 | 1729,읗,7
1732 | 1730,윅,7
1733 | 1731,얠,7
1734 | 1732,씰,7
1735 | 1733,썅,7
1736 | 1734,쌜,7
1737 | 1735,쌉,7
1738 | 1736,슌,7
1739 | 1737,쉿,7
1740 | 1738,쇳,7
1741 | 1739,셍,7
1742 | 1740,뿍,7
1743 | 1741,뼌,7
1744 | 1742,뺌,7
1745 | 1743,밈,7
1746 | 1744,룟,7
1747 | 1745,룔,7
1748 | 1746,뜀,7
1749 | 1747,끅,7
1750 | 1748,꾜,7
1751 | 1749,겡,7
1752 | 1750,Z,7
1753 | 1751,횰,6
1754 | 1752,헴,6
1755 | 1753,핌,6
1756 | 1754,펩,6
1757 | 1755,펨,6
1758 | 1756,켸,6
1759 | 1757,쩠,6
1760 | 1758,잽,6
1761 | 1759,엡,6
1762 | 1760,앝,6
1763 | 1761,쓕,6
1764 | 1762,썜,6
1765 | 1763,쌘,6
1766 | 1764,삣,6
1767 | 1765,빻,6
1768 | 1766,몀,6
1769 | 1767,뤼,6
1770 | 1768,롄,6
1771 | 1769,떫,6
1772 | 1770,덫,6
1773 | 1771,뉸,6
1774 | 1772,꽥,6
1775 | 1773,궂,6
1776 | 1774,괭,6
1777 | 1775,0,6
1778 | 1776,힉,5
1779 | 1777,휸,5
1780 | 1778,퓰,5
1781 | 1779,퉤,5
1782 | 1780,퀭,5
1783 | 1781,켐,5
1784 | 1782,췻,5
1785 | 1783,챡,5
1786 | 1784,쨀,5
1787 | 1785,젱,5
1788 | 1786,읒,5
1789 | 1787,웡,5
1790 | 1788,옙,5
1791 | 1789,얬,5
1792 | 1790,앎,5
1793 | 1791,씼,5
1794 | 1792,쑨,5
1795 | 1793,쐐,5
1796 | 1794,쌨,5
1797 | 1795,쉈,5
1798 | 1796,숩,5
1799 | 1797,셤,5
1800 | 1798,삠,5
1801 | 1799,뱄,5
1802 | 1800,뭥,5
1803 | 1801,멨,5
1804 | 1802,먀,5
1805 | 1803,랒,5
1806 | 1804,땟,5
1807 | 1805,땁,5
1808 | 1806,뉠,5
1809 | 1807,꽌,5
1810 | 1808,귐,5
1811 | 1809,굥,5
1812 | 1810,굣,5
1813 | 1811,겋,5
1814 | 1812,8,5
1815 | 1813,5,5
1816 | 1814,훙,4
1817 | 1815,혠,4
1818 | 1816,폔,4
1819 | 1817,튬,4
1820 | 1818,퉜,4
1821 | 1819,탉,4
1822 | 1820,큽,4
1823 | 1821,퀀,4
1824 | 1822,쿰,4
1825 | 1823,켤,4
1826 | 1824,켕,4
1827 | 1825,칡,4
1828 | 1826,쯘,4
1829 | 1827,짯,4
1830 | 1828,짔,4
1831 | 1829,쥔,4
1832 | 1830,줜,4
1833 | 1831,죗,4
1834 | 1832,욧,4
1835 | 1833,쒸,4
1836 | 1834,쏼,4
1837 | 1835,쌂,4
1838 | 1836,셴,4
1839 | 1837,샨,4
1840 | 1838,뽜,4
1841 | 1839,뻠,4
1842 | 1840,빝,4
1843 | 1841,봥,4
1844 | 1842,뫼,4
1845 | 1843,맜,4
1846 | 1844,맀,4
1847 | 1845,뤠,4
1848 | 1846,띌,4
1849 | 1847,띈,4
1850 | 1848,뛌,4
1851 | 1849,뚸,4
1852 | 1850,떽,4
1853 | 1851,떰,4
1854 | 1852,떈,4
1855 | 1853,땍,4
1856 | 1854,돠,4
1857 | 1855,뎄,4
1858 | 1856,늣,4
1859 | 1857,뇸,4
1860 | 1858,놥,4
1861 | 1859,녈,4
1862 | 1860,넒,4
1863 | 1861,낏,4
1864 | 1862,갉,4
1865 | 1863,z,4
1866 | 1864,힛,3
1867 | 1865,휀,3
1868 | 1866,헙,3
1869 | 1867,픗,3
1870 | 1868,퐈,3
1871 | 1869,팹,3
1872 | 1870,틔,3
1873 | 1871,퇸,3
1874 | 1872,텟,3
1875 | 1873,탰,3
1876 | 1874,퀼,3
1877 | 1875,칬,3
1878 | 1876,췰,3
1879 | 1877,쳥,3
1880 | 1878,챂,3
1881 | 1879,찧,3
1882 | 1880,쭙,3
1883 | 1881,쬬,3
1884 | 1882,쩄,3
1885 | 1883,쨉,3
1886 | 1884,줴,3
1887 | 1885,죵,3
1888 | 1886,좍,3
1889 | 1887,좃,3
1890 | 1888,읐,3
1891 | 1889,읃,3
1892 | 1890,웟,3
1893 | 1891,왯,3
1894 | 1892,옭,3
1895 | 1893,씅,3
1896 | 1894,쒀,3
1897 | 1895,쑹,3
1898 | 1896,쏵,3
1899 | 1897,쎔,3
1900 | 1898,쎅,3
1901 | 1899,슾,3
1902 | 1900,쉣,3
1903 | 1901,솨,3
1904 | 1902,셸,3
1905 | 1903,셩,3
1906 | 1904,삯,3
1907 | 1905,뿟,3
1908 | 1906,뾱,3
1909 | 1907,뽝,3
1910 | 1908,뻰,3
1911 | 1909,뵤,3
1912 | 1910,볏,3
1913 | 1911,뱌,3
1914 | 1912,뱁,3
1915 | 1913,밨,3
1916 | 1914,밎,3
1917 | 1915,믐,3
1918 | 1916,뭡,3
1919 | 1917,뭠,3
1920 | 1918,뭍,3
1921 | 1919,묭,3
1922 | 1920,뫠,3
1923 | 1921,멏,3
1924 | 1922,멎,3
1925 | 1923,맽,3
1926 | 1924,뙤,3
1927 | 1925,떔,3
1928 | 1926,땃,3
1929 | 1927,덷,3
1930 | 1928,댜,3
1931 | 1929,늰,3
1932 | 1930,놘,3
1933 | 1931,냘,3
1934 | 1932,뀜,3
1935 | 1933,꽹,3
1936 | 1934,꽐,3
1937 | 1935,꼇,3
1938 | 1936,껸,3
1939 | 1937,껨,3
1940 | 1938,꺅,3
1941 | 1939,곘,3
1942 | 1940,겝,3
1943 | 1941,흣,2
1944 | 1942,흝,2
1945 | 1943,흄,2
1946 | 1944,헸,2
1947 | 1945,헜,2
1948 | 1946,햘,2
1949 | 1947,햑,2
1950 | 1948,핟,2
1951 | 1949,픙,2
1952 | 1950,픕,2
1953 | 1951,푀,2
1954 | 1952,퐉,2
1955 | 1953,틑,2
1956 | 1954,틍,2
1957 | 1955,큣,2
1958 | 1956,퀏,2
1959 | 1957,콴,2
1960 | 1958,켁,2
1961 | 1959,칟,2
1962 | 1960,츳,2
1963 | 1961,츨,2
1964 | 1962,츈,2
1965 | 1963,쵝,2
1966 | 1964,촙,2
1967 | 1965,찦,2
1968 | 1966,찟,2
1969 | 1967,쭁,2
1970 | 1968,쫏,2
1971 | 1969,쩬,2
1972 | 1970,쩧,2
1973 | 1971,짆,2
1974 | 1972,쥰,2
1975 | 1973,쥘,2
1976 | 1974,좐,2
1977 | 1975,졍,2
1978 | 1976,쟝,2
1979 | 1977,잧,2
1980 | 1978,잍,2
1981 | 1979,욍,2
1982 | 1980,왬,2
1983 | 1981,옫,2
1984 | 1982,옜,2
1985 | 1983,옉,2
1986 | 1984,엤,2
1987 | 1985,얐,2
1988 | 1986,얉,2
1989 | 1987,쒯,2
1990 | 1988,쑐,2
1991 | 1989,쏸,2
1992 | 1990,썻,2
1993 | 1991,쌕,2
1994 | 1992,싷,2
1995 | 1993,솝,2
1996 | 1994,솎,2
1997 | 1995,샅,2
1998 | 1996,삻,2
1999 | 1997,쁩,2
2000 | 1998,빘,2
2001 | 1999,뷸,2
2002 | 2000,뷴,2
2003 | 2001,봽,2
2004 | 2002,봈,2
2005 | 2003,볌,2
2006 | 2004,벋,2
2007 | 2005,믕,2
2008 | 2006,뮴,2
2009 | 2007,뭬,2
2010 | 2008,뭑,2
2011 | 2009,묜,2
2012 | 2010,맇,2
2013 | 2011,릈,2
2014 | 2012,륑,2
2015 | 2013,랮,2
2016 | 2014,랟,2
2017 | 2015,띔,2
2018 | 2016,떱,2
2019 | 2017,듈,2
2020 | 2018,됴,2
2021 | 2019,됫,2
2022 | 2020,됙,2
2023 | 2021,됑,2
2024 | 2022,댤,2
2025 | 2023,닯,2
2026 | 2024,늗,2
2027 | 2025,뉩,2
2028 | 2026,눟,2
2029 | 2027,눗,2
2030 | 2028,넚,2
2031 | 2029,냡,2
2032 | 2030,낢,2
2033 | 2031,꾿,2
2034 | 2032,꼳,2
2035 | 2033,꼄,2
2036 | 2034,겊,2
2037 | 2035,갼,2
2038 | 2036,6,2
2039 | 2037,,0
2040 | 2038, ,0
2041 | 2039,_,0
2042 |
--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 | import torch
4 | import torch.nn as nn
5 | import queue
6 | import math
7 | from random import random
8 | from torch import optim
9 | from package.config import Config
10 | from package.definition import char2id, logger, SOS_token, EOS_token, PAD_token
11 | from package.data_loader import CustomDataset, load_corpus, CustomDataLoader
12 | from package.evaluator import evaluate
13 | from package.loss import Perplexity
14 | from package.trainer import supervised_train
15 | from model import LanguageModel
16 |
17 | # Character-level Recurrent Neural Network Language Model implement in Pytorch
18 | # https://github.com/sooftware/char-rnnlm
19 |
20 | if __name__ == '__main__':
21 | os.environ["CUDA_LAUNCH_BLOCKING"] = "1" # if you use Multi-GPU, delete this line
22 | logger.info("device : %s" % torch.cuda.get_device_name(0))
23 | logger.info("CUDA is available : %s" % (torch.cuda.is_available()))
24 | logger.info("CUDA version : %s" % torch.version.cuda)
25 | logger.info("PyTorch version : %s" % torch.__version__)
26 |
27 | config = Config(
28 | use_cuda=True,
29 | hidden_size=512,
30 | dropout_p=0.5,
31 | n_layers=4,
32 | batch_size=16,
33 | max_epochs=40,
34 | lr=0.0001,
35 | teacher_forcing_ratio=1.0,
36 | seed=1,
37 | max_len=428,
38 | worker_num=1
39 | )
40 |
41 | torch.manual_seed(config.seed)
42 | torch.cuda.manual_seed_all(config.seed)
43 | cuda = config.use_cuda and torch.cuda.is_available()
44 | device = torch.device('cuda' if cuda else 'cpu')
45 |
46 | model = LanguageModel(
47 | n_class=len(char2id),
48 | n_layers=config.n_layers,
49 | rnn_cell='lstm',
50 | hidden_size=config.hidden_size,
51 | dropout_p=config.dropout_p,
52 | max_length=config.max_len,
53 | sos_id=SOS_token,
54 | eos_id=EOS_token,
55 | device=device
56 | )
57 | model.flatten_parameters()
58 | model = nn.DataParallel(model).to(device)
59 |
60 | for param in model.parameters():
61 | param.data.uniform_(-0.08, 0.08)
62 |
63 | # Prepare loss
64 | weight = torch.ones(len(char2id)).to(device)
65 | perplexity = Perplexity(weight, PAD_token, device)
66 | optimizer = optim.Adam(model.module.parameters(), lr=config.lr)
67 |
68 | corpus = load_corpus('./data/corpus_df.bin')
69 | total_time_step = math.ceil(len(corpus) / config.batch_size)
70 |
71 | train_set = CustomDataset(corpus[:-10000], SOS_token, EOS_token, config.batch_size)
72 | valid_set = CustomDataset(corpus[-10000:], SOS_token, EOS_token, config.batch_size)
73 |
74 | logger.info('start')
75 | train_begin = time.time()
76 |
77 | for epoch in range(config.max_epochs):
78 | train_queue = queue.Queue(config.worker_num << 1)
79 | train_set.shuffle()
80 |
81 | train_loader = CustomDataLoader(train_set, train_queue, config.batch_size, 0)
82 | train_loader.start()
83 |
84 | train_loss = supervised_train(
85 | model=model,
86 | queue=train_queue,
87 | epoch=epoch,
88 | total_time_step=total_time_step,
89 | train_begin=train_begin,
90 | perplexity=perplexity,
91 | optimizer=optimizer,
92 | device=device,
93 | print_every=10,
94 | teacher_forcing_ratio=config.teacher_forcing_ratio,
95 | worker_num=config.worker_num
96 | )
97 |
98 | torch.save(model, "./data/epoch%s.pt" % str(epoch))
99 | logger.info('Epoch %d (Training) Loss %0.4f' % (epoch, train_loss))
100 | train_loader.join()
101 |
102 | valid_queue = queue.Queue(config.worker_num << 1)
103 | valid_loader = CustomDataLoader(valid_set, valid_queue, config.batch_size, 0)
104 | valid_loader.start()
105 |
106 | valid_loss = evaluate(model, valid_queue, perplexity, device)
107 | valid_loader.join()
108 |
109 | logger.info('Epoch %d (Evaluate) Loss %0.4f' % (epoch, valid_loss))
--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
1 | import random
2 | import torch
3 | import torch.nn as nn
4 | import torch.nn.functional as F
5 |
6 |
7 | class LanguageModel(nn.Module):
8 | def __init__(self, n_class, n_layers, rnn_cell, hidden_size, dropout_p, max_length, sos_id, eos_id, device):
9 |
10 | super(LanguageModel, self).__init__()
11 | assert rnn_cell.lower() in ('lstm', 'gru', 'rnn')
12 |
13 | self.rnn_cell = nn.LSTM if rnn_cell.lower() == 'lstm' else nn.GRU if rnn_cell.lower() == 'gru' else nn.RNN
14 | self.rnn = self.rnn_cell(hidden_size, hidden_size, n_layers, batch_first=True, dropout=dropout_p).to(device)
15 | self.max_length = max_length
16 | self.eos_id = eos_id
17 | self.sos_id = sos_id
18 | self.hidden_size = hidden_size
19 | self.embedding = nn.Embedding(n_class, hidden_size)
20 | self.n_layers = n_layers
21 | self.input_dropout = nn.Dropout(p=dropout_p)
22 | self.out = nn.Linear(self.hidden_size, n_class)
23 | self.device = device
24 |
25 | def forward_step(self, input, hidden, function=F.log_softmax):
26 | """ forward one time step """
27 | batch_size = input.size(0)
28 | seq_length = input.size(1)
29 |
30 | embedded = self.embedding(input).to(self.device)
31 | embedded = self.input_dropout(embedded)
32 |
33 | if self.training:
34 | self.rnn.flatten_parameters()
35 |
36 | output, hidden = self.rnn(embedded, hidden)
37 |
38 | predicted_softmax = function(self.out(output.contiguous().view(-1, self.hidden_size)), dim=1)
39 | predicted_softmax = predicted_softmax.view(batch_size, seq_length, -1)
40 |
41 | return predicted_softmax, hidden
42 |
43 | def forward(self, inputs, teacher_forcing_ratio=1.0, function=F.log_softmax):
44 | batch_size = inputs.size(0)
45 | max_length = inputs.size(1) - 1 # minus the start of sequence symbol
46 |
47 | outputs = list()
48 | use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False
49 |
50 | hidden = self._init_state(batch_size)
51 |
52 | if use_teacher_forcing:
53 | inputs = inputs[inputs != self.eos_id].view(batch_size, -1)
54 | predicted_softmax, hidden = self.forward_step(
55 | input=inputs,
56 | hidden=hidden,
57 | function=function
58 | )
59 |
60 | for di in range(predicted_softmax.size(1)):
61 | step_output = predicted_softmax[:, di, :]
62 | outputs.append(step_output)
63 |
64 | else:
65 | input = inputs[:, 0].unsqueeze(1)
66 | for di in range(max_length):
67 | predicted_softmax, hidden = self.forward_step(
68 | input=input,
69 | hidden=hidden,
70 | function=function
71 | )
72 |
73 | step_output = predicted_softmax.squeeze(1)
74 | outputs.append(step_output)
75 | input = outputs[-1].topk(1)[1]
76 |
77 | return outputs
78 |
79 | def _init_state(self, batch_size):
80 | if isinstance(self.rnn, nn.LSTM):
81 | h_0 = torch.zeros(self.n_layers, batch_size, self.hidden_size).to(self.device)
82 | c_0 = torch.zeros(self.n_layers, batch_size, self.hidden_size).to(self.device)
83 | hidden = (h_0, c_0)
84 |
85 | else:
86 | hidden = torch.zeros(self.n_layers, batch_size, self.hidden_size).to(self.device)
87 |
88 | return hidden
89 |
90 | def flatten_parameters(self):
91 | self.rnn.flatten_parameters()
--------------------------------------------------------------------------------
/package/__pycache__/config.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sooftware/char-rnnlm/fc6573bde13b151f373fc081f63e3f563debf56c/package/__pycache__/config.cpython-37.pyc
--------------------------------------------------------------------------------
/package/__pycache__/data_loader.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sooftware/char-rnnlm/fc6573bde13b151f373fc081f63e3f563debf56c/package/__pycache__/data_loader.cpython-37.pyc
--------------------------------------------------------------------------------
/package/__pycache__/definition.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sooftware/char-rnnlm/fc6573bde13b151f373fc081f63e3f563debf56c/package/__pycache__/definition.cpython-37.pyc
--------------------------------------------------------------------------------
/package/__pycache__/evaluator.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sooftware/char-rnnlm/fc6573bde13b151f373fc081f63e3f563debf56c/package/__pycache__/evaluator.cpython-37.pyc
--------------------------------------------------------------------------------
/package/__pycache__/loss.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sooftware/char-rnnlm/fc6573bde13b151f373fc081f63e3f563debf56c/package/__pycache__/loss.cpython-37.pyc
--------------------------------------------------------------------------------
/package/__pycache__/trainer.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sooftware/char-rnnlm/fc6573bde13b151f373fc081f63e3f563debf56c/package/__pycache__/trainer.cpython-37.pyc
--------------------------------------------------------------------------------
/package/__pycache__/utils.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sooftware/char-rnnlm/fc6573bde13b151f373fc081f63e3f563debf56c/package/__pycache__/utils.cpython-37.pyc
--------------------------------------------------------------------------------
/package/config.py:
--------------------------------------------------------------------------------
1 | class Config:
2 | def __init__(self,
3 | use_cuda=True,
4 | hidden_size=512,
5 | dropout_p=0.5,
6 | n_layers=4,
7 | batch_size=32,
8 | max_epochs=40,
9 | lr=0.0001,
10 | teacher_forcing_ratio=1.0,
11 | seed=1,
12 | max_len=428,
13 | worker_num=1
14 | ):
15 | self.use_cuda = use_cuda
16 | self.hidden_size = hidden_size
17 | self.dropout_p = dropout_p
18 | self.n_layers = n_layers
19 | self.batch_size = batch_size
20 | self.max_epochs = max_epochs
21 | self.lr = lr
22 | self.teacher_forcing_ratio = teacher_forcing_ratio
23 | self.seed = seed
24 | self.max_len = max_len
25 | self.worker_num = worker_num
26 |
--------------------------------------------------------------------------------
/package/data_loader.py:
--------------------------------------------------------------------------------
1 | import threading
2 | import csv
3 | import random
4 | import pandas as pd
5 | import torch
6 | import math
7 | import pickle
8 | from torch.utils.data import Dataset
9 | from package.definition import logger
10 | from package.utils import get_label, get_input
11 |
12 |
13 | class CustomDataLoader(threading.Thread):
14 | def __init__(self, dataset, queue, batch_size, thread_id):
15 | threading.Thread.__init__(self)
16 | self.collate_fn = _collate_fn
17 | self.dataset = dataset
18 | self.batch_size = batch_size
19 | self.queue = queue
20 | self.index = 0
21 | self.thread_id = thread_id
22 | self.dataset_count = dataset.count()
23 |
24 | def create_empty_batch(self):
25 | sequences = torch.zeros(0, 0, 0).to(torch.long)
26 | targets = torch.zeros(0, 0, 0).to(torch.long)
27 |
28 | sequence_lengths = list()
29 | target_lengths = list()
30 |
31 | return sequences, targets, sequence_lengths, target_lengths
32 |
33 | def run(self):
34 | logger.debug('loader %d start' % self.thread_id)
35 | while True:
36 | items = list()
37 |
38 | for _ in range(self.batch_size):
39 | if self.index >= self.dataset_count:
40 | break
41 |
42 | input, label = self.dataset.get_item(self.index)
43 |
44 | if input is not None:
45 | items.append((input, label))
46 |
47 | self.index += 1
48 |
49 | if len(items) == 0:
50 | batch = self.create_empty_batch()
51 | self.queue.put(batch)
52 | break
53 |
54 | random.shuffle(items)
55 |
56 | batch = self.collate_fn(items)
57 | self.queue.put(batch)
58 |
59 | logger.debug('loader %d stop' % self.thread_id)
60 |
61 |
62 | def _collate_fn(batch):
63 | """ functions that pad to the maximum sequence length """
64 |
65 | def seq_length_(p):
66 | return len(p[0])
67 |
68 | def target_length_(p):
69 | return len(p[1])
70 |
71 | seq_lengths = [len(s[0]) for s in batch]
72 | target_lengths = [len(s[1]) for s in batch]
73 |
74 | max_seq_sample = max(batch, key=seq_length_)[0]
75 | max_target_sample = max(batch, key=target_length_)[1]
76 |
77 | max_seq_size = max_seq_sample.size(0)
78 | max_target_size = len(max_target_sample)
79 | batch_size = len(batch)
80 |
81 | seqs = torch.zeros(batch_size, max_seq_size).to(torch.long)
82 | targets = torch.zeros(batch_size, max_target_size).to(torch.long)
83 |
84 | from package.definition import PAD_token
85 | targets.fill_(PAD_token)
86 | seqs.fill_(PAD_token)
87 |
88 | for idx in range(batch_size):
89 | sample = batch[idx]
90 | tensor = sample[0]
91 | target = sample[1]
92 |
93 | seqs[idx].narrow(0, 0, len(tensor)).copy_(torch.LongTensor(tensor))
94 | targets[idx].narrow(0, 0, len(target)).copy_(torch.LongTensor(target))
95 |
96 | return seqs, targets, seq_lengths, target_lengths
97 |
98 |
99 | class CustomDataset(Dataset):
100 | def __init__(self, corpus, sos_id, eos_id, batch_size):
101 | self.corpus = corpus
102 | self.sos_id = sos_id
103 | self.eos_id = eos_id
104 | self.batch_size = batch_size
105 |
106 | def get_item(self, index):
107 | input = get_input(self.corpus[index], self.sos_id)
108 | label = get_label(self.corpus[index], self.eos_id)
109 |
110 | return input, label
111 |
112 | def shuffle(self):
113 | random.shuffle(self.corpus)
114 |
115 | def count(self):
116 | return len(self.corpus)
117 |
118 |
119 | def load_label(label_path, encoding='utf-8'):
120 | char2id = dict()
121 | id2char = dict()
122 |
123 | with open(label_path, 'r', encoding=encoding) as f:
124 | labels = csv.reader(f, delimiter=',')
125 | next(labels)
126 |
127 | for row in labels:
128 | char2id[row[1]] = row[0]
129 | id2char[int(row[0])] = row[1]
130 |
131 | return char2id, id2char
132 |
133 |
134 | def load_corpus(filepath):
135 | with open(filepath, 'rb') as f:
136 | corpus = pickle.load(f)
137 | corpus = list(corpus['id'])
138 |
139 | return corpus
140 |
--------------------------------------------------------------------------------
/package/definition.py:
--------------------------------------------------------------------------------
1 | import logging
2 | import sys
3 | logger = logging.getLogger('root')
4 | FORMAT = "[%(asctime)s %(filename)s:%(lineno)s - %(funcName)s()] %(message)s"
5 | logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, format=FORMAT)
6 | logger.setLevel(logging.INFO)
7 | from package.data_loader import load_label
8 | char2id, id2char = load_label('./data/train_labels.csv', encoding='utf-8')
9 | SOS_token = int(char2id[''])
10 | EOS_token = int(char2id[' '])
11 | PAD_token = int(char2id['_'])
12 | train_dict = {'loss': [], 'cer': []}
13 | valid_dict = {'loss': [], 'cer': []}
--------------------------------------------------------------------------------
/package/evaluator.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from package.definition import logger
3 |
4 |
5 | def evaluate(model, queue, perplexity, device):
6 | logger.info('evaluate() start')
7 |
8 | total_loss = 0
9 | total_num = 0
10 |
11 | model.eval()
12 |
13 | with torch.no_grad():
14 | while True:
15 | loss = perplexity
16 |
17 | inputs, targets, input_lengths, target_lengths = queue.get()
18 |
19 | if inputs.shape[0] == 0:
20 | break
21 |
22 | inputs = inputs.to(device)
23 | targets = targets.to(device)
24 |
25 | model.module.flatten_parameters()
26 | outputs = model(inputs, teacher_forcing_ratio=0.0)
27 |
28 | loss.reset()
29 | for step, step_output in enumerate(outputs):
30 | batch_size = targets.size(0)
31 | loss.eval_batch(step_output.contiguous().view(batch_size, -1), targets[:, step])
32 |
33 | loss = loss.get_loss()
34 |
35 | total_loss += loss
36 | total_num += sum(input_lens)
37 |
38 | logger.info('evaluate() completed')
39 |
40 | return total_loss / total_num
41 |
--------------------------------------------------------------------------------
/package/loss.py:
--------------------------------------------------------------------------------
1 | # Refer to IBM/pytorch-seq2seq
2 | # https://github.com/IBM/pytorch-seq2seq/blob/master/seq2seq/loss/loss.py
3 |
4 | import torch
5 | import math
6 | import torch.nn as nn
7 | import numpy as np
8 |
9 |
10 | class Loss(object):
11 | """ Base class for encapsulation of the loss functions.
12 | This class defines interfaces that are commonly used with loss functions
13 | in training and inferencing. For information regarding individual loss
14 | functions, please refer to http://pytorch.org/docs/master/nn.html#loss-functions
15 | Note:
16 | Do not use this class directly, use one of the sub classes.
17 | Args:
18 | name (str): name of the loss function used by logging messages.
19 | criterion (torch.nn._Loss): one of PyTorch's loss function. Refer
20 | to http://pytorch.org/docs/master/nn.html#loss-functions for
21 | a list of them.
22 | Attributes:
23 | name (str): name of the loss function used by logging messages.
24 | criterion (torch.nn._Loss): one of PyTorch's loss function. Refer
25 | to http://pytorch.org/docs/master/nn.html#loss-functions for
26 | a list of them. Implementation depends on individual
27 | sub-classes.
28 | acc_loss (int or torcn.nn.Tensor): variable that stores accumulated loss.
29 | norm_term (float): normalization term that can be used to calculate
30 | the loss of multiple batches. Implementation depends on individual
31 | sub-classes.
32 | """
33 |
34 | def __init__(self, name, criterion):
35 | self.name = name
36 | self.criterion = criterion
37 | if not issubclass(type(self.criterion), nn.modules.loss._Loss):
38 | raise ValueError("Criterion has to be a subclass of torch.nn._Loss")
39 | # accumulated loss
40 | self.acc_loss = 0
41 | # normalization term
42 | self.norm_term = 0
43 |
44 | def reset(self):
45 | """ Reset the accumulated loss. """
46 | self.acc_loss = 0
47 | self.norm_term = 0
48 |
49 | def get_loss(self):
50 | """ Get the loss.
51 | This method defines how to calculate the averaged loss given the
52 | accumulated loss and the normalization term. Override to define your
53 | own logic.
54 | Returns:
55 | loss (float): value of the loss.
56 | """
57 | raise NotImplementedError
58 |
59 | def eval_batch(self, outputs, target):
60 | """ Evaluate and accumulate loss given outputs and expected results.
61 | This method is called after each batch with the batch outputs and
62 | the target (expected) results. The loss and normalization term are
63 | accumulated in this method. Override it to define your own accumulation
64 | method.
65 | Args:
66 | outputs (torch.Tensor): outputs of a batch.
67 | target (torch.Tensor): expected output of a batch.
68 | """
69 | raise NotImplementedError
70 |
71 | def cuda(self):
72 | self.criterion.cuda()
73 |
74 | def backward(self):
75 | if type(self.acc_loss) is int:
76 | raise ValueError("No loss to back propagate.")
77 | self.acc_loss.backward()
78 |
79 |
80 | class NLLLoss(Loss):
81 | """ Batch averaged negative log-likelihood loss.
82 | Args:
83 | weight (torch.Tensor, optional): refer to http://pytorch.org/docs/master/nn.html#nllloss
84 | mask (int, optional): index of masked token, i.e. weight[mask] = 0.
85 | size_average (bool, optional): refer to http://pytorch.org/docs/master/nn.html#nllloss
86 | """
87 |
88 | _NAME = "Avg NLLLoss"
89 |
90 | def __init__(self, weight=None, mask=None, size_average=True):
91 | self.mask = mask
92 | self.size_average = size_average
93 | if mask is not None:
94 | if weight is None:
95 | raise ValueError("Must provide weight with a mask.")
96 | weight[mask] = 0
97 |
98 | super(NLLLoss, self).__init__(
99 | self._NAME,
100 | nn.NLLLoss(weight=weight, reduction='sum'))
101 |
102 | def get_loss(self):
103 | if isinstance(self.acc_loss, int):
104 | return 0
105 | # total loss for all batches
106 | loss = self.acc_loss.data.item()
107 | if self.size_average:
108 | # average loss per batch
109 | loss /= self.norm_term
110 | return loss
111 |
112 | def eval_batch(self, outputs, target):
113 | self.acc_loss += self.criterion(outputs, target)
114 | self.norm_term += 1
115 |
116 |
117 | class Perplexity(NLLLoss):
118 | """ Language model perplexity loss.
119 | Perplexity is the token averaged likelihood. When the averaging options are the
120 | same, it is the exponential of negative log-likelihood.
121 | Args:
122 | weight (torch.Tensor, optional): refer to http://pytorch.org/docs/master/nn.html#nllloss
123 | mask (int, optional): index of masked token, i.e. weight[mask] = 0.
124 | """
125 |
126 | _NAME = "Perplexity"
127 | _MAX_EXP = 100
128 |
129 | def __init__(self, weight=None, mask=None, device=None):
130 | super(Perplexity, self).__init__(weight=weight, mask=mask, size_average=False)
131 | self.device = device
132 |
133 | def eval_batch(self, outputs, target):
134 | self.acc_loss += self.criterion(outputs, target).to(self.device)
135 | if self.mask is None:
136 | self.norm_term += np.prod(target.size())
137 | else:
138 | self.norm_term += target.data.ne(self.mask).sum()
139 |
140 | def get_loss(self):
141 | nll = super(Perplexity, self).get_loss()
142 | nll /= self.norm_term.item()
143 | if nll > Perplexity._MAX_EXP:
144 | print("WARNING: Loss exceeded maximum value, capping to e^100")
145 | return math.exp(Perplexity._MAX_EXP)
146 | return math.exp(nll)
147 |
--------------------------------------------------------------------------------
/package/trainer.py:
--------------------------------------------------------------------------------
1 | import time
2 | import torch
3 | from package.definition import logger
4 |
5 | train_step_result = {'loss': [], 'cer': []}
6 |
7 |
8 | def supervised_train(model, queue, perplexity, optimizer, device, print_every, epoch,
9 | teacher_forcing_ratio, worker_num, total_time_step, train_begin):
10 | print_loss_total = 0 # Reset every print_every
11 | epoch_loss_total = 0 # Reset every epoch
12 | total_num = 0
13 | time_step = 0
14 |
15 | model.train()
16 | begin = epoch_begin = time.time()
17 |
18 | while True:
19 | loss = perplexity
20 | inputs, targets, input_lens, target_lens = queue.get()
21 |
22 | if inputs.shape[0] == 0:
23 | # empty feats means closing one loader
24 | worker_num -= 1
25 | logger.debug('left train_loader: %d' % worker_num)
26 |
27 | if worker_num == 0:
28 | break
29 | else:
30 | continue
31 |
32 | inputs = inputs.to(device)
33 | targets = targets.to(device)
34 |
35 | model.module.flatten_parameters()
36 | outputs = model(inputs, teacher_forcing_ratio=teacher_forcing_ratio)
37 |
38 | # Get loss
39 | loss.reset()
40 | for step, step_output in enumerate(outputs):
41 | batch_size = targets.size(0)
42 | loss.eval_batch(step_output.contiguous().view(batch_size, -1), targets[:, step])
43 | # Backpropagation
44 | model.zero_grad()
45 | loss.backward()
46 | optimizer.step()
47 | loss = loss.get_loss()
48 |
49 | epoch_loss_total += loss
50 | print_loss_total += loss
51 | total_num += sum(input_lens)
52 |
53 | time_step += 1
54 | torch.cuda.empty_cache()
55 |
56 | if time_step % print_every == 0:
57 | current = time.time()
58 | elapsed = current - begin
59 | epoch_elapsed = (current - epoch_begin) / 60.0
60 | train_elapsed = (current - train_begin) / 3600.0
61 |
62 | logger.info('timestep: {:4d}/{:4d}, perplexity: {:.4f}, elapsed: {:.2f}s {:.2f}m {:.2f}h'.format(
63 | time_step,
64 | total_time_step,
65 | print_loss_total / print_every,
66 | elapsed, epoch_elapsed, train_elapsed
67 | ))
68 | print_loss_total = 0
69 | begin = time.time()
70 |
71 | if time_step % 50000 == 0:
72 | torch.save(model, "./data/epoch%s_%s.pt" % (str(epoch), str(time_step)))
73 |
74 | logger.info('train() completed')
75 |
76 | return epoch_loss_total / total_num
77 |
--------------------------------------------------------------------------------
/package/utils.py:
--------------------------------------------------------------------------------
1 | import Levenshtein as Lev
2 | import torch
3 |
4 |
5 | def get_label(script, eos_id):
6 | tokens = script.split()
7 |
8 | label = list()
9 | for token in tokens:
10 | label.append(int(token))
11 | label.append(int(eos_id))
12 |
13 | return label
14 |
15 |
16 | def get_input(script, sos_id):
17 | tokens = script.split()
18 |
19 | label = list()
20 | label.append(int(sos_id))
21 | for token in tokens:
22 | label.append(int(token))
23 |
24 | return torch.LongTensor(label)
25 |
26 |
27 | def label_to_string(labels, id2char, eos_id):
28 | if len(labels.shape) == 1:
29 | sentence = str()
30 | for label in labels:
31 | if label.item() == eos_id:
32 | break
33 | sentence += id2char[label.item()]
34 | return sentence
35 |
36 | elif len(labels.shape) == 2:
37 | sentences = list()
38 | for batch in labels:
39 | sentence = str()
40 | for label in batch:
41 | if label.item() == eos_id:
42 | break
43 | sentence += id2char[label.item()]
44 | sentences.append(sentence)
45 | return sentences
46 |
47 | else:
48 | raise ValueError("shape Error !!")
49 |
50 |
51 | def char_distance(target, y_hat):
52 | target = target.replace(' ', '')
53 | y_hat = y_hat.replace(' ', '')
54 |
55 | dist = Lev.distance(y_hat, target)
56 | length = len(target.replace(' ', ''))
57 |
58 | return dist, length
59 |
--------------------------------------------------------------------------------
/preprocess/KoNPron.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# KoNPron"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "class KoNPron:\n",
17 | " def __init__(self):\n",
18 | " self.base_digit = ['0','1','2','3','4','5','6','7','8','9']\n",
19 | " self.super_digit = ['⁰','¹','²','³','⁴','⁵','⁶','⁷','⁸','⁹']\n",
20 | " self.small_scale = ['','십','백','천']\n",
21 | " self.large_scale = ['','만 ','억 ','조 ','경 ','해 ']\n",
22 | " self.literal = ['영','일','이','삼','사','오','육','칠','팔','구']\n",
23 | " self.spoken_unit = ['', '하나', '둘', '셋', '넷', '다섯', '여섯', '일곱', '여덟', '아홉']\n",
24 | " self.spoke_tens = ['','열', '스물', '서른', '마흔', '쉰', '예순', '일흔', '여든', '아흔']\n",
25 | " self.sentence = str()\n",
26 | " \n",
27 | " def _detect(self, sentence):\n",
28 | " self.sentence = sentence\n",
29 | " detection_data = list()\n",
30 | " tmp = str()\n",
31 | " \n",
32 | " total_len = len(sentence)\n",
33 | " point_count = 0\n",
34 | " continuous_count = 0\n",
35 | " digit_type = 'vanilla'\n",
36 | " \n",
37 | " detected = False\n",
38 | " zero_started = False\n",
39 | " \n",
40 | " for idx, char in enumerate(sentence):\n",
41 | " if char in self.base_digit:\n",
42 | " if not detected:\n",
43 | " detected = True\n",
44 | " if char is '0':\n",
45 | " zero_started = True\n",
46 | " tmp += char\n",
47 | " continuous_count += 1\n",
48 | " if zero_started and continuous_count>8:\n",
49 | " digit_type = 'telephone/none'\n",
50 | " if continuous_count>20:\n",
51 | " digit_type = 'enormous/none' \n",
52 | " else:\n",
53 | " continous_count = 0\n",
54 | " zero_started = False\n",
55 | " if char == ',':\n",
56 | " if idx+1 < total_len and idx > 0:\n",
57 | " if sentence[idx-1] in self.base_digit and sentence[idx+1] in self.base_digit:\n",
58 | " tmp += char\n",
59 | " elif char == '.':\n",
60 | " if idx+1 < total_len and idx > 0:\n",
61 | " if sentence[idx-1] in self.base_digit and sentence[idx+1] in self.base_digit:\n",
62 | " point_count += 1\n",
63 | " if point_count == 1:\n",
64 | " digit_type = 'fraction'\n",
65 | " if point_count > 1:\n",
66 | " digit_type = 'version'\n",
67 | " tmp += char\n",
68 | " elif char == '^':\n",
69 | " if idx+1 < total_len and idx > 0:\n",
70 | " if sentence[idx-1] in self.base_digit and sentence[idx+1] in self.base_digit:\n",
71 | " digit_type += '/square'\n",
72 | " tmp += char\n",
73 | " else:\n",
74 | " if digit_type != 'exception/none':\n",
75 | " digit_type = 'exception/none'\n",
76 | " tmp += char\n",
77 | " elif char in self.super_digit:\n",
78 | " if idx > 0:\n",
79 | " if sentence[idx-1] in self.base_digit or sentence[idx-1] in self.super_digit:\n",
80 | " if not 'square' in digit_type:\n",
81 | " digit_type += '/square'\n",
82 | " tmp += char\n",
83 | " else:\n",
84 | " if digit_type != 'exception/none':\n",
85 | " digit_type = 'exception/none'\n",
86 | " tmp += char\n",
87 | " elif char == '·':\n",
88 | " if idx+1 < total_len and idx > 0:\n",
89 | " if sentence[idx-1] in self.base_digit and sentence[idx+1] in self.base_digit:\n",
90 | " digit_type = 'date'\n",
91 | " tmp += char\n",
92 | " else:\n",
93 | " if detected:\n",
94 | " detected = False\n",
95 | " if '/' not in digit_type:\n",
96 | " digit_type += '/none'\n",
97 | " detection_data.append((digit_type, tmp))\n",
98 | " tmp = str()\n",
99 | " digit_type = 'vanilla'\n",
100 | " point_count = 0\n",
101 | " continuous_count = 0\n",
102 | " else:\n",
103 | " if tmp:\n",
104 | " detection_data.append((digit_type, tmp))\n",
105 | " tmp = str()\n",
106 | " digit_type = 'vanilla'\n",
107 | " point_count = 0\n",
108 | " continuous_count = 0\n",
109 | " \n",
110 | " if detected:\n",
111 | " if '/' not in digit_type:\n",
112 | " digit_type += '/none'\n",
113 | " detection_data.append((digit_type, tmp))\n",
114 | " \n",
115 | " elif tmp:\n",
116 | " detection_data.append((digit_type, tmp))\n",
117 | " return detection_data\n",
118 | " \n",
119 | " def _preprocess(self, detection_data):\n",
120 | " preprocessed_data = list()\n",
121 | " \n",
122 | " for digit_type, target in detection_data:\n",
123 | " original = target\n",
124 | " target_seq = list()\n",
125 | " reading_method = list()\n",
126 | " target_len = len(target)\n",
127 | "\n",
128 | " main_type, sub_type = digit_type.split('/')\n",
129 | " if main_type == 'exception':\n",
130 | " return None\n",
131 | " \n",
132 | " if main_type == 'version':\n",
133 | " splited = target.split('.')\n",
134 | " for count, frag in enumerate(splited):\n",
135 | " target_seq.append(frag)\n",
136 | " reading_method.append('individual')\n",
137 | " if count < len(splited) - 1:\n",
138 | " target_seq.append('.')\n",
139 | " reading_method.append('point')\n",
140 | " preprocessed_data.append((reading_method, target_seq, original))\n",
141 | " \n",
142 | " if main_type == 'date':\n",
143 | " target = target.split('·')\n",
144 | " for frag in target:\n",
145 | " target_seq.append(frag)\n",
146 | " reading_method.append('individual')\n",
147 | " preprocessed_data.append((reading_method, target_seq, original))\n",
148 | " \n",
149 | " if main_type == 'telephone' or main_type == 'enormous':\n",
150 | " target = target.replace('0', '_')\n",
151 | " target_seq.append(target)\n",
152 | " reading_method.append('individual')\n",
153 | " preprocessed_data.append((reading_method, target_seq, original))\n",
154 | " \n",
155 | " if sub_type != 'none':\n",
156 | " if main_type == 'vanilla':\n",
157 | " target = [target.replace(',','')]\n",
158 | " if main_type == 'fraction':\n",
159 | " target = target.split('.')\n",
160 | " \n",
161 | " if sub_type == 'square':\n",
162 | " if '^' not in target[-1]:\n",
163 | " super_part = str()\n",
164 | " for idx, digit in enumerate(target[-1]):\n",
165 | " if digit not in self.base_digit:\n",
166 | " super_idx = idx\n",
167 | " break\n",
168 | " super_num = list(target[-1][idx:])\n",
169 | " for idx in range(len(super_num)):\n",
170 | " super_num[idx] = str(self.super_digit.index(super_num[idx]))\n",
171 | " super_num = ''.join(super_num)\n",
172 | " super_part += target[-1][:super_idx]+'^'+super_num\n",
173 | " if main_type == 'vanilla':\n",
174 | " target = [super_part]\n",
175 | " else:\n",
176 | " target = target[:1]+[super_part]\n",
177 | " if main_type == 'fraction':\n",
178 | " target_seq.append(target[0])\n",
179 | " reading_method.append('literal')\n",
180 | " target_seq.append('.')\n",
181 | " reading_method.append('point')\n",
182 | "\n",
183 | " tmp_len = len(target_seq)\n",
184 | " tmp = target[-1].split('^')\n",
185 | " for seq in tmp:\n",
186 | " target_seq.append(seq)\n",
187 | " if 'point' in reading_method:\n",
188 | " if 'of' in reading_method:\n",
189 | " reading_method.append('literal')\n",
190 | " else:\n",
191 | " reading_method.append('individual')\n",
192 | " else:\n",
193 | " reading_method.append('literal')\n",
194 | " if len(target_seq) == tmp_len+1:\n",
195 | " target_seq.append(\"^\")\n",
196 | " reading_method.append(\"of\")\n",
197 | " elif len(target_seq) == tmp_len+3:\n",
198 | " target_seq.append(\"^\")\n",
199 | " reading_method.append(\"super\") \n",
200 | " preprocessed_data.append((reading_method, target_seq, original))\n",
201 | " \n",
202 | " else:\n",
203 | " if main_type == 'vanilla':\n",
204 | " target = target.replace(',','')\n",
205 | " target_seq.append(target)\n",
206 | " reading_method.append('literal')\n",
207 | " preprocessed_data.append((reading_method, target_seq, original))\n",
208 | " \n",
209 | " if main_type == 'fraction':\n",
210 | " target = target.split('.')\n",
211 | " for frag in target:\n",
212 | " target_seq.append(frag)\n",
213 | " if 'point' in reading_method:\n",
214 | " reading_method.append('individual')\n",
215 | " else:\n",
216 | " reading_method.append('literal')\n",
217 | " if len(target_seq) == 1:\n",
218 | " target_seq.append(\".\")\n",
219 | " reading_method.append('point')\n",
220 | " \n",
221 | " preprocessed_data.append((reading_method, target_seq, original))\n",
222 | " \n",
223 | " return preprocessed_data\n",
224 | " \n",
225 | " \n",
226 | " def _read(self, preprocessed_data, mode = 'informal'):\n",
227 | " def __literal_read(self, frag, mode = 'informal'):\n",
228 | " korean = str()\n",
229 | " tmp = str()\n",
230 | " length = len(frag)\n",
231 | " for idx, digit in enumerate(frag):\n",
232 | " digit = int(digit)\n",
233 | " inversed_idx = length-idx-1\n",
234 | " if mode == 'formal':\n",
235 | " if inversed_idx%4:\n",
236 | " if digit:\n",
237 | " tmp += self.literal[digit]\n",
238 | " tmp += self.small_scale[inversed_idx%4]\n",
239 | " else:\n",
240 | " if digit or length == 1:\n",
241 | " tmp += self.literal[digit]\n",
242 | " if tmp:\n",
243 | " tmp += self.large_scale[inversed_idx//4]\n",
244 | " korean += tmp\n",
245 | " tmp = str()\n",
246 | " \n",
247 | " elif mode == 'informal':\n",
248 | " if inversed_idx%4:\n",
249 | " if digit>1:\n",
250 | " tmp += self.literal[digit]\n",
251 | " if digit:\n",
252 | " tmp += self.small_scale[inversed_idx%4]\n",
253 | " else:\n",
254 | " if digit or length == 1:\n",
255 | " tmp += self.literal[digit]\n",
256 | " if tmp:\n",
257 | " tmp += self.large_scale[inversed_idx//4]\n",
258 | " if length == 5 and digit == 1 and inversed_idx == 4:\n",
259 | " tmp = tmp[1:]\n",
260 | " korean += tmp\n",
261 | " tmp = str()\n",
262 | " korean += tmp\n",
263 | " return korean\n",
264 | " \n",
265 | " def __individual_read(self, frag):\n",
266 | " korean = str()\n",
267 | " for digit in frag:\n",
268 | " if digit == '_':\n",
269 | " korean += '공'\n",
270 | " else:\n",
271 | " korean += self.literal[int(digit)]\n",
272 | " return korean\n",
273 | " \n",
274 | " \n",
275 | " result = self.sentence\n",
276 | " if preprocessed_data is None:\n",
277 | " return None\n",
278 | " \n",
279 | " for each in preprocessed_data:\n",
280 | " reading_method, target_seq, original = each\n",
281 | " readed = str()\n",
282 | " for idx, frag in enumerate(target_seq):\n",
283 | " if reading_method[idx] == 'literal':\n",
284 | " readed += __literal_read(self, frag)\n",
285 | " if reading_method[idx] == 'individual':\n",
286 | " readed += __individual_read(self, frag)\n",
287 | " if reading_method[idx] == 'point':\n",
288 | " readed += ' 점 '\n",
289 | " if reading_method[idx] == 'of':\n",
290 | " readed += '의 '\n",
291 | " if reading_method[idx] == 'super':\n",
292 | " readed += ' 승'\n",
293 | " result = result.replace(original, readed, 1)\n",
294 | " return result\n",
295 | " \n",
296 | " def convert(self, sentence):\n",
297 | " return self._read(self._preprocess(self._detect(sentence)))"
298 | ]
299 | },
300 | {
301 | "cell_type": "markdown",
302 | "metadata": {},
303 | "source": [
304 | "## 중간 출력 값 확인"
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": 12,
310 | "metadata": {},
311 | "outputs": [
312 | {
313 | "name": "stdout",
314 | "output_type": "stream",
315 | "text": [
316 | "Original : 13,000, 3.14, 1.4^2, 8·15, 1.1.7\n",
317 | "<>\n",
318 | "[('vanilla/none', '13,000'), ('fraction/none', '3.14'), ('fraction/square', '1.4^2'), ('date/none', '8·15'), ('version/none', '1.1.7')]\n",
319 | "<>\n",
320 | "[(['literal'], ['13000'], '13,000'), (['literal', 'point', 'individual'], ['3', '.', '14'], '3.14'), (['literal', 'point', 'individual', 'of', 'literal', 'super'], ['1', '.', '4', '^', '2', '^'], '1.4^2'), (['individual', 'individual'], ['8', '15'], '8·15'), (['individual', 'point', 'individual', 'point', 'individual'], ['1', '.', '1', '.', '7'], '1.1.7')]\n",
321 | "<>\n",
322 | "만 삼천, 삼 점 일사, 일 점 사의 이 승, 팔일오, 일 점 일 점 칠\n"
323 | ]
324 | }
325 | ],
326 | "source": [
327 | "test = KoNPron()\n",
328 | "sentence = \"01066883225, 420076049241990172380904740935\"\n",
329 | "sentence = \"13,000, 3.14, 1.4^2, 8·15, 1.1.7\"\n",
330 | "print(\"Original : \",sentence)\n",
331 | "detect = test._detect(sentence)\n",
332 | "print(\"<>\")\n",
333 | "print(detect)\n",
334 | "preprocessed = test._preprocess(detect)\n",
335 | "print(\"<>\")\n",
336 | "print(preprocessed)\n",
337 | "print(\"<>\")\n",
338 | "print(test._read(preprocessed))"
339 | ]
340 | },
341 | {
342 | "cell_type": "markdown",
343 | "metadata": {},
344 | "source": [
345 | "## 예외상황에 대하여는 중간출력 및 최종출력을 None으로 설정"
346 | ]
347 | },
348 | {
349 | "cell_type": "code",
350 | "execution_count": 3,
351 | "metadata": {},
352 | "outputs": [
353 | {
354 | "name": "stdout",
355 | "output_type": "stream",
356 | "text": [
357 | "[('exception/none', '⁴')]\n",
358 | "None\n",
359 | "None\n"
360 | ]
361 | }
362 | ],
363 | "source": [
364 | "exception = \"아니⁴ 이런 일이?\"\n",
365 | "detect = test._detect(exception)\n",
366 | "print(detect)\n",
367 | "preprocessed = test._preprocess(detect)\n",
368 | "print(preprocessed)\n",
369 | "result = test._read(preprocessed)\n",
370 | "print(result)"
371 | ]
372 | },
373 | {
374 | "cell_type": "markdown",
375 | "metadata": {},
376 | "source": [
377 | "## 예외 문장과 정상 문장을 구분하여 출력"
378 | ]
379 | },
380 | {
381 | "cell_type": "code",
382 | "execution_count": 6,
383 | "metadata": {},
384 | "outputs": [
385 | {
386 | "name": "stdout",
387 | "output_type": "stream",
388 | "text": [
389 | "팔일오 행사를 맞아 광장에 사만 구천여 명의 시민들이 모였다. 이 중 오십육 점 이이퍼센트는 태극기를 들고 있었다.\n",
390 | "버전 일 점 영 점 칠, 삼천만큼 사랑해, 십육 점 칠의 삼 승과 만 오천삼백삼십이 점 칠팔삼의 백십 승\n",
391 | "제 핸드폰 번호는 공일공일이삼사오육칠팔, 집 전화번호는 공이일이삼사오육칠이고 송장번호는 사이공공칠육공사구이사일구구공일칠이삼팔공구공사칠사공구삼오\n"
392 | ]
393 | }
394 | ],
395 | "source": [
396 | "sentence1 = \"8·15 행사를 맞아 광장에 4,9000여 명의 시민들이 모였다. 이 중 56.22퍼센트는 태극기를 들고 있었다.\"\n",
397 | "sentence2 = \"아니 벌써³ 4월이라니\"\n",
398 | "exception1 = \"그러게요^^\"\n",
399 | "exception2 = \"버전 1.0.7, 3000만큼 사랑해, 16.7³과 15332.783^110\"\n",
400 | "sentence3 = \"제 핸드폰 번호는 01012345678, 집 전화번호는 021234567이고 송장번호는 420076049241990172380904740935\"\n",
401 | "\n",
402 | "data = [sentence1, exception1, exception2, sentence2, sentence3]\n",
403 | "\n",
404 | "for sentence in data:\n",
405 | " ret = test.convert(sentence)\n",
406 | " if ret is not None:\n",
407 | " print(ret)"
408 | ]
409 | },
410 | {
411 | "cell_type": "code",
412 | "execution_count": null,
413 | "metadata": {},
414 | "outputs": [],
415 | "source": []
416 | }
417 | ],
418 | "metadata": {
419 | "kernelspec": {
420 | "display_name": "Python 3",
421 | "language": "python",
422 | "name": "python3"
423 | },
424 | "language_info": {
425 | "codemirror_mode": {
426 | "name": "ipython",
427 | "version": 3
428 | },
429 | "file_extension": ".py",
430 | "mimetype": "text/x-python",
431 | "name": "python",
432 | "nbconvert_exporter": "python",
433 | "pygments_lexer": "ipython3",
434 | "version": "3.7.3"
435 | }
436 | },
437 | "nbformat": 4,
438 | "nbformat_minor": 2
439 | }
440 |
--------------------------------------------------------------------------------
/preprocess/Preprocess.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Preprocess for Language Model"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "수집한 Corpus를 통합하여 일련의 규칙으로 전처리"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "* 필요한 라이브러리 import"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 1,
27 | "metadata": {},
28 | "outputs": [],
29 | "source": [
30 | "import pandas as pd\n",
31 | "import numpy as np\n",
32 | "import random\n",
33 | "import pickle"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "* 수집한 corpus 통합"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 2,
46 | "metadata": {},
47 | "outputs": [
48 | {
49 | "ename": "FileNotFoundError",
50 | "evalue": "[Errno 2] No such file or directory: '../data/OpenSubtitles.csv'",
51 | "output_type": "error",
52 | "traceback": [
53 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
54 | "\u001b[1;31mFileNotFoundError\u001b[0m Traceback (most recent call last)",
55 | "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mkor_corpus\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'../data/OpenSubtitles.csv'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mencoding\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'utf-8'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2\u001b[0m \u001b[0msub_corpus\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'../data/new_kor.csv'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mencoding\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'cp949'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[0mkor_corpus\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mconcat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mkor_corpus\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0msub_corpus\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[0msub_corpus\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'../data/tatoeba.csv'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mencoding\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'utf-8'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
56 | "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py\u001b[0m in \u001b[0;36mparser_f\u001b[1;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)\u001b[0m\n\u001b[0;32m 674\u001b[0m )\n\u001b[0;32m 675\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 676\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 677\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 678\u001b[0m \u001b[0mparser_f\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m__name__\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mname\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
57 | "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py\u001b[0m in \u001b[0;36m_read\u001b[1;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[0;32m 446\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 447\u001b[0m \u001b[1;31m# Create the parser.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 448\u001b[1;33m \u001b[0mparser\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mTextFileReader\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfp_or_buf\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 449\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 450\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mchunksize\u001b[0m \u001b[1;32mor\u001b[0m \u001b[0miterator\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
58 | "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[0;32m 878\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m\"has_index_names\"\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mkwds\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m\"has_index_names\"\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 879\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 880\u001b[1;33m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_make_engine\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 881\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 882\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
59 | "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py\u001b[0m in \u001b[0;36m_make_engine\u001b[1;34m(self, engine)\u001b[0m\n\u001b[0;32m 1112\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m_make_engine\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mengine\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m\"c\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1113\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mengine\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m\"c\"\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1114\u001b[1;33m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_engine\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mCParserWrapper\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mf\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1115\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1116\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mengine\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m\"python\"\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
60 | "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, src, **kwds)\u001b[0m\n\u001b[0;32m 1872\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mkwds\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"compression\"\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mNone\u001b[0m \u001b[1;32mand\u001b[0m \u001b[0mencoding\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1873\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msrc\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mstr\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1874\u001b[1;33m \u001b[0msrc\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msrc\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"rb\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1875\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mhandles\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msrc\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1876\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
61 | "\u001b[1;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '../data/OpenSubtitles.csv'"
62 | ]
63 | }
64 | ],
65 | "source": [
66 | "kor_corpus = pd.read_csv('../data/OpenSubtitles.csv', encoding='utf-8')\n",
67 | "sub_corpus = pd.read_csv('../data/new_kor.csv', encoding='cp949')\n",
68 | "kor_corpus = pd.concat([kor_corpus, sub_corpus])\n",
69 | "\n",
70 | "sub_corpus = pd.read_csv('../data/tatoeba.csv', encoding='utf-8')\n",
71 | "kor_corpus = pd.concat([kor_corpus, sub_corpus])\n",
72 | "\n",
73 | "sub_corpus = pd.read_csv('../data/네이버 해커톤 데이터셋.csv', encoding='utf-8', engine='python')\n",
74 | "kor_corpus = pd.concat([kor_corpus, sub_corpus])\n",
75 | "\n",
76 | "for i in range(10):\n",
77 | " sub_corpus = pd.read_csv('../data/AI Hub 한-영 데이터셋[%s].csv' % i, encoding='utf-8', engine='python')\n",
78 | " kor_corpus = pd.concat([kor_corpus, sub_corpus])"
79 | ]
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {},
84 | "source": [
85 | "* corpus 크기 확인"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": 3,
91 | "metadata": {},
92 | "outputs": [
93 | {
94 | "data": {
95 | "text/plain": [
96 | "3048264"
97 | ]
98 | },
99 | "execution_count": 3,
100 | "metadata": {},
101 | "output_type": "execute_result"
102 | }
103 | ],
104 | "source": [
105 | "len(kor_corpus)"
106 | ]
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "* 코퍼스 내용 확인"
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": 4,
118 | "metadata": {},
119 | "outputs": [
120 | {
121 | "data": {
122 | "text/html": [
123 | "\n",
124 | "\n",
137 | "
\n",
138 | " \n",
139 | " \n",
140 | " \n",
141 | " ko \n",
142 | " \n",
143 | " \n",
144 | " \n",
145 | " \n",
146 | " 0 \n",
147 | " 폭설이 내리고 우박, 진눈깨비가 퍼부어도 눈보라가 몰아쳐도 강풍이 불고 비바람이 휘... \n",
148 | " \n",
149 | " \n",
150 | " 1 \n",
151 | " 우리의 한결같은 심부름꾼 황새 아저씨 가는 길을 그 누가 막으랴! \n",
152 | " \n",
153 | " \n",
154 | " 2 \n",
155 | " 황새 아저씨를 기다리세요 \n",
156 | " \n",
157 | " \n",
158 | " 3 \n",
159 | " 찾아와 선물을 주실 거예요 \n",
160 | " \n",
161 | " \n",
162 | " 4 \n",
163 | " 가난하든 부자이든 상관이 없답니다 \n",
164 | " \n",
165 | " \n",
166 | "
\n",
167 | "
"
168 | ],
169 | "text/plain": [
170 | " ko\n",
171 | "0 폭설이 내리고 우박, 진눈깨비가 퍼부어도 눈보라가 몰아쳐도 강풍이 불고 비바람이 휘...\n",
172 | "1 우리의 한결같은 심부름꾼 황새 아저씨 가는 길을 그 누가 막으랴!\n",
173 | "2 황새 아저씨를 기다리세요\n",
174 | "3 찾아와 선물을 주실 거예요\n",
175 | "4 가난하든 부자이든 상관이 없답니다"
176 | ]
177 | },
178 | "execution_count": 4,
179 | "metadata": {},
180 | "output_type": "execute_result"
181 | }
182 | ],
183 | "source": [
184 | "kor_corpus.head()"
185 | ]
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "metadata": {},
190 | "source": [
191 | "* list로 변환 및 shuffle (기존에는 그냥 concatenate 했기 때문에 섞어줌)"
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": 5,
197 | "metadata": {},
198 | "outputs": [
199 | {
200 | "data": {
201 | "text/plain": [
202 | "['어디 보자...',\n",
203 | " '7대 왕국 종족 중에 오크는 처음 듣는다',\n",
204 | " '이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.',\n",
205 | " '라니스터 가의',\n",
206 | " '별 희한한 생각이 다 떠오르곤 하죠',\n",
207 | " '아!',\n",
208 | " '회의 전에 마실 것 좀 드릴까요? 커피와 차가 있습니다.',\n",
209 | " '나는 그런 일에는 전혀 흥미가 없어요.',\n",
210 | " '일자리가 있고 제대로 된 회사가 있다면 늘 사람들이 붐비고 활력이 넘치는 도시로 자리매김할 수 있다.',\n",
211 | " '카카오가 많이 함유된 다크 초콜릿이 더 좋다는 것을 잊지 말자.',\n",
212 | " '비오 13세는 누구인가?',\n",
213 | " '네, 어..',\n",
214 | " '치밀한 사람이긴 하지만...',\n",
215 | " '계속 절 따라오셨죠',\n",
216 | " '세계 4위 특허출원 강국이지만 지식재산 심사 품질과 보호 수준이 낮아 가치를 제대로 인정받지 못하고 있다.',\n",
217 | " '나는 전부터 이 학교를 잘 알고 있었어.',\n",
218 | " '-뭔가 꿍꿍이를 꾸미는게 틀림없어',\n",
219 | " '어쩔 수 없이 과학 수업에 늦었습니다.',\n",
220 | " '가르시아 에르난데스, 토니 그란데, 하비에르 미냐노 등 축구대표팀의 외국인 코치들이 8일 오후 (현지시간) 오스트리아 레오강의 대표팀 숙소에서 인터뷰에 응하고있다.',\n",
221 | " '검찰은 “(증인신문 내용이) 다 녹음 됐으니까 (이 전 대통령이 한 말에 대해) 따지려면 따져볼 수 있다”고 말하기도 했다.']"
222 | ]
223 | },
224 | "execution_count": 5,
225 | "metadata": {},
226 | "output_type": "execute_result"
227 | }
228 | ],
229 | "source": [
230 | "kor_corpus = list(kor_corpus['ko'])\n",
231 | "random.shuffle(kor_corpus)\n",
232 | "kor_corpus[:20]"
233 | ]
234 | },
235 | {
236 | "cell_type": "markdown",
237 | "metadata": {},
238 | "source": [
239 | "* ( ) 와 같이 괄호 처리된 내용들을 필터링한다. (Speech-To-Text에 사용할 언어모델이기 때문에 괄호 및 괄호 내용은 필요없음)"
240 | ]
241 | },
242 | {
243 | "cell_type": "code",
244 | "execution_count": 6,
245 | "metadata": {},
246 | "outputs": [],
247 | "source": [
248 | "new_corpus = list()\n",
249 | "flag = False\n",
250 | "\n",
251 | "for sentence in kor_corpus:\n",
252 | " new_sentence = str()\n",
253 | " for ch in sentence:\n",
254 | " if ch == '(':\n",
255 | " flag = True\n",
256 | " continue\n",
257 | " elif ch == ')':\n",
258 | " flag = False\n",
259 | " continue\n",
260 | " elif flag == False:\n",
261 | " new_sentence += ch\n",
262 | " else:\n",
263 | " continue\n",
264 | " \n",
265 | " new_corpus.append(new_sentence)"
266 | ]
267 | },
268 | {
269 | "cell_type": "markdown",
270 | "metadata": {},
271 | "source": [
272 | "* ()가 잘 필터링 됐는지 확인"
273 | ]
274 | },
275 | {
276 | "cell_type": "code",
277 | "execution_count": 7,
278 | "metadata": {},
279 | "outputs": [
280 | {
281 | "data": {
282 | "text/plain": [
283 | "['어디 보자...',\n",
284 | " '7대 왕국 종족 중에 오크는 처음 듣는다',\n",
285 | " '이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.',\n",
286 | " '라니스터 가의',\n",
287 | " '별 희한한 생각이 다 떠오르곤 하죠',\n",
288 | " '아!',\n",
289 | " '회의 전에 마실 것 좀 드릴까요? 커피와 차가 있습니다.',\n",
290 | " '나는 그런 일에는 전혀 흥미가 없어요.',\n",
291 | " '일자리가 있고 제대로 된 회사가 있다면 늘 사람들이 붐비고 활력이 넘치는 도시로 자리매김할 수 있다.',\n",
292 | " '카카오가 많이 함유된 다크 초콜릿이 더 좋다는 것을 잊지 말자.',\n",
293 | " '비오 13세는 누구인가?',\n",
294 | " '네, 어..',\n",
295 | " '치밀한 사람이긴 하지만...',\n",
296 | " '계속 절 따라오셨죠',\n",
297 | " '세계 4위 특허출원 강국이지만 지식재산 심사 품질과 보호 수준이 낮아 가치를 제대로 인정받지 못하고 있다.',\n",
298 | " '나는 전부터 이 학교를 잘 알고 있었어.',\n",
299 | " '-뭔가 꿍꿍이를 꾸미는게 틀림없어',\n",
300 | " '어쩔 수 없이 과학 수업에 늦었습니다.',\n",
301 | " '가르시아 에르난데스, 토니 그란데, 하비에르 미냐노 등 축구대표팀의 외국인 코치들이 8일 오후 오스트리아 레오강의 대표팀 숙소에서 인터뷰에 응하고있다.',\n",
302 | " '검찰은 “ 다 녹음 됐으니까 따지려면 따져볼 수 있다”고 말하기도 했다.']"
303 | ]
304 | },
305 | "execution_count": 7,
306 | "metadata": {},
307 | "output_type": "execute_result"
308 | }
309 | ],
310 | "source": [
311 | "kor_corpus = new_corpus\n",
312 | "kor_corpus[:20]"
313 | ]
314 | },
315 | {
316 | "cell_type": "markdown",
317 | "metadata": {},
318 | "source": [
319 | "* 현재 corpus에 어떤 특수문자들이 있는지 확인"
320 | ]
321 | },
322 | {
323 | "cell_type": "code",
324 | "execution_count": 8,
325 | "metadata": {},
326 | "outputs": [],
327 | "source": [
328 | "special_ch = list()\n",
329 | "\n",
330 | "for sentence in kor_corpus:\n",
331 | " for ch in sentence:\n",
332 | " if ch.isdigit() == False and ch.isalpha() == False and ch not in special_ch:\n",
333 | " special_ch.append(ch)"
334 | ]
335 | },
336 | {
337 | "cell_type": "markdown",
338 | "metadata": {},
339 | "source": [
340 | "* corpus에 등장한 특수문자 개수 확인"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": 9,
346 | "metadata": {},
347 | "outputs": [
348 | {
349 | "data": {
350 | "text/plain": [
351 | "209"
352 | ]
353 | },
354 | "execution_count": 9,
355 | "metadata": {},
356 | "output_type": "execute_result"
357 | }
358 | ],
359 | "source": [
360 | "len(special_ch)"
361 | ]
362 | },
363 | {
364 | "cell_type": "markdown",
365 | "metadata": {},
366 | "source": [
367 | "* 등장한 특수문자를 csv 파일로 저장하여 적절한 발음전사로 매핑"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 10,
373 | "metadata": {},
374 | "outputs": [
375 | {
376 | "data": {
377 | "text/html": [
378 | "\n",
379 | "\n",
392 | "
\n",
393 | " \n",
394 | " \n",
395 | " \n",
396 | " special \n",
397 | " replace \n",
398 | " \n",
399 | " \n",
400 | " \n",
401 | " \n",
402 | " 0 \n",
403 | " \n",
404 | " \n",
405 | " \n",
406 | " \n",
407 | " 1 \n",
408 | " . \n",
409 | " . \n",
410 | " \n",
411 | " \n",
412 | " 2 \n",
413 | " “ \n",
414 | " \n",
415 | " \n",
416 | " \n",
417 | " 3 \n",
418 | " \" \n",
419 | " \n",
420 | " \n",
421 | " \n",
422 | " 4 \n",
423 | " - \n",
424 | " \n",
425 | " \n",
426 | " \n",
427 | "
\n",
428 | "
"
429 | ],
430 | "text/plain": [
431 | " special replace\n",
432 | "0 \n",
433 | "1 . .\n",
434 | "2 “ \n",
435 | "3 \" \n",
436 | "4 - "
437 | ]
438 | },
439 | "execution_count": 10,
440 | "metadata": {},
441 | "output_type": "execute_result"
442 | }
443 | ],
444 | "source": [
445 | "special_df = pd.read_csv('special.csv', encoding='utf-8')\n",
446 | "special_df = special_df.fillna('')\n",
447 | "special_df.head()"
448 | ]
449 | },
450 | {
451 | "cell_type": "markdown",
452 | "metadata": {},
453 | "source": [
454 | "* special_dict 생성 (special => replace로 변환)"
455 | ]
456 | },
457 | {
458 | "cell_type": "code",
459 | "execution_count": 11,
460 | "metadata": {},
461 | "outputs": [],
462 | "source": [
463 | "special_dict = dict()\n",
464 | "\n",
465 | "for (special, replace) in zip(special_df['special'], special_df['replace']):\n",
466 | " special_dict[special] = replace\n",
467 | "\n",
468 | "special_dict[' '] = ' '"
469 | ]
470 | },
471 | {
472 | "cell_type": "markdown",
473 | "metadata": {},
474 | "source": [
475 | "* in 연산자로 체크를 위해 list 형식으로 저장"
476 | ]
477 | },
478 | {
479 | "cell_type": "code",
480 | "execution_count": 12,
481 | "metadata": {},
482 | "outputs": [],
483 | "source": [
484 | "specials = list(special_df['special'])\n",
485 | "replaces = list(special_df['replace'])"
486 | ]
487 | },
488 | {
489 | "cell_type": "markdown",
490 | "metadata": {},
491 | "source": [
492 | "* 특수문자들을 원하는 발음전사로 변환"
493 | ]
494 | },
495 | {
496 | "cell_type": "code",
497 | "execution_count": 13,
498 | "metadata": {},
499 | "outputs": [],
500 | "source": [
501 | "new_corpus = list()\n",
502 | "\n",
503 | "for sentence in kor_corpus:\n",
504 | " new_sentence = str()\n",
505 | " for ch in sentence:\n",
506 | " if ch in specials:\n",
507 | " new_sentence += special_dict[ch]\n",
508 | " else:\n",
509 | " new_sentence += ch\n",
510 | " \n",
511 | " new_corpus.append(new_sentence)"
512 | ]
513 | },
514 | {
515 | "cell_type": "markdown",
516 | "metadata": {},
517 | "source": [
518 | "* 특수문자가 적절히 변경되었는지 확인"
519 | ]
520 | },
521 | {
522 | "cell_type": "code",
523 | "execution_count": 14,
524 | "metadata": {},
525 | "outputs": [
526 | {
527 | "data": {
528 | "text/plain": [
529 | "['어디 보자...',\n",
530 | " '7대 왕국 종족 중에 오크는 처음 듣는다',\n",
531 | " '이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.',\n",
532 | " '라니스터 가의',\n",
533 | " '별 희한한 생각이 다 떠오르곤 하죠',\n",
534 | " '아!',\n",
535 | " '회의 전에 마실 것 좀 드릴까요? 커피와 차가 있습니다.',\n",
536 | " '나는 그런 일에는 전혀 흥미가 없어요.',\n",
537 | " '일자리가 있고 제대로 된 회사가 있다면 늘 사람들이 붐비고 활력이 넘치는 도시로 자리매김할 수 있다.',\n",
538 | " '카카오가 많이 함유된 다크 초콜릿이 더 좋다는 것을 잊지 말자.',\n",
539 | " '비오 13세는 누구인가?',\n",
540 | " '네 어..',\n",
541 | " '치밀한 사람이긴 하지만...',\n",
542 | " '계속 절 따라오셨죠',\n",
543 | " '세계 4위 특허출원 강국이지만 지식재산 심사 품질과 보호 수준이 낮아 가치를 제대로 인정받지 못하고 있다.',\n",
544 | " '나는 전부터 이 학교를 잘 알고 있었어.',\n",
545 | " '뭔가 꿍꿍이를 꾸미는게 틀림없어',\n",
546 | " '어쩔 수 없이 과학 수업에 늦었습니다.',\n",
547 | " '가르시아 에르난데스 토니 그란데 하비에르 미냐노 등 축구대표팀의 외국인 코치들이 8일 오후 오스트리아 레오강의 대표팀 숙소에서 인터뷰에 응하고있다.',\n",
548 | " '검찰은 다 녹음 됐으니까 따지려면 따져볼 수 있다고 말하기도 했다.']"
549 | ]
550 | },
551 | "execution_count": 14,
552 | "metadata": {},
553 | "output_type": "execute_result"
554 | }
555 | ],
556 | "source": [
557 | "new_corpus[:20]"
558 | ]
559 | },
560 | {
561 | "cell_type": "markdown",
562 | "metadata": {},
563 | "source": [
564 | "* corpus 업데이트"
565 | ]
566 | },
567 | {
568 | "cell_type": "code",
569 | "execution_count": 15,
570 | "metadata": {},
571 | "outputs": [],
572 | "source": [
573 | "kor_corpus = new_corpus"
574 | ]
575 | },
576 | {
577 | "cell_type": "markdown",
578 | "metadata": {},
579 | "source": [
580 | "* 알파벳 필터링 (소문자)"
581 | ]
582 | },
583 | {
584 | "cell_type": "code",
585 | "execution_count": 16,
586 | "metadata": {},
587 | "outputs": [],
588 | "source": [
589 | "alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']\n",
590 | "new_corpus = list()\n",
591 | "\n",
592 | "for sentence in kor_corpus:\n",
593 | " new_sentence = str()\n",
594 | " for ch in sentence:\n",
595 | " if ch in alphabet:\n",
596 | " continue\n",
597 | " else:\n",
598 | " new_sentence += ch\n",
599 | " \n",
600 | " new_corpus.append(new_sentence)"
601 | ]
602 | },
603 | {
604 | "cell_type": "code",
605 | "execution_count": 17,
606 | "metadata": {},
607 | "outputs": [
608 | {
609 | "data": {
610 | "text/plain": [
611 | "['어디 보자...',\n",
612 | " '7대 왕국 종족 중에 오크는 처음 듣는다',\n",
613 | " '이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.',\n",
614 | " '라니스터 가의',\n",
615 | " '별 희한한 생각이 다 떠오르곤 하죠',\n",
616 | " '아!',\n",
617 | " '회의 전에 마실 것 좀 드릴까요? 커피와 차가 있습니다.',\n",
618 | " '나는 그런 일에는 전혀 흥미가 없어요.',\n",
619 | " '일자리가 있고 제대로 된 회사가 있다면 늘 사람들이 붐비고 활력이 넘치는 도시로 자리매김할 수 있다.',\n",
620 | " '카카오가 많이 함유된 다크 초콜릿이 더 좋다는 것을 잊지 말자.',\n",
621 | " '비오 13세는 누구인가?',\n",
622 | " '네 어..',\n",
623 | " '치밀한 사람이긴 하지만...',\n",
624 | " '계속 절 따라오셨죠',\n",
625 | " '세계 4위 특허출원 강국이지만 지식재산 심사 품질과 보호 수준이 낮아 가치를 제대로 인정받지 못하고 있다.',\n",
626 | " '나는 전부터 이 학교를 잘 알고 있었어.',\n",
627 | " '뭔가 꿍꿍이를 꾸미는게 틀림없어',\n",
628 | " '어쩔 수 없이 과학 수업에 늦었습니다.',\n",
629 | " '가르시아 에르난데스 토니 그란데 하비에르 미냐노 등 축구대표팀의 외국인 코치들이 8일 오후 오스트리아 레오강의 대표팀 숙소에서 인터뷰에 응하고있다.',\n",
630 | " '검찰은 다 녹음 됐으니까 따지려면 따져볼 수 있다고 말하기도 했다.']"
631 | ]
632 | },
633 | "execution_count": 17,
634 | "metadata": {},
635 | "output_type": "execute_result"
636 | }
637 | ],
638 | "source": [
639 | "kor_corpus = new_corpus\n",
640 | "kor_corpus[:20]"
641 | ]
642 | },
643 | {
644 | "cell_type": "markdown",
645 | "metadata": {},
646 | "source": [
647 | "* 대문자 필터링 (발음으로 전사 ex) K리그 => 케이리그)"
648 | ]
649 | },
650 | {
651 | "cell_type": "code",
652 | "execution_count": 18,
653 | "metadata": {},
654 | "outputs": [],
655 | "source": [
656 | "new_corpus = list()\n",
657 | "upper = ['A', 'B', 'C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']\n",
658 | "upper_dict = {'A' : '에이', 'B' : '비', 'C' : '씨','D' : '디','E' : '이','F' : '에프','G' : '쥐',\n",
659 | " 'H' : '에이취','I' : '아이','J' : '제이','K' : '케이','L' : '엘','M' : '엠','N' : '엔',\n",
660 | " 'O' : '오','P' : '피','Q' : '큐','R' : '알','S' : '에스','T' : '티','U' : '유','V' : '브이','W' : '떠블유',\n",
661 | " 'X' : '엑스','Y' : '와이','Z' : '지'}\n",
662 | "\n",
663 | "for sentence in kor_corpus:\n",
664 | " new_sentence = str()\n",
665 | " for idx, ch in enumerate(sentence):\n",
666 | " if ch in upper:\n",
667 | " if idx == 0 and (idx == len(sentence) - 1 or sentence[idx + 1] == ' '):\n",
668 | " continue\n",
669 | " \n",
670 | " elif idx != 0 and idx < len(sentence) - 1 and sentence[idx + 1] == ' ' and sentence[idx - 1] == ' ':\n",
671 | " continue\n",
672 | " \n",
673 | " elif idx != 0 and sentence[idx -1] == ' ' and idx + 1 == len(sentence):\n",
674 | " continue\n",
675 | " \n",
676 | " else:\n",
677 | " new_sentence += upper_dict[ch]\n",
678 | " \n",
679 | " else:\n",
680 | " new_sentence += ch\n",
681 | " \n",
682 | " new_corpus.append(new_sentence)"
683 | ]
684 | },
685 | {
686 | "cell_type": "code",
687 | "execution_count": 19,
688 | "metadata": {},
689 | "outputs": [
690 | {
691 | "data": {
692 | "text/plain": [
693 | "['어디 보자...',\n",
694 | " '7대 왕국 종족 중에 오크는 처음 듣는다',\n",
695 | " '이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.',\n",
696 | " '라니스터 가의',\n",
697 | " '별 희한한 생각이 다 떠오르곤 하죠',\n",
698 | " '아!',\n",
699 | " '회의 전에 마실 것 좀 드릴까요? 커피와 차가 있습니다.',\n",
700 | " '나는 그런 일에는 전혀 흥미가 없어요.',\n",
701 | " '일자리가 있고 제대로 된 회사가 있다면 늘 사람들이 붐비고 활력이 넘치는 도시로 자리매김할 수 있다.',\n",
702 | " '카카오가 많이 함유된 다크 초콜릿이 더 좋다는 것을 잊지 말자.',\n",
703 | " '비오 13세는 누구인가?',\n",
704 | " '네 어..',\n",
705 | " '치밀한 사람이긴 하지만...',\n",
706 | " '계속 절 따라오셨죠',\n",
707 | " '세계 4위 특허출원 강국이지만 지식재산 심사 품질과 보호 수준이 낮아 가치를 제대로 인정받지 못하고 있다.',\n",
708 | " '나는 전부터 이 학교를 잘 알고 있었어.',\n",
709 | " '뭔가 꿍꿍이를 꾸미는게 틀림없어',\n",
710 | " '어쩔 수 없이 과학 수업에 늦었습니다.',\n",
711 | " '가르시아 에르난데스 토니 그란데 하비에르 미냐노 등 축구대표팀의 외국인 코치들이 8일 오후 오스트리아 레오강의 대표팀 숙소에서 인터뷰에 응하고있다.',\n",
712 | " '검찰은 다 녹음 됐으니까 따지려면 따져볼 수 있다고 말하기도 했다.']"
713 | ]
714 | },
715 | "execution_count": 19,
716 | "metadata": {},
717 | "output_type": "execute_result"
718 | }
719 | ],
720 | "source": [
721 | "new_corpus[:20]"
722 | ]
723 | },
724 | {
725 | "cell_type": "markdown",
726 | "metadata": {},
727 | "source": [
728 | "* 데이터셋 업데이트"
729 | ]
730 | },
731 | {
732 | "cell_type": "code",
733 | "execution_count": 20,
734 | "metadata": {},
735 | "outputs": [],
736 | "source": [
737 | "kor_corpus = new_corpus"
738 | ]
739 | },
740 | {
741 | "cell_type": "markdown",
742 | "metadata": {},
743 | "source": [
744 | "* 긴 공백 => ' '"
745 | ]
746 | },
747 | {
748 | "cell_type": "code",
749 | "execution_count": 21,
750 | "metadata": {},
751 | "outputs": [],
752 | "source": [
753 | "new_corpus = list()\n",
754 | "\n",
755 | "for sentence in kor_corpus:\n",
756 | " new_sentence = str()\n",
757 | " tokens = sentence.split()\n",
758 | " for idx, token in enumerate(tokens):\n",
759 | " if idx == 0:\n",
760 | " new_sentence += token + ' '\n",
761 | " elif idx == len(tokens) - 1:\n",
762 | " new_sentence += token\n",
763 | " else:\n",
764 | " new_sentence += token + ' '\n",
765 | " \n",
766 | " new_corpus.append(new_sentence)"
767 | ]
768 | },
769 | {
770 | "cell_type": "code",
771 | "execution_count": 22,
772 | "metadata": {},
773 | "outputs": [
774 | {
775 | "data": {
776 | "text/plain": [
777 | "['어디 보자...',\n",
778 | " '7대 왕국 종족 중에 오크는 처음 듣는다',\n",
779 | " '이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.',\n",
780 | " '라니스터 가의',\n",
781 | " '별 희한한 생각이 다 떠오르곤 하죠',\n",
782 | " '아! ',\n",
783 | " '회의 전에 마실 것 좀 드릴까요? 커피와 차가 있습니다.',\n",
784 | " '나는 그런 일에는 전혀 흥미가 없어요.',\n",
785 | " '일자리가 있고 제대로 된 회사가 있다면 늘 사람들이 붐비고 활력이 넘치는 도시로 자리매김할 수 있다.',\n",
786 | " '카카오가 많이 함유된 다크 초콜릿이 더 좋다는 것을 잊지 말자.',\n",
787 | " '비오 13세는 누구인가?',\n",
788 | " '네 어..',\n",
789 | " '치밀한 사람이긴 하지만...',\n",
790 | " '계속 절 따라오셨죠',\n",
791 | " '세계 4위 특허출원 강국이지만 지식재산 심사 품질과 보호 수준이 낮아 가치를 제대로 인정받지 못하고 있다.',\n",
792 | " '나는 전부터 이 학교를 잘 알고 있었어.',\n",
793 | " '뭔가 꿍꿍이를 꾸미는게 틀림없어',\n",
794 | " '어쩔 수 없이 과학 수업에 늦었습니다.',\n",
795 | " '가르시아 에르난데스 토니 그란데 하비에르 미냐노 등 축구대표팀의 외국인 코치들이 8일 오후 오스트리아 레오강의 대표팀 숙소에서 인터뷰에 응하고있다.',\n",
796 | " '검찰은 다 녹음 됐으니까 따지려면 따져볼 수 있다고 말하기도 했다.']"
797 | ]
798 | },
799 | "execution_count": 22,
800 | "metadata": {},
801 | "output_type": "execute_result"
802 | }
803 | ],
804 | "source": [
805 | "new_corpus[:20]"
806 | ]
807 | },
808 | {
809 | "cell_type": "markdown",
810 | "metadata": {},
811 | "source": [
812 | "* 데이터셋 업데이트"
813 | ]
814 | },
815 | {
816 | "cell_type": "code",
817 | "execution_count": 23,
818 | "metadata": {},
819 | "outputs": [],
820 | "source": [
821 | "kor_corpus = new_corpus"
822 | ]
823 | },
824 | {
825 | "cell_type": "markdown",
826 | "metadata": {},
827 | "source": [
828 | "* 데이터셋 크기 확인"
829 | ]
830 | },
831 | {
832 | "cell_type": "code",
833 | "execution_count": 24,
834 | "metadata": {},
835 | "outputs": [
836 | {
837 | "data": {
838 | "text/plain": [
839 | "3048264"
840 | ]
841 | },
842 | "execution_count": 24,
843 | "metadata": {},
844 | "output_type": "execute_result"
845 | }
846 | ],
847 | "source": [
848 | "len(kor_corpus)"
849 | ]
850 | },
851 | {
852 | "cell_type": "markdown",
853 | "metadata": {},
854 | "source": [
855 | "* 필터링 되면서 빈문자열만 남은 데이터 제외"
856 | ]
857 | },
858 | {
859 | "cell_type": "code",
860 | "execution_count": 25,
861 | "metadata": {},
862 | "outputs": [],
863 | "source": [
864 | "new_corpus = list()\n",
865 | "\n",
866 | "for sentence in kor_corpus:\n",
867 | " if len(sentence) == 0:\n",
868 | " continue\n",
869 | " new_corpus.append(sentence)"
870 | ]
871 | },
872 | {
873 | "cell_type": "markdown",
874 | "metadata": {},
875 | "source": [
876 | "* 약 25,000개 필터링"
877 | ]
878 | },
879 | {
880 | "cell_type": "code",
881 | "execution_count": 26,
882 | "metadata": {},
883 | "outputs": [
884 | {
885 | "data": {
886 | "text/plain": [
887 | "3023296"
888 | ]
889 | },
890 | "execution_count": 26,
891 | "metadata": {},
892 | "output_type": "execute_result"
893 | }
894 | ],
895 | "source": [
896 | "len(new_corpus)"
897 | ]
898 | },
899 | {
900 | "cell_type": "markdown",
901 | "metadata": {},
902 | "source": [
903 | "* 데이터셋 업데이트"
904 | ]
905 | },
906 | {
907 | "cell_type": "code",
908 | "execution_count": 27,
909 | "metadata": {},
910 | "outputs": [],
911 | "source": [
912 | "kor_corpus = new_corpus"
913 | ]
914 | },
915 | {
916 | "cell_type": "markdown",
917 | "metadata": {},
918 | "source": [
919 | "* 아무지점이나 찍어서 중간확인"
920 | ]
921 | },
922 | {
923 | "cell_type": "code",
924 | "execution_count": 28,
925 | "metadata": {},
926 | "outputs": [
927 | {
928 | "data": {
929 | "text/plain": [
930 | "['이란 세계식량계획 사무소 대표 및 국가 담당국장 쥐는 2018년 올해 주이란 한국대사관 측이 220만 달러를 기부했을 때 한국 국민과 정부에 감사의 뜻을 전하기도 했다.',\n",
931 | " '조심해요 오염된 거예요',\n",
932 | " '그건 그렇고 메이즈는 어때?',\n",
933 | " '간에 있는 덩어리가..',\n",
934 | " '. ',\n",
935 | " '물렸어요? ',\n",
936 | " '안 보이나 보네',\n",
937 | " '그 전엔 아무것도 쓸 것 이 없었어요. 닥터 그레이.',\n",
938 | " '하루만 머무를 거요',\n",
939 | " '그런건 내 일이 아니라서 다른 도움이 필요하면 말해 난 쇼파에 있을꺼야',\n",
940 | " '고마워 ',\n",
941 | " '취업준비생들은 현실적으로 청년취업이 어려운 상황에서 아쉽다는 반응이었다.',\n",
942 | " '시장을 떠나는 순간 덮치겠음',\n",
943 | " '그리고 디자인이랑',\n",
944 | " '저항권이란 위헌적인 국가권력의 행사 때문에 헌법이 침해 또는 파괴되는 경우 주권자인 국민이 헌법을 보장하기 위해 최후 비상수단으로 국가권력에 실력으로 저항하는 것입니다.',\n",
945 | " '현대모비스 양동근도 정규 리그에서 상대해 본 결과 함지훈을 막을 선수가 없다는 게 전자랜드의 약점이라고 지적했다.',\n",
946 | " '아사히는 꾸준히 한국어 공부를 하면서 내가 하고 싶은 노래를 만들기 위해 항상 들고 다닌다고 말해 아티스트로서 더욱 발전할 수 있는 가능성을 보여줬다.',\n",
947 | " '서양마사지 방식으로는 스웨디시 마사지 에스알엔 그리고 홀리스틱 마사지가 있다.',\n",
948 | " '내 말을 들었더라면',\n",
949 | " '앞으로 이렇게 수도권 광역 경제발전위원회가 자주 열려서 다양한 이야기를 해야 한다.']"
950 | ]
951 | },
952 | "execution_count": 28,
953 | "metadata": {},
954 | "output_type": "execute_result"
955 | }
956 | ],
957 | "source": [
958 | "kor_corpus[6000:6020]"
959 | ]
960 | },
961 | {
962 | "cell_type": "markdown",
963 | "metadata": {},
964 | "source": [
965 | "* 뒤에가 공백으로 끝나는 문자열 공백 제거"
966 | ]
967 | },
968 | {
969 | "cell_type": "code",
970 | "execution_count": 29,
971 | "metadata": {},
972 | "outputs": [],
973 | "source": [
974 | "for idx, sentence in enumerate(kor_corpus):\n",
975 | " if sentence[-1] == ' ':\n",
976 | " kor_corpus[idx] = sentence[:-1] "
977 | ]
978 | },
979 | {
980 | "cell_type": "code",
981 | "execution_count": 30,
982 | "metadata": {},
983 | "outputs": [
984 | {
985 | "data": {
986 | "text/plain": [
987 | "['이란 세계식량계획 사무소 대표 및 국가 담당국장 쥐는 2018년 올해 주이란 한국대사관 측이 220만 달러를 기부했을 때 한국 국민과 정부에 감사의 뜻을 전하기도 했다.',\n",
988 | " '조심해요 오염된 거예요',\n",
989 | " '그건 그렇고 메이즈는 어때?',\n",
990 | " '간에 있는 덩어리가..',\n",
991 | " '.',\n",
992 | " '물렸어요?',\n",
993 | " '안 보이나 보네',\n",
994 | " '그 전엔 아무것도 쓸 것 이 없었어요. 닥터 그레이.',\n",
995 | " '하루만 머무를 거요',\n",
996 | " '그런건 내 일이 아니라서 다른 도움이 필요하면 말해 난 쇼파에 있을꺼야',\n",
997 | " '고마워',\n",
998 | " '취업준비생들은 현실적으로 청년취업이 어려운 상황에서 아쉽다는 반응이었다.',\n",
999 | " '시장을 떠나는 순간 덮치겠음',\n",
1000 | " '그리고 디자인이랑',\n",
1001 | " '저항권이란 위헌적인 국가권력의 행사 때문에 헌법이 침해 또는 파괴되는 경우 주권자인 국민이 헌법을 보장하기 위해 최후 비상수단으로 국가권력에 실력으로 저항하는 것입니다.',\n",
1002 | " '현대모비스 양동근도 정규 리그에서 상대해 본 결과 함지훈을 막을 선수가 없다는 게 전자랜드의 약점이라고 지적했다.',\n",
1003 | " '아사히는 꾸준히 한국어 공부를 하면서 내가 하고 싶은 노래를 만들기 위해 항상 들고 다닌다고 말해 아티스트로서 더욱 발전할 수 있는 가능성을 보여줬다.',\n",
1004 | " '서양마사지 방식으로는 스웨디시 마사지 에스알엔 그리고 홀리스틱 마사지가 있다.',\n",
1005 | " '내 말을 들었더라면',\n",
1006 | " '앞으로 이렇게 수도권 광역 경제발전위원회가 자주 열려서 다양한 이야기를 해야 한다.']"
1007 | ]
1008 | },
1009 | "execution_count": 30,
1010 | "metadata": {},
1011 | "output_type": "execute_result"
1012 | }
1013 | ],
1014 | "source": [
1015 | "kor_corpus[6000:6020]"
1016 | ]
1017 | },
1018 | {
1019 | "cell_type": "markdown",
1020 | "metadata": {},
1021 | "source": [
1022 | "* 현재 데이터셋 크기 확인"
1023 | ]
1024 | },
1025 | {
1026 | "cell_type": "code",
1027 | "execution_count": 31,
1028 | "metadata": {},
1029 | "outputs": [
1030 | {
1031 | "data": {
1032 | "text/plain": [
1033 | "3023296"
1034 | ]
1035 | },
1036 | "execution_count": 31,
1037 | "metadata": {},
1038 | "output_type": "execute_result"
1039 | }
1040 | ],
1041 | "source": [
1042 | "len(kor_corpus)"
1043 | ]
1044 | },
1045 | {
1046 | "cell_type": "markdown",
1047 | "metadata": {},
1048 | "source": [
1049 | "* '?', '!', '.'만 남은 문자열 필터링"
1050 | ]
1051 | },
1052 | {
1053 | "cell_type": "code",
1054 | "execution_count": 32,
1055 | "metadata": {},
1056 | "outputs": [
1057 | {
1058 | "data": {
1059 | "text/plain": [
1060 | "3001447"
1061 | ]
1062 | },
1063 | "execution_count": 32,
1064 | "metadata": {},
1065 | "output_type": "execute_result"
1066 | }
1067 | ],
1068 | "source": [
1069 | "new_corpus = list()\n",
1070 | "\n",
1071 | "for sentence in kor_corpus:\n",
1072 | " if sentence == '?' or sentence == '!' or sentence == '.':\n",
1073 | " continue\n",
1074 | " \n",
1075 | " else:\n",
1076 | " new_corpus.append(sentence)\n",
1077 | " \n",
1078 | "len(new_corpus)"
1079 | ]
1080 | },
1081 | {
1082 | "cell_type": "markdown",
1083 | "metadata": {},
1084 | "source": [
1085 | "* 약 2만개 필터링 후 데이터셋 업데이트"
1086 | ]
1087 | },
1088 | {
1089 | "cell_type": "code",
1090 | "execution_count": 33,
1091 | "metadata": {},
1092 | "outputs": [],
1093 | "source": [
1094 | "kor_corpus = new_corpus"
1095 | ]
1096 | },
1097 | {
1098 | "cell_type": "markdown",
1099 | "metadata": {},
1100 | "source": [
1101 | "* import **KoNPron** (Korean Number Pronunciaion) => 숫자를 발음전사로 변환"
1102 | ]
1103 | },
1104 | {
1105 | "cell_type": "code",
1106 | "execution_count": 34,
1107 | "metadata": {},
1108 | "outputs": [],
1109 | "source": [
1110 | "from konpron import KoNPron\n",
1111 | "\n",
1112 | "kpr = KoNPron()"
1113 | ]
1114 | },
1115 | {
1116 | "cell_type": "markdown",
1117 | "metadata": {},
1118 | "source": [
1119 | "* **KoNPron** 테스트"
1120 | ]
1121 | },
1122 | {
1123 | "cell_type": "code",
1124 | "execution_count": 35,
1125 | "metadata": {},
1126 | "outputs": [
1127 | {
1128 | "name": "stdout",
1129 | "output_type": "stream",
1130 | "text": [
1131 | "천이백삼십사랑, 십이 점 삼사랑, 천이백삼십사랑, 백이십삼의 사 승랑, 일의 이백삼십사 승랑, 일 점 이삼의 사 승랑, 십이 점 삼의 사 승랑, 일이삼사랑, 이억 구천삼백삼십만 천삼백 점 오오삼영일의 사 승\n"
1132 | ]
1133 | }
1134 | ],
1135 | "source": [
1136 | "sentence = \"1234랑, 12.34랑, 1,234랑, 123^4랑, 1²³⁴랑, 1.23^4랑, 12.3⁴랑, 12·34랑, 293301300.55301⁴\"\n",
1137 | "\n",
1138 | "ret = kpr.convert(sentence)\n",
1139 | "if ret is not None:\n",
1140 | " print(ret)"
1141 | ]
1142 | },
1143 | {
1144 | "cell_type": "markdown",
1145 | "metadata": {},
1146 | "source": [
1147 | "* **KoNPron**으로 숫자 => 발음전사로 변환"
1148 | ]
1149 | },
1150 | {
1151 | "cell_type": "code",
1152 | "execution_count": 36,
1153 | "metadata": {},
1154 | "outputs": [],
1155 | "source": [
1156 | "new_corpus = list()\n",
1157 | "\n",
1158 | "for sentence in kor_corpus:\n",
1159 | " try:\n",
1160 | " ret = kpr.convert(sentence)\n",
1161 | " if ret is not None:\n",
1162 | " new_corpus.append(ret)\n",
1163 | " else:\n",
1164 | " new_corpus.append(sentence)\n",
1165 | " except:\n",
1166 | " print(sentence)"
1167 | ]
1168 | },
1169 | {
1170 | "cell_type": "code",
1171 | "execution_count": 37,
1172 | "metadata": {},
1173 | "outputs": [],
1174 | "source": [
1175 | "kor_corpus = new_corpus"
1176 | ]
1177 | },
1178 | {
1179 | "cell_type": "markdown",
1180 | "metadata": {},
1181 | "source": [
1182 | "* 음향모델에서 사용하는 2,040개의 문자 레이블로만 이루어진 문장 수 확인"
1183 | ]
1184 | },
1185 | {
1186 | "cell_type": "code",
1187 | "execution_count": 38,
1188 | "metadata": {},
1189 | "outputs": [
1190 | {
1191 | "name": "stdout",
1192 | "output_type": "stream",
1193 | "text": [
1194 | "2992120\n"
1195 | ]
1196 | }
1197 | ],
1198 | "source": [
1199 | "acoustic_labels = list(pd.read_csv('train_labels.csv')['char'])\n",
1200 | "count = 0\n",
1201 | "\n",
1202 | "for sentence in kor_corpus:\n",
1203 | " for ch in sentence:\n",
1204 | " if ch not in acoustic_labels:\n",
1205 | " count += 1\n",
1206 | " break\n",
1207 | "\n",
1208 | "print(len(kor_corpus) - count)"
1209 | ]
1210 | },
1211 | {
1212 | "cell_type": "markdown",
1213 | "metadata": {},
1214 | "source": [
1215 | "* 조건에 만족하는 2,992,638개의 문장들로 최종 코퍼스 구성"
1216 | ]
1217 | },
1218 | {
1219 | "cell_type": "code",
1220 | "execution_count": 41,
1221 | "metadata": {},
1222 | "outputs": [
1223 | {
1224 | "data": {
1225 | "text/plain": [
1226 | "2992120"
1227 | ]
1228 | },
1229 | "execution_count": 41,
1230 | "metadata": {},
1231 | "output_type": "execute_result"
1232 | }
1233 | ],
1234 | "source": [
1235 | "final_corpus = list()\n",
1236 | "\n",
1237 | "for sentence in kor_corpus:\n",
1238 | " flag = True\n",
1239 | " for ch in sentence:\n",
1240 | " if ch not in acoustic_labels:\n",
1241 | " flag = False\n",
1242 | " break\n",
1243 | " if flag:\n",
1244 | " final_corpus.append(sentence)\n",
1245 | "\n",
1246 | "len(final_corpus)"
1247 | ]
1248 | },
1249 | {
1250 | "cell_type": "code",
1251 | "execution_count": 44,
1252 | "metadata": {},
1253 | "outputs": [],
1254 | "source": [
1255 | "import csv\n",
1256 | "\n",
1257 | "def load_label(label_path, encoding='utf-8'):\n",
1258 | " char2id = dict()\n",
1259 | " id2char = dict()\n",
1260 | "\n",
1261 | " with open(label_path, 'r', encoding=encoding) as f:\n",
1262 | " labels = csv.reader(f, delimiter=',')\n",
1263 | " next(labels)\n",
1264 | "\n",
1265 | " for row in labels:\n",
1266 | " char2id[row[1]] = row[0]\n",
1267 | " id2char[int(row[0])] = row[1]\n",
1268 | "\n",
1269 | " return char2id, id2char\n",
1270 | "\n",
1271 | "char2id, id2char = load_label('train_labels.csv')"
1272 | ]
1273 | },
1274 | {
1275 | "cell_type": "code",
1276 | "execution_count": 49,
1277 | "metadata": {},
1278 | "outputs": [
1279 | {
1280 | "name": "stdout",
1281 | "output_type": "stream",
1282 | "text": [
1283 | "2992120\n",
1284 | "2992120\n"
1285 | ]
1286 | }
1287 | ],
1288 | "source": [
1289 | "corpus_labels = list()\n",
1290 | "\n",
1291 | "for sentence in final_corpus:\n",
1292 | " label = str()\n",
1293 | " for ch in sentence:\n",
1294 | " if idx == 0:\n",
1295 | " label += char2id[ch] + ' '\n",
1296 | " elif idx == len(sentence) - 1:\n",
1297 | " label += char2id[ch]\n",
1298 | " else:\n",
1299 | " label += char2id[ch] + ' '\n",
1300 | " \n",
1301 | " corpus_labels.append(label)\n",
1302 | " \n",
1303 | "print(len(final_corpus))\n",
1304 | "print(len(corpus_labels))"
1305 | ]
1306 | },
1307 | {
1308 | "cell_type": "code",
1309 | "execution_count": 50,
1310 | "metadata": {},
1311 | "outputs": [
1312 | {
1313 | "data": {
1314 | "text/html": [
1315 | "\n",
1316 | "\n",
1329 | "
\n",
1330 | " \n",
1331 | " \n",
1332 | " \n",
1333 | " ko \n",
1334 | " id \n",
1335 | " \n",
1336 | " \n",
1337 | " \n",
1338 | " \n",
1339 | " 0 \n",
1340 | " 어디 보자... \n",
1341 | " 8 190 0 42 45 1 1 1 \n",
1342 | " \n",
1343 | " \n",
1344 | " 1 \n",
1345 | " 칠대 왕국 종족 중에 오크는 처음 듣는다 \n",
1346 | " 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n",
1347 | " \n",
1348 | " \n",
1349 | " 2 \n",
1350 | " 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n",
1351 | " 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n",
1352 | " \n",
1353 | " \n",
1354 | " 3 \n",
1355 | " 라니스터 가의 \n",
1356 | " 32 20 79 162 0 6 130 \n",
1357 | " \n",
1358 | " \n",
1359 | " 4 \n",
1360 | " 별 희한한 생각이 다 떠오르곤 하죠 \n",
1361 | " 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... \n",
1362 | " \n",
1363 | " \n",
1364 | "
\n",
1365 | "
"
1366 | ],
1367 | "text/plain": [
1368 | " ko \\\n",
1369 | "0 어디 보자... \n",
1370 | "1 칠대 왕국 종족 중에 오크는 처음 듣는다 \n",
1371 | "2 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n",
1372 | "3 라니스터 가의 \n",
1373 | "4 별 희한한 생각이 다 떠오르곤 하죠 \n",
1374 | "\n",
1375 | " id \n",
1376 | "0 8 190 0 42 45 1 1 1 \n",
1377 | "1 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n",
1378 | "2 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n",
1379 | "3 32 20 79 162 0 6 130 \n",
1380 | "4 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... "
1381 | ]
1382 | },
1383 | "execution_count": 50,
1384 | "metadata": {},
1385 | "output_type": "execute_result"
1386 | }
1387 | ],
1388 | "source": [
1389 | "corpus_dict = {'ko' : final_corpus,\n",
1390 | " 'id' : corpus_labels}\n",
1391 | "corpus_df = pd.DataFrame(corpus_dict)\n",
1392 | "\n",
1393 | "corpus_df.head()"
1394 | ]
1395 | },
1396 | {
1397 | "cell_type": "code",
1398 | "execution_count": 53,
1399 | "metadata": {},
1400 | "outputs": [
1401 | {
1402 | "data": {
1403 | "text/plain": [
1404 | "2992120"
1405 | ]
1406 | },
1407 | "execution_count": 53,
1408 | "metadata": {},
1409 | "output_type": "execute_result"
1410 | }
1411 | ],
1412 | "source": [
1413 | "len(corpus_df)"
1414 | ]
1415 | },
1416 | {
1417 | "cell_type": "markdown",
1418 | "metadata": {},
1419 | "source": [
1420 | "* 데이터셋 피클로 저장"
1421 | ]
1422 | },
1423 | {
1424 | "cell_type": "code",
1425 | "execution_count": 52,
1426 | "metadata": {},
1427 | "outputs": [],
1428 | "source": [
1429 | "import pickle\n",
1430 | "\n",
1431 | "with open('corpus_df.bin', 'wb') as f:\n",
1432 | " pickle.dump(corpus_df, f)"
1433 | ]
1434 | }
1435 | ],
1436 | "metadata": {
1437 | "kernelspec": {
1438 | "display_name": "Python 3",
1439 | "language": "python",
1440 | "name": "python3"
1441 | },
1442 | "language_info": {
1443 | "codemirror_mode": {
1444 | "name": "ipython",
1445 | "version": 3
1446 | },
1447 | "file_extension": ".py",
1448 | "mimetype": "text/x-python",
1449 | "name": "python",
1450 | "nbconvert_exporter": "python",
1451 | "pygments_lexer": "ipython3",
1452 | "version": "3.7.6"
1453 | }
1454 | },
1455 | "nbformat": 4,
1456 | "nbformat_minor": 2
1457 | }
1458 |
--------------------------------------------------------------------------------
/preprocess/konpron.py:
--------------------------------------------------------------------------------
1 | class KoNPron:
2 | def __init__(self):
3 | self.base_digit = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
4 | self.super_digit = ['⁰', '¹', '²', '³', '⁴', '⁵', '⁶', '⁷', '⁸', '⁹']
5 | self.small_scale = ['', '십', '백', '천']
6 | self.large_scale = ['', '만 ', '억 ', '조 ', '경 ', '해 ']
7 | self.literal = ['영', '일', '이', '삼', '사', '오', '육', '칠', '팔', '구']
8 | self.spoken_unit = ['', '하나', '둘', '셋', '넷', '다섯', '여섯', '일곱', '여덟', '아홉']
9 | self.spoke_tens = ['', '열', '스물', '서른', '마흔', '쉰', '예순', '일흔', '여든', '아흔']
10 | self.sentence = str()
11 |
12 | def _detect(self, sentence):
13 | self.sentence = sentence
14 | detection_data = list()
15 | tmp = str()
16 |
17 | total_len = len(sentence)
18 | point_count = 0
19 | continuous_count = 0
20 | digit_type = 'vanilla'
21 |
22 | detected = False
23 | zero_started = False
24 |
25 | for idx, char in enumerate(sentence):
26 | if char in self.base_digit:
27 | if not detected:
28 | detected = True
29 | if char is '0':
30 | zero_started = True
31 | tmp += char
32 | continuous_count += 1
33 | if zero_started and continuous_count > 8:
34 | digit_type = 'telephone/none'
35 | if continuous_count > 20:
36 | digit_type = 'enormous/none'
37 | else:
38 | continous_count = 0
39 | zero_started = False
40 | if char == ',':
41 | if idx + 1 < total_len and idx > 0:
42 | if sentence[idx - 1] in self.base_digit and sentence[idx + 1] in self.base_digit:
43 | tmp += char
44 | elif char == '.':
45 | if idx + 1 < total_len and idx > 0:
46 | if sentence[idx - 1] in self.base_digit and sentence[idx + 1] in self.base_digit:
47 | point_count += 1
48 | if point_count == 1:
49 | digit_type = 'fraction'
50 | if point_count > 1:
51 | digit_type = 'version'
52 | tmp += char
53 | elif char == '^':
54 | if idx + 1 < total_len and idx > 0:
55 | if sentence[idx - 1] in self.base_digit and sentence[idx + 1] in self.base_digit:
56 | digit_type += '/square'
57 | tmp += char
58 | else:
59 | if digit_type != 'exception/none':
60 | digit_type = 'exception/none'
61 | tmp += char
62 | elif char in self.super_digit:
63 | if idx > 0:
64 | if sentence[idx - 1] in self.base_digit or sentence[idx - 1] in self.super_digit:
65 | if not 'square' in digit_type:
66 | digit_type += '/square'
67 | tmp += char
68 | else:
69 | if digit_type != 'exception/none':
70 | digit_type = 'exception/none'
71 | tmp += char
72 | elif char == '·':
73 | if idx + 1 < total_len and idx > 0:
74 | if sentence[idx - 1] in self.base_digit and sentence[idx + 1] in self.base_digit:
75 | digit_type = 'date'
76 | tmp += char
77 | else:
78 | if detected:
79 | detected = False
80 | if '/' not in digit_type:
81 | digit_type += '/none'
82 | detection_data.append((digit_type, tmp))
83 | tmp = str()
84 | digit_type = 'vanilla'
85 | point_count = 0
86 | continuous_count = 0
87 | else:
88 | if tmp:
89 | detection_data.append((digit_type, tmp))
90 | tmp = str()
91 | digit_type = 'vanilla'
92 | point_count = 0
93 | continuous_count = 0
94 |
95 | if detected:
96 | if '/' not in digit_type:
97 | digit_type += '/none'
98 | detection_data.append((digit_type, tmp))
99 |
100 | elif tmp:
101 | detection_data.append((digit_type, tmp))
102 | return detection_data
103 |
104 | def _preprocess(self, detection_data):
105 | preprocessed_data = list()
106 |
107 | for digit_type, target in detection_data:
108 | original = target
109 | target_seq = list()
110 | reading_method = list()
111 | target_len = len(target)
112 |
113 | main_type, sub_type = digit_type.split('/')
114 | if main_type == 'exception':
115 | return None
116 |
117 | if main_type == 'version':
118 | splited = target.split('.')
119 | for count, frag in enumerate(splited):
120 | target_seq.append(frag)
121 | reading_method.append('individual')
122 | if count < len(splited) - 1:
123 | target_seq.append('.')
124 | reading_method.append('point')
125 | preprocessed_data.append((reading_method, target_seq, original))
126 |
127 | if main_type == 'date':
128 | target = target.split('·')
129 | for frag in target:
130 | target_seq.append(frag)
131 | reading_method.append('individual')
132 | preprocessed_data.append((reading_method, target_seq, original))
133 |
134 | if main_type == 'telephone' or main_type == 'enormous':
135 | target = target.replace('0', '_')
136 | target_seq.append(target)
137 | reading_method.append('individual')
138 | preprocessed_data.append((reading_method, target_seq, original))
139 |
140 | if sub_type != 'none':
141 | if main_type == 'vanilla':
142 | target = [target.replace(',', '')]
143 | if main_type == 'fraction':
144 | target = target.split('.')
145 |
146 | if sub_type == 'square':
147 | if '^' not in target[-1]:
148 | super_part = str()
149 | for idx, digit in enumerate(target[-1]):
150 | if digit not in self.base_digit:
151 | super_idx = idx
152 | break
153 | super_num = list(target[-1][idx:])
154 | for idx in range(len(super_num)):
155 | super_num[idx] = str(self.super_digit.index(super_num[idx]))
156 | super_num = ''.join(super_num)
157 | super_part += target[-1][:super_idx] + '^' + super_num
158 | if main_type == 'vanilla':
159 | target = [super_part]
160 | else:
161 | target = target[:1] + [super_part]
162 | if main_type == 'fraction':
163 | target_seq.append(target[0])
164 | reading_method.append('literal')
165 | target_seq.append('.')
166 | reading_method.append('point')
167 |
168 | tmp_len = len(target_seq)
169 | tmp = target[-1].split('^')
170 | for seq in tmp:
171 | target_seq.append(seq)
172 | if 'point' in reading_method:
173 | if 'of' in reading_method:
174 | reading_method.append('literal')
175 | else:
176 | reading_method.append('individual')
177 | else:
178 | reading_method.append('literal')
179 | if len(target_seq) == tmp_len + 1:
180 | target_seq.append("^")
181 | reading_method.append("of")
182 | elif len(target_seq) == tmp_len + 3:
183 | target_seq.append("^")
184 | reading_method.append("super")
185 | preprocessed_data.append((reading_method, target_seq, original))
186 |
187 | else:
188 | if main_type == 'vanilla':
189 | target = target.replace(',', '')
190 | target_seq.append(target)
191 | reading_method.append('literal')
192 | preprocessed_data.append((reading_method, target_seq, original))
193 |
194 | if main_type == 'fraction':
195 | target = target.split('.')
196 | for frag in target:
197 | target_seq.append(frag)
198 | if 'point' in reading_method:
199 | reading_method.append('individual')
200 | else:
201 | reading_method.append('literal')
202 | if len(target_seq) == 1:
203 | target_seq.append(".")
204 | reading_method.append('point')
205 |
206 | preprocessed_data.append((reading_method, target_seq, original))
207 |
208 | return preprocessed_data
209 |
210 | def _read(self, preprocessed_data, mode='informal'):
211 | def literal_read(self, frag, mode='informal'):
212 | korean = str()
213 | tmp = str()
214 | length = len(frag)
215 | for idx, digit in enumerate(frag):
216 | digit = int(digit)
217 | inversed_idx = length - idx - 1
218 | if mode == 'formal':
219 | if inversed_idx % 4:
220 | if digit:
221 | tmp += self.literal[digit]
222 | tmp += self.small_scale[inversed_idx % 4]
223 | else:
224 | if digit or length == 1:
225 | tmp += self.literal[digit]
226 | if tmp:
227 | tmp += self.large_scale[inversed_idx // 4]
228 | korean += tmp
229 | tmp = str()
230 |
231 | elif mode == 'informal':
232 | if inversed_idx % 4:
233 | if digit > 1:
234 | tmp += self.literal[digit]
235 | if digit:
236 | tmp += self.small_scale[inversed_idx % 4]
237 | else:
238 | if digit or length == 1:
239 | tmp += self.literal[digit]
240 | if tmp:
241 | tmp += self.large_scale[inversed_idx // 4]
242 | if length == 5 and digit == 1 and inversed_idx == 4:
243 | tmp = tmp[1:]
244 | korean += tmp
245 | tmp = str()
246 | korean += tmp
247 | return korean
248 |
249 | def individual_read(self, frag):
250 | korean = str()
251 | for digit in frag:
252 | if digit == '_':
253 | korean += '공'
254 | else:
255 | korean += self.literal[int(digit)]
256 | return korean
257 |
258 | result = self.sentence
259 | if preprocessed_data is None:
260 | return None
261 |
262 | for each in preprocessed_data:
263 | reading_method, target_seq, original = each
264 | readed = str()
265 | for idx, frag in enumerate(target_seq):
266 | if reading_method[idx] == 'literal':
267 | readed += literal_read(self, frag)
268 | if reading_method[idx] == 'individual':
269 | readed += individual_read(self, frag)
270 | if reading_method[idx] == 'point':
271 | readed += ' 점 '
272 | if reading_method[idx] == 'of':
273 | readed += '의 '
274 | if reading_method[idx] == 'super':
275 | readed += ' 승'
276 | result = result.replace(original, readed, 1)
277 | return result
278 |
279 | def convert(self, sentence):
280 | return self._read(self._preprocess(self._detect(sentence)))
--------------------------------------------------------------------------------
/preprocess/special.csv:
--------------------------------------------------------------------------------
1 | special,replace
2 | ,
3 | .,.
4 | “,
5 | """",
6 | -,
7 | ",",
8 | ·,.
9 | ”,
10 | ㎏,킬로그램
11 | ㎞,킬로미터
12 | ?,?
13 | ‘,
14 | ’,
15 | ;,
16 | %,퍼센트
17 | !,!
18 | ㎓,기가헤르츠
19 | ',
20 | ♪,
21 | ∼,
22 | [,
23 | ],
24 | &,
25 | ~,
26 | /,
27 | <,
28 | >,
29 | 「,
30 | 」,
31 | $,달러
32 | :,
33 | _,
34 | 《,
35 | 》,
36 | ㈜,
37 | Ⅱ,
38 | ㎡,제곱미터
39 | ¾,사 분의 삼
40 | ,
41 | …,
42 | ㎜,밀리미터
43 | =,는
44 | #,샾
45 | 『,
46 | 』,
47 | ‧,
48 | ♬,
49 | ㏊,
50 | @,앳
51 | ℉,화씨
52 | ㎥,세제곱미터
53 | Ⅲ,
54 | →,
55 | 〈,
56 | 〉,
57 | ㎝,센티미터
58 | {,
59 | \,원
60 | },
61 | ¥,
62 | ㎾,킬로와트
63 | Ⅰ,
64 | ․,.
65 | ㈔,
66 | ㎖,밀리리터
67 | ○,
68 | ㎍,마이크로그램
69 | ,
70 | ㎿,메가와트
71 | ℃,도
72 | ㎳,밀리세컨드
73 | +,플러스
74 | ™,
75 | <,
76 | >,
77 | *,
78 | ㎢,제곱킬로미터
79 | ⅓,삼 분의 일
80 | &,
81 | ^,
82 | ㎚,나노미터
83 | ,
84 | √,
85 | ㎎,밀리그램
86 | 〮,
87 | ♡,
88 | ∙,
89 | ㎛,마이크로미터
90 | °,도
91 | –,
92 | ,
93 | Ⅹ,
94 | ″,
95 | ±,플러스 마이너스
96 | ㎘,킬로리터
97 | ≪,
98 | ≫,
99 | £,
100 | ×,
101 | �,
102 | ㎉,킬로칼로리
103 | ★,
104 | □,
105 | ˚,
106 | ㎧,미터 퍼 세크
107 | ㎈,칼로리
108 | •,
109 | ′,
110 | ¡,
111 | 〓,
112 | 【,
113 | 】,
114 | 〔,
115 | 〕,
116 | ・,
117 | ˙,
118 | ㏈,데시벨
119 | Ⅵ,
120 | Ⅳ,
121 | ?,?
122 | ⅔,삼 분의 이
123 | 、,
124 | 〜,
125 | ♥,
126 | `,
127 | ®,
128 | 〇,
129 | ㎠,제곱센티미터
130 | ㏄,씨씨
131 | ▒,
132 | ,
133 | —,
134 | 「,
135 | 」,
136 | !,!
137 | ―,
138 | ♩,
139 | ⓛ,
140 | ㎒,메가헤르츠
141 | ▶,
142 | :,
143 | ˜,
144 | ´,
145 | ㎽,밀리와트
146 | ,
147 | ~,
148 | ┃,
149 | ·,
150 | ⼐,
151 | ㎃,밀리암페어
152 | %,퍼센트
153 | ◆,
154 | -,
155 | ⇒,
156 | ‰,퍼밀
157 | (,
158 | ),
159 | ㍱,헥토파스칼
160 | ━,
161 | ㎐,헤르츠
162 | ≡,
163 | ¿,
164 | 。,
165 | €,
166 | ¶,
167 | ・,
168 | ㎔,테라헤르츠
169 | ※,
170 | Ⅸ,
171 | ⑯,
172 | ,
173 | ,,
174 | ↗,
175 | ↘,
176 | *,
177 | ㎸,킬로볼트
178 | ㉦,
179 | ½,이 분의 일
180 | ⟪,
181 | ⟫,
182 | ,
183 |
,
184 | Ⅶ,
185 | .,
186 | §,
187 | ㎨,미터 퍼 세크 제곱
188 | △,
189 | ☆,
190 | ÷,나누기
191 | ∪,
192 | ㎩,파스칼
193 | ,
194 | 〃,
195 | ㎫,메가파스칼
196 | £,
197 | ㎗,
198 | ■,
199 | ,
200 | ,
201 | ‵,
202 | ⓧ,
203 | ,
204 | ¢,
205 | ♫,
206 | ,
207 | ㎟,제곱밀리미터
208 | ➡,
209 | ⃗,
210 | ㎑,킬로헤르츠
211 | ,
212 |
--------------------------------------------------------------------------------
/preprocess/train_labels.csv:
--------------------------------------------------------------------------------
1 | id,char,freq
2 | 0, ,5774462
3 | 1,.,640924
4 | 2,그,556373
5 | 3,이,509291
6 | 4,는,374559
7 | 5,아,370444
8 | 6,가,369698
9 | 7,고,356378
10 | 8,어,333842
11 | 9,거,306987
12 | 10,지,276453
13 | 11,데,249269
14 | 12,?,235024
15 | 13,나,229646
16 | 14,하,226136
17 | 15,다,221216
18 | 16,서,211193
19 | 17,에,204330
20 | 18,도,190561
21 | 19,게,177140
22 | 20,니,173284
23 | 21,",",152938
24 | 22,기,149467
25 | 23,은,144674
26 | 24,면,142025
27 | 25,야,137553
28 | 26,있,133155
29 | 27,한,121564
30 | 28,을,121048
31 | 29,까,119483
32 | 30,해,115148
33 | 31,리,111855
34 | 32,라,111479
35 | 33,래,105784
36 | 34,사,100533
37 | 35,근,99781
38 | 36,들,99447
39 | 37,안,97043
40 | 38,로,91847
41 | 39,일,88319
42 | 40,뭐,87328
43 | 41,내,85968
44 | 42,보,82911
45 | 43,제,80874
46 | 44,같,79626
47 | 45,자,76298
48 | 46,만,76093
49 | 47,시,72836
50 | 48,런,70919
51 | 49,너,69192
52 | 50,대,68756
53 | 51,때,67179
54 | 52,되,66237
55 | 53,으,66106
56 | 54,진,62831
57 | 55,를,61802
58 | 56,잖,61455
59 | 57,오,60782
60 | 58,러,60629
61 | 59,인,60234
62 | 60,막,59994
63 | 61,무,58705
64 | 62,었,58385
65 | 63,구,57294
66 | 64,했,57209
67 | 65,수,56787
68 | 66,간,55275
69 | 67,애,54476
70 | 68,우,53539
71 | 69,요,53234
72 | 70,마,53125
73 | 71,생,52815
74 | 72,렇,50798
75 | 73,냥,49989
76 | 74,짜,49581
77 | 75,주,48969
78 | 76,없,48392
79 | 77,말,47929
80 | 78,학,46285
81 | 79,스,46225
82 | 80,더,44487
83 | 81,많,43607
84 | 82,원,41379
85 | 83,음,41348
86 | 84,정,39775
87 | 85,겠,39691
88 | 86,여,39203
89 | 87,먹,39194
90 | 88,금,38720
91 | 89,든,38476
92 | 90,부,38398
93 | 91,할,38262
94 | 92,전,36575
95 | 93,번,36375
96 | 94,좋,36363
97 | 95,랑,36081
98 | 96,네,35514
99 | 97,람,33799
100 | 98,약,33412
101 | 99,건,33371
102 | 100,각,32167
103 | 101,좀,31738
104 | 102,알,30893
105 | 103,잘,30132
106 | 104,걸,29634
107 | 105,모,29629
108 | 106,것,28482
109 | 107,상,28247
110 | 108,려,28218
111 | 109,장,27856
112 | 110,히,27705
113 | 111,않,27305
114 | 112,맞,27202
115 | 113,던,27082
116 | 114,르,26286
117 | 115,교,26116
118 | 116,바,25994
119 | 117,냐,25742
120 | 118,드,25702
121 | 119,십,25654
122 | 120,날,25556
123 | 121,치,25287
124 | 122,비,25278
125 | 123,단,25129
126 | 124,동,25047
127 | 125,또,24720
128 | 126,못,24528
129 | 127,저,24074
130 | 128,얘,23990
131 | 129,중,23851
132 | 130,의,23607
133 | 131,난,23318
134 | 132,엄,23057
135 | 133,봤,22930
136 | 134,걔,22732
137 | 135,화,22593
138 | 136,응,22254
139 | 137,싶,21756
140 | 138,갔,21628
141 | 139,았,21052
142 | 140,집,20850
143 | 141,왜,20801
144 | 142,계,20757
145 | 143,공,20620
146 | 144,긴,20547
147 | 145,신,20371
148 | 146,적,20244
149 | 147,연,20225
150 | 148,직,20061
151 | 149,실,19467
152 | 150,영,19454
153 | 151,미,19444
154 | 152,봐,18931
155 | 153,분,18893
156 | 154,테,18829
157 | 155,년,18669
158 | 156,트,18654
159 | 157,문,18230
160 | 158,와,18114
161 | 159,돼,18028
162 | 160,물,17889
163 | 161,예,17864
164 | 162,터,17722
165 | 163,세,17719
166 | 164,럼,17521
167 | 165,청,17479
168 | 166,차,17455
169 | 167,친,17355
170 | 168,개,17355
171 | 169,삼,17242
172 | 170,국,17224
173 | 171,두,17129
174 | 172,소,17125
175 | 173,살,16893
176 | 174,재,16635
177 | 175,운,15949
178 | 176,쫌,15780
179 | 177,유,15516
180 | 178,속,15255
181 | 179,명,15190
182 | 180,랬,15155
183 | 181,본,15148
184 | 182,갈,15084
185 | 183,방,15069
186 | 184,돈,14998
187 | 185,타,14919
188 | 186,처,14908
189 | 187,빠,14851
190 | 188,과,14843
191 | 189,식,14743
192 | 190,디,14633
193 | 191,배,14403
194 | 192,피,14093
195 | 193,뭔,14093
196 | 194,선,13929
197 | 195,남,13909
198 | 196,경,13868
199 | 197,달,13787
200 | 198,언,13579
201 | 199,받,13519
202 | 200,심,13409
203 | 201,월,13367
204 | 202,버,13339
205 | 203,왔,13265
206 | 204,느,13223
207 | 205,점,12960
208 | 206,올,12923
209 | 207,업,12861
210 | 208,른,12801
211 | 209,성,12717
212 | 210,회,12591
213 | 211,조,12570
214 | 212,워,12424
215 | 213,따,12410
216 | 214,행,12350
217 | 215,반,12158
218 | 216,님,11998
219 | 217,딱,11908
220 | 218,관,11828
221 | 219,입,11275
222 | 220,카,11235
223 | 221,당,11068
224 | 222,였,10977
225 | 223,케,10576
226 | 224,쪽,10430
227 | 225,천,10384
228 | 226,작,10381
229 | 227,누,10336
230 | 228,열,10252
231 | 229,얼,10250
232 | 230,울,10246
233 | 231,찮,10231
234 | 232,된,10191
235 | 233,별,10159
236 | 234,떻,10108
237 | 235,머,9876
238 | 236,쓰,9853
239 | 237,위,9841
240 | 238,크,9838
241 | 239,노,9799
242 | 240,괜,9735
243 | 241,강,9698
244 | 242,편,9668
245 | 243,몰,9623
246 | 244,맛,9382
247 | 245,준,9342
248 | 246,줄,9294
249 | 247,파,9282
250 | 248,백,9252
251 | 249,매,9181
252 | 250,산,9160
253 | 251,술,9142
254 | 252,힘,9056
255 | 253,프,9019
256 | 254,즘,8997
257 | 255,임,8969
258 | 256,체,8888
259 | 257,형,8790
260 | 258,몇,8742
261 | 259,맨,8712
262 | 260,새,8623
263 | 261,용,8571
264 | 262,키,8547
265 | 263,통,8410
266 | 264,양,8363
267 | 265,끝,8361
268 | 266,싸,8328
269 | 267,볼,8188
270 | 268,혼,8179
271 | 269,온,8132
272 | 270,등,8123
273 | 271,길,8067
274 | 272,될,8033
275 | 273,밌,7998
276 | 274,육,7924
277 | 275,늘,7909
278 | 276,슨,7835
279 | 277,됐,7738
280 | 278,놀,7707
281 | 279,외,7608
282 | 280,팔,7601
283 | 281,져,7551
284 | 282,레,7485
285 | 283,억,7461
286 | 284,발,7450
287 | 285,결,7412
288 | 286,초,7290
289 | 287,감,7180
290 | 288,군,7174
291 | 289,호,7173
292 | 290,름,7146
293 | 291,솔,7079
294 | 292,닌,7051
295 | 293,밖,7013
296 | 294,불,7007
297 | 295,밥,6784
298 | 296,포,6676
299 | 297,싫,6631
300 | 298,완,6582
301 | 299,갖,6511
302 | 300,겨,6468
303 | 301,질,6453
304 | 302,토,6448
305 | 303,험,6417
306 | 304,색,6371
307 | 305,떤,6352
308 | 306,역,6340
309 | 307,티,6319
310 | 308,갑,6316
311 | 309,목,6262
312 | 310,린,6256
313 | 311,추,6204
314 | 312,격,6174
315 | 313,후,6119
316 | 314,확,6095
317 | 315,루,6079
318 | 316,민,6024
319 | 317,끼,6023
320 | 318,칠,6019
321 | 319,돌,5997
322 | 320,찍,5956
323 | 321,쪼,5946
324 | 322,깐,5788
325 | 323,필,5786
326 | 324,빨,5693
327 | 325,났,5657
328 | 326,락,5561
329 | 327,박,5537
330 | 328,끔,5483
331 | 329,낌,5403
332 | 330,럴,5397
333 | 331,취,5344
334 | 332,복,5315
335 | 333,둘,5264
336 | 334,페,5217
337 | 335,렸,5198
338 | 336,써,5197
339 | 337,줘,5173
340 | 338,급,5067
341 | 339,력,5065
342 | 340,잡,5030
343 | 341,씩,5006
344 | 342,찾,4990
345 | 343,놓,4987
346 | 344,최,4894
347 | 345,코,4891
348 | 346,넘,4870
349 | 347,졌,4803
350 | 348,섯,4799
351 | 349,브,4793
352 | 350,현,4764
353 | 351,눈,4760
354 | 352,항,4751
355 | 353,귀,4708
356 | 354,설,4688
357 | 355,벌,4666
358 | 356,담,4647
359 | 357,앞,4640
360 | 358,책,4630
361 | 359,절,4629
362 | 360,플,4523
363 | 361,폰,4513
364 | 362,태,4496
365 | 363,종,4487
366 | 364,옛,4450
367 | 365,증,4413
368 | 366,튼,4411
369 | 367,글,4408
370 | 368,습,4383
371 | 369,병,4377
372 | 370,론,4373
373 | 371,출,4364
374 | 372,능,4354
375 | 373,침,4345
376 | 374,순,4339
377 | 375,줬,4308
378 | 376,평,4303
379 | 377,메,4287
380 | 378,똑,4281
381 | 379,커,4261
382 | 380,엔,4248
383 | 381,꾸,4230
384 | 382,란,4194
385 | 383,듣,4083
386 | 384,씨,4009
387 | 385,큰,4002
388 | 386,표,3995
389 | 387,잠,3942
390 | 388,먼,3942
391 | 389,쁘,3840
392 | 390,활,3820
393 | 391,합,3787
394 | 392,접,3732
395 | 393,럽,3722
396 | 394,옷,3705
397 | 395,쳐,3690
398 | 396,손,3689
399 | 397,붙,3645
400 | 398,망,3640
401 | 399,죽,3609
402 | 400,투,3606
403 | 401,족,3603
404 | 402,셨,3589
405 | 403,참,3572
406 | 404,떨,3567
407 | 405,웃,3533
408 | 406,졸,3516
409 | 407,쉬,3492
410 | 408,뭘,3447
411 | 409,변,3406
412 | 410,릴,3374
413 | 411,웠,3293
414 | 412,홍,3267
415 | 413,즈,3265
416 | 414,랐,3245
417 | 415,독,3243
418 | 416,충,3239
419 | 417,짝,3217
420 | 418,떡,3197
421 | 419,뒤,3195
422 | 420,휴,3161
423 | 421,셔,3142
424 | 422,넣,3135
425 | 423,쨌,3075
426 | 424,악,3073
427 | 425,패,3049
428 | 426,빼,3041
429 | 427,슬,2983
430 | 428,특,2975
431 | 429,꺼,2970
432 | 430,숙,2951
433 | 431,쯤,2934
434 | 432,텐,2905
435 | 433,창,2901
436 | 434,겼,2888
437 | 435,굴,2869
438 | 436,판,2863
439 | 437,죠,2851
440 | 438,답,2820
441 | 439,희,2816
442 | 440,허,2815
443 | 441,옆,2798
444 | 442,료,2791
445 | 443,닐,2790
446 | 444,택,2769
447 | 445,림,2760
448 | 446,읽,2742
449 | 447,핸,2733
450 | 448,축,2730
451 | 449,풀,2716
452 | 450,틀,2712
453 | 451,몸,2694
454 | 452,골,2690
455 | 453,황,2635
456 | 454,켜,2635
457 | 455,익,2625
458 | 456,베,2613
459 | 457,북,2600
460 | 458,법,2578
461 | 459,늦,2578
462 | 460,함,2568
463 | 461,랜,2555
464 | 462,꼬,2555
465 | 463,향,2547
466 | 464,석,2541
467 | 465,환,2533
468 | 466,슷,2529
469 | 467,품,2518
470 | 468,혀,2513
471 | 469,블,2512
472 | 470,쓸,2503
473 | 471,채,2472
474 | 472,며,2470
475 | 473,욕,2463
476 | 474,권,2450
477 | 475,검,2445
478 | 476,굳,2428
479 | 477,록,2425
480 | 478,톡,2408
481 | 479,김,2408
482 | 480,넌,2383
483 | 481,깨,2375
484 | 482,션,2374
485 | 483,캐,2369
486 | 484,송,2339
487 | 485,녀,2336
488 | 486,탈,2327
489 | 487,광,2321
490 | 488,혹,2313
491 | 489,퍼,2300
492 | 490,뽑,2278
493 | 491,철,2265
494 | 492,째,2249
495 | 493,움,2232
496 | 494,밤,2231
497 | 495,꼭,2226
498 | 496,샀,2224
499 | 497,끊,2212
500 | 498,땐,2203
501 | 499,깔,2179
502 | 500,멀,2145
503 | 501,높,2141
504 | 502,께,2140
505 | 503,큼,2121
506 | 504,녁,2104
507 | 505,곳,2082
508 | 506,잔,2070
509 | 507,쉽,2070
510 | 508,짐,2067
511 | 509,암,2063
512 | 510,극,2061
513 | 511,련,2056
514 | 512,떠,2056
515 | 513,벽,2049
516 | 514,헤,2047
517 | 515,C,2040
518 | 516,끄,2024
519 | 517,곱,2015
520 | 518,승,2011
521 | 519,봉,2009
522 | 520,착,2006
523 | 521,촌,1990
524 | 522,껴,1986
525 | 523,딩,1983
526 | 524,류,1978
527 | 525,뜨,1970
528 | 526,넷,1941
529 | 527,놨,1922
530 | 528,궁,1894
531 | 529,논,1882
532 | 530,곤,1875
533 | 531,클,1869
534 | 532,싼,1859
535 | 533,앉,1854
536 | 534,컴,1849
537 | 535,맥,1841
538 | 536,팀,1830
539 | 537,썼,1818
540 | 538,낫,1801
541 | 539,튜,1788
542 | 540,걱,1786
543 | 541,쁜,1770
544 | 542,킨,1760
545 | 543,빌,1752
546 | 544,쿠,1748
547 | 545,찌,1738
548 | 546,쌤,1719
549 | 547,T,1715
550 | 548,밀,1711
551 | 549,빵,1702
552 | 550,냈,1702
553 | 551,센,1691
554 | 552,딴,1688
555 | 553,쩌,1678
556 | 554,딸,1678
557 | 555,걍,1596
558 | 556,획,1588
559 | 557,씬,1582
560 | 558,챙,1541
561 | 559,첫,1536
562 | 560,범,1530
563 | 561,핑,1519
564 | 562,굉,1519
565 | 563,쩔,1514
566 | 564,팅,1507
567 | 565,긍,1486
568 | 566,탄,1471
569 | 567,덟,1470
570 | 568,퇴,1469
571 | 569,뛰,1469
572 | 570,층,1467
573 | 571,춰,1454
574 | 572,훨,1447
575 | 573,찬,1439
576 | 574,듯,1424
577 | 575,S,1396
578 | 576,왕,1392
579 | 577,텔,1385
580 | 578,뉴,1382
581 | 579,렌,1377
582 | 580,탕,1374
583 | 581,짓,1371
584 | 582,밑,1365
585 | 583,헬,1358
586 | 584,존,1339
587 | 585,립,1323
588 | 586,녔,1318
589 | 587,꼈,1305
590 | 588,빡,1304
591 | 589,낮,1287
592 | 590,견,1282
593 | 591,링,1281
594 | 592,볶,1271
595 | 593,낙,1271
596 | 594,릭,1267
597 | 595,젠,1263
598 | 596,퓨,1262
599 | 597,츠,1256
600 | 598,맘,1252
601 | 599,놔,1247
602 | 600,렵,1241
603 | 601,땜,1232
604 | 602,쇼,1224
605 | 603,값,1215
606 | 604,닭,1203
607 | 605,깝,1200
608 | 606,픈,1194
609 | 607,탁,1183
610 | 608,쓴,1179
611 | 609,농,1172
612 | 610,량,1166
613 | 611,염,1156
614 | 612,홉,1144
615 | 613,척,1130
616 | 614,겁,1129
617 | 615,콘,1127
618 | 616,섭,1125
619 | 617,냄,1125
620 | 618,P,1125
621 | 619,효,1124
622 | 620,규,1124
623 | 621,꿈,1121
624 | 622,곡,1093
625 | 623,액,1090
626 | 624,쎄,1077
627 | 625,덜,1075
628 | 626,턴,1065
629 | 627,킹,1061
630 | 628,훈,1057
631 | 629,쳤,1054
632 | 630,널,1047
633 | 631,멋,1037
634 | 632,꿀,1034
635 | 633,깜,1019
636 | 634,짧,1014
637 | 635,롤,1013
638 | 636,낼,1012
639 | 637,꽤,1003
640 | 638,총,984
641 | 639,램,984
642 | 640,덕,980
643 | 641,믄,974
644 | 642,믿,972
645 | 643,흥,970
646 | 644,롱,967
647 | 645,뜻,962
648 | 646,짤,958
649 | 647,쌍,957
650 | 648,컨,953
651 | 649,셋,952
652 | 650,잤,950
653 | 651,닥,950
654 | 652,웬,946
655 | 653,엽,944
656 | 654,혜,939
657 | 655,찰,935
658 | 656,뻐,935
659 | 657,뿌,934
660 | 658,빈,934
661 | 659,꿔,934
662 | 660,낸,932
663 | 661,뻔,928
664 | 662,쌓,926
665 | 663,즐,919
666 | 664,튀,914
667 | 665,겹,909
668 | 666,득,899
669 | 667,끌,896
670 | 668,M,880
671 | 669,V,877
672 | 670,녹,876
673 | 671,푸,870
674 | 672,쭉,863
675 | 673,싱,858
676 | 674,팬,857
677 | 675,A,854
678 | 676,!,841
679 | 677,념,836
680 | 678,맡,825
681 | 679,쟁,814
682 | 680,엑,810
683 | 681,켓,809
684 | 682,뀌,808
685 | 683,털,803
686 | 684,풍,802
687 | 685,웨,799
688 | 686,땡,792
689 | 687,롯,791
690 | 688,롭,788
691 | 689,젊,781
692 | 690,넓,778
693 | 691,멘,777
694 | 692,냉,772
695 | 693,칼,771
696 | 694,잉,768
697 | 695,빙,768
698 | 696,뿐,767
699 | 697,옮,761
700 | 698,젤,760
701 | 699,B,757
702 | 700,죄,756
703 | 701,탔,752
704 | 702,샤,746
705 | 703,홀,745
706 | 704,떼,743
707 | 705,줌,738
708 | 706,징,734
709 | 707,폭,727
710 | 708,G,721
711 | 709,킬,713
712 | 710,흔,712
713 | 711,딜,711
714 | 712,슈,703
715 | 713,율,700
716 | 714,즌,697
717 | 715,씀,694
718 | 716,앙,689
719 | 717,눠,688
720 | 718,콩,686
721 | 719,얻,684
722 | 720,숨,682
723 | 721,닝,673
724 | 722,꽃,668
725 | 723,쌀,667
726 | 724,컬,666
727 | 725,춤,666
728 | 726,c,664
729 | 727,뚫,663
730 | 728,엠,661
731 | 729,몬,659
732 | 730,D,658
733 | 731,흐,656
734 | 732,앤,646
735 | 733,똥,645
736 | 734,콜,644
737 | 735,델,635
738 | 736,렀,628
739 | 737,폐,627
740 | 738,엘,624
741 | 739,쁠,623
742 | 740,랄,622
743 | 741,걘,621
744 | 742,벤,619
745 | 743,봄,611
746 | 744,왠,609
747 | 745,씻,609
748 | 746,률,608
749 | 747,켰,600
750 | 748,짱,600
751 | 749,웹,599
752 | 750,압,599
753 | 751,럭,596
754 | 752,땅,596
755 | 753,멍,595
756 | 754,랩,595
757 | 755,댓,595
758 | 756,깊,595
759 | 757,뮤,592
760 | 758,령,590
761 | 759,릿,589
762 | 760,낀,589
763 | 761,윤,586
764 | 762,옥,584
765 | 763,룸,582
766 | 764,딘,579
767 | 765,객,578
768 | 766,댄,576
769 | 767,컵,574
770 | 768,폴,573
771 | 769,쟤,570
772 | 770,뷰,569
773 | 771,템,568
774 | 772,덴,567
775 | 773,눌,559
776 | 774,캠,558
777 | 775,홈,557
778 | 776,삶,557
779 | 777,삭,555
780 | 778,벨,555
781 | 779,엉,552
782 | 780,헐,549
783 | 781,벅,545
784 | 782,벗,544
785 | 783,혈,543
786 | 784,밍,539
787 | 785,셀,536
788 | 786,낭,534
789 | 787,춥,533
790 | 788,릉,533
791 | 789,t,533
792 | 790,잃,532
793 | 791,I,529
794 | 792,놈,526
795 | 793,춘,524
796 | 794,찜,520
797 | 795,R,519
798 | 796,걷,518
799 | 797,삐,515
800 | 798,헌,510
801 | 799,딨,510
802 | 800,빛,504
803 | 801,흘,503
804 | 802,닫,502
805 | 803,균,502
806 | 804,p,495
807 | 805,L,494
808 | 806,좌,492
809 | 807,껄,491
810 | 808,펜,489
811 | 809,N,487
812 | 810,싹,486
813 | 811,탑,485
814 | 812,쏘,483
815 | 813,O,482
816 | 814,픽,480
817 | 815,덩,477
818 | 816,햄,476
819 | 817,큐,473
820 | 818,힐,472
821 | 819,곧,471
822 | 820,낳,470
823 | 821,힌,468
824 | 822,팩,468
825 | 823,뒷,468
826 | 824,툰,467
827 | 825,섬,465
828 | 826,꽂,463
829 | 827,례,462
830 | 828,핫,460
831 | 829,섞,460
832 | 830,촬,458
833 | 831,흰,457
834 | 832,둥,455
835 | 833,K,450
836 | 834,괴,449
837 | 835,s,448
838 | 836,핀,446
839 | 837,꿨,444
840 | 838,틱,441
841 | 839,밝,441
842 | 840,랙,440
843 | 841,땠,440
844 | 842,둔,440
845 | 843,슴,439
846 | 844,첨,438
847 | 845,밴,432
848 | 846,렁,431
849 | 847,칭,429
850 | 848,묻,428
851 | 849,뜬,425
852 | 850,깎,424
853 | 851,엇,423
854 | 852,컸,421
855 | 853,퀴,420
856 | 854,납,418
857 | 855,협,417
858 | 856,몽,416
859 | 857,꼐,415
860 | 858,떴,414
861 | 859,썰,410
862 | 860,찐,407
863 | 861,꼴,407
864 | 862,갠,406
865 | 863,턱,405
866 | 864,틴,398
867 | 865,낄,398
868 | 866,뒀,397
869 | 867,끗,396
870 | 868,꼼,395
871 | 869,F,395
872 | 870,샵,394
873 | 871,휘,392
874 | 872,뼈,390
875 | 873,뚜,389
876 | 874,쩍,388
877 | 875,팡,386
878 | 876,멜,386
879 | 877,톤,385
880 | 878,앨,385
881 | 879,탐,384
882 | 880,칸,384
883 | 881,끓,383
884 | 882,뚱,381
885 | 883,닮,378
886 | 884,깃,375
887 | 885,짬,374
888 | 886,빤,371
889 | 887,측,370
890 | 888,혔,369
891 | 889,꽁,369
892 | 890,펴,368
893 | 891,앴,368
894 | 892,겸,368
895 | 893,쿨,367
896 | 894,릇,363
897 | 895,얀,362
898 | 896,쿄,358
899 | 897,컷,358
900 | 898,팠,356
901 | 899,끈,355
902 | 900,렴,354
903 | 901,잊,352
904 | 902,덤,350
905 | 903,갤,342
906 | 904,븐,340
907 | 905,흡,337
908 | 906,덮,337
909 | 907,씹,335
910 | 908,뽀,335
911 | 909,뚝,335
912 | 910,갚,335
913 | 911,찔,334
914 | 912,댔,333
915 | 913,혁,332
916 | 914,띠,328
917 | 915,벼,327
918 | 916,얇,324
919 | 917,뺐,324
920 | 918,팝,323
921 | 919,잇,322
922 | 920,왼,322
923 | 921,낚,321
924 | 922,칙,316
925 | 923,겉,316
926 | 924,뜯,313
927 | 925,닦,312
928 | 926,짠,311
929 | 927,썹,310
930 | 928,뷔,310
931 | 929,묶,310
932 | 930,꾼,306
933 | 931,빅,305
934 | 932,땄,305
935 | 933,캡,304
936 | 934,묘,304
937 | 935,샘,303
938 | 936,묵,303
939 | 937,a,302
940 | 938,쭈,300
941 | 939,b,300
942 | 940,겪,299
943 | 941,둬,298
944 | 942,J,298
945 | 943,쫄,296
946 | 944,랫,296
947 | 945,뀐,296
948 | 946,흑,295
949 | 947,댕,295
950 | 948,꽉,295
951 | 949,곰,294
952 | 950,붕,293
953 | 951,땀,292
954 | 952,릎,290
955 | 953,뽕,289
956 | 954,쥐,288
957 | 955,렉,287
958 | 956,숭,283
959 | 957,샐,283
960 | 958,v,282
961 | 959,렛,281
962 | 960,녕,281
963 | 961,힙,280
964 | 962,쫙,279
965 | 963,촉,278
966 | 964,쩜,277
967 | 965,긋,277
968 | 966,샌,276
969 | 967,o,275
970 | 968,쫓,273
971 | 969,쩐,273
972 | 970,헷,272
973 | 971,X,268
974 | 972,웅,267
975 | 973,뺏,267
976 | 974,쵸,266
977 | 975,쪘,266
978 | 976,랍,266
979 | 977,E,266
980 | 978,좁,265
981 | 979,앱,265
982 | 980,썸,264
983 | 981,냅,264
984 | 982,펙,263
985 | 983,늙,263
986 | 984,껌,261
987 | 985,n,261
988 | 986,e,261
989 | 987,랭,260
990 | 988,귤,260
991 | 989,찢,259
992 | 990,닿,259
993 | 991,띄,258
994 | 992,긁,255
995 | 993,귄,253
996 | 994,굽,253
997 | 995,갓,253
998 | 996,캔,252
999 | 997,멈,252
1000 | 998,욱,250
1001 | 999,뺄,250
1002 | 1000,뇌,250
1003 | 1001,팟,249
1004 | 1002,쌌,248
1005 | 1003,룹,248
1006 | 1004,덥,248
1007 | 1005,폼,246
1008 | 1006,톱,244
1009 | 1007,듬,244
1010 | 1008,껍,244
1011 | 1009,흠,243
1012 | 1010,팍,243
1013 | 1011,맹,243
1014 | 1012,쉴,242
1015 | 1013,썩,240
1016 | 1014,밟,240
1017 | 1015,맵,237
1018 | 1016,돋,236
1019 | 1017,콤,235
1020 | 1018,맙,234
1021 | 1019,뱅,233
1022 | 1020,쫍,231
1023 | 1021,윗,229
1024 | 1022,뜩,229
1025 | 1023,찝,228
1026 | 1024,뺀,227
1027 | 1025,닷,226
1028 | 1026,넨,226
1029 | 1027,쌈,225
1030 | 1028,쩨,224
1031 | 1029,붓,224
1032 | 1030,쩡,223
1033 | 1031,믹,223
1034 | 1032,잼,221
1035 | 1033,r,221
1036 | 1034,쭐,220
1037 | 1035,엊,219
1038 | 1036,g,219
1039 | 1037,췄,217
1040 | 1038,룩,217
1041 | 1039,텀,215
1042 | 1040,쇠,213
1043 | 1041,숫,212
1044 | 1042,풋,210
1045 | 1043,쌩,208
1046 | 1044,쾌,207
1047 | 1045,볍,207
1048 | 1046,뤄,207
1049 | 1047,겐,207
1050 | 1048,m,207
1051 | 1049,펌,206
1052 | 1050,쪄,206
1053 | 1051,뻥,206
1054 | 1052,i,206
1055 | 1053,뻤,205
1056 | 1054,k,204
1057 | 1055,핵,203
1058 | 1056,셉,200
1059 | 1057,듀,198
1060 | 1058,닉,198
1061 | 1059,략,197
1062 | 1060,넉,197
1063 | 1061,딪,196
1064 | 1062,낯,195
1065 | 1063,텍,194
1066 | 1064,뱃,194
1067 | 1065,멤,194
1068 | 1066,윈,192
1069 | 1067,엎,192
1070 | 1068,뭉,192
1071 | 1069,젝,191
1072 | 1070,셜,190
1073 | 1071,빴,190
1074 | 1072,룰,190
1075 | 1073,앗,189
1076 | 1074,궈,189
1077 | 1075,윙,188
1078 | 1076,엥,188
1079 | 1077,d,185
1080 | 1078,꼽,183
1081 | 1079,챔,182
1082 | 1080,쉐,182
1083 | 1081,봇,182
1084 | 1082,푼,180
1085 | 1083,댁,179
1086 | 1084,칵,178
1087 | 1085,뿔,177
1088 | 1086,뺑,176
1089 | 1087,탱,175
1090 | 1088,쿼,174
1091 | 1089,갱,174
1092 | 1090,퉁,173
1093 | 1091,빔,173
1094 | 1092,썬,172
1095 | 1093,빽,172
1096 | 1094,둑,172
1097 | 1095,헛,169
1098 | 1096,빗,168
1099 | 1097,탓,167
1100 | 1098,륵,165
1101 | 1099,꼰,165
1102 | 1100,쎈,163
1103 | 1101,쥬,162
1104 | 1102,깡,162
1105 | 1103,퀄,161
1106 | 1104,빚,161
1107 | 1105,즉,160
1108 | 1106,삿,160
1109 | 1107,밭,160
1110 | 1108,u,160
1111 | 1109,혐,159
1112 | 1110,햇,159
1113 | 1111,툭,159
1114 | 1112,탠,158
1115 | 1113,샷,158
1116 | 1114,맣,157
1117 | 1115,껏,157
1118 | 1116,핏,156
1119 | 1117,앵,155
1120 | 1118,뜰,155
1121 | 1119,굿,155
1122 | 1120,U,155
1123 | 1121,섹,154
1124 | 1122,펑,153
1125 | 1123,맻,153
1126 | 1124,뀔,153
1127 | 1125,깥,152
1128 | 1126,뱀,150
1129 | 1127,뢰,150
1130 | 1128,껀,150
1131 | 1129,뉘,149
1132 | 1130,흉,146
1133 | 1131,틈,146
1134 | 1132,쏟,146
1135 | 1133,훔,143
1136 | 1134,쇄,142
1137 | 1135,뎅,142
1138 | 1136,칩,141
1139 | 1137,띵,140
1140 | 1138,푹,139
1141 | 1139,넥,139
1142 | 1140,퀘,137
1143 | 1141,훅,136
1144 | 1142,융,136
1145 | 1143,멸,136
1146 | 1144,냠,136
1147 | 1145,횟,134
1148 | 1146,찼,134
1149 | 1147,Y,134
1150 | 1148,룬,133
1151 | 1149,귈,133
1152 | 1150,H,133
1153 | 1151,젓,132
1154 | 1152,쏠,132
1155 | 1153,숲,132
1156 | 1154,냬,132
1157 | 1155,l,132
1158 | 1156,짰,130
1159 | 1157,멕,130
1160 | 1158,뇨,130
1161 | 1159,팽,128
1162 | 1160,깼,127
1163 | 1161,숏,126
1164 | 1162,굔,125
1165 | 1163,슐,123
1166 | 1164,쉰,123
1167 | 1165,얄,122
1168 | 1166,뱉,122
1169 | 1167,렬,122
1170 | 1168,굶,122
1171 | 1169,팁,121
1172 | 1170,츄,121
1173 | 1171,뭣,121
1174 | 1172,륙,121
1175 | 1173,횡,120
1176 | 1174,옹,119
1177 | 1175,뻘,117
1178 | 1176,옵,116
1179 | 1177,옴,116
1180 | 1178,얹,116
1181 | 1179,쑥,116
1182 | 1180,깄,116
1183 | 1181,므,115
1184 | 1182,찡,112
1185 | 1183,젖,112
1186 | 1184,꽈,112
1187 | 1185,틸,111
1188 | 1186,콕,111
1189 | 1187,첩,111
1190 | 1188,똘,111
1191 | 1189,쿵,110
1192 | 1190,왤,110
1193 | 1191,괌,110
1194 | 1192,밸,109
1195 | 1193,녜,109
1196 | 1194,갸,109
1197 | 1195,펀,108
1198 | 1196,칫,108
1199 | 1197,맺,108
1200 | 1198,탭,107
1201 | 1199,쁨,106
1202 | 1200,폈,105
1203 | 1201,펼,105
1204 | 1202,첼,105
1205 | 1203,숱,105
1206 | 1204,섰,105
1207 | 1205,킥,104
1208 | 1206,맑,103
1209 | 1207,랗,103
1210 | 1208,펐,102
1211 | 1209,넛,102
1212 | 1210,솜,101
1213 | 1211,벙,101
1214 | 1212,껑,101
1215 | 1213,f,101
1216 | 1214,룡,100
1217 | 1215,훌,99
1218 | 1216,x,98
1219 | 1217,쓱,97
1220 | 1218,늬,97
1221 | 1219,곽,97
1222 | 1220,y,97
1223 | 1221,욘,96
1224 | 1222,돔,96
1225 | 1223,겄,96
1226 | 1224,텝,95
1227 | 1225,훠,94
1228 | 1226,텅,94
1229 | 1227,씌,94
1230 | 1228,꺾,94
1231 | 1229,벚,93
1232 | 1230,렷,93
1233 | 1231,귓,93
1234 | 1232,찹,92
1235 | 1233,툴,91
1236 | 1234,깅,91
1237 | 1235,쭤,90
1238 | 1236,욜,90
1239 | 1237,얌,90
1240 | 1238,짖,89
1241 | 1239,옳,89
1242 | 1240,벳,88
1243 | 1241,뛸,88
1244 | 1242,깠,88
1245 | 1243,퍽,87
1246 | 1244,퀸,87
1247 | 1245,엮,87
1248 | 1246,삽,87
1249 | 1247,겟,87
1250 | 1248,왓,86
1251 | 1249,댈,86
1252 | 1250,샴,85
1253 | 1251,뻗,85
1254 | 1252,됨,84
1255 | 1253,얜,83
1256 | 1254,굵,83
1257 | 1255,눕,82
1258 | 1256,갇,82
1259 | 1257,셰,81
1260 | 1258,늫,81
1261 | 1259,텨,80
1262 | 1260,숍,80
1263 | 1261,뻑,80
1264 | 1262,됩,80
1265 | 1263,잎,79
1266 | 1264,뭇,79
1267 | 1265,퐁,78
1268 | 1266,팸,78
1269 | 1267,쯔,78
1270 | 1268,넜,78
1271 | 1269,깍,78
1272 | 1270,쌔,77
1273 | 1271,셈,77
1274 | 1272,읍,76
1275 | 1273,픔,75
1276 | 1274,펫,75
1277 | 1275,콧,75
1278 | 1276,얗,75
1279 | 1277,눅,75
1280 | 1278,j,74
1281 | 1279,쬐,73
1282 | 1280,렙,73
1283 | 1281,닙,73
1284 | 1282,슥,72
1285 | 1283,흙,71
1286 | 1284,쭝,71
1287 | 1285,짭,71
1288 | 1286,샹,71
1289 | 1287,릏,71
1290 | 1288,럿,71
1291 | 1289,덧,71
1292 | 1290,즙,70
1293 | 1291,늑,70
1294 | 1292,괄,70
1295 | 1293,킷,68
1296 | 1294,쿡,68
1297 | 1295,캉,68
1298 | 1296,둡,68
1299 | 1297,톨,67
1300 | 1298,엣,67
1301 | 1299,숟,67
1302 | 1300,낑,67
1303 | 1301,펭,66
1304 | 1302,왁,66
1305 | 1303,쏴,66
1306 | 1304,쏙,66
1307 | 1305,봅,66
1308 | 1306,멧,66
1309 | 1307,줏,65
1310 | 1308,뵈,65
1311 | 1309,쫑,64
1312 | 1310,륨,64
1313 | 1311,h,64
1314 | 1312,펄,63
1315 | 1313,짼,63
1316 | 1314,짚,63
1317 | 1315,껐,63
1318 | 1316,겜,63
1319 | 1317,싯,62
1320 | 1318,붐,62
1321 | 1319,렐,62
1322 | 1320,돗,62
1323 | 1321,팥,61
1324 | 1322,웰,61
1325 | 1323,륜,61
1326 | 1324,잣,60
1327 | 1325,슝,60
1328 | 1326,붉,60
1329 | 1327,윽,59
1330 | 1328,삘,59
1331 | 1329,딲,59
1332 | 1330,갯,58
1333 | 1331,횐,57
1334 | 1332,헨,57
1335 | 1333,캘,57
1336 | 1334,쩰,57
1337 | 1335,뤘,57
1338 | 1336,랴,57
1339 | 1337,껜,57
1340 | 1338,펠,56
1341 | 1339,킵,56
1342 | 1340,컹,56
1343 | 1341,렘,56
1344 | 1342,뛴,56
1345 | 1343,헝,55
1346 | 1344,씽,55
1347 | 1345,뮬,55
1348 | 1346,젯,54
1349 | 1347,샜,54
1350 | 1348,뿜,54
1351 | 1349,뒹,54
1352 | 1350,뎌,54
1353 | 1351,깬,54
1354 | 1352,챠,53
1355 | 1353,왈,53
1356 | 1354,뾰,53
1357 | 1355,뚤,53
1358 | 1356,꾹,53
1359 | 1357,갛,53
1360 | 1358,잌,52
1361 | 1359,엿,52
1362 | 1360,솥,52
1363 | 1361,벡,52
1364 | 1362,룻,52
1365 | 1363,꿍,52
1366 | 1364,곈,52
1367 | 1365,팜,51
1368 | 1366,튕,51
1369 | 1367,컥,51
1370 | 1368,첸,51
1371 | 1369,줍,51
1372 | 1370,섀,51
1373 | 1371,몫,51
1374 | 1372,뜸,51
1375 | 1373,깁,51
1376 | 1374,핬,50
1377 | 1375,쭘,50
1378 | 1376,쌰,50
1379 | 1377,넬,50
1380 | 1378,큘,49
1381 | 1379,쾅,49
1382 | 1380,캄,48
1383 | 1381,괘,48
1384 | 1382,쟀,47
1385 | 1383,윌,47
1386 | 1384,엌,47
1387 | 1385,앓,47
1388 | 1386,씁,47
1389 | 1387,륭,47
1390 | 1388,W,47
1391 | 1389,쑤,46
1392 | 1390,삥,46
1393 | 1391,돕,46
1394 | 1392,깰,46
1395 | 1393,핍,45
1396 | 1394,텃,45
1397 | 1395,슛,45
1398 | 1396,맸,45
1399 | 1397,롬,45
1400 | 1398,갭,45
1401 | 1399,얽,44
1402 | 1400,쏭,44
1403 | 1401,랠,44
1404 | 1402,겔,44
1405 | 1403,Q,44
1406 | 1404,핥,43
1407 | 1405,킴,43
1408 | 1406,읏,43
1409 | 1407,앚,43
1410 | 1408,숯,43
1411 | 1409,밋,43
1412 | 1410,뽐,42
1413 | 1411,뻣,42
1414 | 1412,눴,42
1415 | 1413,잭,41
1416 | 1414,뽈,41
1417 | 1415,뗐,41
1418 | 1416,꽝,41
1419 | 1417,훑,40
1420 | 1418,캥,40
1421 | 1419,쫀,40
1422 | 1420,뵙,40
1423 | 1421,홋,39
1424 | 1422,펍,39
1425 | 1423,뺨,39
1426 | 1424,됬,39
1427 | 1425,끽,39
1428 | 1426,빕,38
1429 | 1427,밉,38
1430 | 1428,꿋,38
1431 | 1429,헉,37
1432 | 1430,캣,37
1433 | 1431,촘,37
1434 | 1432,셌,37
1435 | 1433,삔,37
1436 | 1434,삑,37
1437 | 1435,뽂,37
1438 | 1436,뮌,37
1439 | 1437,뗄,37
1440 | 1438,텼,36
1441 | 1439,탬,36
1442 | 1440,쨍,36
1443 | 1441,웜,36
1444 | 1442,앰,36
1445 | 1443,맴,36
1446 | 1444,띡,36
1447 | 1445,꿇,36
1448 | 1446,걀,36
1449 | 1447,흩,35
1450 | 1448,쥴,35
1451 | 1449,씸,35
1452 | 1450,낡,35
1453 | 1451,곁,35
1454 | 1452,휙,34
1455 | 1453,쿤,34
1456 | 1454,켄,34
1457 | 1455,츤,34
1458 | 1456,얕,34
1459 | 1457,썪,34
1460 | 1458,둠,34
1461 | 1459,촛,33
1462 | 1460,챌,33
1463 | 1461,죙,33
1464 | 1462,쟈,33
1465 | 1463,잰,33
1466 | 1464,뵀,33
1467 | 1465,w,33
1468 | 1466,푠,32
1469 | 1467,폿,32
1470 | 1468,쨈,32
1471 | 1469,밲,32
1472 | 1470,랖,32
1473 | 1471,떵,32
1474 | 1472,찻,31
1475 | 1473,옇,31
1476 | 1474,뽁,31
1477 | 1475,둣,31
1478 | 1476,닳,31
1479 | 1477,긱,31
1480 | 1478,곶,31
1481 | 1479,휠,30
1482 | 1480,춧,30
1483 | 1481,쐈,30
1484 | 1482,썽,30
1485 | 1483,뼛,29
1486 | 1484,떳,29
1487 | 1485,굘,29
1488 | 1486,훗,28
1489 | 1487,퀵,28
1490 | 1488,봬,28
1491 | 1489,릅,28
1492 | 1490,꺠,28
1493 | 1491,슉,27
1494 | 1492,눔,27
1495 | 1493,끙,27
1496 | 1494,궐,27
1497 | 1495,2,27
1498 | 1496,촥,26
1499 | 1497,젬,26
1500 | 1498,솟,26
1501 | 1499,맷,26
1502 | 1500,룐,26
1503 | 1501,뎃,26
1504 | 1502,깽,26
1505 | 1503,툼,25
1506 | 1504,쎌,25
1507 | 1505,쉼,25
1508 | 1506,쉘,25
1509 | 1507,숑,25
1510 | 1508,뎀,25
1511 | 1509,냔,25
1512 | 1510,쫘,24
1513 | 1511,쎘,24
1514 | 1512,싣,24
1515 | 1513,섣,24
1516 | 1514,샾,24
1517 | 1515,맬,24
1518 | 1516,뗀,24
1519 | 1517,꾀,24
1520 | 1518,헹,23
1521 | 1519,햐,23
1522 | 1520,톰,23
1523 | 1521,췌,23
1524 | 1522,챈,23
1525 | 1523,봔,23
1526 | 1524,밧,23
1527 | 1525,맏,23
1528 | 1526,딥,23
1529 | 1527,늠,23
1530 | 1528,낵,23
1531 | 1529,낱,23
1532 | 1530,꺄,23
1533 | 1531,갬,23
1534 | 1532,훼,22
1535 | 1533,핼,22
1536 | 1534,튠,22
1537 | 1535,웩,22
1538 | 1536,쏜,22
1539 | 1537,뿅,22
1540 | 1538,빰,22
1541 | 1539,딤,22
1542 | 1540,꿉,22
1543 | 1541,걜,22
1544 | 1542,1,22
1545 | 1543,짙,21
1546 | 1544,얍,21
1547 | 1545,샛,21
1548 | 1546,뗘,21
1549 | 1547,듭,21
1550 | 1548,챘,20
1551 | 1549,쯧,20
1552 | 1550,짹,20
1553 | 1551,잦,20
1554 | 1552,옐,20
1555 | 1553,빳,20
1556 | 1554,몹,20
1557 | 1555,몄,20
1558 | 1556,똔,20
1559 | 1557,딧,20
1560 | 1558,놉,20
1561 | 1559,궜,20
1562 | 1560,굼,20
1563 | 1561,헥,19
1564 | 1562,캬,19
1565 | 1563,챕,19
1566 | 1564,쟨,19
1567 | 1565,멓,19
1568 | 1566,똠,19
1569 | 1567,댐,19
1570 | 1568,텁,18
1571 | 1569,켈,18
1572 | 1570,첵,18
1573 | 1571,숄,18
1574 | 1572,띨,18
1575 | 1573,듦,18
1576 | 1574,궤,18
1577 | 1575,곗,18
1578 | 1576,튈,17
1579 | 1577,좆,17
1580 | 1578,윷,17
1581 | 1579,옅,17
1582 | 1580,얏,17
1583 | 1581,믈,17
1584 | 1582,룽,17
1585 | 1583,띃,17
1586 | 1584,딕,17
1587 | 1585,뎁,17
1588 | 1586,닛,17
1589 | 1587,냑,17
1590 | 1588,겅,17
1591 | 1589,휩,16
1592 | 1590,팎,16
1593 | 1591,틋,16
1594 | 1592,콸,16
1595 | 1593,콥,16
1596 | 1594,잴,16
1597 | 1595,웁,16
1598 | 1596,슘,16
1599 | 1597,멱,16
1600 | 1598,랏,16
1601 | 1599,떄,16
1602 | 1600,뒨,16
1603 | 1601,꿰,16
1604 | 1602,깻,16
1605 | 1603,긌,16
1606 | 1604,젼,15
1607 | 1605,윰,15
1608 | 1606,웍,15
1609 | 1607,앳,15
1610 | 1608,샬,15
1611 | 1609,샥,15
1612 | 1610,볕,15
1613 | 1611,멩,15
1614 | 1612,넹,15
1615 | 1613,넙,15
1616 | 1614,끕,15
1617 | 1615,휜,14
1618 | 1616,텄,14
1619 | 1617,쫒,14
1620 | 1618,쩝,14
1621 | 1619,쨋,14
1622 | 1620,윳,14
1623 | 1621,쉭,14
1624 | 1622,쇽,14
1625 | 1623,셧,14
1626 | 1624,뵐,14
1627 | 1625,땔,14
1628 | 1626,덱,14
1629 | 1627,댑,14
1630 | 1628,꺽,14
1631 | 1629,곪,14
1632 | 1630,켔,13
1633 | 1631,츰,13
1634 | 1632,읜,13
1635 | 1633,쑈,13
1636 | 1634,볐,13
1637 | 1635,및,13
1638 | 1636,롷,13
1639 | 1637,딛,13
1640 | 1638,냇,13
1641 | 1639,3,13
1642 | 1640,혓,12
1643 | 1641,팻,12
1644 | 1642,팰,12
1645 | 1643,킁,12
1646 | 1644,촤,12
1647 | 1645,쨰,12
1648 | 1646,잿,12
1649 | 1647,옌,12
1650 | 1648,쐬,12
1651 | 1649,쌋,12
1652 | 1650,슁,12
1653 | 1651,쉑,12
1654 | 1652,빢,12
1655 | 1653,뵌,12
1656 | 1654,뭄,12
1657 | 1655,묽,12
1658 | 1656,뎠,12
1659 | 1657,늪,12
1660 | 1658,뇽,12
1661 | 1659,훤,11
1662 | 1660,횔,11
1663 | 1661,홧,11
1664 | 1662,햅,11
1665 | 1663,푯,11
1666 | 1664,팼,11
1667 | 1665,탯,11
1668 | 1666,탤,11
1669 | 1667,큭,11
1670 | 1668,짢,11
1671 | 1669,읊,11
1672 | 1670,롸,11
1673 | 1671,띤,11
1674 | 1672,놋,11
1675 | 1673,넴,11
1676 | 1674,귿,11
1677 | 1675,q,11
1678 | 1676,휑,10
1679 | 1677,퓸,10
1680 | 1678,튿,10
1681 | 1679,튄,10
1682 | 1680,촐,10
1683 | 1681,쭌,10
1684 | 1682,짊,10
1685 | 1683,숴,10
1686 | 1684,숀,10
1687 | 1685,뿡,10
1688 | 1686,뻬,10
1689 | 1687,렜,10
1690 | 1688,뗬,10
1691 | 1689,늄,10
1692 | 1690,끍,10
1693 | 1691,곌,10
1694 | 1692,갰,10
1695 | 1693,폄,9
1696 | 1694,콰,9
1697 | 1695,캇,9
1698 | 1696,캅,9
1699 | 1697,늉,9
1700 | 1698,뉜,9
1701 | 1699,큔,8
1702 | 1700,콱,8
1703 | 1701,켠,8
1704 | 1702,쳔,8
1705 | 1703,쳇,8
1706 | 1704,읔,8
1707 | 1705,읎,8
1708 | 1706,욤,8
1709 | 1707,뼘,8
1710 | 1708,롹,8
1711 | 1709,렝,8
1712 | 1710,뚠,8
1713 | 1711,땋,8
1714 | 1712,덨,8
1715 | 1713,넵,8
1716 | 1714,넝,8
1717 | 1715,넋,8
1718 | 1716,꿩,8
1719 | 1717,꼿,8
1720 | 1718,깟,8
1721 | 1719,곯,8
1722 | 1720,4,8
1723 | 1721,힝,7
1724 | 1722,헀,7
1725 | 1723,푤,7
1726 | 1724,쿱,7
1727 | 1725,캤,7
1728 | 1726,챗,7
1729 | 1727,쯩,7
1730 | 1728,쮸,7
1731 | 1729,읗,7
1732 | 1730,윅,7
1733 | 1731,얠,7
1734 | 1732,씰,7
1735 | 1733,썅,7
1736 | 1734,쌜,7
1737 | 1735,쌉,7
1738 | 1736,슌,7
1739 | 1737,쉿,7
1740 | 1738,쇳,7
1741 | 1739,셍,7
1742 | 1740,뿍,7
1743 | 1741,뼌,7
1744 | 1742,뺌,7
1745 | 1743,밈,7
1746 | 1744,룟,7
1747 | 1745,룔,7
1748 | 1746,뜀,7
1749 | 1747,끅,7
1750 | 1748,꾜,7
1751 | 1749,겡,7
1752 | 1750,Z,7
1753 | 1751,횰,6
1754 | 1752,헴,6
1755 | 1753,핌,6
1756 | 1754,펩,6
1757 | 1755,펨,6
1758 | 1756,켸,6
1759 | 1757,쩠,6
1760 | 1758,잽,6
1761 | 1759,엡,6
1762 | 1760,앝,6
1763 | 1761,쓕,6
1764 | 1762,썜,6
1765 | 1763,쌘,6
1766 | 1764,삣,6
1767 | 1765,빻,6
1768 | 1766,몀,6
1769 | 1767,뤼,6
1770 | 1768,롄,6
1771 | 1769,떫,6
1772 | 1770,덫,6
1773 | 1771,뉸,6
1774 | 1772,꽥,6
1775 | 1773,궂,6
1776 | 1774,괭,6
1777 | 1775,0,6
1778 | 1776,힉,5
1779 | 1777,휸,5
1780 | 1778,퓰,5
1781 | 1779,퉤,5
1782 | 1780,퀭,5
1783 | 1781,켐,5
1784 | 1782,췻,5
1785 | 1783,챡,5
1786 | 1784,쨀,5
1787 | 1785,젱,5
1788 | 1786,읒,5
1789 | 1787,웡,5
1790 | 1788,옙,5
1791 | 1789,얬,5
1792 | 1790,앎,5
1793 | 1791,씼,5
1794 | 1792,쑨,5
1795 | 1793,쐐,5
1796 | 1794,쌨,5
1797 | 1795,쉈,5
1798 | 1796,숩,5
1799 | 1797,셤,5
1800 | 1798,삠,5
1801 | 1799,뱄,5
1802 | 1800,뭥,5
1803 | 1801,멨,5
1804 | 1802,먀,5
1805 | 1803,랒,5
1806 | 1804,땟,5
1807 | 1805,땁,5
1808 | 1806,뉠,5
1809 | 1807,꽌,5
1810 | 1808,귐,5
1811 | 1809,굥,5
1812 | 1810,굣,5
1813 | 1811,겋,5
1814 | 1812,8,5
1815 | 1813,5,5
1816 | 1814,훙,4
1817 | 1815,혠,4
1818 | 1816,폔,4
1819 | 1817,튬,4
1820 | 1818,퉜,4
1821 | 1819,탉,4
1822 | 1820,큽,4
1823 | 1821,퀀,4
1824 | 1822,쿰,4
1825 | 1823,켤,4
1826 | 1824,켕,4
1827 | 1825,칡,4
1828 | 1826,쯘,4
1829 | 1827,짯,4
1830 | 1828,짔,4
1831 | 1829,쥔,4
1832 | 1830,줜,4
1833 | 1831,죗,4
1834 | 1832,욧,4
1835 | 1833,쒸,4
1836 | 1834,쏼,4
1837 | 1835,쌂,4
1838 | 1836,셴,4
1839 | 1837,샨,4
1840 | 1838,뽜,4
1841 | 1839,뻠,4
1842 | 1840,빝,4
1843 | 1841,봥,4
1844 | 1842,뫼,4
1845 | 1843,맜,4
1846 | 1844,맀,4
1847 | 1845,뤠,4
1848 | 1846,띌,4
1849 | 1847,띈,4
1850 | 1848,뛌,4
1851 | 1849,뚸,4
1852 | 1850,떽,4
1853 | 1851,떰,4
1854 | 1852,떈,4
1855 | 1853,땍,4
1856 | 1854,돠,4
1857 | 1855,뎄,4
1858 | 1856,늣,4
1859 | 1857,뇸,4
1860 | 1858,놥,4
1861 | 1859,녈,4
1862 | 1860,넒,4
1863 | 1861,낏,4
1864 | 1862,갉,4
1865 | 1863,z,4
1866 | 1864,힛,3
1867 | 1865,휀,3
1868 | 1866,헙,3
1869 | 1867,픗,3
1870 | 1868,퐈,3
1871 | 1869,팹,3
1872 | 1870,틔,3
1873 | 1871,퇸,3
1874 | 1872,텟,3
1875 | 1873,탰,3
1876 | 1874,퀼,3
1877 | 1875,칬,3
1878 | 1876,췰,3
1879 | 1877,쳥,3
1880 | 1878,챂,3
1881 | 1879,찧,3
1882 | 1880,쭙,3
1883 | 1881,쬬,3
1884 | 1882,쩄,3
1885 | 1883,쨉,3
1886 | 1884,줴,3
1887 | 1885,죵,3
1888 | 1886,좍,3
1889 | 1887,좃,3
1890 | 1888,읐,3
1891 | 1889,읃,3
1892 | 1890,웟,3
1893 | 1891,왯,3
1894 | 1892,옭,3
1895 | 1893,씅,3
1896 | 1894,쒀,3
1897 | 1895,쑹,3
1898 | 1896,쏵,3
1899 | 1897,쎔,3
1900 | 1898,쎅,3
1901 | 1899,슾,3
1902 | 1900,쉣,3
1903 | 1901,솨,3
1904 | 1902,셸,3
1905 | 1903,셩,3
1906 | 1904,삯,3
1907 | 1905,뿟,3
1908 | 1906,뾱,3
1909 | 1907,뽝,3
1910 | 1908,뻰,3
1911 | 1909,뵤,3
1912 | 1910,볏,3
1913 | 1911,뱌,3
1914 | 1912,뱁,3
1915 | 1913,밨,3
1916 | 1914,밎,3
1917 | 1915,믐,3
1918 | 1916,뭡,3
1919 | 1917,뭠,3
1920 | 1918,뭍,3
1921 | 1919,묭,3
1922 | 1920,뫠,3
1923 | 1921,멏,3
1924 | 1922,멎,3
1925 | 1923,맽,3
1926 | 1924,뙤,3
1927 | 1925,떔,3
1928 | 1926,땃,3
1929 | 1927,덷,3
1930 | 1928,댜,3
1931 | 1929,늰,3
1932 | 1930,놘,3
1933 | 1931,냘,3
1934 | 1932,뀜,3
1935 | 1933,꽹,3
1936 | 1934,꽐,3
1937 | 1935,꼇,3
1938 | 1936,껸,3
1939 | 1937,껨,3
1940 | 1938,꺅,3
1941 | 1939,곘,3
1942 | 1940,겝,3
1943 | 1941,흣,2
1944 | 1942,흝,2
1945 | 1943,흄,2
1946 | 1944,헸,2
1947 | 1945,헜,2
1948 | 1946,햘,2
1949 | 1947,햑,2
1950 | 1948,핟,2
1951 | 1949,픙,2
1952 | 1950,픕,2
1953 | 1951,푀,2
1954 | 1952,퐉,2
1955 | 1953,틑,2
1956 | 1954,틍,2
1957 | 1955,큣,2
1958 | 1956,퀏,2
1959 | 1957,콴,2
1960 | 1958,켁,2
1961 | 1959,칟,2
1962 | 1960,츳,2
1963 | 1961,츨,2
1964 | 1962,츈,2
1965 | 1963,쵝,2
1966 | 1964,촙,2
1967 | 1965,찦,2
1968 | 1966,찟,2
1969 | 1967,쭁,2
1970 | 1968,쫏,2
1971 | 1969,쩬,2
1972 | 1970,쩧,2
1973 | 1971,짆,2
1974 | 1972,쥰,2
1975 | 1973,쥘,2
1976 | 1974,좐,2
1977 | 1975,졍,2
1978 | 1976,쟝,2
1979 | 1977,잧,2
1980 | 1978,잍,2
1981 | 1979,욍,2
1982 | 1980,왬,2
1983 | 1981,옫,2
1984 | 1982,옜,2
1985 | 1983,옉,2
1986 | 1984,엤,2
1987 | 1985,얐,2
1988 | 1986,얉,2
1989 | 1987,쒯,2
1990 | 1988,쑐,2
1991 | 1989,쏸,2
1992 | 1990,썻,2
1993 | 1991,쌕,2
1994 | 1992,싷,2
1995 | 1993,솝,2
1996 | 1994,솎,2
1997 | 1995,샅,2
1998 | 1996,삻,2
1999 | 1997,쁩,2
2000 | 1998,빘,2
2001 | 1999,뷸,2
2002 | 2000,뷴,2
2003 | 2001,봽,2
2004 | 2002,봈,2
2005 | 2003,볌,2
2006 | 2004,벋,2
2007 | 2005,믕,2
2008 | 2006,뮴,2
2009 | 2007,뭬,2
2010 | 2008,뭑,2
2011 | 2009,묜,2
2012 | 2010,맇,2
2013 | 2011,릈,2
2014 | 2012,륑,2
2015 | 2013,랮,2
2016 | 2014,랟,2
2017 | 2015,띔,2
2018 | 2016,떱,2
2019 | 2017,듈,2
2020 | 2018,됴,2
2021 | 2019,됫,2
2022 | 2020,됙,2
2023 | 2021,됑,2
2024 | 2022,댤,2
2025 | 2023,닯,2
2026 | 2024,늗,2
2027 | 2025,뉩,2
2028 | 2026,눟,2
2029 | 2027,눗,2
2030 | 2028,넚,2
2031 | 2029,냡,2
2032 | 2030,낢,2
2033 | 2031,꾿,2
2034 | 2032,꼳,2
2035 | 2033,꼄,2
2036 | 2034,겊,2
2037 | 2035,갼,2
2038 | 2036,6,2
2039 | 2037,,0
2040 | 2038, ,0
2041 | 2039,_,0
2042 |
--------------------------------------------------------------------------------