├── .gitattributes ├── .gitignore ├── .idea ├── RNN-LM.iml ├── inspectionProfiles │ └── profiles_settings.xml ├── misc.xml ├── modules.xml └── vcs.xml ├── LICENSE ├── README.md ├── data ├── .ipynb_checkpoints │ └── Data-Analysis-checkpoint.ipynb ├── Data-Analysis.ipynb └── train_labels.csv ├── main.py ├── model.py ├── package ├── __pycache__ │ ├── config.cpython-37.pyc │ ├── data_loader.cpython-37.pyc │ ├── definition.cpython-37.pyc │ ├── evaluator.cpython-37.pyc │ ├── loss.cpython-37.pyc │ ├── trainer.cpython-37.pyc │ └── utils.cpython-37.pyc ├── config.py ├── data_loader.py ├── definition.py ├── evaluator.py ├── loss.py ├── trainer.py └── utils.py └── preprocess ├── .ipynb_checkpoints └── Preprocess-checkpoint.ipynb ├── KoNPron.ipynb ├── Preprocess.ipynb ├── konpron.py ├── special.csv └── train_labels.csv /.gitattributes: -------------------------------------------------------------------------------- 1 | docs/* linguist-vendored 2 | sphinx/* linguist-vendored 3 | preprocess/* linguist-vendored 4 | data/* linguist-vendored 5 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.bin 2 | *.zip 3 | *.idea 4 | __pycache__/ 5 | *.pyc 6 | .idea -------------------------------------------------------------------------------- /.idea/RNN-LM.iml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 12 | 13 | 15 | -------------------------------------------------------------------------------- /.idea/inspectionProfiles/profiles_settings.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 6 | -------------------------------------------------------------------------------- /.idea/misc.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | -------------------------------------------------------------------------------- /.idea/modules.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 SooHwan Kim 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Character RNN Language Model 2 | 3 | ### Recurrent Neural Netowrk Language Model in Pytorch 4 | [](https://pytorch.org/) 5 | 6 | ## Intro 7 | 8 | This is a project for Character-level Recurrent Neural Network Language Model (Rnnlm) implemented in [Pytorch](pytorch.org). 9 | This language model can be combined with tasks such as speech recognition, machine translation, image captioning etc.. 10 | We appreciate any kind of feedback or contribution. 11 | 12 | 13 | 14 | ## Installation 15 | This project recommends Python 3.7 or higher. 16 | I recommend creating a new virtual environment for this project (using virtualenv or conda). 17 | 18 | ### Prerequisites 19 | 20 | * Numpy: `pip install numpy` (Refer [here](https://github.com/numpy/numpy) for problem installing Numpy). 21 | * Pandas: `pip install pandas` (Refer [here](https://github.com/pandas-dev/pandas) for problem installing Pandas) 22 | * PyTorch: Refer to [PyTorch website](http://pytorch.org/) to install the version w.r.t. your environment. 23 | 24 | 25 | ## Troubleshoots and Contributing 26 | 27 | If you have any questions, bug reports, and feature requests, please [open an issue](https://github.com/sooftware/RNN-LM/issues) on Github. 28 | or Contacts sh951011@gmail.com please. 29 | 30 | I appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues. 31 | 32 | ### Code Style 33 | 34 | We follow PEP-8 for code style. Especially the style of docstrings is important to generate documentation. 35 | 36 | ## Reference 37 | 38 | [[1] IBM pytorch-seq2seq](https://github.com/IBM/pytorch-seq2seq) 39 | [[2] Character-unit based End-to-End Korean Speech Recognition](https://github.com/sooftware/End-to-End-Korean-Speech-Recognition) 40 | [[3] 「An analysis of incorporating an external language model into a sequence-to-sequence model」 Paper](https://arxiv.org/abs/1712.01996) 41 | 42 | ## Author 43 | 44 | * Soohwan Kim [@sooftware](https://github.com/sooftware) 45 | * Contacts: sh951011@gmail.com 46 | -------------------------------------------------------------------------------- /data/.ipynb_checkpoints/Data-Analysis-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pickle\n", 10 | "import pandas as pd\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "import seaborn as sns" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 2, 18 | "metadata": {}, 19 | "outputs": [ 20 | { 21 | "data": { 22 | "text/html": [ 23 | "
\n", 24 | "\n", 37 | "\n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | "
koid
0어디 보자...8 190 0 42 45 1 1 1
1칠대 왕국 종족 중에 오크는 처음 듣는다318 50 0 576 170 0 363 401 0 129 17 0 57 238 4...
2이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22...
3라니스터 가의32 20 79 162 0 6 130
4별 희한한 생각이 다 떠오르곤 하죠233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5...
.........
2992115네 오늘 아침 범죄현장에 있었죠96 0 57 275 0 5 373 0 560 700 350 109 17 0 26 ...
2992116머시235 47
2992117눈은 풀려 있었고 입에선 연신 침이 흘러 나왔다.351 23 0 449 108 0 26 62 7 0 219 17 194 0 147 ...
2992118나는 좋은 선생님이야.13 4 0 94 23 0 194 71 216 3 25 1
2992119다만 취업 준비를 지원한다는 제도 성격을 고려해 유흥.도박.성인 용품 등 용도나 고...15 46 0 331 207 0 245 122 55 0 10 82 27 15 4 0...
\n", 103 | "

2992120 rows × 2 columns

\n", 104 | "
" 105 | ], 106 | "text/plain": [ 107 | " ko \\\n", 108 | "0 어디 보자... \n", 109 | "1 칠대 왕국 종족 중에 오크는 처음 듣는다 \n", 110 | "2 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n", 111 | "3 라니스터 가의 \n", 112 | "4 별 희한한 생각이 다 떠오르곤 하죠 \n", 113 | "... ... \n", 114 | "2992115 네 오늘 아침 범죄현장에 있었죠 \n", 115 | "2992116 머시 \n", 116 | "2992117 눈은 풀려 있었고 입에선 연신 침이 흘러 나왔다. \n", 117 | "2992118 나는 좋은 선생님이야. \n", 118 | "2992119 다만 취업 준비를 지원한다는 제도 성격을 고려해 유흥.도박.성인 용품 등 용도나 고... \n", 119 | "\n", 120 | " id \n", 121 | "0 8 190 0 42 45 1 1 1 \n", 122 | "1 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n", 123 | "2 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n", 124 | "3 32 20 79 162 0 6 130 \n", 125 | "4 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... \n", 126 | "... ... \n", 127 | "2992115 96 0 57 275 0 5 373 0 560 700 350 109 17 0 26 ... \n", 128 | "2992116 235 47 \n", 129 | "2992117 351 23 0 449 108 0 26 62 7 0 219 17 194 0 147 ... \n", 130 | "2992118 13 4 0 94 23 0 194 71 216 3 25 1 \n", 131 | "2992119 15 46 0 331 207 0 245 122 55 0 10 82 27 15 4 0... \n", 132 | "\n", 133 | "[2992120 rows x 2 columns]" 134 | ] 135 | }, 136 | "execution_count": 2, 137 | "metadata": {}, 138 | "output_type": "execute_result" 139 | } 140 | ], 141 | "source": [ 142 | "corpus_df = None\n", 143 | "\n", 144 | "with open('corpus_df.bin', 'rb') as f:\n", 145 | " corpus_df = pickle.load(f)\n", 146 | " \n", 147 | "corpus_df" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 4, 153 | "metadata": {}, 154 | "outputs": [], 155 | "source": [ 156 | "targets = corpus_df['id']\n", 157 | "target_lengths = list()\n", 158 | "\n", 159 | "for target in targets:\n", 160 | " tokens = target.split()\n", 161 | " target_lengths.append(len(tokens))" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 5, 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "data": { 171 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAEaCAYAAAAv2I3rAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAcb0lEQVR4nO3de5QV1YHv8e8P2hdDIgpCDD5ag47iGI3gI6MxSjBB9GYymSQ3xoQmGjVzIxLHuVGjS2CUic6V+MBM4ttGHR3nJiZeIT6I4E0yiQ5E8QXGNsEHENEGVAIxt3HfP/Y+bVGc7j4N3b274fdZ66w+VbXPrl37nPPrql3V1QohYGZmPa9f7gaYmW2rHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2DokKUj6cifK16fXHNOd7eqtJE2U1JK7HUWSbpM0N3c7OtLZz1oN9c2XdFNX1dfVelUA95UPSW8g6bj0YX1H0pDSsu0kvdbVH+auJqlO0iRJj0t6W9Kbkp6QdJGkXXK3ry+SdEx63+u7uN5RkjZI+k1X1tsDPgv8Q+5GtKVXBbBtlj8AE0rz/hZYn6EtNZO0HTAbmA7cA4wBDgEuAo4CGrag7u27oo22kbOA7wP1kkbnbkytQgirQghv5W5HW/pUAEv6kqTH0p7SG5JmS9q/sLxy6PsFSf9H0jpJv5P0lVI9+0h6SNKfJL0s6RvlQxVJSyVdXHrdTZLmF6ZPSK9bldr0qKQjNmNddZKmSvp9KvespLNq7JabgTNK885M88v9t7ukuyWtkbQ+tWN0qczxkp5K7XhK0vFV6hmWjlZeT3uuv5R0bI3trTgHOAH4VAjhyhDCf4UQloYQ5oQQ/hvQWFhfg6Tn0t7+q5Iuk1RXWD5f0s2SLpW0AliW5i+VND29b2+lz8wVkvoVXlvL+3xM2sa302ORpE91ZmPTHuRDktamfvuRpL0Ly6dKapL0N5KWSPqjpHmSPlSq5xRJL6b35z8lnZw+88ekvd6fp6K/T/Pnl15/pqSXUn/8RNJuNbT9fcApwA3A3cTPV7lMkPQ/JN2e+ugVSd8qlWn3+1ulzkZJD1WZP0/Sben5HpJ+mOpbr/h9/5+FsuXv2ha/l12pTwUwsANwKXAY8cu7AZitTfd4LgduBz5M3Lu6VdJ+AJIE3AvsDBwLfBo4CfjIZrRnIPA94h7bXwMvAA9IGtzJdd1EPFQ6CzgQ+CfgCkmn19CGu4HdlcZb0xf248AtxUKpLT8GDgBOBo4AXgMeVhrCkPRB4H5gIbGPzwOuKdWzEzAPeB9wYtqWOameA2tob8VXgEdCCL+qtjCEsDqt76S0LbcDB6c2fQOYUnrJF4DdgE8Q96YrJgHLgcOBc4GzgW/W2khJ/YH7gMeIfXIYMBVY14k6RgKPAr8CRqf2bSD22Y6ForsDfw+cSvw8DaLwPkoaBdwJ3EU8WvgX4OrC618B/iY9PyLV99nC8sOB44mfwXHAocCVNWzCqcALIYSngNuAUyQNrFJuCvB/U73/i/gZLv4Cr/X7W/EDYKykfSozCp/vG9OsfyV+v8YSvzunA69Wq6wr3ssuF0LoNQ/imzu3E+V3BQJwdJquT9P/UChTB6wFzkrTJ6QyI0r1rANuKsxbClxcWt9NwPx22tMPWA2cWuu6gH2Ad4EDSnVdAjzZzrqOS3XvQfwQNqb5lwP3pecB+HJ6/ok0PbJQxw7ACuCSNH0Z8BJQVyhzcqmeicQPeF2pPY8AV5feh2Paaf864Noa3uOfA/eU5k0mDrFsn6bnA78F+pXKLQV+Xpr3z8Crtb7PwC5pW47rxOdyItBS+lzfXSqzQ+qDz6TpqUALsFuhzBfTZ2PHNH1nle35erGvgWPSdH2V79brwA6FeRcAK2rYnt8AkwvTzwJnlsqE8vsJLAG+U+v3t/yZTdNPAZcVpr8DPFuYXgRMbWcd83nvu9bp97K7H31qD1jSoZLuVTxUfxt4OS3au1T0ycqTEEILcU9vWJo1EngjhNBUKLMKeH4z2rNPOuRqkvQW8Bbxt3GlPbWsazQgYEE6PF0raS3wbWC/GptyPfD5dDg5kff2DooOAppDCM8V2vIOcW/goEJ7H099VvGLUj2HAx8A1pTa+7FOtBfiNtdyJ6iDiHtVRY8COwLFw/OFIYR3q7y+vIf9S2C4pPfX0sgQ98RvAh6U9FNJF0j6y1peW3A48Lel/mpO21Dss+UhhNcL08uI/TQ0TY8Efl2qu+oRRBsWp/e8WP+wtgoDKA6pHQz8W2F2I1WGISh876rV34nvb9H1wFcl9VccdprIxp/vq4Fvp6GNK9TOUFgXvZddqq7jIr2DpAHAQ8RAOI148gnib+PyIcyfS9OBjYdbavniv0v88BdtV5q+H3iDeEj8SlrvL0rt6WhdlXb9NZseCtV0q7oQwiJJzxAPTVuIQwJVi1aZVwzCaqFYnu4HLCae6CvrzKHc87wX/B0pt0FV5v+xxrrK72mH73MI4QxJ1wCfJB7VXCrp7BDC9TWusx9xCOXyKsuaC8+rfW4rry/P2xzV6i9ve9mZxJxYEUexIL2mn6TDQgjFqyLa/N518vtbdDtwBXHYpB9xL3ZW6wpCuFXSA8QhleOBn0q6N4RQ9eqfLngvu1Rf2gM+kDjGd1EIYV4IYTHxzejoA1T2HLCbpBGVGYqXPJVPBqwEPlia1zp2m8Z5RwKXhxAeTHuWf+K9vZVa17Uw/dwrhNBUerzYie26njjMcEsIYUOV5c8CQ9J4ZKUtOxDHCp8tlDkyjZVVlK/lXQDsC7xVpb3LO9HeO4Axkj5abaHeuwztWeKYX9GxxCGI39WwnqNK0x8l7mlWzoy3+z5XhBCeCSF8N4RwIvEEZ7U9wLYsIJ6PeLFKn63uRD3PpfYXlbevEoL92ULpKOGLxB2MQwuPQ4jnATrTB5v1/U3v093EE81nAD9MR5HFMitCCLeGECYQx4BPbe8IZwvfyy7VGwN4YDpUKT4OII5NvgNMkvQhSZ8gniDq7B7BXOK40SxJh0s6hPhbtqVU11zgv0v6pKS/lHQVGx8qrSaOqZ0haf8UJHex8eVfHa4rDU/cAtwo6SuSRkg6RNJpks7vxHbdRvyAX9rG8keAx4F/k3S0pL8i7knsSLy8iPRzN+AGSQemPp5equdO4PfEkyefVLzy5EhJF0r6TCfaew3wM+Lh4D9KGi1pb0njJP2Y9y6t+w7wd+lwcX9JXyCOl84IIZT3uKo5VPEKg/0lfYk4fnxVYXm773N6P65QPHu+d3qfP0YMw1r9MzGA7pB0RBq6Ol7SNZL27UQ93wWOlvRPaXs+TTwpCe99dl8i7tWPlzRU0s6dqL/sy6neW1NotT6Iv0C/JOkvaqxrS76/1xNP+H6KeCVGK0nXSRqf6jyIeNLxFeDtciVd9F52rdyD0MUHMURClceStPxzxCsN/gQ8QdwzagEmpuX1VDn5AzRRGKgnnvh6ONXzCvE3/OPAzEKZ9xHDcjVxL2kqpZNwaf2LUj3PA3+3mevqD3yLeNLiz8RhjUeBz7fTV8elbd2jnTLlExq7E/cm1hB/UTwKjC695hPA08QvyzPEM/blegYTw3pZau8y4tUeH2nvfajSvjpiIC4gDiG8ld7XbwODCuUaiMMelXVNZ+MThfMpnEAtzF+ayt6a6l5FPDvfv9b3OfXZj4gnHt8hXlFxI7BzO9s1kcJJuDTvYOAnaT3r0+fkBmDXtHwq0FR6zSYn1IiXg72Y2vIr4tUfARhVKPOt1E8bCttxG6UT3KSAbWc7ngTuamPZLun9+Fq1z1qaNxe4rTDd7ve3rXrS/CeA56vM/x7xBOx64nDObOCgap+NzXkvu/uh1LBtmuJ1jq8Sz4bP3FrWta2TtJT45bssd1u6i6QJxF8wg0MIa3K3pzukk28vAd8NIczI3Z6u1GdOwnWldOjWQtyrGkq8fjEQrxnus+uyrZ+kfySOv64iXl1xBfAfW2P4Kv7BzFDi9fEDiUcmW5VtMoCBAcTrbOuJh74LiYfLr/XxddnW78PEcd9diUNad7DpH6VsLfYinm9YAXw1hPBm5vZ0OQ9BmJll0huvgjAz2yZ0aghiyJAhob6+vpuaYma2dVq4cOEbIYRNbnzUqQCur69nwYIFXdcqM7NtgKSXqs33EISZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmXTqf8J1p5kzZ9LU1NRhuWXLlgEwfPjwbm3PiBEjmDRpUreuw8y2bb0mgJuamnjymcVsGLBru+X6r3sTgD+8031N779uVbfVbWZW0WsCGGDDgF1Zf8D4dsvstGQOQIfltkRlHWZm3cljwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpn0SADPnDmTmTNn9sSqrMR9b9Z71fXESpqamnpiNVaF+96s9/IQhJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy6QudwOs+y1atIjjjjsudzM2MXToUFauXLnRvP79+zN48GBWrlzJoEGDWLNmDXV1dQwbNozly5dz5ZVXMmrUKACam5s599xzefnllwGYMWMGo0aNorm5mWnTpnHOOecwY8YMJHHaaadxySWXMHz4cC6//HIApk2bxpQpUxg8ePBGbWhubub8889n+fLlXHrppTQ2Nm5SrriOa6+9ttPL21J+XflnrfX0BpVt6UttrqY7t8N7wJZNOXwBNmzY0Dp/zZo1ALS0tLBs2TJCCEyZMqW1bGNjY2v4Aq3LGhsbefrpp7nssstYvHgxzz33HFOnTmXdunW88MILzJo1q7XMrFmzNmlDY2MjTU1NrFu3jilTplQtV1zH5ixvS/l15Z+11tMbtNfHfUl3bocDeCu3aNGi3E3oUmvXrmXhwoU0Nzcze/bsTZbNmzePBx54gBACS5cu3WhZxezZs1vLPPDAAzQ3N7cua25uZs6cORu9rlyuubl5o3V0dnlbqr2u/LOWenqD4rb0lTZX093b0SNDEMuWLWP9+vVMnjy5zTJNTU30+3PoieZ0qN+f3qKp6e1222v5TJkyhTFjxrBhw4ZNlk2fPr3D17e0tCAJiHvcs2bN4txzzwXi3k5LS8smrymWa2xs5N13393s5W2p9rr22tGbFbelr7S5mu7ejg73gCWdKWmBpAWvv/56l63YbHOtXbuWuXPnVl3W0tJSNUDLQgit5R9++OHW+e3VWyk3d+7cTdbRmeVtqfa69trRmxW3pa+0uZru3o4O94BDCDcANwCMHj16s3ZRhw8fDsA111zTZpnJkyez8HevbU71Xe7dHd/PiH2HtdvevqI3nnzbUgMHDmTMmDHcd999myyrq4sf6Y6CTBIhBOrq6jjhhBNa548dO7bNeivlxo4dy5w5czZaR2eWt6Xa69prR29W3Ja+0uZquns7PAZsfc60adNoaGigf//+myy76KKL6Nev/Y91XV0d2223HRCvupgwYULrsoaGhtYQLyqWa2ho2GQdnVnelmqva68dvVlxW/pKm6vp7u1wAG/lDjnkkNxN6FIDBw5k1KhRDB48mJNOOmmTZccffzzjxo1DEvX19RstqzjppJNay4wbN26jS4sGDx7M+PHjN3pdudzgwYM3Wkdnl7el2uvKP2uppzcobktfaXM13b0dDmDLZujQoZvM69+/f+v8QYMGAXGPdfjw4Uhi2rRprWUbGhrYa6+9WqcryxoaGjj44IO5+OKLOfDAAxk5ciRTp05lwIAB7LfffkyYMKG1TLU9moaGBkaMGMGAAQOYNm1a1XLFdWzO8raUX1f+2Zf2JNvr476kO7dDlZMRtRg9enRYsGBBp1dSuZqgljHg9QeMb7MMwE5L4iVCHZXbEjstmcOorWQMuJa+N7PuJWlhCGF0eb73gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSZ1PbGSESNG9MRqrAr3vVnv1SMBPGnSpJ5YjVXhvjfrvTwEYWaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8ukLncDivqvW8VOS+Z0UKYZoMNyW9oOGNZt9ZuZQS8K4BEjRtRUbtmyFgCGD+/OgBxWc3vMzDZXrwngSZMm5W6CmVmP8hiwmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMlEIofbC0uvAS5u5riHAG5v52m2B+6dt7pv2uX/a1lv6Zu8Qwm7lmZ0K4C0haUEIYXSPrKwPcv+0zX3TPvdP23p733gIwswsEwewmVkmPRnAN/Tguvoi90/b3Dftc/+0rVf3TY+NAZuZ2cY8BGFmlokD2Mwsk24PYEnjJD0vqUnSBd29vt5I0i2SVkp6pjBvV0kPS3oh/dwlzZeka1N/PSXpsHwt736S9pQ0T9JiSc9Kmpzmu38ASTtKelzSotQ/09L8fSQ9lvrn3yVtn+bvkKab0vL6nO3vCZL6S3pC0v1pus/0TbcGsKT+wPeAE4GRwCmSRnbnOnup24BxpXkXAD8LIewH/CxNQ+yr/dLjTOD7PdTGXFqA80IIBwJHAd9InxH3T/QOMCaEcAhwKDBO0lHAFcBVqX9WA6en8qcDq0MII4CrUrmt3WRgcWG67/RNCKHbHsBHgQcL0xcCF3bnOnvrA6gHnilMPw/snp7vDjyfnl8PnFKt3LbwAH4CnOD+qdo3A4DfAEcS/7qrLs1v/Z4BDwIfTc/rUjnlbns39skexF/QY4D7AfWlvunuIYjhwCuF6VfTPINhIYQVAOnn0DR/m+2zdEj4EeAx3D+t0iH2k8BK4GHgRWBNCKElFSn2QWv/pOVvAoN7tsU96mrgW8C7aXowfahvujuAVWWer3tr3zbZZ5IGAj8EvhlCeKu9olXmbdX9E0LYEEI4lLi3dwRwYLVi6ec20z+STgZWhhAWFmdXKdpr+6a7A/hVYM/C9B7A8m5eZ1/xmqTdAdLPlWn+NtdnkrYjhu+dIYQfpdnun5IQwhpgPnGsfJCkurSo2Aet/ZOW7wys6tmW9pijgU9LWgrcTRyGuJo+1DfdHcD/BeyXzkpuD3wRuK+b19lX3Ac0pOcNxLHPyvwJ6Wz/UcCblUPxrZEkATcDi0MI3y0scv8AknaTNCg93wkYSzzhNA/4XCpW7p9Kv30OeCSkQc+tTQjhwhDCHiGEemK2PBJCOJW+1Dc9MEg+HvgtcdzqotyD9jkewF3ACuD/EX8Ln04ce/oZ8EL6uWsqK+KVIy8CTwOjc7e/m/vmGOJh4FPAk+kx3v3T2j8fBp5I/fMMcEmavy/wONAE/AewQ5q/Y5puSsv3zb0NPdRPxwH397W+8Z8im5ll4r+EMzPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmGUkKkkbkbofl4QDehkk6RtJ/SnpT0ipJv5R0eO529YQcwSdpvqSv9eQ6rXer67iIbY0kvZ9496i/B+4Btgc+Rrz9oZn1AO8Bb7v2Bwgh3BXizV7WhxAeCiE8VSkg6bR0o/TVkh6UtHdh2QmSlqS95+skPVrZu5M0VdIdhbL1aY+zLk3vLOlmSSskLZN0Wbp3NJImSvqFpCvTen8v6cRCXbtKulXS8rT8x4VlJ0t6UtKatGf/4c3pmA62O0j6errZ92pJ30t/Tl25a9kMSW+kdp9d2W5J04m/4K6TtFbSdYVVjq1Wn239HMDbrt8CGyQ1SjpR6T9OVEj6DPBt4LPAbsDPiX9SjaQhxJvnXAwMIf5Z8NGdWHcj8UbsI4i3n/wkUDw0P5J4n98hwL8ANxdC6XbifXEPIt6i8qrUpsOAW4CziH/GfD1wn6QdOtGudre74GTgcOAQ4AvAp9L8M4g3jD8UOAz4TOUFIYSLUl1nhxAGhhDOrqE+29rl/ltoP/I9iLc1vI14f4oW4s1KhqVlPwVOL5TtB6wD9gYmAL8uLFOq42tpeipwR2F5PfF+D3XAMOIwx06F5acA89LziUBTYdmA9NoPEG/M/i6wS5Vt+T5waWne88DH29j2AIyoMr/N7S687pjC8nuAC9LzR4CzCsvGVrY7Tc+v9FGpHVXr82Prf3gPeBsWQlgcQpgYQtgD+Cvgg8Tb+UEM2mvS4fwa4m37RLyp9Qcp3BQ9xOR4hdrsDWwHrCjUfT3v3XAd4A+FutelpwOJtxJcFUJY3Ua951XqTPXumdraGe1t9ybtI4bzwPR8o36h9j5pqz7byvkknAEQQlgi6TbiITzE8JgeQrizXFbSfhTuyZuGB4r36P0jcc+14gOF568Q94CHhPf+a0GtXgF2lTQoxHvjlpdNDyFM72Sd1dZRdbtrsIJ4/9mKPUvLfecr24j3gLdRkg6QdJ6kPdL0nsShgF+nIj8ALpR0UFq+s6TPp2WzgYMkfTadWDuHjUP2SeBYSXtJ2pn4vwCB1n8v9BAwQ9L7JfWT9CFJH++ozem1PwX+VdIukraTdGxafCPwdUlHKvoLSSdJel87VW6v+F+HK4/+HWx3R+4BJksarngP3/NLy18j3irRDHAAb8veJp7sekzSH4nB+wxwHkAI4V7if429W9JbadmJadkbwOeBy4Fm4n8o/mWl4hDCw8C/E+9hu5B4uVvRBOJlb88R/2vt/yaO79biK8T7Ki8h/peMb6Z1LiCeBLsu1dlEHE9uz7PA+sLjq+1tdw1uJP5yeYp4D985xLH1DWn5NcDn0tUO19ZYp23FfD9g6xKS5hNPvN2Uuy29Rbp87gchhL07LGzbJO8Bm3URSTtJGp+u+x0OTAHuzd0u670cwGZdR8A04hDIE8T/3XZJ1hZZr+YhCDOzTLwHbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZ/H++yD/HUsXeOAAAAABJRU5ErkJggg==\n", 172 | "text/plain": [ 173 | "
" 174 | ] 175 | }, 176 | "metadata": { 177 | "needs_background": "light" 178 | }, 179 | "output_type": "display_data" 180 | } 181 | ], 182 | "source": [ 183 | "sns.boxplot(target_lengths)\n", 184 | "plt.title('Language Model Corpus length Analysis', fontsize='x-large')\n", 185 | "plt.xlabel('Sequence Length', fontsize='large')\n", 186 | "plt.show()" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 6, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "sentences = corpus_df['ko']\n", 196 | "\n", 197 | "new_sentences = list()\n", 198 | "new_targets = list()\n", 199 | "\n", 200 | "for (sentence, target) in zip(sentences, targets):\n", 201 | " if len(target.split()) < 151:\n", 202 | " new_sentences.append(sentence)\n", 203 | " new_targets.append(target)" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 8, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "2991682\n", 216 | "2991682\n" 217 | ] 218 | } 219 | ], 220 | "source": [ 221 | "print(len(new_sentences))\n", 222 | "print(len(new_targets))" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": 11, 228 | "metadata": {}, 229 | "outputs": [ 230 | { 231 | "data": { 232 | "text/html": [ 233 | "
\n", 234 | "\n", 247 | "\n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | "
koid
0어디 보자...8 190 0 42 45 1 1 1
1칠대 왕국 종족 중에 오크는 처음 듣는다318 50 0 576 170 0 363 401 0 129 17 0 57 238 4...
2이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22...
3라니스터 가의32 20 79 162 0 6 130
4별 희한한 생각이 다 떠오르곤 하죠233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5...
\n", 283 | "
" 284 | ], 285 | "text/plain": [ 286 | " ko \\\n", 287 | "0 어디 보자... \n", 288 | "1 칠대 왕국 종족 중에 오크는 처음 듣는다 \n", 289 | "2 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n", 290 | "3 라니스터 가의 \n", 291 | "4 별 희한한 생각이 다 떠오르곤 하죠 \n", 292 | "\n", 293 | " id \n", 294 | "0 8 190 0 42 45 1 1 1 \n", 295 | "1 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n", 296 | "2 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n", 297 | "3 32 20 79 162 0 6 130 \n", 298 | "4 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... " 299 | ] 300 | }, 301 | "execution_count": 11, 302 | "metadata": {}, 303 | "output_type": "execute_result" 304 | } 305 | ], 306 | "source": [ 307 | "corpus_dict = {'ko' : new_sentences,\n", 308 | " 'id' : new_targets}\n", 309 | "corpus_df = pd.DataFrame(corpus_dict)\n", 310 | "corpus_df.head()" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": 12, 316 | "metadata": {}, 317 | "outputs": [], 318 | "source": [ 319 | "with open('corpus_df.bin', 'wb') as f:\n", 320 | " pickle.dump(corpus_df, f)" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [] 329 | } 330 | ], 331 | "metadata": { 332 | "kernelspec": { 333 | "display_name": "Python 3", 334 | "language": "python", 335 | "name": "python3" 336 | }, 337 | "language_info": { 338 | "codemirror_mode": { 339 | "name": "ipython", 340 | "version": 3 341 | }, 342 | "file_extension": ".py", 343 | "mimetype": "text/x-python", 344 | "name": "python", 345 | "nbconvert_exporter": "python", 346 | "pygments_lexer": "ipython3", 347 | "version": "3.7.6" 348 | } 349 | }, 350 | "nbformat": 4, 351 | "nbformat_minor": 4 352 | } 353 | -------------------------------------------------------------------------------- /data/Data-Analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pickle\n", 10 | "import pandas as pd\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "import seaborn as sns" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 2, 18 | "metadata": {}, 19 | "outputs": [ 20 | { 21 | "data": { 22 | "text/html": [ 23 | "
\n", 24 | "\n", 37 | "\n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | "
koid
0어디 보자...8 190 0 42 45 1 1 1
1칠대 왕국 종족 중에 오크는 처음 듣는다318 50 0 576 170 0 363 401 0 129 17 0 57 238 4...
2이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22...
3라니스터 가의32 20 79 162 0 6 130
4별 희한한 생각이 다 떠오르곤 하죠233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5...
.........
2992115네 오늘 아침 범죄현장에 있었죠96 0 57 275 0 5 373 0 560 700 350 109 17 0 26 ...
2992116머시235 47
2992117눈은 풀려 있었고 입에선 연신 침이 흘러 나왔다.351 23 0 449 108 0 26 62 7 0 219 17 194 0 147 ...
2992118나는 좋은 선생님이야.13 4 0 94 23 0 194 71 216 3 25 1
2992119다만 취업 준비를 지원한다는 제도 성격을 고려해 유흥.도박.성인 용품 등 용도나 고...15 46 0 331 207 0 245 122 55 0 10 82 27 15 4 0...
\n", 103 | "

2992120 rows × 2 columns

\n", 104 | "
" 105 | ], 106 | "text/plain": [ 107 | " ko \\\n", 108 | "0 어디 보자... \n", 109 | "1 칠대 왕국 종족 중에 오크는 처음 듣는다 \n", 110 | "2 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n", 111 | "3 라니스터 가의 \n", 112 | "4 별 희한한 생각이 다 떠오르곤 하죠 \n", 113 | "... ... \n", 114 | "2992115 네 오늘 아침 범죄현장에 있었죠 \n", 115 | "2992116 머시 \n", 116 | "2992117 눈은 풀려 있었고 입에선 연신 침이 흘러 나왔다. \n", 117 | "2992118 나는 좋은 선생님이야. \n", 118 | "2992119 다만 취업 준비를 지원한다는 제도 성격을 고려해 유흥.도박.성인 용품 등 용도나 고... \n", 119 | "\n", 120 | " id \n", 121 | "0 8 190 0 42 45 1 1 1 \n", 122 | "1 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n", 123 | "2 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n", 124 | "3 32 20 79 162 0 6 130 \n", 125 | "4 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... \n", 126 | "... ... \n", 127 | "2992115 96 0 57 275 0 5 373 0 560 700 350 109 17 0 26 ... \n", 128 | "2992116 235 47 \n", 129 | "2992117 351 23 0 449 108 0 26 62 7 0 219 17 194 0 147 ... \n", 130 | "2992118 13 4 0 94 23 0 194 71 216 3 25 1 \n", 131 | "2992119 15 46 0 331 207 0 245 122 55 0 10 82 27 15 4 0... \n", 132 | "\n", 133 | "[2992120 rows x 2 columns]" 134 | ] 135 | }, 136 | "execution_count": 2, 137 | "metadata": {}, 138 | "output_type": "execute_result" 139 | } 140 | ], 141 | "source": [ 142 | "corpus_df = None\n", 143 | "\n", 144 | "with open('corpus_df.bin', 'rb') as f:\n", 145 | " corpus_df = pickle.load(f)\n", 146 | " \n", 147 | "corpus_df" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 4, 153 | "metadata": {}, 154 | "outputs": [], 155 | "source": [ 156 | "targets = corpus_df['id']\n", 157 | "target_lengths = list()\n", 158 | "\n", 159 | "for target in targets:\n", 160 | " tokens = target.split()\n", 161 | " target_lengths.append(len(tokens))" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 5, 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "data": { 171 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAEaCAYAAAAv2I3rAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAcb0lEQVR4nO3de5QV1YHv8e8P2hdDIgpCDD5ag47iGI3gI6MxSjBB9GYymSQ3xoQmGjVzIxLHuVGjS2CUic6V+MBM4ttGHR3nJiZeIT6I4E0yiQ5E8QXGNsEHENEGVAIxt3HfP/Y+bVGc7j4N3b274fdZ66w+VbXPrl37nPPrql3V1QohYGZmPa9f7gaYmW2rHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2DokKUj6cifK16fXHNOd7eqtJE2U1JK7HUWSbpM0N3c7OtLZz1oN9c2XdFNX1dfVelUA95UPSW8g6bj0YX1H0pDSsu0kvdbVH+auJqlO0iRJj0t6W9Kbkp6QdJGkXXK3ry+SdEx63+u7uN5RkjZI+k1X1tsDPgv8Q+5GtKVXBbBtlj8AE0rz/hZYn6EtNZO0HTAbmA7cA4wBDgEuAo4CGrag7u27oo22kbOA7wP1kkbnbkytQgirQghv5W5HW/pUAEv6kqTH0p7SG5JmS9q/sLxy6PsFSf9H0jpJv5P0lVI9+0h6SNKfJL0s6RvlQxVJSyVdXHrdTZLmF6ZPSK9bldr0qKQjNmNddZKmSvp9KvespLNq7JabgTNK885M88v9t7ukuyWtkbQ+tWN0qczxkp5K7XhK0vFV6hmWjlZeT3uuv5R0bI3trTgHOAH4VAjhyhDCf4UQloYQ5oQQ/hvQWFhfg6Tn0t7+q5Iuk1RXWD5f0s2SLpW0AliW5i+VND29b2+lz8wVkvoVXlvL+3xM2sa302ORpE91ZmPTHuRDktamfvuRpL0Ly6dKapL0N5KWSPqjpHmSPlSq5xRJL6b35z8lnZw+88ekvd6fp6K/T/Pnl15/pqSXUn/8RNJuNbT9fcApwA3A3cTPV7lMkPQ/JN2e+ugVSd8qlWn3+1ulzkZJD1WZP0/Sben5HpJ+mOpbr/h9/5+FsuXv2ha/l12pTwUwsANwKXAY8cu7AZitTfd4LgduBz5M3Lu6VdJ+AJIE3AvsDBwLfBo4CfjIZrRnIPA94h7bXwMvAA9IGtzJdd1EPFQ6CzgQ+CfgCkmn19CGu4HdlcZb0xf248AtxUKpLT8GDgBOBo4AXgMeVhrCkPRB4H5gIbGPzwOuKdWzEzAPeB9wYtqWOameA2tob8VXgEdCCL+qtjCEsDqt76S0LbcDB6c2fQOYUnrJF4DdgE8Q96YrJgHLgcOBc4GzgW/W2khJ/YH7gMeIfXIYMBVY14k6RgKPAr8CRqf2bSD22Y6ForsDfw+cSvw8DaLwPkoaBdwJ3EU8WvgX4OrC618B/iY9PyLV99nC8sOB44mfwXHAocCVNWzCqcALIYSngNuAUyQNrFJuCvB/U73/i/gZLv4Cr/X7W/EDYKykfSozCp/vG9OsfyV+v8YSvzunA69Wq6wr3ssuF0LoNQ/imzu3E+V3BQJwdJquT9P/UChTB6wFzkrTJ6QyI0r1rANuKsxbClxcWt9NwPx22tMPWA2cWuu6gH2Ad4EDSnVdAjzZzrqOS3XvQfwQNqb5lwP3pecB+HJ6/ok0PbJQxw7ACuCSNH0Z8BJQVyhzcqmeicQPeF2pPY8AV5feh2Paaf864Noa3uOfA/eU5k0mDrFsn6bnA78F+pXKLQV+Xpr3z8Crtb7PwC5pW47rxOdyItBS+lzfXSqzQ+qDz6TpqUALsFuhzBfTZ2PHNH1nle35erGvgWPSdH2V79brwA6FeRcAK2rYnt8AkwvTzwJnlsqE8vsJLAG+U+v3t/yZTdNPAZcVpr8DPFuYXgRMbWcd83nvu9bp97K7H31qD1jSoZLuVTxUfxt4OS3au1T0ycqTEEILcU9vWJo1EngjhNBUKLMKeH4z2rNPOuRqkvQW8Bbxt3GlPbWsazQgYEE6PF0raS3wbWC/GptyPfD5dDg5kff2DooOAppDCM8V2vIOcW/goEJ7H099VvGLUj2HAx8A1pTa+7FOtBfiNtdyJ6iDiHtVRY8COwLFw/OFIYR3q7y+vIf9S2C4pPfX0sgQ98RvAh6U9FNJF0j6y1peW3A48Lel/mpO21Dss+UhhNcL08uI/TQ0TY8Efl2qu+oRRBsWp/e8WP+wtgoDKA6pHQz8W2F2I1WGISh876rV34nvb9H1wFcl9VccdprIxp/vq4Fvp6GNK9TOUFgXvZddqq7jIr2DpAHAQ8RAOI148gnib+PyIcyfS9OBjYdbavniv0v88BdtV5q+H3iDeEj8SlrvL0rt6WhdlXb9NZseCtV0q7oQwiJJzxAPTVuIQwJVi1aZVwzCaqFYnu4HLCae6CvrzKHc87wX/B0pt0FV5v+xxrrK72mH73MI4QxJ1wCfJB7VXCrp7BDC9TWusx9xCOXyKsuaC8+rfW4rry/P2xzV6i9ve9mZxJxYEUexIL2mn6TDQgjFqyLa/N518vtbdDtwBXHYpB9xL3ZW6wpCuFXSA8QhleOBn0q6N4RQ9eqfLngvu1Rf2gM+kDjGd1EIYV4IYTHxzejoA1T2HLCbpBGVGYqXPJVPBqwEPlia1zp2m8Z5RwKXhxAeTHuWf+K9vZVa17Uw/dwrhNBUerzYie26njjMcEsIYUOV5c8CQ9J4ZKUtOxDHCp8tlDkyjZVVlK/lXQDsC7xVpb3LO9HeO4Axkj5abaHeuwztWeKYX9GxxCGI39WwnqNK0x8l7mlWzoy3+z5XhBCeCSF8N4RwIvEEZ7U9wLYsIJ6PeLFKn63uRD3PpfYXlbevEoL92ULpKOGLxB2MQwuPQ4jnATrTB5v1/U3v093EE81nAD9MR5HFMitCCLeGECYQx4BPbe8IZwvfyy7VGwN4YDpUKT4OII5NvgNMkvQhSZ8gniDq7B7BXOK40SxJh0s6hPhbtqVU11zgv0v6pKS/lHQVGx8qrSaOqZ0haf8UJHex8eVfHa4rDU/cAtwo6SuSRkg6RNJpks7vxHbdRvyAX9rG8keAx4F/k3S0pL8i7knsSLy8iPRzN+AGSQemPp5equdO4PfEkyefVLzy5EhJF0r6TCfaew3wM+Lh4D9KGi1pb0njJP2Y9y6t+w7wd+lwcX9JXyCOl84IIZT3uKo5VPEKg/0lfYk4fnxVYXm773N6P65QPHu+d3qfP0YMw1r9MzGA7pB0RBq6Ol7SNZL27UQ93wWOlvRPaXs+TTwpCe99dl8i7tWPlzRU0s6dqL/sy6neW1NotT6Iv0C/JOkvaqxrS76/1xNP+H6KeCVGK0nXSRqf6jyIeNLxFeDtciVd9F52rdyD0MUHMURClceStPxzxCsN/gQ8QdwzagEmpuX1VDn5AzRRGKgnnvh6ONXzCvE3/OPAzEKZ9xHDcjVxL2kqpZNwaf2LUj3PA3+3mevqD3yLeNLiz8RhjUeBz7fTV8elbd2jnTLlExq7E/cm1hB/UTwKjC695hPA08QvyzPEM/blegYTw3pZau8y4tUeH2nvfajSvjpiIC4gDiG8ld7XbwODCuUaiMMelXVNZ+MThfMpnEAtzF+ayt6a6l5FPDvfv9b3OfXZj4gnHt8hXlFxI7BzO9s1kcJJuDTvYOAnaT3r0+fkBmDXtHwq0FR6zSYn1IiXg72Y2vIr4tUfARhVKPOt1E8bCttxG6UT3KSAbWc7ngTuamPZLun9+Fq1z1qaNxe4rTDd7ve3rXrS/CeA56vM/x7xBOx64nDObOCgap+NzXkvu/uh1LBtmuJ1jq8Sz4bP3FrWta2TtJT45bssd1u6i6QJxF8wg0MIa3K3pzukk28vAd8NIczI3Z6u1GdOwnWldOjWQtyrGkq8fjEQrxnus+uyrZ+kfySOv64iXl1xBfAfW2P4Kv7BzFDi9fEDiUcmW5VtMoCBAcTrbOuJh74LiYfLr/XxddnW78PEcd9diUNad7DpH6VsLfYinm9YAXw1hPBm5vZ0OQ9BmJll0huvgjAz2yZ0aghiyJAhob6+vpuaYma2dVq4cOEbIYRNbnzUqQCur69nwYIFXdcqM7NtgKSXqs33EISZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmXTqf8J1p5kzZ9LU1NRhuWXLlgEwfPjwbm3PiBEjmDRpUreuw8y2bb0mgJuamnjymcVsGLBru+X6r3sTgD+8031N779uVbfVbWZW0WsCGGDDgF1Zf8D4dsvstGQOQIfltkRlHWZm3cljwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpn0SADPnDmTmTNn9sSqrMR9b9Z71fXESpqamnpiNVaF+96s9/IQhJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy6QudwOs+y1atIjjjjsudzM2MXToUFauXLnRvP79+zN48GBWrlzJoEGDWLNmDXV1dQwbNozly5dz5ZVXMmrUKACam5s599xzefnllwGYMWMGo0aNorm5mWnTpnHOOecwY8YMJHHaaadxySWXMHz4cC6//HIApk2bxpQpUxg8ePBGbWhubub8889n+fLlXHrppTQ2Nm5SrriOa6+9ttPL21J+XflnrfX0BpVt6UttrqY7t8N7wJZNOXwBNmzY0Dp/zZo1ALS0tLBs2TJCCEyZMqW1bGNjY2v4Aq3LGhsbefrpp7nssstYvHgxzz33HFOnTmXdunW88MILzJo1q7XMrFmzNmlDY2MjTU1NrFu3jilTplQtV1zH5ixvS/l15Z+11tMbtNfHfUl3bocDeCu3aNGi3E3oUmvXrmXhwoU0Nzcze/bsTZbNmzePBx54gBACS5cu3WhZxezZs1vLPPDAAzQ3N7cua25uZs6cORu9rlyuubl5o3V0dnlbqr2u/LOWenqD4rb0lTZX093b0SNDEMuWLWP9+vVMnjy5zTJNTU30+3PoieZ0qN+f3qKp6e1222v5TJkyhTFjxrBhw4ZNlk2fPr3D17e0tCAJiHvcs2bN4txzzwXi3k5LS8smrymWa2xs5N13393s5W2p9rr22tGbFbelr7S5mu7ejg73gCWdKWmBpAWvv/56l63YbHOtXbuWuXPnVl3W0tJSNUDLQgit5R9++OHW+e3VWyk3d+7cTdbRmeVtqfa69trRmxW3pa+0uZru3o4O94BDCDcANwCMHj16s3ZRhw8fDsA111zTZpnJkyez8HevbU71Xe7dHd/PiH2HtdvevqI3nnzbUgMHDmTMmDHcd999myyrq4sf6Y6CTBIhBOrq6jjhhBNa548dO7bNeivlxo4dy5w5czZaR2eWt6Xa69prR29W3Ja+0uZquns7PAZsfc60adNoaGigf//+myy76KKL6Nev/Y91XV0d2223HRCvupgwYULrsoaGhtYQLyqWa2ho2GQdnVnelmqva68dvVlxW/pKm6vp7u1wAG/lDjnkkNxN6FIDBw5k1KhRDB48mJNOOmmTZccffzzjxo1DEvX19RstqzjppJNay4wbN26jS4sGDx7M+PHjN3pdudzgwYM3Wkdnl7el2uvKP2uppzcobktfaXM13b0dDmDLZujQoZvM69+/f+v8QYMGAXGPdfjw4Uhi2rRprWUbGhrYa6+9WqcryxoaGjj44IO5+OKLOfDAAxk5ciRTp05lwIAB7LfffkyYMKG1TLU9moaGBkaMGMGAAQOYNm1a1XLFdWzO8raUX1f+2Zf2JNvr476kO7dDlZMRtRg9enRYsGBBp1dSuZqgljHg9QeMb7MMwE5L4iVCHZXbEjstmcOorWQMuJa+N7PuJWlhCGF0eb73gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSZ1PbGSESNG9MRqrAr3vVnv1SMBPGnSpJ5YjVXhvjfrvTwEYWaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8vEAWxmlokD2MwsEwewmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMnEAm5ll4gA2M8ukLncDivqvW8VOS+Z0UKYZoMNyW9oOGNZt9ZuZQS8K4BEjRtRUbtmyFgCGD+/OgBxWc3vMzDZXrwngSZMm5W6CmVmP8hiwmVkmDmAzs0wcwGZmmTiAzcwycQCbmWXiADYzy8QBbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZOIDNzDJxAJuZZeIANjPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmZpk4gM3MMlEIofbC0uvAS5u5riHAG5v52m2B+6dt7pv2uX/a1lv6Zu8Qwm7lmZ0K4C0haUEIYXSPrKwPcv+0zX3TPvdP23p733gIwswsEwewmVkmPRnAN/Tguvoi90/b3Dftc/+0rVf3TY+NAZuZ2cY8BGFmlokD2Mwsk24PYEnjJD0vqUnSBd29vt5I0i2SVkp6pjBvV0kPS3oh/dwlzZeka1N/PSXpsHwt736S9pQ0T9JiSc9Kmpzmu38ASTtKelzSotQ/09L8fSQ9lvrn3yVtn+bvkKab0vL6nO3vCZL6S3pC0v1pus/0TbcGsKT+wPeAE4GRwCmSRnbnOnup24BxpXkXAD8LIewH/CxNQ+yr/dLjTOD7PdTGXFqA80IIBwJHAd9InxH3T/QOMCaEcAhwKDBO0lHAFcBVqX9WA6en8qcDq0MII4CrUrmt3WRgcWG67/RNCKHbHsBHgQcL0xcCF3bnOnvrA6gHnilMPw/snp7vDjyfnl8PnFKt3LbwAH4CnOD+qdo3A4DfAEcS/7qrLs1v/Z4BDwIfTc/rUjnlbns39skexF/QY4D7AfWlvunuIYjhwCuF6VfTPINhIYQVAOnn0DR/m+2zdEj4EeAx3D+t0iH2k8BK4GHgRWBNCKElFSn2QWv/pOVvAoN7tsU96mrgW8C7aXowfahvujuAVWWer3tr3zbZZ5IGAj8EvhlCeKu9olXmbdX9E0LYEEI4lLi3dwRwYLVi6ec20z+STgZWhhAWFmdXKdpr+6a7A/hVYM/C9B7A8m5eZ1/xmqTdAdLPlWn+NtdnkrYjhu+dIYQfpdnun5IQwhpgPnGsfJCkurSo2Aet/ZOW7wys6tmW9pijgU9LWgrcTRyGuJo+1DfdHcD/BeyXzkpuD3wRuK+b19lX3Ac0pOcNxLHPyvwJ6Wz/UcCblUPxrZEkATcDi0MI3y0scv8AknaTNCg93wkYSzzhNA/4XCpW7p9Kv30OeCSkQc+tTQjhwhDCHiGEemK2PBJCOJW+1Dc9MEg+HvgtcdzqotyD9jkewF3ACuD/EX8Ln04ce/oZ8EL6uWsqK+KVIy8CTwOjc7e/m/vmGOJh4FPAk+kx3v3T2j8fBp5I/fMMcEmavy/wONAE/AewQ5q/Y5puSsv3zb0NPdRPxwH397W+8Z8im5ll4r+EMzPLxAFsZpaJA9jMLBMHsJlZJg5gM7NMHMBmGUkKkkbkbofl4QDehkk6RtJ/SnpT0ipJv5R0eO529YQcwSdpvqSv9eQ6rXer67iIbY0kvZ9496i/B+4Btgc+Rrz9oZn1AO8Bb7v2Bwgh3BXizV7WhxAeCiE8VSkg6bR0o/TVkh6UtHdh2QmSlqS95+skPVrZu5M0VdIdhbL1aY+zLk3vLOlmSSskLZN0Wbp3NJImSvqFpCvTen8v6cRCXbtKulXS8rT8x4VlJ0t6UtKatGf/4c3pmA62O0j6errZ92pJ30t/Tl25a9kMSW+kdp9d2W5J04m/4K6TtFbSdYVVjq1Wn239HMDbrt8CGyQ1SjpR6T9OVEj6DPBt4LPAbsDPiX9SjaQhxJvnXAwMIf5Z8NGdWHcj8UbsI4i3n/wkUDw0P5J4n98hwL8ANxdC6XbifXEPIt6i8qrUpsOAW4CziH/GfD1wn6QdOtGudre74GTgcOAQ4AvAp9L8M4g3jD8UOAz4TOUFIYSLUl1nhxAGhhDOrqE+29rl/ltoP/I9iLc1vI14f4oW4s1KhqVlPwVOL5TtB6wD9gYmAL8uLFOq42tpeipwR2F5PfF+D3XAMOIwx06F5acA89LziUBTYdmA9NoPEG/M/i6wS5Vt+T5waWne88DH29j2AIyoMr/N7S687pjC8nuAC9LzR4CzCsvGVrY7Tc+v9FGpHVXr82Prf3gPeBsWQlgcQpgYQtgD+Cvgg8Tb+UEM2mvS4fwa4m37RLyp9Qcp3BQ9xOR4hdrsDWwHrCjUfT3v3XAd4A+FutelpwOJtxJcFUJY3Ua951XqTPXumdraGe1t9ybtI4bzwPR8o36h9j5pqz7byvkknAEQQlgi6TbiITzE8JgeQrizXFbSfhTuyZuGB4r36P0jcc+14gOF568Q94CHhPf+a0GtXgF2lTQoxHvjlpdNDyFM72Sd1dZRdbtrsIJ4/9mKPUvLfecr24j3gLdRkg6QdJ6kPdL0nsShgF+nIj8ALpR0UFq+s6TPp2WzgYMkfTadWDuHjUP2SeBYSXtJ2pn4vwCB1n8v9BAwQ9L7JfWT9CFJH++ozem1PwX+VdIukraTdGxafCPwdUlHKvoLSSdJel87VW6v+F+HK4/+HWx3R+4BJksarngP3/NLy18j3irRDHAAb8veJp7sekzSH4nB+wxwHkAI4V7if429W9JbadmJadkbwOeBy4Fm4n8o/mWl4hDCw8C/E+9hu5B4uVvRBOJlb88R/2vt/yaO79biK8T7Ki8h/peMb6Z1LiCeBLsu1dlEHE9uz7PA+sLjq+1tdw1uJP5yeYp4D985xLH1DWn5NcDn0tUO19ZYp23FfD9g6xKS5hNPvN2Uuy29Rbp87gchhL07LGzbJO8Bm3URSTtJGp+u+x0OTAHuzd0u670cwGZdR8A04hDIE8T/3XZJ1hZZr+YhCDOzTLwHbGaWiQPYzCwTB7CZWSYOYDOzTBzAZmaZ/H++yD/HUsXeOAAAAABJRU5ErkJggg==\n", 172 | "text/plain": [ 173 | "
" 174 | ] 175 | }, 176 | "metadata": { 177 | "needs_background": "light" 178 | }, 179 | "output_type": "display_data" 180 | } 181 | ], 182 | "source": [ 183 | "sns.boxplot(target_lengths)\n", 184 | "plt.title('Language Model Corpus length Analysis', fontsize='x-large')\n", 185 | "plt.xlabel('Sequence Length', fontsize='large')\n", 186 | "plt.show()" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 6, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "sentences = corpus_df['ko']\n", 196 | "\n", 197 | "new_sentences = list()\n", 198 | "new_targets = list()\n", 199 | "\n", 200 | "for (sentence, target) in zip(sentences, targets):\n", 201 | " if len(target.split()) < 151:\n", 202 | " new_sentences.append(sentence)\n", 203 | " new_targets.append(target)" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 8, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "2991682\n", 216 | "2991682\n" 217 | ] 218 | } 219 | ], 220 | "source": [ 221 | "print(len(new_sentences))\n", 222 | "print(len(new_targets))" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": 11, 228 | "metadata": {}, 229 | "outputs": [ 230 | { 231 | "data": { 232 | "text/html": [ 233 | "
\n", 234 | "\n", 247 | "\n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | "
koid
0어디 보자...8 190 0 42 45 1 1 1
1칠대 왕국 종족 중에 오크는 처음 듣는다318 50 0 576 170 0 363 401 0 129 17 0 57 238 4...
2이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22...
3라니스터 가의32 20 79 162 0 6 130
4별 희한한 생각이 다 떠오르곤 하죠233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5...
\n", 283 | "
" 284 | ], 285 | "text/plain": [ 286 | " ko \\\n", 287 | "0 어디 보자... \n", 288 | "1 칠대 왕국 종족 중에 오크는 처음 듣는다 \n", 289 | "2 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n", 290 | "3 라니스터 가의 \n", 291 | "4 별 희한한 생각이 다 떠오르곤 하죠 \n", 292 | "\n", 293 | " id \n", 294 | "0 8 190 0 42 45 1 1 1 \n", 295 | "1 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n", 296 | "2 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n", 297 | "3 32 20 79 162 0 6 130 \n", 298 | "4 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... " 299 | ] 300 | }, 301 | "execution_count": 11, 302 | "metadata": {}, 303 | "output_type": "execute_result" 304 | } 305 | ], 306 | "source": [ 307 | "corpus_dict = {'ko' : new_sentences,\n", 308 | " 'id' : new_targets}\n", 309 | "corpus_df = pd.DataFrame(corpus_dict)\n", 310 | "corpus_df.head()" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": 12, 316 | "metadata": {}, 317 | "outputs": [], 318 | "source": [ 319 | "with open('corpus_df.bin', 'wb') as f:\n", 320 | " pickle.dump(corpus_df, f)" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [] 329 | } 330 | ], 331 | "metadata": { 332 | "kernelspec": { 333 | "display_name": "Python 3", 334 | "language": "python", 335 | "name": "python3" 336 | }, 337 | "language_info": { 338 | "codemirror_mode": { 339 | "name": "ipython", 340 | "version": 3 341 | }, 342 | "file_extension": ".py", 343 | "mimetype": "text/x-python", 344 | "name": "python", 345 | "nbconvert_exporter": "python", 346 | "pygments_lexer": "ipython3", 347 | "version": "3.7.6" 348 | } 349 | }, 350 | "nbformat": 4, 351 | "nbformat_minor": 4 352 | } 353 | -------------------------------------------------------------------------------- /data/train_labels.csv: -------------------------------------------------------------------------------- 1 | id,char,freq 2 | 0, ,5774462 3 | 1,.,640924 4 | 2,그,556373 5 | 3,이,509291 6 | 4,는,374559 7 | 5,아,370444 8 | 6,가,369698 9 | 7,고,356378 10 | 8,어,333842 11 | 9,거,306987 12 | 10,지,276453 13 | 11,데,249269 14 | 12,?,235024 15 | 13,나,229646 16 | 14,하,226136 17 | 15,다,221216 18 | 16,서,211193 19 | 17,에,204330 20 | 18,도,190561 21 | 19,게,177140 22 | 20,니,173284 23 | 21,",",152938 24 | 22,기,149467 25 | 23,은,144674 26 | 24,면,142025 27 | 25,야,137553 28 | 26,있,133155 29 | 27,한,121564 30 | 28,을,121048 31 | 29,까,119483 32 | 30,해,115148 33 | 31,리,111855 34 | 32,라,111479 35 | 33,래,105784 36 | 34,사,100533 37 | 35,근,99781 38 | 36,들,99447 39 | 37,안,97043 40 | 38,로,91847 41 | 39,일,88319 42 | 40,뭐,87328 43 | 41,내,85968 44 | 42,보,82911 45 | 43,제,80874 46 | 44,같,79626 47 | 45,자,76298 48 | 46,만,76093 49 | 47,시,72836 50 | 48,런,70919 51 | 49,너,69192 52 | 50,대,68756 53 | 51,때,67179 54 | 52,되,66237 55 | 53,으,66106 56 | 54,진,62831 57 | 55,를,61802 58 | 56,잖,61455 59 | 57,오,60782 60 | 58,러,60629 61 | 59,인,60234 62 | 60,막,59994 63 | 61,무,58705 64 | 62,었,58385 65 | 63,구,57294 66 | 64,했,57209 67 | 65,수,56787 68 | 66,간,55275 69 | 67,애,54476 70 | 68,우,53539 71 | 69,요,53234 72 | 70,마,53125 73 | 71,생,52815 74 | 72,렇,50798 75 | 73,냥,49989 76 | 74,짜,49581 77 | 75,주,48969 78 | 76,없,48392 79 | 77,말,47929 80 | 78,학,46285 81 | 79,스,46225 82 | 80,더,44487 83 | 81,많,43607 84 | 82,원,41379 85 | 83,음,41348 86 | 84,정,39775 87 | 85,겠,39691 88 | 86,여,39203 89 | 87,먹,39194 90 | 88,금,38720 91 | 89,든,38476 92 | 90,부,38398 93 | 91,할,38262 94 | 92,전,36575 95 | 93,번,36375 96 | 94,좋,36363 97 | 95,랑,36081 98 | 96,네,35514 99 | 97,람,33799 100 | 98,약,33412 101 | 99,건,33371 102 | 100,각,32167 103 | 101,좀,31738 104 | 102,알,30893 105 | 103,잘,30132 106 | 104,걸,29634 107 | 105,모,29629 108 | 106,것,28482 109 | 107,상,28247 110 | 108,려,28218 111 | 109,장,27856 112 | 110,히,27705 113 | 111,않,27305 114 | 112,맞,27202 115 | 113,던,27082 116 | 114,르,26286 117 | 115,교,26116 118 | 116,바,25994 119 | 117,냐,25742 120 | 118,드,25702 121 | 119,십,25654 122 | 120,날,25556 123 | 121,치,25287 124 | 122,비,25278 125 | 123,단,25129 126 | 124,동,25047 127 | 125,또,24720 128 | 126,못,24528 129 | 127,저,24074 130 | 128,얘,23990 131 | 129,중,23851 132 | 130,의,23607 133 | 131,난,23318 134 | 132,엄,23057 135 | 133,봤,22930 136 | 134,걔,22732 137 | 135,화,22593 138 | 136,응,22254 139 | 137,싶,21756 140 | 138,갔,21628 141 | 139,았,21052 142 | 140,집,20850 143 | 141,왜,20801 144 | 142,계,20757 145 | 143,공,20620 146 | 144,긴,20547 147 | 145,신,20371 148 | 146,적,20244 149 | 147,연,20225 150 | 148,직,20061 151 | 149,실,19467 152 | 150,영,19454 153 | 151,미,19444 154 | 152,봐,18931 155 | 153,분,18893 156 | 154,테,18829 157 | 155,년,18669 158 | 156,트,18654 159 | 157,문,18230 160 | 158,와,18114 161 | 159,돼,18028 162 | 160,물,17889 163 | 161,예,17864 164 | 162,터,17722 165 | 163,세,17719 166 | 164,럼,17521 167 | 165,청,17479 168 | 166,차,17455 169 | 167,친,17355 170 | 168,개,17355 171 | 169,삼,17242 172 | 170,국,17224 173 | 171,두,17129 174 | 172,소,17125 175 | 173,살,16893 176 | 174,재,16635 177 | 175,운,15949 178 | 176,쫌,15780 179 | 177,유,15516 180 | 178,속,15255 181 | 179,명,15190 182 | 180,랬,15155 183 | 181,본,15148 184 | 182,갈,15084 185 | 183,방,15069 186 | 184,돈,14998 187 | 185,타,14919 188 | 186,처,14908 189 | 187,빠,14851 190 | 188,과,14843 191 | 189,식,14743 192 | 190,디,14633 193 | 191,배,14403 194 | 192,피,14093 195 | 193,뭔,14093 196 | 194,선,13929 197 | 195,남,13909 198 | 196,경,13868 199 | 197,달,13787 200 | 198,언,13579 201 | 199,받,13519 202 | 200,심,13409 203 | 201,월,13367 204 | 202,버,13339 205 | 203,왔,13265 206 | 204,느,13223 207 | 205,점,12960 208 | 206,올,12923 209 | 207,업,12861 210 | 208,른,12801 211 | 209,성,12717 212 | 210,회,12591 213 | 211,조,12570 214 | 212,워,12424 215 | 213,따,12410 216 | 214,행,12350 217 | 215,반,12158 218 | 216,님,11998 219 | 217,딱,11908 220 | 218,관,11828 221 | 219,입,11275 222 | 220,카,11235 223 | 221,당,11068 224 | 222,였,10977 225 | 223,케,10576 226 | 224,쪽,10430 227 | 225,천,10384 228 | 226,작,10381 229 | 227,누,10336 230 | 228,열,10252 231 | 229,얼,10250 232 | 230,울,10246 233 | 231,찮,10231 234 | 232,된,10191 235 | 233,별,10159 236 | 234,떻,10108 237 | 235,머,9876 238 | 236,쓰,9853 239 | 237,위,9841 240 | 238,크,9838 241 | 239,노,9799 242 | 240,괜,9735 243 | 241,강,9698 244 | 242,편,9668 245 | 243,몰,9623 246 | 244,맛,9382 247 | 245,준,9342 248 | 246,줄,9294 249 | 247,파,9282 250 | 248,백,9252 251 | 249,매,9181 252 | 250,산,9160 253 | 251,술,9142 254 | 252,힘,9056 255 | 253,프,9019 256 | 254,즘,8997 257 | 255,임,8969 258 | 256,체,8888 259 | 257,형,8790 260 | 258,몇,8742 261 | 259,맨,8712 262 | 260,새,8623 263 | 261,용,8571 264 | 262,키,8547 265 | 263,통,8410 266 | 264,양,8363 267 | 265,끝,8361 268 | 266,싸,8328 269 | 267,볼,8188 270 | 268,혼,8179 271 | 269,온,8132 272 | 270,등,8123 273 | 271,길,8067 274 | 272,될,8033 275 | 273,밌,7998 276 | 274,육,7924 277 | 275,늘,7909 278 | 276,슨,7835 279 | 277,됐,7738 280 | 278,놀,7707 281 | 279,외,7608 282 | 280,팔,7601 283 | 281,져,7551 284 | 282,레,7485 285 | 283,억,7461 286 | 284,발,7450 287 | 285,결,7412 288 | 286,초,7290 289 | 287,감,7180 290 | 288,군,7174 291 | 289,호,7173 292 | 290,름,7146 293 | 291,솔,7079 294 | 292,닌,7051 295 | 293,밖,7013 296 | 294,불,7007 297 | 295,밥,6784 298 | 296,포,6676 299 | 297,싫,6631 300 | 298,완,6582 301 | 299,갖,6511 302 | 300,겨,6468 303 | 301,질,6453 304 | 302,토,6448 305 | 303,험,6417 306 | 304,색,6371 307 | 305,떤,6352 308 | 306,역,6340 309 | 307,티,6319 310 | 308,갑,6316 311 | 309,목,6262 312 | 310,린,6256 313 | 311,추,6204 314 | 312,격,6174 315 | 313,후,6119 316 | 314,확,6095 317 | 315,루,6079 318 | 316,민,6024 319 | 317,끼,6023 320 | 318,칠,6019 321 | 319,돌,5997 322 | 320,찍,5956 323 | 321,쪼,5946 324 | 322,깐,5788 325 | 323,필,5786 326 | 324,빨,5693 327 | 325,났,5657 328 | 326,락,5561 329 | 327,박,5537 330 | 328,끔,5483 331 | 329,낌,5403 332 | 330,럴,5397 333 | 331,취,5344 334 | 332,복,5315 335 | 333,둘,5264 336 | 334,페,5217 337 | 335,렸,5198 338 | 336,써,5197 339 | 337,줘,5173 340 | 338,급,5067 341 | 339,력,5065 342 | 340,잡,5030 343 | 341,씩,5006 344 | 342,찾,4990 345 | 343,놓,4987 346 | 344,최,4894 347 | 345,코,4891 348 | 346,넘,4870 349 | 347,졌,4803 350 | 348,섯,4799 351 | 349,브,4793 352 | 350,현,4764 353 | 351,눈,4760 354 | 352,항,4751 355 | 353,귀,4708 356 | 354,설,4688 357 | 355,벌,4666 358 | 356,담,4647 359 | 357,앞,4640 360 | 358,책,4630 361 | 359,절,4629 362 | 360,플,4523 363 | 361,폰,4513 364 | 362,태,4496 365 | 363,종,4487 366 | 364,옛,4450 367 | 365,증,4413 368 | 366,튼,4411 369 | 367,글,4408 370 | 368,습,4383 371 | 369,병,4377 372 | 370,론,4373 373 | 371,출,4364 374 | 372,능,4354 375 | 373,침,4345 376 | 374,순,4339 377 | 375,줬,4308 378 | 376,평,4303 379 | 377,메,4287 380 | 378,똑,4281 381 | 379,커,4261 382 | 380,엔,4248 383 | 381,꾸,4230 384 | 382,란,4194 385 | 383,듣,4083 386 | 384,씨,4009 387 | 385,큰,4002 388 | 386,표,3995 389 | 387,잠,3942 390 | 388,먼,3942 391 | 389,쁘,3840 392 | 390,활,3820 393 | 391,합,3787 394 | 392,접,3732 395 | 393,럽,3722 396 | 394,옷,3705 397 | 395,쳐,3690 398 | 396,손,3689 399 | 397,붙,3645 400 | 398,망,3640 401 | 399,죽,3609 402 | 400,투,3606 403 | 401,족,3603 404 | 402,셨,3589 405 | 403,참,3572 406 | 404,떨,3567 407 | 405,웃,3533 408 | 406,졸,3516 409 | 407,쉬,3492 410 | 408,뭘,3447 411 | 409,변,3406 412 | 410,릴,3374 413 | 411,웠,3293 414 | 412,홍,3267 415 | 413,즈,3265 416 | 414,랐,3245 417 | 415,독,3243 418 | 416,충,3239 419 | 417,짝,3217 420 | 418,떡,3197 421 | 419,뒤,3195 422 | 420,휴,3161 423 | 421,셔,3142 424 | 422,넣,3135 425 | 423,쨌,3075 426 | 424,악,3073 427 | 425,패,3049 428 | 426,빼,3041 429 | 427,슬,2983 430 | 428,특,2975 431 | 429,꺼,2970 432 | 430,숙,2951 433 | 431,쯤,2934 434 | 432,텐,2905 435 | 433,창,2901 436 | 434,겼,2888 437 | 435,굴,2869 438 | 436,판,2863 439 | 437,죠,2851 440 | 438,답,2820 441 | 439,희,2816 442 | 440,허,2815 443 | 441,옆,2798 444 | 442,료,2791 445 | 443,닐,2790 446 | 444,택,2769 447 | 445,림,2760 448 | 446,읽,2742 449 | 447,핸,2733 450 | 448,축,2730 451 | 449,풀,2716 452 | 450,틀,2712 453 | 451,몸,2694 454 | 452,골,2690 455 | 453,황,2635 456 | 454,켜,2635 457 | 455,익,2625 458 | 456,베,2613 459 | 457,북,2600 460 | 458,법,2578 461 | 459,늦,2578 462 | 460,함,2568 463 | 461,랜,2555 464 | 462,꼬,2555 465 | 463,향,2547 466 | 464,석,2541 467 | 465,환,2533 468 | 466,슷,2529 469 | 467,품,2518 470 | 468,혀,2513 471 | 469,블,2512 472 | 470,쓸,2503 473 | 471,채,2472 474 | 472,며,2470 475 | 473,욕,2463 476 | 474,권,2450 477 | 475,검,2445 478 | 476,굳,2428 479 | 477,록,2425 480 | 478,톡,2408 481 | 479,김,2408 482 | 480,넌,2383 483 | 481,깨,2375 484 | 482,션,2374 485 | 483,캐,2369 486 | 484,송,2339 487 | 485,녀,2336 488 | 486,탈,2327 489 | 487,광,2321 490 | 488,혹,2313 491 | 489,퍼,2300 492 | 490,뽑,2278 493 | 491,철,2265 494 | 492,째,2249 495 | 493,움,2232 496 | 494,밤,2231 497 | 495,꼭,2226 498 | 496,샀,2224 499 | 497,끊,2212 500 | 498,땐,2203 501 | 499,깔,2179 502 | 500,멀,2145 503 | 501,높,2141 504 | 502,께,2140 505 | 503,큼,2121 506 | 504,녁,2104 507 | 505,곳,2082 508 | 506,잔,2070 509 | 507,쉽,2070 510 | 508,짐,2067 511 | 509,암,2063 512 | 510,극,2061 513 | 511,련,2056 514 | 512,떠,2056 515 | 513,벽,2049 516 | 514,헤,2047 517 | 515,C,2040 518 | 516,끄,2024 519 | 517,곱,2015 520 | 518,승,2011 521 | 519,봉,2009 522 | 520,착,2006 523 | 521,촌,1990 524 | 522,껴,1986 525 | 523,딩,1983 526 | 524,류,1978 527 | 525,뜨,1970 528 | 526,넷,1941 529 | 527,놨,1922 530 | 528,궁,1894 531 | 529,논,1882 532 | 530,곤,1875 533 | 531,클,1869 534 | 532,싼,1859 535 | 533,앉,1854 536 | 534,컴,1849 537 | 535,맥,1841 538 | 536,팀,1830 539 | 537,썼,1818 540 | 538,낫,1801 541 | 539,튜,1788 542 | 540,걱,1786 543 | 541,쁜,1770 544 | 542,킨,1760 545 | 543,빌,1752 546 | 544,쿠,1748 547 | 545,찌,1738 548 | 546,쌤,1719 549 | 547,T,1715 550 | 548,밀,1711 551 | 549,빵,1702 552 | 550,냈,1702 553 | 551,센,1691 554 | 552,딴,1688 555 | 553,쩌,1678 556 | 554,딸,1678 557 | 555,걍,1596 558 | 556,획,1588 559 | 557,씬,1582 560 | 558,챙,1541 561 | 559,첫,1536 562 | 560,범,1530 563 | 561,핑,1519 564 | 562,굉,1519 565 | 563,쩔,1514 566 | 564,팅,1507 567 | 565,긍,1486 568 | 566,탄,1471 569 | 567,덟,1470 570 | 568,퇴,1469 571 | 569,뛰,1469 572 | 570,층,1467 573 | 571,춰,1454 574 | 572,훨,1447 575 | 573,찬,1439 576 | 574,듯,1424 577 | 575,S,1396 578 | 576,왕,1392 579 | 577,텔,1385 580 | 578,뉴,1382 581 | 579,렌,1377 582 | 580,탕,1374 583 | 581,짓,1371 584 | 582,밑,1365 585 | 583,헬,1358 586 | 584,존,1339 587 | 585,립,1323 588 | 586,녔,1318 589 | 587,꼈,1305 590 | 588,빡,1304 591 | 589,낮,1287 592 | 590,견,1282 593 | 591,링,1281 594 | 592,볶,1271 595 | 593,낙,1271 596 | 594,릭,1267 597 | 595,젠,1263 598 | 596,퓨,1262 599 | 597,츠,1256 600 | 598,맘,1252 601 | 599,놔,1247 602 | 600,렵,1241 603 | 601,땜,1232 604 | 602,쇼,1224 605 | 603,값,1215 606 | 604,닭,1203 607 | 605,깝,1200 608 | 606,픈,1194 609 | 607,탁,1183 610 | 608,쓴,1179 611 | 609,농,1172 612 | 610,량,1166 613 | 611,염,1156 614 | 612,홉,1144 615 | 613,척,1130 616 | 614,겁,1129 617 | 615,콘,1127 618 | 616,섭,1125 619 | 617,냄,1125 620 | 618,P,1125 621 | 619,효,1124 622 | 620,규,1124 623 | 621,꿈,1121 624 | 622,곡,1093 625 | 623,액,1090 626 | 624,쎄,1077 627 | 625,덜,1075 628 | 626,턴,1065 629 | 627,킹,1061 630 | 628,훈,1057 631 | 629,쳤,1054 632 | 630,널,1047 633 | 631,멋,1037 634 | 632,꿀,1034 635 | 633,깜,1019 636 | 634,짧,1014 637 | 635,롤,1013 638 | 636,낼,1012 639 | 637,꽤,1003 640 | 638,총,984 641 | 639,램,984 642 | 640,덕,980 643 | 641,믄,974 644 | 642,믿,972 645 | 643,흥,970 646 | 644,롱,967 647 | 645,뜻,962 648 | 646,짤,958 649 | 647,쌍,957 650 | 648,컨,953 651 | 649,셋,952 652 | 650,잤,950 653 | 651,닥,950 654 | 652,웬,946 655 | 653,엽,944 656 | 654,혜,939 657 | 655,찰,935 658 | 656,뻐,935 659 | 657,뿌,934 660 | 658,빈,934 661 | 659,꿔,934 662 | 660,낸,932 663 | 661,뻔,928 664 | 662,쌓,926 665 | 663,즐,919 666 | 664,튀,914 667 | 665,겹,909 668 | 666,득,899 669 | 667,끌,896 670 | 668,M,880 671 | 669,V,877 672 | 670,녹,876 673 | 671,푸,870 674 | 672,쭉,863 675 | 673,싱,858 676 | 674,팬,857 677 | 675,A,854 678 | 676,!,841 679 | 677,념,836 680 | 678,맡,825 681 | 679,쟁,814 682 | 680,엑,810 683 | 681,켓,809 684 | 682,뀌,808 685 | 683,털,803 686 | 684,풍,802 687 | 685,웨,799 688 | 686,땡,792 689 | 687,롯,791 690 | 688,롭,788 691 | 689,젊,781 692 | 690,넓,778 693 | 691,멘,777 694 | 692,냉,772 695 | 693,칼,771 696 | 694,잉,768 697 | 695,빙,768 698 | 696,뿐,767 699 | 697,옮,761 700 | 698,젤,760 701 | 699,B,757 702 | 700,죄,756 703 | 701,탔,752 704 | 702,샤,746 705 | 703,홀,745 706 | 704,떼,743 707 | 705,줌,738 708 | 706,징,734 709 | 707,폭,727 710 | 708,G,721 711 | 709,킬,713 712 | 710,흔,712 713 | 711,딜,711 714 | 712,슈,703 715 | 713,율,700 716 | 714,즌,697 717 | 715,씀,694 718 | 716,앙,689 719 | 717,눠,688 720 | 718,콩,686 721 | 719,얻,684 722 | 720,숨,682 723 | 721,닝,673 724 | 722,꽃,668 725 | 723,쌀,667 726 | 724,컬,666 727 | 725,춤,666 728 | 726,c,664 729 | 727,뚫,663 730 | 728,엠,661 731 | 729,몬,659 732 | 730,D,658 733 | 731,흐,656 734 | 732,앤,646 735 | 733,똥,645 736 | 734,콜,644 737 | 735,델,635 738 | 736,렀,628 739 | 737,폐,627 740 | 738,엘,624 741 | 739,쁠,623 742 | 740,랄,622 743 | 741,걘,621 744 | 742,벤,619 745 | 743,봄,611 746 | 744,왠,609 747 | 745,씻,609 748 | 746,률,608 749 | 747,켰,600 750 | 748,짱,600 751 | 749,웹,599 752 | 750,압,599 753 | 751,럭,596 754 | 752,땅,596 755 | 753,멍,595 756 | 754,랩,595 757 | 755,댓,595 758 | 756,깊,595 759 | 757,뮤,592 760 | 758,령,590 761 | 759,릿,589 762 | 760,낀,589 763 | 761,윤,586 764 | 762,옥,584 765 | 763,룸,582 766 | 764,딘,579 767 | 765,객,578 768 | 766,댄,576 769 | 767,컵,574 770 | 768,폴,573 771 | 769,쟤,570 772 | 770,뷰,569 773 | 771,템,568 774 | 772,덴,567 775 | 773,눌,559 776 | 774,캠,558 777 | 775,홈,557 778 | 776,삶,557 779 | 777,삭,555 780 | 778,벨,555 781 | 779,엉,552 782 | 780,헐,549 783 | 781,벅,545 784 | 782,벗,544 785 | 783,혈,543 786 | 784,밍,539 787 | 785,셀,536 788 | 786,낭,534 789 | 787,춥,533 790 | 788,릉,533 791 | 789,t,533 792 | 790,잃,532 793 | 791,I,529 794 | 792,놈,526 795 | 793,춘,524 796 | 794,찜,520 797 | 795,R,519 798 | 796,걷,518 799 | 797,삐,515 800 | 798,헌,510 801 | 799,딨,510 802 | 800,빛,504 803 | 801,흘,503 804 | 802,닫,502 805 | 803,균,502 806 | 804,p,495 807 | 805,L,494 808 | 806,좌,492 809 | 807,껄,491 810 | 808,펜,489 811 | 809,N,487 812 | 810,싹,486 813 | 811,탑,485 814 | 812,쏘,483 815 | 813,O,482 816 | 814,픽,480 817 | 815,덩,477 818 | 816,햄,476 819 | 817,큐,473 820 | 818,힐,472 821 | 819,곧,471 822 | 820,낳,470 823 | 821,힌,468 824 | 822,팩,468 825 | 823,뒷,468 826 | 824,툰,467 827 | 825,섬,465 828 | 826,꽂,463 829 | 827,례,462 830 | 828,핫,460 831 | 829,섞,460 832 | 830,촬,458 833 | 831,흰,457 834 | 832,둥,455 835 | 833,K,450 836 | 834,괴,449 837 | 835,s,448 838 | 836,핀,446 839 | 837,꿨,444 840 | 838,틱,441 841 | 839,밝,441 842 | 840,랙,440 843 | 841,땠,440 844 | 842,둔,440 845 | 843,슴,439 846 | 844,첨,438 847 | 845,밴,432 848 | 846,렁,431 849 | 847,칭,429 850 | 848,묻,428 851 | 849,뜬,425 852 | 850,깎,424 853 | 851,엇,423 854 | 852,컸,421 855 | 853,퀴,420 856 | 854,납,418 857 | 855,협,417 858 | 856,몽,416 859 | 857,꼐,415 860 | 858,떴,414 861 | 859,썰,410 862 | 860,찐,407 863 | 861,꼴,407 864 | 862,갠,406 865 | 863,턱,405 866 | 864,틴,398 867 | 865,낄,398 868 | 866,뒀,397 869 | 867,끗,396 870 | 868,꼼,395 871 | 869,F,395 872 | 870,샵,394 873 | 871,휘,392 874 | 872,뼈,390 875 | 873,뚜,389 876 | 874,쩍,388 877 | 875,팡,386 878 | 876,멜,386 879 | 877,톤,385 880 | 878,앨,385 881 | 879,탐,384 882 | 880,칸,384 883 | 881,끓,383 884 | 882,뚱,381 885 | 883,닮,378 886 | 884,깃,375 887 | 885,짬,374 888 | 886,빤,371 889 | 887,측,370 890 | 888,혔,369 891 | 889,꽁,369 892 | 890,펴,368 893 | 891,앴,368 894 | 892,겸,368 895 | 893,쿨,367 896 | 894,릇,363 897 | 895,얀,362 898 | 896,쿄,358 899 | 897,컷,358 900 | 898,팠,356 901 | 899,끈,355 902 | 900,렴,354 903 | 901,잊,352 904 | 902,덤,350 905 | 903,갤,342 906 | 904,븐,340 907 | 905,흡,337 908 | 906,덮,337 909 | 907,씹,335 910 | 908,뽀,335 911 | 909,뚝,335 912 | 910,갚,335 913 | 911,찔,334 914 | 912,댔,333 915 | 913,혁,332 916 | 914,띠,328 917 | 915,벼,327 918 | 916,얇,324 919 | 917,뺐,324 920 | 918,팝,323 921 | 919,잇,322 922 | 920,왼,322 923 | 921,낚,321 924 | 922,칙,316 925 | 923,겉,316 926 | 924,뜯,313 927 | 925,닦,312 928 | 926,짠,311 929 | 927,썹,310 930 | 928,뷔,310 931 | 929,묶,310 932 | 930,꾼,306 933 | 931,빅,305 934 | 932,땄,305 935 | 933,캡,304 936 | 934,묘,304 937 | 935,샘,303 938 | 936,묵,303 939 | 937,a,302 940 | 938,쭈,300 941 | 939,b,300 942 | 940,겪,299 943 | 941,둬,298 944 | 942,J,298 945 | 943,쫄,296 946 | 944,랫,296 947 | 945,뀐,296 948 | 946,흑,295 949 | 947,댕,295 950 | 948,꽉,295 951 | 949,곰,294 952 | 950,붕,293 953 | 951,땀,292 954 | 952,릎,290 955 | 953,뽕,289 956 | 954,쥐,288 957 | 955,렉,287 958 | 956,숭,283 959 | 957,샐,283 960 | 958,v,282 961 | 959,렛,281 962 | 960,녕,281 963 | 961,힙,280 964 | 962,쫙,279 965 | 963,촉,278 966 | 964,쩜,277 967 | 965,긋,277 968 | 966,샌,276 969 | 967,o,275 970 | 968,쫓,273 971 | 969,쩐,273 972 | 970,헷,272 973 | 971,X,268 974 | 972,웅,267 975 | 973,뺏,267 976 | 974,쵸,266 977 | 975,쪘,266 978 | 976,랍,266 979 | 977,E,266 980 | 978,좁,265 981 | 979,앱,265 982 | 980,썸,264 983 | 981,냅,264 984 | 982,펙,263 985 | 983,늙,263 986 | 984,껌,261 987 | 985,n,261 988 | 986,e,261 989 | 987,랭,260 990 | 988,귤,260 991 | 989,찢,259 992 | 990,닿,259 993 | 991,띄,258 994 | 992,긁,255 995 | 993,귄,253 996 | 994,굽,253 997 | 995,갓,253 998 | 996,캔,252 999 | 997,멈,252 1000 | 998,욱,250 1001 | 999,뺄,250 1002 | 1000,뇌,250 1003 | 1001,팟,249 1004 | 1002,쌌,248 1005 | 1003,룹,248 1006 | 1004,덥,248 1007 | 1005,폼,246 1008 | 1006,톱,244 1009 | 1007,듬,244 1010 | 1008,껍,244 1011 | 1009,흠,243 1012 | 1010,팍,243 1013 | 1011,맹,243 1014 | 1012,쉴,242 1015 | 1013,썩,240 1016 | 1014,밟,240 1017 | 1015,맵,237 1018 | 1016,돋,236 1019 | 1017,콤,235 1020 | 1018,맙,234 1021 | 1019,뱅,233 1022 | 1020,쫍,231 1023 | 1021,윗,229 1024 | 1022,뜩,229 1025 | 1023,찝,228 1026 | 1024,뺀,227 1027 | 1025,닷,226 1028 | 1026,넨,226 1029 | 1027,쌈,225 1030 | 1028,쩨,224 1031 | 1029,붓,224 1032 | 1030,쩡,223 1033 | 1031,믹,223 1034 | 1032,잼,221 1035 | 1033,r,221 1036 | 1034,쭐,220 1037 | 1035,엊,219 1038 | 1036,g,219 1039 | 1037,췄,217 1040 | 1038,룩,217 1041 | 1039,텀,215 1042 | 1040,쇠,213 1043 | 1041,숫,212 1044 | 1042,풋,210 1045 | 1043,쌩,208 1046 | 1044,쾌,207 1047 | 1045,볍,207 1048 | 1046,뤄,207 1049 | 1047,겐,207 1050 | 1048,m,207 1051 | 1049,펌,206 1052 | 1050,쪄,206 1053 | 1051,뻥,206 1054 | 1052,i,206 1055 | 1053,뻤,205 1056 | 1054,k,204 1057 | 1055,핵,203 1058 | 1056,셉,200 1059 | 1057,듀,198 1060 | 1058,닉,198 1061 | 1059,략,197 1062 | 1060,넉,197 1063 | 1061,딪,196 1064 | 1062,낯,195 1065 | 1063,텍,194 1066 | 1064,뱃,194 1067 | 1065,멤,194 1068 | 1066,윈,192 1069 | 1067,엎,192 1070 | 1068,뭉,192 1071 | 1069,젝,191 1072 | 1070,셜,190 1073 | 1071,빴,190 1074 | 1072,룰,190 1075 | 1073,앗,189 1076 | 1074,궈,189 1077 | 1075,윙,188 1078 | 1076,엥,188 1079 | 1077,d,185 1080 | 1078,꼽,183 1081 | 1079,챔,182 1082 | 1080,쉐,182 1083 | 1081,봇,182 1084 | 1082,푼,180 1085 | 1083,댁,179 1086 | 1084,칵,178 1087 | 1085,뿔,177 1088 | 1086,뺑,176 1089 | 1087,탱,175 1090 | 1088,쿼,174 1091 | 1089,갱,174 1092 | 1090,퉁,173 1093 | 1091,빔,173 1094 | 1092,썬,172 1095 | 1093,빽,172 1096 | 1094,둑,172 1097 | 1095,헛,169 1098 | 1096,빗,168 1099 | 1097,탓,167 1100 | 1098,륵,165 1101 | 1099,꼰,165 1102 | 1100,쎈,163 1103 | 1101,쥬,162 1104 | 1102,깡,162 1105 | 1103,퀄,161 1106 | 1104,빚,161 1107 | 1105,즉,160 1108 | 1106,삿,160 1109 | 1107,밭,160 1110 | 1108,u,160 1111 | 1109,혐,159 1112 | 1110,햇,159 1113 | 1111,툭,159 1114 | 1112,탠,158 1115 | 1113,샷,158 1116 | 1114,맣,157 1117 | 1115,껏,157 1118 | 1116,핏,156 1119 | 1117,앵,155 1120 | 1118,뜰,155 1121 | 1119,굿,155 1122 | 1120,U,155 1123 | 1121,섹,154 1124 | 1122,펑,153 1125 | 1123,맻,153 1126 | 1124,뀔,153 1127 | 1125,깥,152 1128 | 1126,뱀,150 1129 | 1127,뢰,150 1130 | 1128,껀,150 1131 | 1129,뉘,149 1132 | 1130,흉,146 1133 | 1131,틈,146 1134 | 1132,쏟,146 1135 | 1133,훔,143 1136 | 1134,쇄,142 1137 | 1135,뎅,142 1138 | 1136,칩,141 1139 | 1137,띵,140 1140 | 1138,푹,139 1141 | 1139,넥,139 1142 | 1140,퀘,137 1143 | 1141,훅,136 1144 | 1142,융,136 1145 | 1143,멸,136 1146 | 1144,냠,136 1147 | 1145,횟,134 1148 | 1146,찼,134 1149 | 1147,Y,134 1150 | 1148,룬,133 1151 | 1149,귈,133 1152 | 1150,H,133 1153 | 1151,젓,132 1154 | 1152,쏠,132 1155 | 1153,숲,132 1156 | 1154,냬,132 1157 | 1155,l,132 1158 | 1156,짰,130 1159 | 1157,멕,130 1160 | 1158,뇨,130 1161 | 1159,팽,128 1162 | 1160,깼,127 1163 | 1161,숏,126 1164 | 1162,굔,125 1165 | 1163,슐,123 1166 | 1164,쉰,123 1167 | 1165,얄,122 1168 | 1166,뱉,122 1169 | 1167,렬,122 1170 | 1168,굶,122 1171 | 1169,팁,121 1172 | 1170,츄,121 1173 | 1171,뭣,121 1174 | 1172,륙,121 1175 | 1173,횡,120 1176 | 1174,옹,119 1177 | 1175,뻘,117 1178 | 1176,옵,116 1179 | 1177,옴,116 1180 | 1178,얹,116 1181 | 1179,쑥,116 1182 | 1180,깄,116 1183 | 1181,므,115 1184 | 1182,찡,112 1185 | 1183,젖,112 1186 | 1184,꽈,112 1187 | 1185,틸,111 1188 | 1186,콕,111 1189 | 1187,첩,111 1190 | 1188,똘,111 1191 | 1189,쿵,110 1192 | 1190,왤,110 1193 | 1191,괌,110 1194 | 1192,밸,109 1195 | 1193,녜,109 1196 | 1194,갸,109 1197 | 1195,펀,108 1198 | 1196,칫,108 1199 | 1197,맺,108 1200 | 1198,탭,107 1201 | 1199,쁨,106 1202 | 1200,폈,105 1203 | 1201,펼,105 1204 | 1202,첼,105 1205 | 1203,숱,105 1206 | 1204,섰,105 1207 | 1205,킥,104 1208 | 1206,맑,103 1209 | 1207,랗,103 1210 | 1208,펐,102 1211 | 1209,넛,102 1212 | 1210,솜,101 1213 | 1211,벙,101 1214 | 1212,껑,101 1215 | 1213,f,101 1216 | 1214,룡,100 1217 | 1215,훌,99 1218 | 1216,x,98 1219 | 1217,쓱,97 1220 | 1218,늬,97 1221 | 1219,곽,97 1222 | 1220,y,97 1223 | 1221,욘,96 1224 | 1222,돔,96 1225 | 1223,겄,96 1226 | 1224,텝,95 1227 | 1225,훠,94 1228 | 1226,텅,94 1229 | 1227,씌,94 1230 | 1228,꺾,94 1231 | 1229,벚,93 1232 | 1230,렷,93 1233 | 1231,귓,93 1234 | 1232,찹,92 1235 | 1233,툴,91 1236 | 1234,깅,91 1237 | 1235,쭤,90 1238 | 1236,욜,90 1239 | 1237,얌,90 1240 | 1238,짖,89 1241 | 1239,옳,89 1242 | 1240,벳,88 1243 | 1241,뛸,88 1244 | 1242,깠,88 1245 | 1243,퍽,87 1246 | 1244,퀸,87 1247 | 1245,엮,87 1248 | 1246,삽,87 1249 | 1247,겟,87 1250 | 1248,왓,86 1251 | 1249,댈,86 1252 | 1250,샴,85 1253 | 1251,뻗,85 1254 | 1252,됨,84 1255 | 1253,얜,83 1256 | 1254,굵,83 1257 | 1255,눕,82 1258 | 1256,갇,82 1259 | 1257,셰,81 1260 | 1258,늫,81 1261 | 1259,텨,80 1262 | 1260,숍,80 1263 | 1261,뻑,80 1264 | 1262,됩,80 1265 | 1263,잎,79 1266 | 1264,뭇,79 1267 | 1265,퐁,78 1268 | 1266,팸,78 1269 | 1267,쯔,78 1270 | 1268,넜,78 1271 | 1269,깍,78 1272 | 1270,쌔,77 1273 | 1271,셈,77 1274 | 1272,읍,76 1275 | 1273,픔,75 1276 | 1274,펫,75 1277 | 1275,콧,75 1278 | 1276,얗,75 1279 | 1277,눅,75 1280 | 1278,j,74 1281 | 1279,쬐,73 1282 | 1280,렙,73 1283 | 1281,닙,73 1284 | 1282,슥,72 1285 | 1283,흙,71 1286 | 1284,쭝,71 1287 | 1285,짭,71 1288 | 1286,샹,71 1289 | 1287,릏,71 1290 | 1288,럿,71 1291 | 1289,덧,71 1292 | 1290,즙,70 1293 | 1291,늑,70 1294 | 1292,괄,70 1295 | 1293,킷,68 1296 | 1294,쿡,68 1297 | 1295,캉,68 1298 | 1296,둡,68 1299 | 1297,톨,67 1300 | 1298,엣,67 1301 | 1299,숟,67 1302 | 1300,낑,67 1303 | 1301,펭,66 1304 | 1302,왁,66 1305 | 1303,쏴,66 1306 | 1304,쏙,66 1307 | 1305,봅,66 1308 | 1306,멧,66 1309 | 1307,줏,65 1310 | 1308,뵈,65 1311 | 1309,쫑,64 1312 | 1310,륨,64 1313 | 1311,h,64 1314 | 1312,펄,63 1315 | 1313,짼,63 1316 | 1314,짚,63 1317 | 1315,껐,63 1318 | 1316,겜,63 1319 | 1317,싯,62 1320 | 1318,붐,62 1321 | 1319,렐,62 1322 | 1320,돗,62 1323 | 1321,팥,61 1324 | 1322,웰,61 1325 | 1323,륜,61 1326 | 1324,잣,60 1327 | 1325,슝,60 1328 | 1326,붉,60 1329 | 1327,윽,59 1330 | 1328,삘,59 1331 | 1329,딲,59 1332 | 1330,갯,58 1333 | 1331,횐,57 1334 | 1332,헨,57 1335 | 1333,캘,57 1336 | 1334,쩰,57 1337 | 1335,뤘,57 1338 | 1336,랴,57 1339 | 1337,껜,57 1340 | 1338,펠,56 1341 | 1339,킵,56 1342 | 1340,컹,56 1343 | 1341,렘,56 1344 | 1342,뛴,56 1345 | 1343,헝,55 1346 | 1344,씽,55 1347 | 1345,뮬,55 1348 | 1346,젯,54 1349 | 1347,샜,54 1350 | 1348,뿜,54 1351 | 1349,뒹,54 1352 | 1350,뎌,54 1353 | 1351,깬,54 1354 | 1352,챠,53 1355 | 1353,왈,53 1356 | 1354,뾰,53 1357 | 1355,뚤,53 1358 | 1356,꾹,53 1359 | 1357,갛,53 1360 | 1358,잌,52 1361 | 1359,엿,52 1362 | 1360,솥,52 1363 | 1361,벡,52 1364 | 1362,룻,52 1365 | 1363,꿍,52 1366 | 1364,곈,52 1367 | 1365,팜,51 1368 | 1366,튕,51 1369 | 1367,컥,51 1370 | 1368,첸,51 1371 | 1369,줍,51 1372 | 1370,섀,51 1373 | 1371,몫,51 1374 | 1372,뜸,51 1375 | 1373,깁,51 1376 | 1374,핬,50 1377 | 1375,쭘,50 1378 | 1376,쌰,50 1379 | 1377,넬,50 1380 | 1378,큘,49 1381 | 1379,쾅,49 1382 | 1380,캄,48 1383 | 1381,괘,48 1384 | 1382,쟀,47 1385 | 1383,윌,47 1386 | 1384,엌,47 1387 | 1385,앓,47 1388 | 1386,씁,47 1389 | 1387,륭,47 1390 | 1388,W,47 1391 | 1389,쑤,46 1392 | 1390,삥,46 1393 | 1391,돕,46 1394 | 1392,깰,46 1395 | 1393,핍,45 1396 | 1394,텃,45 1397 | 1395,슛,45 1398 | 1396,맸,45 1399 | 1397,롬,45 1400 | 1398,갭,45 1401 | 1399,얽,44 1402 | 1400,쏭,44 1403 | 1401,랠,44 1404 | 1402,겔,44 1405 | 1403,Q,44 1406 | 1404,핥,43 1407 | 1405,킴,43 1408 | 1406,읏,43 1409 | 1407,앚,43 1410 | 1408,숯,43 1411 | 1409,밋,43 1412 | 1410,뽐,42 1413 | 1411,뻣,42 1414 | 1412,눴,42 1415 | 1413,잭,41 1416 | 1414,뽈,41 1417 | 1415,뗐,41 1418 | 1416,꽝,41 1419 | 1417,훑,40 1420 | 1418,캥,40 1421 | 1419,쫀,40 1422 | 1420,뵙,40 1423 | 1421,홋,39 1424 | 1422,펍,39 1425 | 1423,뺨,39 1426 | 1424,됬,39 1427 | 1425,끽,39 1428 | 1426,빕,38 1429 | 1427,밉,38 1430 | 1428,꿋,38 1431 | 1429,헉,37 1432 | 1430,캣,37 1433 | 1431,촘,37 1434 | 1432,셌,37 1435 | 1433,삔,37 1436 | 1434,삑,37 1437 | 1435,뽂,37 1438 | 1436,뮌,37 1439 | 1437,뗄,37 1440 | 1438,텼,36 1441 | 1439,탬,36 1442 | 1440,쨍,36 1443 | 1441,웜,36 1444 | 1442,앰,36 1445 | 1443,맴,36 1446 | 1444,띡,36 1447 | 1445,꿇,36 1448 | 1446,걀,36 1449 | 1447,흩,35 1450 | 1448,쥴,35 1451 | 1449,씸,35 1452 | 1450,낡,35 1453 | 1451,곁,35 1454 | 1452,휙,34 1455 | 1453,쿤,34 1456 | 1454,켄,34 1457 | 1455,츤,34 1458 | 1456,얕,34 1459 | 1457,썪,34 1460 | 1458,둠,34 1461 | 1459,촛,33 1462 | 1460,챌,33 1463 | 1461,죙,33 1464 | 1462,쟈,33 1465 | 1463,잰,33 1466 | 1464,뵀,33 1467 | 1465,w,33 1468 | 1466,푠,32 1469 | 1467,폿,32 1470 | 1468,쨈,32 1471 | 1469,밲,32 1472 | 1470,랖,32 1473 | 1471,떵,32 1474 | 1472,찻,31 1475 | 1473,옇,31 1476 | 1474,뽁,31 1477 | 1475,둣,31 1478 | 1476,닳,31 1479 | 1477,긱,31 1480 | 1478,곶,31 1481 | 1479,휠,30 1482 | 1480,춧,30 1483 | 1481,쐈,30 1484 | 1482,썽,30 1485 | 1483,뼛,29 1486 | 1484,떳,29 1487 | 1485,굘,29 1488 | 1486,훗,28 1489 | 1487,퀵,28 1490 | 1488,봬,28 1491 | 1489,릅,28 1492 | 1490,꺠,28 1493 | 1491,슉,27 1494 | 1492,눔,27 1495 | 1493,끙,27 1496 | 1494,궐,27 1497 | 1495,2,27 1498 | 1496,촥,26 1499 | 1497,젬,26 1500 | 1498,솟,26 1501 | 1499,맷,26 1502 | 1500,룐,26 1503 | 1501,뎃,26 1504 | 1502,깽,26 1505 | 1503,툼,25 1506 | 1504,쎌,25 1507 | 1505,쉼,25 1508 | 1506,쉘,25 1509 | 1507,숑,25 1510 | 1508,뎀,25 1511 | 1509,냔,25 1512 | 1510,쫘,24 1513 | 1511,쎘,24 1514 | 1512,싣,24 1515 | 1513,섣,24 1516 | 1514,샾,24 1517 | 1515,맬,24 1518 | 1516,뗀,24 1519 | 1517,꾀,24 1520 | 1518,헹,23 1521 | 1519,햐,23 1522 | 1520,톰,23 1523 | 1521,췌,23 1524 | 1522,챈,23 1525 | 1523,봔,23 1526 | 1524,밧,23 1527 | 1525,맏,23 1528 | 1526,딥,23 1529 | 1527,늠,23 1530 | 1528,낵,23 1531 | 1529,낱,23 1532 | 1530,꺄,23 1533 | 1531,갬,23 1534 | 1532,훼,22 1535 | 1533,핼,22 1536 | 1534,튠,22 1537 | 1535,웩,22 1538 | 1536,쏜,22 1539 | 1537,뿅,22 1540 | 1538,빰,22 1541 | 1539,딤,22 1542 | 1540,꿉,22 1543 | 1541,걜,22 1544 | 1542,1,22 1545 | 1543,짙,21 1546 | 1544,얍,21 1547 | 1545,샛,21 1548 | 1546,뗘,21 1549 | 1547,듭,21 1550 | 1548,챘,20 1551 | 1549,쯧,20 1552 | 1550,짹,20 1553 | 1551,잦,20 1554 | 1552,옐,20 1555 | 1553,빳,20 1556 | 1554,몹,20 1557 | 1555,몄,20 1558 | 1556,똔,20 1559 | 1557,딧,20 1560 | 1558,놉,20 1561 | 1559,궜,20 1562 | 1560,굼,20 1563 | 1561,헥,19 1564 | 1562,캬,19 1565 | 1563,챕,19 1566 | 1564,쟨,19 1567 | 1565,멓,19 1568 | 1566,똠,19 1569 | 1567,댐,19 1570 | 1568,텁,18 1571 | 1569,켈,18 1572 | 1570,첵,18 1573 | 1571,숄,18 1574 | 1572,띨,18 1575 | 1573,듦,18 1576 | 1574,궤,18 1577 | 1575,곗,18 1578 | 1576,튈,17 1579 | 1577,좆,17 1580 | 1578,윷,17 1581 | 1579,옅,17 1582 | 1580,얏,17 1583 | 1581,믈,17 1584 | 1582,룽,17 1585 | 1583,띃,17 1586 | 1584,딕,17 1587 | 1585,뎁,17 1588 | 1586,닛,17 1589 | 1587,냑,17 1590 | 1588,겅,17 1591 | 1589,휩,16 1592 | 1590,팎,16 1593 | 1591,틋,16 1594 | 1592,콸,16 1595 | 1593,콥,16 1596 | 1594,잴,16 1597 | 1595,웁,16 1598 | 1596,슘,16 1599 | 1597,멱,16 1600 | 1598,랏,16 1601 | 1599,떄,16 1602 | 1600,뒨,16 1603 | 1601,꿰,16 1604 | 1602,깻,16 1605 | 1603,긌,16 1606 | 1604,젼,15 1607 | 1605,윰,15 1608 | 1606,웍,15 1609 | 1607,앳,15 1610 | 1608,샬,15 1611 | 1609,샥,15 1612 | 1610,볕,15 1613 | 1611,멩,15 1614 | 1612,넹,15 1615 | 1613,넙,15 1616 | 1614,끕,15 1617 | 1615,휜,14 1618 | 1616,텄,14 1619 | 1617,쫒,14 1620 | 1618,쩝,14 1621 | 1619,쨋,14 1622 | 1620,윳,14 1623 | 1621,쉭,14 1624 | 1622,쇽,14 1625 | 1623,셧,14 1626 | 1624,뵐,14 1627 | 1625,땔,14 1628 | 1626,덱,14 1629 | 1627,댑,14 1630 | 1628,꺽,14 1631 | 1629,곪,14 1632 | 1630,켔,13 1633 | 1631,츰,13 1634 | 1632,읜,13 1635 | 1633,쑈,13 1636 | 1634,볐,13 1637 | 1635,및,13 1638 | 1636,롷,13 1639 | 1637,딛,13 1640 | 1638,냇,13 1641 | 1639,3,13 1642 | 1640,혓,12 1643 | 1641,팻,12 1644 | 1642,팰,12 1645 | 1643,킁,12 1646 | 1644,촤,12 1647 | 1645,쨰,12 1648 | 1646,잿,12 1649 | 1647,옌,12 1650 | 1648,쐬,12 1651 | 1649,쌋,12 1652 | 1650,슁,12 1653 | 1651,쉑,12 1654 | 1652,빢,12 1655 | 1653,뵌,12 1656 | 1654,뭄,12 1657 | 1655,묽,12 1658 | 1656,뎠,12 1659 | 1657,늪,12 1660 | 1658,뇽,12 1661 | 1659,훤,11 1662 | 1660,횔,11 1663 | 1661,홧,11 1664 | 1662,햅,11 1665 | 1663,푯,11 1666 | 1664,팼,11 1667 | 1665,탯,11 1668 | 1666,탤,11 1669 | 1667,큭,11 1670 | 1668,짢,11 1671 | 1669,읊,11 1672 | 1670,롸,11 1673 | 1671,띤,11 1674 | 1672,놋,11 1675 | 1673,넴,11 1676 | 1674,귿,11 1677 | 1675,q,11 1678 | 1676,휑,10 1679 | 1677,퓸,10 1680 | 1678,튿,10 1681 | 1679,튄,10 1682 | 1680,촐,10 1683 | 1681,쭌,10 1684 | 1682,짊,10 1685 | 1683,숴,10 1686 | 1684,숀,10 1687 | 1685,뿡,10 1688 | 1686,뻬,10 1689 | 1687,렜,10 1690 | 1688,뗬,10 1691 | 1689,늄,10 1692 | 1690,끍,10 1693 | 1691,곌,10 1694 | 1692,갰,10 1695 | 1693,폄,9 1696 | 1694,콰,9 1697 | 1695,캇,9 1698 | 1696,캅,9 1699 | 1697,늉,9 1700 | 1698,뉜,9 1701 | 1699,큔,8 1702 | 1700,콱,8 1703 | 1701,켠,8 1704 | 1702,쳔,8 1705 | 1703,쳇,8 1706 | 1704,읔,8 1707 | 1705,읎,8 1708 | 1706,욤,8 1709 | 1707,뼘,8 1710 | 1708,롹,8 1711 | 1709,렝,8 1712 | 1710,뚠,8 1713 | 1711,땋,8 1714 | 1712,덨,8 1715 | 1713,넵,8 1716 | 1714,넝,8 1717 | 1715,넋,8 1718 | 1716,꿩,8 1719 | 1717,꼿,8 1720 | 1718,깟,8 1721 | 1719,곯,8 1722 | 1720,4,8 1723 | 1721,힝,7 1724 | 1722,헀,7 1725 | 1723,푤,7 1726 | 1724,쿱,7 1727 | 1725,캤,7 1728 | 1726,챗,7 1729 | 1727,쯩,7 1730 | 1728,쮸,7 1731 | 1729,읗,7 1732 | 1730,윅,7 1733 | 1731,얠,7 1734 | 1732,씰,7 1735 | 1733,썅,7 1736 | 1734,쌜,7 1737 | 1735,쌉,7 1738 | 1736,슌,7 1739 | 1737,쉿,7 1740 | 1738,쇳,7 1741 | 1739,셍,7 1742 | 1740,뿍,7 1743 | 1741,뼌,7 1744 | 1742,뺌,7 1745 | 1743,밈,7 1746 | 1744,룟,7 1747 | 1745,룔,7 1748 | 1746,뜀,7 1749 | 1747,끅,7 1750 | 1748,꾜,7 1751 | 1749,겡,7 1752 | 1750,Z,7 1753 | 1751,횰,6 1754 | 1752,헴,6 1755 | 1753,핌,6 1756 | 1754,펩,6 1757 | 1755,펨,6 1758 | 1756,켸,6 1759 | 1757,쩠,6 1760 | 1758,잽,6 1761 | 1759,엡,6 1762 | 1760,앝,6 1763 | 1761,쓕,6 1764 | 1762,썜,6 1765 | 1763,쌘,6 1766 | 1764,삣,6 1767 | 1765,빻,6 1768 | 1766,몀,6 1769 | 1767,뤼,6 1770 | 1768,롄,6 1771 | 1769,떫,6 1772 | 1770,덫,6 1773 | 1771,뉸,6 1774 | 1772,꽥,6 1775 | 1773,궂,6 1776 | 1774,괭,6 1777 | 1775,0,6 1778 | 1776,힉,5 1779 | 1777,휸,5 1780 | 1778,퓰,5 1781 | 1779,퉤,5 1782 | 1780,퀭,5 1783 | 1781,켐,5 1784 | 1782,췻,5 1785 | 1783,챡,5 1786 | 1784,쨀,5 1787 | 1785,젱,5 1788 | 1786,읒,5 1789 | 1787,웡,5 1790 | 1788,옙,5 1791 | 1789,얬,5 1792 | 1790,앎,5 1793 | 1791,씼,5 1794 | 1792,쑨,5 1795 | 1793,쐐,5 1796 | 1794,쌨,5 1797 | 1795,쉈,5 1798 | 1796,숩,5 1799 | 1797,셤,5 1800 | 1798,삠,5 1801 | 1799,뱄,5 1802 | 1800,뭥,5 1803 | 1801,멨,5 1804 | 1802,먀,5 1805 | 1803,랒,5 1806 | 1804,땟,5 1807 | 1805,땁,5 1808 | 1806,뉠,5 1809 | 1807,꽌,5 1810 | 1808,귐,5 1811 | 1809,굥,5 1812 | 1810,굣,5 1813 | 1811,겋,5 1814 | 1812,8,5 1815 | 1813,5,5 1816 | 1814,훙,4 1817 | 1815,혠,4 1818 | 1816,폔,4 1819 | 1817,튬,4 1820 | 1818,퉜,4 1821 | 1819,탉,4 1822 | 1820,큽,4 1823 | 1821,퀀,4 1824 | 1822,쿰,4 1825 | 1823,켤,4 1826 | 1824,켕,4 1827 | 1825,칡,4 1828 | 1826,쯘,4 1829 | 1827,짯,4 1830 | 1828,짔,4 1831 | 1829,쥔,4 1832 | 1830,줜,4 1833 | 1831,죗,4 1834 | 1832,욧,4 1835 | 1833,쒸,4 1836 | 1834,쏼,4 1837 | 1835,쌂,4 1838 | 1836,셴,4 1839 | 1837,샨,4 1840 | 1838,뽜,4 1841 | 1839,뻠,4 1842 | 1840,빝,4 1843 | 1841,봥,4 1844 | 1842,뫼,4 1845 | 1843,맜,4 1846 | 1844,맀,4 1847 | 1845,뤠,4 1848 | 1846,띌,4 1849 | 1847,띈,4 1850 | 1848,뛌,4 1851 | 1849,뚸,4 1852 | 1850,떽,4 1853 | 1851,떰,4 1854 | 1852,떈,4 1855 | 1853,땍,4 1856 | 1854,돠,4 1857 | 1855,뎄,4 1858 | 1856,늣,4 1859 | 1857,뇸,4 1860 | 1858,놥,4 1861 | 1859,녈,4 1862 | 1860,넒,4 1863 | 1861,낏,4 1864 | 1862,갉,4 1865 | 1863,z,4 1866 | 1864,힛,3 1867 | 1865,휀,3 1868 | 1866,헙,3 1869 | 1867,픗,3 1870 | 1868,퐈,3 1871 | 1869,팹,3 1872 | 1870,틔,3 1873 | 1871,퇸,3 1874 | 1872,텟,3 1875 | 1873,탰,3 1876 | 1874,퀼,3 1877 | 1875,칬,3 1878 | 1876,췰,3 1879 | 1877,쳥,3 1880 | 1878,챂,3 1881 | 1879,찧,3 1882 | 1880,쭙,3 1883 | 1881,쬬,3 1884 | 1882,쩄,3 1885 | 1883,쨉,3 1886 | 1884,줴,3 1887 | 1885,죵,3 1888 | 1886,좍,3 1889 | 1887,좃,3 1890 | 1888,읐,3 1891 | 1889,읃,3 1892 | 1890,웟,3 1893 | 1891,왯,3 1894 | 1892,옭,3 1895 | 1893,씅,3 1896 | 1894,쒀,3 1897 | 1895,쑹,3 1898 | 1896,쏵,3 1899 | 1897,쎔,3 1900 | 1898,쎅,3 1901 | 1899,슾,3 1902 | 1900,쉣,3 1903 | 1901,솨,3 1904 | 1902,셸,3 1905 | 1903,셩,3 1906 | 1904,삯,3 1907 | 1905,뿟,3 1908 | 1906,뾱,3 1909 | 1907,뽝,3 1910 | 1908,뻰,3 1911 | 1909,뵤,3 1912 | 1910,볏,3 1913 | 1911,뱌,3 1914 | 1912,뱁,3 1915 | 1913,밨,3 1916 | 1914,밎,3 1917 | 1915,믐,3 1918 | 1916,뭡,3 1919 | 1917,뭠,3 1920 | 1918,뭍,3 1921 | 1919,묭,3 1922 | 1920,뫠,3 1923 | 1921,멏,3 1924 | 1922,멎,3 1925 | 1923,맽,3 1926 | 1924,뙤,3 1927 | 1925,떔,3 1928 | 1926,땃,3 1929 | 1927,덷,3 1930 | 1928,댜,3 1931 | 1929,늰,3 1932 | 1930,놘,3 1933 | 1931,냘,3 1934 | 1932,뀜,3 1935 | 1933,꽹,3 1936 | 1934,꽐,3 1937 | 1935,꼇,3 1938 | 1936,껸,3 1939 | 1937,껨,3 1940 | 1938,꺅,3 1941 | 1939,곘,3 1942 | 1940,겝,3 1943 | 1941,흣,2 1944 | 1942,흝,2 1945 | 1943,흄,2 1946 | 1944,헸,2 1947 | 1945,헜,2 1948 | 1946,햘,2 1949 | 1947,햑,2 1950 | 1948,핟,2 1951 | 1949,픙,2 1952 | 1950,픕,2 1953 | 1951,푀,2 1954 | 1952,퐉,2 1955 | 1953,틑,2 1956 | 1954,틍,2 1957 | 1955,큣,2 1958 | 1956,퀏,2 1959 | 1957,콴,2 1960 | 1958,켁,2 1961 | 1959,칟,2 1962 | 1960,츳,2 1963 | 1961,츨,2 1964 | 1962,츈,2 1965 | 1963,쵝,2 1966 | 1964,촙,2 1967 | 1965,찦,2 1968 | 1966,찟,2 1969 | 1967,쭁,2 1970 | 1968,쫏,2 1971 | 1969,쩬,2 1972 | 1970,쩧,2 1973 | 1971,짆,2 1974 | 1972,쥰,2 1975 | 1973,쥘,2 1976 | 1974,좐,2 1977 | 1975,졍,2 1978 | 1976,쟝,2 1979 | 1977,잧,2 1980 | 1978,잍,2 1981 | 1979,욍,2 1982 | 1980,왬,2 1983 | 1981,옫,2 1984 | 1982,옜,2 1985 | 1983,옉,2 1986 | 1984,엤,2 1987 | 1985,얐,2 1988 | 1986,얉,2 1989 | 1987,쒯,2 1990 | 1988,쑐,2 1991 | 1989,쏸,2 1992 | 1990,썻,2 1993 | 1991,쌕,2 1994 | 1992,싷,2 1995 | 1993,솝,2 1996 | 1994,솎,2 1997 | 1995,샅,2 1998 | 1996,삻,2 1999 | 1997,쁩,2 2000 | 1998,빘,2 2001 | 1999,뷸,2 2002 | 2000,뷴,2 2003 | 2001,봽,2 2004 | 2002,봈,2 2005 | 2003,볌,2 2006 | 2004,벋,2 2007 | 2005,믕,2 2008 | 2006,뮴,2 2009 | 2007,뭬,2 2010 | 2008,뭑,2 2011 | 2009,묜,2 2012 | 2010,맇,2 2013 | 2011,릈,2 2014 | 2012,륑,2 2015 | 2013,랮,2 2016 | 2014,랟,2 2017 | 2015,띔,2 2018 | 2016,떱,2 2019 | 2017,듈,2 2020 | 2018,됴,2 2021 | 2019,됫,2 2022 | 2020,됙,2 2023 | 2021,됑,2 2024 | 2022,댤,2 2025 | 2023,닯,2 2026 | 2024,늗,2 2027 | 2025,뉩,2 2028 | 2026,눟,2 2029 | 2027,눗,2 2030 | 2028,넚,2 2031 | 2029,냡,2 2032 | 2030,낢,2 2033 | 2031,꾿,2 2034 | 2032,꼳,2 2035 | 2033,꼄,2 2036 | 2034,겊,2 2037 | 2035,갼,2 2038 | 2036,6,2 2039 | 2037,,0 2040 | 2038,,0 2041 | 2039,_,0 2042 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | import torch 4 | import torch.nn as nn 5 | import queue 6 | import math 7 | from random import random 8 | from torch import optim 9 | from package.config import Config 10 | from package.definition import char2id, logger, SOS_token, EOS_token, PAD_token 11 | from package.data_loader import CustomDataset, load_corpus, CustomDataLoader 12 | from package.evaluator import evaluate 13 | from package.loss import Perplexity 14 | from package.trainer import supervised_train 15 | from model import LanguageModel 16 | 17 | # Character-level Recurrent Neural Network Language Model implement in Pytorch 18 | # https://github.com/sooftware/char-rnnlm 19 | 20 | if __name__ == '__main__': 21 | os.environ["CUDA_LAUNCH_BLOCKING"] = "1" # if you use Multi-GPU, delete this line 22 | logger.info("device : %s" % torch.cuda.get_device_name(0)) 23 | logger.info("CUDA is available : %s" % (torch.cuda.is_available())) 24 | logger.info("CUDA version : %s" % torch.version.cuda) 25 | logger.info("PyTorch version : %s" % torch.__version__) 26 | 27 | config = Config( 28 | use_cuda=True, 29 | hidden_size=512, 30 | dropout_p=0.5, 31 | n_layers=4, 32 | batch_size=16, 33 | max_epochs=40, 34 | lr=0.0001, 35 | teacher_forcing_ratio=1.0, 36 | seed=1, 37 | max_len=428, 38 | worker_num=1 39 | ) 40 | 41 | torch.manual_seed(config.seed) 42 | torch.cuda.manual_seed_all(config.seed) 43 | cuda = config.use_cuda and torch.cuda.is_available() 44 | device = torch.device('cuda' if cuda else 'cpu') 45 | 46 | model = LanguageModel( 47 | n_class=len(char2id), 48 | n_layers=config.n_layers, 49 | rnn_cell='lstm', 50 | hidden_size=config.hidden_size, 51 | dropout_p=config.dropout_p, 52 | max_length=config.max_len, 53 | sos_id=SOS_token, 54 | eos_id=EOS_token, 55 | device=device 56 | ) 57 | model.flatten_parameters() 58 | model = nn.DataParallel(model).to(device) 59 | 60 | for param in model.parameters(): 61 | param.data.uniform_(-0.08, 0.08) 62 | 63 | # Prepare loss 64 | weight = torch.ones(len(char2id)).to(device) 65 | perplexity = Perplexity(weight, PAD_token, device) 66 | optimizer = optim.Adam(model.module.parameters(), lr=config.lr) 67 | 68 | corpus = load_corpus('./data/corpus_df.bin') 69 | total_time_step = math.ceil(len(corpus) / config.batch_size) 70 | 71 | train_set = CustomDataset(corpus[:-10000], SOS_token, EOS_token, config.batch_size) 72 | valid_set = CustomDataset(corpus[-10000:], SOS_token, EOS_token, config.batch_size) 73 | 74 | logger.info('start') 75 | train_begin = time.time() 76 | 77 | for epoch in range(config.max_epochs): 78 | train_queue = queue.Queue(config.worker_num << 1) 79 | train_set.shuffle() 80 | 81 | train_loader = CustomDataLoader(train_set, train_queue, config.batch_size, 0) 82 | train_loader.start() 83 | 84 | train_loss = supervised_train( 85 | model=model, 86 | queue=train_queue, 87 | epoch=epoch, 88 | total_time_step=total_time_step, 89 | train_begin=train_begin, 90 | perplexity=perplexity, 91 | optimizer=optimizer, 92 | device=device, 93 | print_every=10, 94 | teacher_forcing_ratio=config.teacher_forcing_ratio, 95 | worker_num=config.worker_num 96 | ) 97 | 98 | torch.save(model, "./data/epoch%s.pt" % str(epoch)) 99 | logger.info('Epoch %d (Training) Loss %0.4f' % (epoch, train_loss)) 100 | train_loader.join() 101 | 102 | valid_queue = queue.Queue(config.worker_num << 1) 103 | valid_loader = CustomDataLoader(valid_set, valid_queue, config.batch_size, 0) 104 | valid_loader.start() 105 | 106 | valid_loss = evaluate(model, valid_queue, perplexity, device) 107 | valid_loader.join() 108 | 109 | logger.info('Epoch %d (Evaluate) Loss %0.4f' % (epoch, valid_loss)) -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | import random 2 | import torch 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | 6 | 7 | class LanguageModel(nn.Module): 8 | def __init__(self, n_class, n_layers, rnn_cell, hidden_size, dropout_p, max_length, sos_id, eos_id, device): 9 | 10 | super(LanguageModel, self).__init__() 11 | assert rnn_cell.lower() in ('lstm', 'gru', 'rnn') 12 | 13 | self.rnn_cell = nn.LSTM if rnn_cell.lower() == 'lstm' else nn.GRU if rnn_cell.lower() == 'gru' else nn.RNN 14 | self.rnn = self.rnn_cell(hidden_size, hidden_size, n_layers, batch_first=True, dropout=dropout_p).to(device) 15 | self.max_length = max_length 16 | self.eos_id = eos_id 17 | self.sos_id = sos_id 18 | self.hidden_size = hidden_size 19 | self.embedding = nn.Embedding(n_class, hidden_size) 20 | self.n_layers = n_layers 21 | self.input_dropout = nn.Dropout(p=dropout_p) 22 | self.out = nn.Linear(self.hidden_size, n_class) 23 | self.device = device 24 | 25 | def forward_step(self, input, hidden, function=F.log_softmax): 26 | """ forward one time step """ 27 | batch_size = input.size(0) 28 | seq_length = input.size(1) 29 | 30 | embedded = self.embedding(input).to(self.device) 31 | embedded = self.input_dropout(embedded) 32 | 33 | if self.training: 34 | self.rnn.flatten_parameters() 35 | 36 | output, hidden = self.rnn(embedded, hidden) 37 | 38 | predicted_softmax = function(self.out(output.contiguous().view(-1, self.hidden_size)), dim=1) 39 | predicted_softmax = predicted_softmax.view(batch_size, seq_length, -1) 40 | 41 | return predicted_softmax, hidden 42 | 43 | def forward(self, inputs, teacher_forcing_ratio=1.0, function=F.log_softmax): 44 | batch_size = inputs.size(0) 45 | max_length = inputs.size(1) - 1 # minus the start of sequence symbol 46 | 47 | outputs = list() 48 | use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False 49 | 50 | hidden = self._init_state(batch_size) 51 | 52 | if use_teacher_forcing: 53 | inputs = inputs[inputs != self.eos_id].view(batch_size, -1) 54 | predicted_softmax, hidden = self.forward_step( 55 | input=inputs, 56 | hidden=hidden, 57 | function=function 58 | ) 59 | 60 | for di in range(predicted_softmax.size(1)): 61 | step_output = predicted_softmax[:, di, :] 62 | outputs.append(step_output) 63 | 64 | else: 65 | input = inputs[:, 0].unsqueeze(1) 66 | for di in range(max_length): 67 | predicted_softmax, hidden = self.forward_step( 68 | input=input, 69 | hidden=hidden, 70 | function=function 71 | ) 72 | 73 | step_output = predicted_softmax.squeeze(1) 74 | outputs.append(step_output) 75 | input = outputs[-1].topk(1)[1] 76 | 77 | return outputs 78 | 79 | def _init_state(self, batch_size): 80 | if isinstance(self.rnn, nn.LSTM): 81 | h_0 = torch.zeros(self.n_layers, batch_size, self.hidden_size).to(self.device) 82 | c_0 = torch.zeros(self.n_layers, batch_size, self.hidden_size).to(self.device) 83 | hidden = (h_0, c_0) 84 | 85 | else: 86 | hidden = torch.zeros(self.n_layers, batch_size, self.hidden_size).to(self.device) 87 | 88 | return hidden 89 | 90 | def flatten_parameters(self): 91 | self.rnn.flatten_parameters() -------------------------------------------------------------------------------- /package/__pycache__/config.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sooftware/char-rnnlm/fc6573bde13b151f373fc081f63e3f563debf56c/package/__pycache__/config.cpython-37.pyc -------------------------------------------------------------------------------- /package/__pycache__/data_loader.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sooftware/char-rnnlm/fc6573bde13b151f373fc081f63e3f563debf56c/package/__pycache__/data_loader.cpython-37.pyc -------------------------------------------------------------------------------- /package/__pycache__/definition.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sooftware/char-rnnlm/fc6573bde13b151f373fc081f63e3f563debf56c/package/__pycache__/definition.cpython-37.pyc -------------------------------------------------------------------------------- /package/__pycache__/evaluator.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sooftware/char-rnnlm/fc6573bde13b151f373fc081f63e3f563debf56c/package/__pycache__/evaluator.cpython-37.pyc -------------------------------------------------------------------------------- /package/__pycache__/loss.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sooftware/char-rnnlm/fc6573bde13b151f373fc081f63e3f563debf56c/package/__pycache__/loss.cpython-37.pyc -------------------------------------------------------------------------------- /package/__pycache__/trainer.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sooftware/char-rnnlm/fc6573bde13b151f373fc081f63e3f563debf56c/package/__pycache__/trainer.cpython-37.pyc -------------------------------------------------------------------------------- /package/__pycache__/utils.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sooftware/char-rnnlm/fc6573bde13b151f373fc081f63e3f563debf56c/package/__pycache__/utils.cpython-37.pyc -------------------------------------------------------------------------------- /package/config.py: -------------------------------------------------------------------------------- 1 | class Config: 2 | def __init__(self, 3 | use_cuda=True, 4 | hidden_size=512, 5 | dropout_p=0.5, 6 | n_layers=4, 7 | batch_size=32, 8 | max_epochs=40, 9 | lr=0.0001, 10 | teacher_forcing_ratio=1.0, 11 | seed=1, 12 | max_len=428, 13 | worker_num=1 14 | ): 15 | self.use_cuda = use_cuda 16 | self.hidden_size = hidden_size 17 | self.dropout_p = dropout_p 18 | self.n_layers = n_layers 19 | self.batch_size = batch_size 20 | self.max_epochs = max_epochs 21 | self.lr = lr 22 | self.teacher_forcing_ratio = teacher_forcing_ratio 23 | self.seed = seed 24 | self.max_len = max_len 25 | self.worker_num = worker_num 26 | -------------------------------------------------------------------------------- /package/data_loader.py: -------------------------------------------------------------------------------- 1 | import threading 2 | import csv 3 | import random 4 | import pandas as pd 5 | import torch 6 | import math 7 | import pickle 8 | from torch.utils.data import Dataset 9 | from package.definition import logger 10 | from package.utils import get_label, get_input 11 | 12 | 13 | class CustomDataLoader(threading.Thread): 14 | def __init__(self, dataset, queue, batch_size, thread_id): 15 | threading.Thread.__init__(self) 16 | self.collate_fn = _collate_fn 17 | self.dataset = dataset 18 | self.batch_size = batch_size 19 | self.queue = queue 20 | self.index = 0 21 | self.thread_id = thread_id 22 | self.dataset_count = dataset.count() 23 | 24 | def create_empty_batch(self): 25 | sequences = torch.zeros(0, 0, 0).to(torch.long) 26 | targets = torch.zeros(0, 0, 0).to(torch.long) 27 | 28 | sequence_lengths = list() 29 | target_lengths = list() 30 | 31 | return sequences, targets, sequence_lengths, target_lengths 32 | 33 | def run(self): 34 | logger.debug('loader %d start' % self.thread_id) 35 | while True: 36 | items = list() 37 | 38 | for _ in range(self.batch_size): 39 | if self.index >= self.dataset_count: 40 | break 41 | 42 | input, label = self.dataset.get_item(self.index) 43 | 44 | if input is not None: 45 | items.append((input, label)) 46 | 47 | self.index += 1 48 | 49 | if len(items) == 0: 50 | batch = self.create_empty_batch() 51 | self.queue.put(batch) 52 | break 53 | 54 | random.shuffle(items) 55 | 56 | batch = self.collate_fn(items) 57 | self.queue.put(batch) 58 | 59 | logger.debug('loader %d stop' % self.thread_id) 60 | 61 | 62 | def _collate_fn(batch): 63 | """ functions that pad to the maximum sequence length """ 64 | 65 | def seq_length_(p): 66 | return len(p[0]) 67 | 68 | def target_length_(p): 69 | return len(p[1]) 70 | 71 | seq_lengths = [len(s[0]) for s in batch] 72 | target_lengths = [len(s[1]) for s in batch] 73 | 74 | max_seq_sample = max(batch, key=seq_length_)[0] 75 | max_target_sample = max(batch, key=target_length_)[1] 76 | 77 | max_seq_size = max_seq_sample.size(0) 78 | max_target_size = len(max_target_sample) 79 | batch_size = len(batch) 80 | 81 | seqs = torch.zeros(batch_size, max_seq_size).to(torch.long) 82 | targets = torch.zeros(batch_size, max_target_size).to(torch.long) 83 | 84 | from package.definition import PAD_token 85 | targets.fill_(PAD_token) 86 | seqs.fill_(PAD_token) 87 | 88 | for idx in range(batch_size): 89 | sample = batch[idx] 90 | tensor = sample[0] 91 | target = sample[1] 92 | 93 | seqs[idx].narrow(0, 0, len(tensor)).copy_(torch.LongTensor(tensor)) 94 | targets[idx].narrow(0, 0, len(target)).copy_(torch.LongTensor(target)) 95 | 96 | return seqs, targets, seq_lengths, target_lengths 97 | 98 | 99 | class CustomDataset(Dataset): 100 | def __init__(self, corpus, sos_id, eos_id, batch_size): 101 | self.corpus = corpus 102 | self.sos_id = sos_id 103 | self.eos_id = eos_id 104 | self.batch_size = batch_size 105 | 106 | def get_item(self, index): 107 | input = get_input(self.corpus[index], self.sos_id) 108 | label = get_label(self.corpus[index], self.eos_id) 109 | 110 | return input, label 111 | 112 | def shuffle(self): 113 | random.shuffle(self.corpus) 114 | 115 | def count(self): 116 | return len(self.corpus) 117 | 118 | 119 | def load_label(label_path, encoding='utf-8'): 120 | char2id = dict() 121 | id2char = dict() 122 | 123 | with open(label_path, 'r', encoding=encoding) as f: 124 | labels = csv.reader(f, delimiter=',') 125 | next(labels) 126 | 127 | for row in labels: 128 | char2id[row[1]] = row[0] 129 | id2char[int(row[0])] = row[1] 130 | 131 | return char2id, id2char 132 | 133 | 134 | def load_corpus(filepath): 135 | with open(filepath, 'rb') as f: 136 | corpus = pickle.load(f) 137 | corpus = list(corpus['id']) 138 | 139 | return corpus 140 | -------------------------------------------------------------------------------- /package/definition.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import sys 3 | logger = logging.getLogger('root') 4 | FORMAT = "[%(asctime)s %(filename)s:%(lineno)s - %(funcName)s()] %(message)s" 5 | logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, format=FORMAT) 6 | logger.setLevel(logging.INFO) 7 | from package.data_loader import load_label 8 | char2id, id2char = load_label('./data/train_labels.csv', encoding='utf-8') 9 | SOS_token = int(char2id['']) 10 | EOS_token = int(char2id['']) 11 | PAD_token = int(char2id['_']) 12 | train_dict = {'loss': [], 'cer': []} 13 | valid_dict = {'loss': [], 'cer': []} -------------------------------------------------------------------------------- /package/evaluator.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from package.definition import logger 3 | 4 | 5 | def evaluate(model, queue, perplexity, device): 6 | logger.info('evaluate() start') 7 | 8 | total_loss = 0 9 | total_num = 0 10 | 11 | model.eval() 12 | 13 | with torch.no_grad(): 14 | while True: 15 | loss = perplexity 16 | 17 | inputs, targets, input_lengths, target_lengths = queue.get() 18 | 19 | if inputs.shape[0] == 0: 20 | break 21 | 22 | inputs = inputs.to(device) 23 | targets = targets.to(device) 24 | 25 | model.module.flatten_parameters() 26 | outputs = model(inputs, teacher_forcing_ratio=0.0) 27 | 28 | loss.reset() 29 | for step, step_output in enumerate(outputs): 30 | batch_size = targets.size(0) 31 | loss.eval_batch(step_output.contiguous().view(batch_size, -1), targets[:, step]) 32 | 33 | loss = loss.get_loss() 34 | 35 | total_loss += loss 36 | total_num += sum(input_lens) 37 | 38 | logger.info('evaluate() completed') 39 | 40 | return total_loss / total_num 41 | -------------------------------------------------------------------------------- /package/loss.py: -------------------------------------------------------------------------------- 1 | # Refer to IBM/pytorch-seq2seq 2 | # https://github.com/IBM/pytorch-seq2seq/blob/master/seq2seq/loss/loss.py 3 | 4 | import torch 5 | import math 6 | import torch.nn as nn 7 | import numpy as np 8 | 9 | 10 | class Loss(object): 11 | """ Base class for encapsulation of the loss functions. 12 | This class defines interfaces that are commonly used with loss functions 13 | in training and inferencing. For information regarding individual loss 14 | functions, please refer to http://pytorch.org/docs/master/nn.html#loss-functions 15 | Note: 16 | Do not use this class directly, use one of the sub classes. 17 | Args: 18 | name (str): name of the loss function used by logging messages. 19 | criterion (torch.nn._Loss): one of PyTorch's loss function. Refer 20 | to http://pytorch.org/docs/master/nn.html#loss-functions for 21 | a list of them. 22 | Attributes: 23 | name (str): name of the loss function used by logging messages. 24 | criterion (torch.nn._Loss): one of PyTorch's loss function. Refer 25 | to http://pytorch.org/docs/master/nn.html#loss-functions for 26 | a list of them. Implementation depends on individual 27 | sub-classes. 28 | acc_loss (int or torcn.nn.Tensor): variable that stores accumulated loss. 29 | norm_term (float): normalization term that can be used to calculate 30 | the loss of multiple batches. Implementation depends on individual 31 | sub-classes. 32 | """ 33 | 34 | def __init__(self, name, criterion): 35 | self.name = name 36 | self.criterion = criterion 37 | if not issubclass(type(self.criterion), nn.modules.loss._Loss): 38 | raise ValueError("Criterion has to be a subclass of torch.nn._Loss") 39 | # accumulated loss 40 | self.acc_loss = 0 41 | # normalization term 42 | self.norm_term = 0 43 | 44 | def reset(self): 45 | """ Reset the accumulated loss. """ 46 | self.acc_loss = 0 47 | self.norm_term = 0 48 | 49 | def get_loss(self): 50 | """ Get the loss. 51 | This method defines how to calculate the averaged loss given the 52 | accumulated loss and the normalization term. Override to define your 53 | own logic. 54 | Returns: 55 | loss (float): value of the loss. 56 | """ 57 | raise NotImplementedError 58 | 59 | def eval_batch(self, outputs, target): 60 | """ Evaluate and accumulate loss given outputs and expected results. 61 | This method is called after each batch with the batch outputs and 62 | the target (expected) results. The loss and normalization term are 63 | accumulated in this method. Override it to define your own accumulation 64 | method. 65 | Args: 66 | outputs (torch.Tensor): outputs of a batch. 67 | target (torch.Tensor): expected output of a batch. 68 | """ 69 | raise NotImplementedError 70 | 71 | def cuda(self): 72 | self.criterion.cuda() 73 | 74 | def backward(self): 75 | if type(self.acc_loss) is int: 76 | raise ValueError("No loss to back propagate.") 77 | self.acc_loss.backward() 78 | 79 | 80 | class NLLLoss(Loss): 81 | """ Batch averaged negative log-likelihood loss. 82 | Args: 83 | weight (torch.Tensor, optional): refer to http://pytorch.org/docs/master/nn.html#nllloss 84 | mask (int, optional): index of masked token, i.e. weight[mask] = 0. 85 | size_average (bool, optional): refer to http://pytorch.org/docs/master/nn.html#nllloss 86 | """ 87 | 88 | _NAME = "Avg NLLLoss" 89 | 90 | def __init__(self, weight=None, mask=None, size_average=True): 91 | self.mask = mask 92 | self.size_average = size_average 93 | if mask is not None: 94 | if weight is None: 95 | raise ValueError("Must provide weight with a mask.") 96 | weight[mask] = 0 97 | 98 | super(NLLLoss, self).__init__( 99 | self._NAME, 100 | nn.NLLLoss(weight=weight, reduction='sum')) 101 | 102 | def get_loss(self): 103 | if isinstance(self.acc_loss, int): 104 | return 0 105 | # total loss for all batches 106 | loss = self.acc_loss.data.item() 107 | if self.size_average: 108 | # average loss per batch 109 | loss /= self.norm_term 110 | return loss 111 | 112 | def eval_batch(self, outputs, target): 113 | self.acc_loss += self.criterion(outputs, target) 114 | self.norm_term += 1 115 | 116 | 117 | class Perplexity(NLLLoss): 118 | """ Language model perplexity loss. 119 | Perplexity is the token averaged likelihood. When the averaging options are the 120 | same, it is the exponential of negative log-likelihood. 121 | Args: 122 | weight (torch.Tensor, optional): refer to http://pytorch.org/docs/master/nn.html#nllloss 123 | mask (int, optional): index of masked token, i.e. weight[mask] = 0. 124 | """ 125 | 126 | _NAME = "Perplexity" 127 | _MAX_EXP = 100 128 | 129 | def __init__(self, weight=None, mask=None, device=None): 130 | super(Perplexity, self).__init__(weight=weight, mask=mask, size_average=False) 131 | self.device = device 132 | 133 | def eval_batch(self, outputs, target): 134 | self.acc_loss += self.criterion(outputs, target).to(self.device) 135 | if self.mask is None: 136 | self.norm_term += np.prod(target.size()) 137 | else: 138 | self.norm_term += target.data.ne(self.mask).sum() 139 | 140 | def get_loss(self): 141 | nll = super(Perplexity, self).get_loss() 142 | nll /= self.norm_term.item() 143 | if nll > Perplexity._MAX_EXP: 144 | print("WARNING: Loss exceeded maximum value, capping to e^100") 145 | return math.exp(Perplexity._MAX_EXP) 146 | return math.exp(nll) 147 | -------------------------------------------------------------------------------- /package/trainer.py: -------------------------------------------------------------------------------- 1 | import time 2 | import torch 3 | from package.definition import logger 4 | 5 | train_step_result = {'loss': [], 'cer': []} 6 | 7 | 8 | def supervised_train(model, queue, perplexity, optimizer, device, print_every, epoch, 9 | teacher_forcing_ratio, worker_num, total_time_step, train_begin): 10 | print_loss_total = 0 # Reset every print_every 11 | epoch_loss_total = 0 # Reset every epoch 12 | total_num = 0 13 | time_step = 0 14 | 15 | model.train() 16 | begin = epoch_begin = time.time() 17 | 18 | while True: 19 | loss = perplexity 20 | inputs, targets, input_lens, target_lens = queue.get() 21 | 22 | if inputs.shape[0] == 0: 23 | # empty feats means closing one loader 24 | worker_num -= 1 25 | logger.debug('left train_loader: %d' % worker_num) 26 | 27 | if worker_num == 0: 28 | break 29 | else: 30 | continue 31 | 32 | inputs = inputs.to(device) 33 | targets = targets.to(device) 34 | 35 | model.module.flatten_parameters() 36 | outputs = model(inputs, teacher_forcing_ratio=teacher_forcing_ratio) 37 | 38 | # Get loss 39 | loss.reset() 40 | for step, step_output in enumerate(outputs): 41 | batch_size = targets.size(0) 42 | loss.eval_batch(step_output.contiguous().view(batch_size, -1), targets[:, step]) 43 | # Backpropagation 44 | model.zero_grad() 45 | loss.backward() 46 | optimizer.step() 47 | loss = loss.get_loss() 48 | 49 | epoch_loss_total += loss 50 | print_loss_total += loss 51 | total_num += sum(input_lens) 52 | 53 | time_step += 1 54 | torch.cuda.empty_cache() 55 | 56 | if time_step % print_every == 0: 57 | current = time.time() 58 | elapsed = current - begin 59 | epoch_elapsed = (current - epoch_begin) / 60.0 60 | train_elapsed = (current - train_begin) / 3600.0 61 | 62 | logger.info('timestep: {:4d}/{:4d}, perplexity: {:.4f}, elapsed: {:.2f}s {:.2f}m {:.2f}h'.format( 63 | time_step, 64 | total_time_step, 65 | print_loss_total / print_every, 66 | elapsed, epoch_elapsed, train_elapsed 67 | )) 68 | print_loss_total = 0 69 | begin = time.time() 70 | 71 | if time_step % 50000 == 0: 72 | torch.save(model, "./data/epoch%s_%s.pt" % (str(epoch), str(time_step))) 73 | 74 | logger.info('train() completed') 75 | 76 | return epoch_loss_total / total_num 77 | -------------------------------------------------------------------------------- /package/utils.py: -------------------------------------------------------------------------------- 1 | import Levenshtein as Lev 2 | import torch 3 | 4 | 5 | def get_label(script, eos_id): 6 | tokens = script.split() 7 | 8 | label = list() 9 | for token in tokens: 10 | label.append(int(token)) 11 | label.append(int(eos_id)) 12 | 13 | return label 14 | 15 | 16 | def get_input(script, sos_id): 17 | tokens = script.split() 18 | 19 | label = list() 20 | label.append(int(sos_id)) 21 | for token in tokens: 22 | label.append(int(token)) 23 | 24 | return torch.LongTensor(label) 25 | 26 | 27 | def label_to_string(labels, id2char, eos_id): 28 | if len(labels.shape) == 1: 29 | sentence = str() 30 | for label in labels: 31 | if label.item() == eos_id: 32 | break 33 | sentence += id2char[label.item()] 34 | return sentence 35 | 36 | elif len(labels.shape) == 2: 37 | sentences = list() 38 | for batch in labels: 39 | sentence = str() 40 | for label in batch: 41 | if label.item() == eos_id: 42 | break 43 | sentence += id2char[label.item()] 44 | sentences.append(sentence) 45 | return sentences 46 | 47 | else: 48 | raise ValueError("shape Error !!") 49 | 50 | 51 | def char_distance(target, y_hat): 52 | target = target.replace(' ', '') 53 | y_hat = y_hat.replace(' ', '') 54 | 55 | dist = Lev.distance(y_hat, target) 56 | length = len(target.replace(' ', '')) 57 | 58 | return dist, length 59 | -------------------------------------------------------------------------------- /preprocess/KoNPron.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# KoNPron" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "class KoNPron:\n", 17 | " def __init__(self):\n", 18 | " self.base_digit = ['0','1','2','3','4','5','6','7','8','9']\n", 19 | " self.super_digit = ['⁰','¹','²','³','⁴','⁵','⁶','⁷','⁸','⁹']\n", 20 | " self.small_scale = ['','십','백','천']\n", 21 | " self.large_scale = ['','만 ','억 ','조 ','경 ','해 ']\n", 22 | " self.literal = ['영','일','이','삼','사','오','육','칠','팔','구']\n", 23 | " self.spoken_unit = ['', '하나', '둘', '셋', '넷', '다섯', '여섯', '일곱', '여덟', '아홉']\n", 24 | " self.spoke_tens = ['','열', '스물', '서른', '마흔', '쉰', '예순', '일흔', '여든', '아흔']\n", 25 | " self.sentence = str()\n", 26 | " \n", 27 | " def _detect(self, sentence):\n", 28 | " self.sentence = sentence\n", 29 | " detection_data = list()\n", 30 | " tmp = str()\n", 31 | " \n", 32 | " total_len = len(sentence)\n", 33 | " point_count = 0\n", 34 | " continuous_count = 0\n", 35 | " digit_type = 'vanilla'\n", 36 | " \n", 37 | " detected = False\n", 38 | " zero_started = False\n", 39 | " \n", 40 | " for idx, char in enumerate(sentence):\n", 41 | " if char in self.base_digit:\n", 42 | " if not detected:\n", 43 | " detected = True\n", 44 | " if char is '0':\n", 45 | " zero_started = True\n", 46 | " tmp += char\n", 47 | " continuous_count += 1\n", 48 | " if zero_started and continuous_count>8:\n", 49 | " digit_type = 'telephone/none'\n", 50 | " if continuous_count>20:\n", 51 | " digit_type = 'enormous/none' \n", 52 | " else:\n", 53 | " continous_count = 0\n", 54 | " zero_started = False\n", 55 | " if char == ',':\n", 56 | " if idx+1 < total_len and idx > 0:\n", 57 | " if sentence[idx-1] in self.base_digit and sentence[idx+1] in self.base_digit:\n", 58 | " tmp += char\n", 59 | " elif char == '.':\n", 60 | " if idx+1 < total_len and idx > 0:\n", 61 | " if sentence[idx-1] in self.base_digit and sentence[idx+1] in self.base_digit:\n", 62 | " point_count += 1\n", 63 | " if point_count == 1:\n", 64 | " digit_type = 'fraction'\n", 65 | " if point_count > 1:\n", 66 | " digit_type = 'version'\n", 67 | " tmp += char\n", 68 | " elif char == '^':\n", 69 | " if idx+1 < total_len and idx > 0:\n", 70 | " if sentence[idx-1] in self.base_digit and sentence[idx+1] in self.base_digit:\n", 71 | " digit_type += '/square'\n", 72 | " tmp += char\n", 73 | " else:\n", 74 | " if digit_type != 'exception/none':\n", 75 | " digit_type = 'exception/none'\n", 76 | " tmp += char\n", 77 | " elif char in self.super_digit:\n", 78 | " if idx > 0:\n", 79 | " if sentence[idx-1] in self.base_digit or sentence[idx-1] in self.super_digit:\n", 80 | " if not 'square' in digit_type:\n", 81 | " digit_type += '/square'\n", 82 | " tmp += char\n", 83 | " else:\n", 84 | " if digit_type != 'exception/none':\n", 85 | " digit_type = 'exception/none'\n", 86 | " tmp += char\n", 87 | " elif char == '·':\n", 88 | " if idx+1 < total_len and idx > 0:\n", 89 | " if sentence[idx-1] in self.base_digit and sentence[idx+1] in self.base_digit:\n", 90 | " digit_type = 'date'\n", 91 | " tmp += char\n", 92 | " else:\n", 93 | " if detected:\n", 94 | " detected = False\n", 95 | " if '/' not in digit_type:\n", 96 | " digit_type += '/none'\n", 97 | " detection_data.append((digit_type, tmp))\n", 98 | " tmp = str()\n", 99 | " digit_type = 'vanilla'\n", 100 | " point_count = 0\n", 101 | " continuous_count = 0\n", 102 | " else:\n", 103 | " if tmp:\n", 104 | " detection_data.append((digit_type, tmp))\n", 105 | " tmp = str()\n", 106 | " digit_type = 'vanilla'\n", 107 | " point_count = 0\n", 108 | " continuous_count = 0\n", 109 | " \n", 110 | " if detected:\n", 111 | " if '/' not in digit_type:\n", 112 | " digit_type += '/none'\n", 113 | " detection_data.append((digit_type, tmp))\n", 114 | " \n", 115 | " elif tmp:\n", 116 | " detection_data.append((digit_type, tmp))\n", 117 | " return detection_data\n", 118 | " \n", 119 | " def _preprocess(self, detection_data):\n", 120 | " preprocessed_data = list()\n", 121 | " \n", 122 | " for digit_type, target in detection_data:\n", 123 | " original = target\n", 124 | " target_seq = list()\n", 125 | " reading_method = list()\n", 126 | " target_len = len(target)\n", 127 | "\n", 128 | " main_type, sub_type = digit_type.split('/')\n", 129 | " if main_type == 'exception':\n", 130 | " return None\n", 131 | " \n", 132 | " if main_type == 'version':\n", 133 | " splited = target.split('.')\n", 134 | " for count, frag in enumerate(splited):\n", 135 | " target_seq.append(frag)\n", 136 | " reading_method.append('individual')\n", 137 | " if count < len(splited) - 1:\n", 138 | " target_seq.append('.')\n", 139 | " reading_method.append('point')\n", 140 | " preprocessed_data.append((reading_method, target_seq, original))\n", 141 | " \n", 142 | " if main_type == 'date':\n", 143 | " target = target.split('·')\n", 144 | " for frag in target:\n", 145 | " target_seq.append(frag)\n", 146 | " reading_method.append('individual')\n", 147 | " preprocessed_data.append((reading_method, target_seq, original))\n", 148 | " \n", 149 | " if main_type == 'telephone' or main_type == 'enormous':\n", 150 | " target = target.replace('0', '_')\n", 151 | " target_seq.append(target)\n", 152 | " reading_method.append('individual')\n", 153 | " preprocessed_data.append((reading_method, target_seq, original))\n", 154 | " \n", 155 | " if sub_type != 'none':\n", 156 | " if main_type == 'vanilla':\n", 157 | " target = [target.replace(',','')]\n", 158 | " if main_type == 'fraction':\n", 159 | " target = target.split('.')\n", 160 | " \n", 161 | " if sub_type == 'square':\n", 162 | " if '^' not in target[-1]:\n", 163 | " super_part = str()\n", 164 | " for idx, digit in enumerate(target[-1]):\n", 165 | " if digit not in self.base_digit:\n", 166 | " super_idx = idx\n", 167 | " break\n", 168 | " super_num = list(target[-1][idx:])\n", 169 | " for idx in range(len(super_num)):\n", 170 | " super_num[idx] = str(self.super_digit.index(super_num[idx]))\n", 171 | " super_num = ''.join(super_num)\n", 172 | " super_part += target[-1][:super_idx]+'^'+super_num\n", 173 | " if main_type == 'vanilla':\n", 174 | " target = [super_part]\n", 175 | " else:\n", 176 | " target = target[:1]+[super_part]\n", 177 | " if main_type == 'fraction':\n", 178 | " target_seq.append(target[0])\n", 179 | " reading_method.append('literal')\n", 180 | " target_seq.append('.')\n", 181 | " reading_method.append('point')\n", 182 | "\n", 183 | " tmp_len = len(target_seq)\n", 184 | " tmp = target[-1].split('^')\n", 185 | " for seq in tmp:\n", 186 | " target_seq.append(seq)\n", 187 | " if 'point' in reading_method:\n", 188 | " if 'of' in reading_method:\n", 189 | " reading_method.append('literal')\n", 190 | " else:\n", 191 | " reading_method.append('individual')\n", 192 | " else:\n", 193 | " reading_method.append('literal')\n", 194 | " if len(target_seq) == tmp_len+1:\n", 195 | " target_seq.append(\"^\")\n", 196 | " reading_method.append(\"of\")\n", 197 | " elif len(target_seq) == tmp_len+3:\n", 198 | " target_seq.append(\"^\")\n", 199 | " reading_method.append(\"super\") \n", 200 | " preprocessed_data.append((reading_method, target_seq, original))\n", 201 | " \n", 202 | " else:\n", 203 | " if main_type == 'vanilla':\n", 204 | " target = target.replace(',','')\n", 205 | " target_seq.append(target)\n", 206 | " reading_method.append('literal')\n", 207 | " preprocessed_data.append((reading_method, target_seq, original))\n", 208 | " \n", 209 | " if main_type == 'fraction':\n", 210 | " target = target.split('.')\n", 211 | " for frag in target:\n", 212 | " target_seq.append(frag)\n", 213 | " if 'point' in reading_method:\n", 214 | " reading_method.append('individual')\n", 215 | " else:\n", 216 | " reading_method.append('literal')\n", 217 | " if len(target_seq) == 1:\n", 218 | " target_seq.append(\".\")\n", 219 | " reading_method.append('point')\n", 220 | " \n", 221 | " preprocessed_data.append((reading_method, target_seq, original))\n", 222 | " \n", 223 | " return preprocessed_data\n", 224 | " \n", 225 | " \n", 226 | " def _read(self, preprocessed_data, mode = 'informal'):\n", 227 | " def __literal_read(self, frag, mode = 'informal'):\n", 228 | " korean = str()\n", 229 | " tmp = str()\n", 230 | " length = len(frag)\n", 231 | " for idx, digit in enumerate(frag):\n", 232 | " digit = int(digit)\n", 233 | " inversed_idx = length-idx-1\n", 234 | " if mode == 'formal':\n", 235 | " if inversed_idx%4:\n", 236 | " if digit:\n", 237 | " tmp += self.literal[digit]\n", 238 | " tmp += self.small_scale[inversed_idx%4]\n", 239 | " else:\n", 240 | " if digit or length == 1:\n", 241 | " tmp += self.literal[digit]\n", 242 | " if tmp:\n", 243 | " tmp += self.large_scale[inversed_idx//4]\n", 244 | " korean += tmp\n", 245 | " tmp = str()\n", 246 | " \n", 247 | " elif mode == 'informal':\n", 248 | " if inversed_idx%4:\n", 249 | " if digit>1:\n", 250 | " tmp += self.literal[digit]\n", 251 | " if digit:\n", 252 | " tmp += self.small_scale[inversed_idx%4]\n", 253 | " else:\n", 254 | " if digit or length == 1:\n", 255 | " tmp += self.literal[digit]\n", 256 | " if tmp:\n", 257 | " tmp += self.large_scale[inversed_idx//4]\n", 258 | " if length == 5 and digit == 1 and inversed_idx == 4:\n", 259 | " tmp = tmp[1:]\n", 260 | " korean += tmp\n", 261 | " tmp = str()\n", 262 | " korean += tmp\n", 263 | " return korean\n", 264 | " \n", 265 | " def __individual_read(self, frag):\n", 266 | " korean = str()\n", 267 | " for digit in frag:\n", 268 | " if digit == '_':\n", 269 | " korean += '공'\n", 270 | " else:\n", 271 | " korean += self.literal[int(digit)]\n", 272 | " return korean\n", 273 | " \n", 274 | " \n", 275 | " result = self.sentence\n", 276 | " if preprocessed_data is None:\n", 277 | " return None\n", 278 | " \n", 279 | " for each in preprocessed_data:\n", 280 | " reading_method, target_seq, original = each\n", 281 | " readed = str()\n", 282 | " for idx, frag in enumerate(target_seq):\n", 283 | " if reading_method[idx] == 'literal':\n", 284 | " readed += __literal_read(self, frag)\n", 285 | " if reading_method[idx] == 'individual':\n", 286 | " readed += __individual_read(self, frag)\n", 287 | " if reading_method[idx] == 'point':\n", 288 | " readed += ' 점 '\n", 289 | " if reading_method[idx] == 'of':\n", 290 | " readed += '의 '\n", 291 | " if reading_method[idx] == 'super':\n", 292 | " readed += ' 승'\n", 293 | " result = result.replace(original, readed, 1)\n", 294 | " return result\n", 295 | " \n", 296 | " def convert(self, sentence):\n", 297 | " return self._read(self._preprocess(self._detect(sentence)))" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "## 중간 출력 값 확인" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 12, 310 | "metadata": {}, 311 | "outputs": [ 312 | { 313 | "name": "stdout", 314 | "output_type": "stream", 315 | "text": [ 316 | "Original : 13,000, 3.14, 1.4^2, 8·15, 1.1.7\n", 317 | "<>\n", 318 | "[('vanilla/none', '13,000'), ('fraction/none', '3.14'), ('fraction/square', '1.4^2'), ('date/none', '8·15'), ('version/none', '1.1.7')]\n", 319 | "<>\n", 320 | "[(['literal'], ['13000'], '13,000'), (['literal', 'point', 'individual'], ['3', '.', '14'], '3.14'), (['literal', 'point', 'individual', 'of', 'literal', 'super'], ['1', '.', '4', '^', '2', '^'], '1.4^2'), (['individual', 'individual'], ['8', '15'], '8·15'), (['individual', 'point', 'individual', 'point', 'individual'], ['1', '.', '1', '.', '7'], '1.1.7')]\n", 321 | "<>\n", 322 | "만 삼천, 삼 점 일사, 일 점 사의 이 승, 팔일오, 일 점 일 점 칠\n" 323 | ] 324 | } 325 | ], 326 | "source": [ 327 | "test = KoNPron()\n", 328 | "sentence = \"01066883225, 420076049241990172380904740935\"\n", 329 | "sentence = \"13,000, 3.14, 1.4^2, 8·15, 1.1.7\"\n", 330 | "print(\"Original : \",sentence)\n", 331 | "detect = test._detect(sentence)\n", 332 | "print(\"<>\")\n", 333 | "print(detect)\n", 334 | "preprocessed = test._preprocess(detect)\n", 335 | "print(\"<>\")\n", 336 | "print(preprocessed)\n", 337 | "print(\"<>\")\n", 338 | "print(test._read(preprocessed))" 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "## 예외상황에 대하여는 중간출력 및 최종출력을 None으로 설정" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": 3, 351 | "metadata": {}, 352 | "outputs": [ 353 | { 354 | "name": "stdout", 355 | "output_type": "stream", 356 | "text": [ 357 | "[('exception/none', '⁴')]\n", 358 | "None\n", 359 | "None\n" 360 | ] 361 | } 362 | ], 363 | "source": [ 364 | "exception = \"아니⁴ 이런 일이?\"\n", 365 | "detect = test._detect(exception)\n", 366 | "print(detect)\n", 367 | "preprocessed = test._preprocess(detect)\n", 368 | "print(preprocessed)\n", 369 | "result = test._read(preprocessed)\n", 370 | "print(result)" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "## 예외 문장과 정상 문장을 구분하여 출력" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": 6, 383 | "metadata": {}, 384 | "outputs": [ 385 | { 386 | "name": "stdout", 387 | "output_type": "stream", 388 | "text": [ 389 | "팔일오 행사를 맞아 광장에 사만 구천여 명의 시민들이 모였다. 이 중 오십육 점 이이퍼센트는 태극기를 들고 있었다.\n", 390 | "버전 일 점 영 점 칠, 삼천만큼 사랑해, 십육 점 칠의 삼 승과 만 오천삼백삼십이 점 칠팔삼의 백십 승\n", 391 | "제 핸드폰 번호는 공일공일이삼사오육칠팔, 집 전화번호는 공이일이삼사오육칠이고 송장번호는 사이공공칠육공사구이사일구구공일칠이삼팔공구공사칠사공구삼오\n" 392 | ] 393 | } 394 | ], 395 | "source": [ 396 | "sentence1 = \"8·15 행사를 맞아 광장에 4,9000여 명의 시민들이 모였다. 이 중 56.22퍼센트는 태극기를 들고 있었다.\"\n", 397 | "sentence2 = \"아니 벌써³ 4월이라니\"\n", 398 | "exception1 = \"그러게요^^\"\n", 399 | "exception2 = \"버전 1.0.7, 3000만큼 사랑해, 16.7³과 15332.783^110\"\n", 400 | "sentence3 = \"제 핸드폰 번호는 01012345678, 집 전화번호는 021234567이고 송장번호는 420076049241990172380904740935\"\n", 401 | "\n", 402 | "data = [sentence1, exception1, exception2, sentence2, sentence3]\n", 403 | "\n", 404 | "for sentence in data:\n", 405 | " ret = test.convert(sentence)\n", 406 | " if ret is not None:\n", 407 | " print(ret)" 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": null, 413 | "metadata": {}, 414 | "outputs": [], 415 | "source": [] 416 | } 417 | ], 418 | "metadata": { 419 | "kernelspec": { 420 | "display_name": "Python 3", 421 | "language": "python", 422 | "name": "python3" 423 | }, 424 | "language_info": { 425 | "codemirror_mode": { 426 | "name": "ipython", 427 | "version": 3 428 | }, 429 | "file_extension": ".py", 430 | "mimetype": "text/x-python", 431 | "name": "python", 432 | "nbconvert_exporter": "python", 433 | "pygments_lexer": "ipython3", 434 | "version": "3.7.3" 435 | } 436 | }, 437 | "nbformat": 4, 438 | "nbformat_minor": 2 439 | } 440 | -------------------------------------------------------------------------------- /preprocess/Preprocess.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Preprocess for Language Model" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "수집한 Corpus를 통합하여 일련의 규칙으로 전처리" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "* 필요한 라이브러리 import" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 1, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "import pandas as pd\n", 31 | "import numpy as np\n", 32 | "import random\n", 33 | "import pickle" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "* 수집한 corpus 통합" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "ename": "FileNotFoundError", 50 | "evalue": "[Errno 2] No such file or directory: '../data/OpenSubtitles.csv'", 51 | "output_type": "error", 52 | "traceback": [ 53 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", 54 | "\u001b[1;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", 55 | "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mkor_corpus\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'../data/OpenSubtitles.csv'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mencoding\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'utf-8'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2\u001b[0m \u001b[0msub_corpus\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'../data/new_kor.csv'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mencoding\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'cp949'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[0mkor_corpus\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mconcat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mkor_corpus\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0msub_corpus\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[0msub_corpus\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'../data/tatoeba.csv'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mencoding\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'utf-8'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 56 | "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py\u001b[0m in \u001b[0;36mparser_f\u001b[1;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)\u001b[0m\n\u001b[0;32m 674\u001b[0m )\n\u001b[0;32m 675\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 676\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 677\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 678\u001b[0m \u001b[0mparser_f\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m__name__\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mname\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 57 | "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py\u001b[0m in \u001b[0;36m_read\u001b[1;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[0;32m 446\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 447\u001b[0m \u001b[1;31m# Create the parser.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 448\u001b[1;33m \u001b[0mparser\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mTextFileReader\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfp_or_buf\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 449\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 450\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mchunksize\u001b[0m \u001b[1;32mor\u001b[0m \u001b[0miterator\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 58 | "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[0;32m 878\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m\"has_index_names\"\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mkwds\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m\"has_index_names\"\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 879\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 880\u001b[1;33m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_make_engine\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 881\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 882\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 59 | "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py\u001b[0m in \u001b[0;36m_make_engine\u001b[1;34m(self, engine)\u001b[0m\n\u001b[0;32m 1112\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m_make_engine\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mengine\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m\"c\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1113\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mengine\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m\"c\"\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1114\u001b[1;33m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_engine\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mCParserWrapper\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mf\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1115\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1116\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mengine\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m\"python\"\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 60 | "\u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, src, **kwds)\u001b[0m\n\u001b[0;32m 1872\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mkwds\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"compression\"\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mNone\u001b[0m \u001b[1;32mand\u001b[0m \u001b[0mencoding\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1873\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msrc\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mstr\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1874\u001b[1;33m \u001b[0msrc\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msrc\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"rb\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1875\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mhandles\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msrc\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1876\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", 61 | "\u001b[1;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '../data/OpenSubtitles.csv'" 62 | ] 63 | } 64 | ], 65 | "source": [ 66 | "kor_corpus = pd.read_csv('../data/OpenSubtitles.csv', encoding='utf-8')\n", 67 | "sub_corpus = pd.read_csv('../data/new_kor.csv', encoding='cp949')\n", 68 | "kor_corpus = pd.concat([kor_corpus, sub_corpus])\n", 69 | "\n", 70 | "sub_corpus = pd.read_csv('../data/tatoeba.csv', encoding='utf-8')\n", 71 | "kor_corpus = pd.concat([kor_corpus, sub_corpus])\n", 72 | "\n", 73 | "sub_corpus = pd.read_csv('../data/네이버 해커톤 데이터셋.csv', encoding='utf-8', engine='python')\n", 74 | "kor_corpus = pd.concat([kor_corpus, sub_corpus])\n", 75 | "\n", 76 | "for i in range(10):\n", 77 | " sub_corpus = pd.read_csv('../data/AI Hub 한-영 데이터셋[%s].csv' % i, encoding='utf-8', engine='python')\n", 78 | " kor_corpus = pd.concat([kor_corpus, sub_corpus])" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "* corpus 크기 확인" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 3, 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "data": { 95 | "text/plain": [ 96 | "3048264" 97 | ] 98 | }, 99 | "execution_count": 3, 100 | "metadata": {}, 101 | "output_type": "execute_result" 102 | } 103 | ], 104 | "source": [ 105 | "len(kor_corpus)" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "* 코퍼스 내용 확인" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 4, 118 | "metadata": {}, 119 | "outputs": [ 120 | { 121 | "data": { 122 | "text/html": [ 123 | "
\n", 124 | "\n", 137 | "\n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | "
ko
0폭설이 내리고 우박, 진눈깨비가 퍼부어도 눈보라가 몰아쳐도 강풍이 불고 비바람이 휘...
1우리의 한결같은 심부름꾼 황새 아저씨 가는 길을 그 누가 막으랴!
2황새 아저씨를 기다리세요
3찾아와 선물을 주실 거예요
4가난하든 부자이든 상관이 없답니다
\n", 167 | "
" 168 | ], 169 | "text/plain": [ 170 | " ko\n", 171 | "0 폭설이 내리고 우박, 진눈깨비가 퍼부어도 눈보라가 몰아쳐도 강풍이 불고 비바람이 휘...\n", 172 | "1 우리의 한결같은 심부름꾼 황새 아저씨 가는 길을 그 누가 막으랴!\n", 173 | "2 황새 아저씨를 기다리세요\n", 174 | "3 찾아와 선물을 주실 거예요\n", 175 | "4 가난하든 부자이든 상관이 없답니다" 176 | ] 177 | }, 178 | "execution_count": 4, 179 | "metadata": {}, 180 | "output_type": "execute_result" 181 | } 182 | ], 183 | "source": [ 184 | "kor_corpus.head()" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "* list로 변환 및 shuffle (기존에는 그냥 concatenate 했기 때문에 섞어줌)" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": 5, 197 | "metadata": {}, 198 | "outputs": [ 199 | { 200 | "data": { 201 | "text/plain": [ 202 | "['어디 보자...',\n", 203 | " '7대 왕국 종족 중에 오크는 처음 듣는다',\n", 204 | " '이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.',\n", 205 | " '라니스터 가의',\n", 206 | " '별 희한한 생각이 다 떠오르곤 하죠',\n", 207 | " '아!',\n", 208 | " '회의 전에 마실 것 좀 드릴까요? 커피와 차가 있습니다.',\n", 209 | " '나는 그런 일에는 전혀 흥미가 없어요.',\n", 210 | " '일자리가 있고 제대로 된 회사가 있다면 늘 사람들이 붐비고 활력이 넘치는 도시로 자리매김할 수 있다.',\n", 211 | " '카카오가 많이 함유된 다크 초콜릿이 더 좋다는 것을 잊지 말자.',\n", 212 | " '비오 13세는 누구인가?',\n", 213 | " '네, 어..',\n", 214 | " '치밀한 사람이긴 하지만...',\n", 215 | " '계속 절 따라오셨죠',\n", 216 | " '세계 4위 특허출원 강국이지만 지식재산 심사 품질과 보호 수준이 낮아 가치를 제대로 인정받지 못하고 있다.',\n", 217 | " '나는 전부터 이 학교를 잘 알고 있었어.',\n", 218 | " '-뭔가 꿍꿍이를 꾸미는게 틀림없어',\n", 219 | " '어쩔 수 없이 과학 수업에 늦었습니다.',\n", 220 | " '가르시아 에르난데스, 토니 그란데, 하비에르 미냐노 등 축구대표팀의 외국인 코치들이 8일 오후 (현지시간) 오스트리아 레오강의 대표팀 숙소에서 인터뷰에 응하고있다.',\n", 221 | " '검찰은 “(증인신문 내용이) 다 녹음 됐으니까 (이 전 대통령이 한 말에 대해) 따지려면 따져볼 수 있다”고 말하기도 했다.']" 222 | ] 223 | }, 224 | "execution_count": 5, 225 | "metadata": {}, 226 | "output_type": "execute_result" 227 | } 228 | ], 229 | "source": [ 230 | "kor_corpus = list(kor_corpus['ko'])\n", 231 | "random.shuffle(kor_corpus)\n", 232 | "kor_corpus[:20]" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "* ( ) 와 같이 괄호 처리된 내용들을 필터링한다. (Speech-To-Text에 사용할 언어모델이기 때문에 괄호 및 괄호 내용은 필요없음)" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": 6, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [ 248 | "new_corpus = list()\n", 249 | "flag = False\n", 250 | "\n", 251 | "for sentence in kor_corpus:\n", 252 | " new_sentence = str()\n", 253 | " for ch in sentence:\n", 254 | " if ch == '(':\n", 255 | " flag = True\n", 256 | " continue\n", 257 | " elif ch == ')':\n", 258 | " flag = False\n", 259 | " continue\n", 260 | " elif flag == False:\n", 261 | " new_sentence += ch\n", 262 | " else:\n", 263 | " continue\n", 264 | " \n", 265 | " new_corpus.append(new_sentence)" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "* ()가 잘 필터링 됐는지 확인" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 7, 278 | "metadata": {}, 279 | "outputs": [ 280 | { 281 | "data": { 282 | "text/plain": [ 283 | "['어디 보자...',\n", 284 | " '7대 왕국 종족 중에 오크는 처음 듣는다',\n", 285 | " '이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.',\n", 286 | " '라니스터 가의',\n", 287 | " '별 희한한 생각이 다 떠오르곤 하죠',\n", 288 | " '아!',\n", 289 | " '회의 전에 마실 것 좀 드릴까요? 커피와 차가 있습니다.',\n", 290 | " '나는 그런 일에는 전혀 흥미가 없어요.',\n", 291 | " '일자리가 있고 제대로 된 회사가 있다면 늘 사람들이 붐비고 활력이 넘치는 도시로 자리매김할 수 있다.',\n", 292 | " '카카오가 많이 함유된 다크 초콜릿이 더 좋다는 것을 잊지 말자.',\n", 293 | " '비오 13세는 누구인가?',\n", 294 | " '네, 어..',\n", 295 | " '치밀한 사람이긴 하지만...',\n", 296 | " '계속 절 따라오셨죠',\n", 297 | " '세계 4위 특허출원 강국이지만 지식재산 심사 품질과 보호 수준이 낮아 가치를 제대로 인정받지 못하고 있다.',\n", 298 | " '나는 전부터 이 학교를 잘 알고 있었어.',\n", 299 | " '-뭔가 꿍꿍이를 꾸미는게 틀림없어',\n", 300 | " '어쩔 수 없이 과학 수업에 늦었습니다.',\n", 301 | " '가르시아 에르난데스, 토니 그란데, 하비에르 미냐노 등 축구대표팀의 외국인 코치들이 8일 오후 오스트리아 레오강의 대표팀 숙소에서 인터뷰에 응하고있다.',\n", 302 | " '검찰은 “ 다 녹음 됐으니까 따지려면 따져볼 수 있다”고 말하기도 했다.']" 303 | ] 304 | }, 305 | "execution_count": 7, 306 | "metadata": {}, 307 | "output_type": "execute_result" 308 | } 309 | ], 310 | "source": [ 311 | "kor_corpus = new_corpus\n", 312 | "kor_corpus[:20]" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "* 현재 corpus에 어떤 특수문자들이 있는지 확인" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 8, 325 | "metadata": {}, 326 | "outputs": [], 327 | "source": [ 328 | "special_ch = list()\n", 329 | "\n", 330 | "for sentence in kor_corpus:\n", 331 | " for ch in sentence:\n", 332 | " if ch.isdigit() == False and ch.isalpha() == False and ch not in special_ch:\n", 333 | " special_ch.append(ch)" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "* corpus에 등장한 특수문자 개수 확인" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": 9, 346 | "metadata": {}, 347 | "outputs": [ 348 | { 349 | "data": { 350 | "text/plain": [ 351 | "209" 352 | ] 353 | }, 354 | "execution_count": 9, 355 | "metadata": {}, 356 | "output_type": "execute_result" 357 | } 358 | ], 359 | "source": [ 360 | "len(special_ch)" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "* 등장한 특수문자를 csv 파일로 저장하여 적절한 발음전사로 매핑" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 10, 373 | "metadata": {}, 374 | "outputs": [ 375 | { 376 | "data": { 377 | "text/html": [ 378 | "
\n", 379 | "\n", 392 | "\n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | "
specialreplace
0
1..
2
3\"
4-
\n", 428 | "
" 429 | ], 430 | "text/plain": [ 431 | " special replace\n", 432 | "0 \n", 433 | "1 . .\n", 434 | "2 “ \n", 435 | "3 \" \n", 436 | "4 - " 437 | ] 438 | }, 439 | "execution_count": 10, 440 | "metadata": {}, 441 | "output_type": "execute_result" 442 | } 443 | ], 444 | "source": [ 445 | "special_df = pd.read_csv('special.csv', encoding='utf-8')\n", 446 | "special_df = special_df.fillna('')\n", 447 | "special_df.head()" 448 | ] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "metadata": {}, 453 | "source": [ 454 | "* special_dict 생성 (special => replace로 변환)" 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "execution_count": 11, 460 | "metadata": {}, 461 | "outputs": [], 462 | "source": [ 463 | "special_dict = dict()\n", 464 | "\n", 465 | "for (special, replace) in zip(special_df['special'], special_df['replace']):\n", 466 | " special_dict[special] = replace\n", 467 | "\n", 468 | "special_dict[' '] = ' '" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "metadata": {}, 474 | "source": [ 475 | "* in 연산자로 체크를 위해 list 형식으로 저장" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": 12, 481 | "metadata": {}, 482 | "outputs": [], 483 | "source": [ 484 | "specials = list(special_df['special'])\n", 485 | "replaces = list(special_df['replace'])" 486 | ] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "metadata": {}, 491 | "source": [ 492 | "* 특수문자들을 원하는 발음전사로 변환" 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": 13, 498 | "metadata": {}, 499 | "outputs": [], 500 | "source": [ 501 | "new_corpus = list()\n", 502 | "\n", 503 | "for sentence in kor_corpus:\n", 504 | " new_sentence = str()\n", 505 | " for ch in sentence:\n", 506 | " if ch in specials:\n", 507 | " new_sentence += special_dict[ch]\n", 508 | " else:\n", 509 | " new_sentence += ch\n", 510 | " \n", 511 | " new_corpus.append(new_sentence)" 512 | ] 513 | }, 514 | { 515 | "cell_type": "markdown", 516 | "metadata": {}, 517 | "source": [ 518 | "* 특수문자가 적절히 변경되었는지 확인" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": 14, 524 | "metadata": {}, 525 | "outputs": [ 526 | { 527 | "data": { 528 | "text/plain": [ 529 | "['어디 보자...',\n", 530 | " '7대 왕국 종족 중에 오크는 처음 듣는다',\n", 531 | " '이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.',\n", 532 | " '라니스터 가의',\n", 533 | " '별 희한한 생각이 다 떠오르곤 하죠',\n", 534 | " '아!',\n", 535 | " '회의 전에 마실 것 좀 드릴까요? 커피와 차가 있습니다.',\n", 536 | " '나는 그런 일에는 전혀 흥미가 없어요.',\n", 537 | " '일자리가 있고 제대로 된 회사가 있다면 늘 사람들이 붐비고 활력이 넘치는 도시로 자리매김할 수 있다.',\n", 538 | " '카카오가 많이 함유된 다크 초콜릿이 더 좋다는 것을 잊지 말자.',\n", 539 | " '비오 13세는 누구인가?',\n", 540 | " '네 어..',\n", 541 | " '치밀한 사람이긴 하지만...',\n", 542 | " '계속 절 따라오셨죠',\n", 543 | " '세계 4위 특허출원 강국이지만 지식재산 심사 품질과 보호 수준이 낮아 가치를 제대로 인정받지 못하고 있다.',\n", 544 | " '나는 전부터 이 학교를 잘 알고 있었어.',\n", 545 | " '뭔가 꿍꿍이를 꾸미는게 틀림없어',\n", 546 | " '어쩔 수 없이 과학 수업에 늦었습니다.',\n", 547 | " '가르시아 에르난데스 토니 그란데 하비에르 미냐노 등 축구대표팀의 외국인 코치들이 8일 오후 오스트리아 레오강의 대표팀 숙소에서 인터뷰에 응하고있다.',\n", 548 | " '검찰은 다 녹음 됐으니까 따지려면 따져볼 수 있다고 말하기도 했다.']" 549 | ] 550 | }, 551 | "execution_count": 14, 552 | "metadata": {}, 553 | "output_type": "execute_result" 554 | } 555 | ], 556 | "source": [ 557 | "new_corpus[:20]" 558 | ] 559 | }, 560 | { 561 | "cell_type": "markdown", 562 | "metadata": {}, 563 | "source": [ 564 | "* corpus 업데이트" 565 | ] 566 | }, 567 | { 568 | "cell_type": "code", 569 | "execution_count": 15, 570 | "metadata": {}, 571 | "outputs": [], 572 | "source": [ 573 | "kor_corpus = new_corpus" 574 | ] 575 | }, 576 | { 577 | "cell_type": "markdown", 578 | "metadata": {}, 579 | "source": [ 580 | "* 알파벳 필터링 (소문자)" 581 | ] 582 | }, 583 | { 584 | "cell_type": "code", 585 | "execution_count": 16, 586 | "metadata": {}, 587 | "outputs": [], 588 | "source": [ 589 | "alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']\n", 590 | "new_corpus = list()\n", 591 | "\n", 592 | "for sentence in kor_corpus:\n", 593 | " new_sentence = str()\n", 594 | " for ch in sentence:\n", 595 | " if ch in alphabet:\n", 596 | " continue\n", 597 | " else:\n", 598 | " new_sentence += ch\n", 599 | " \n", 600 | " new_corpus.append(new_sentence)" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": 17, 606 | "metadata": {}, 607 | "outputs": [ 608 | { 609 | "data": { 610 | "text/plain": [ 611 | "['어디 보자...',\n", 612 | " '7대 왕국 종족 중에 오크는 처음 듣는다',\n", 613 | " '이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.',\n", 614 | " '라니스터 가의',\n", 615 | " '별 희한한 생각이 다 떠오르곤 하죠',\n", 616 | " '아!',\n", 617 | " '회의 전에 마실 것 좀 드릴까요? 커피와 차가 있습니다.',\n", 618 | " '나는 그런 일에는 전혀 흥미가 없어요.',\n", 619 | " '일자리가 있고 제대로 된 회사가 있다면 늘 사람들이 붐비고 활력이 넘치는 도시로 자리매김할 수 있다.',\n", 620 | " '카카오가 많이 함유된 다크 초콜릿이 더 좋다는 것을 잊지 말자.',\n", 621 | " '비오 13세는 누구인가?',\n", 622 | " '네 어..',\n", 623 | " '치밀한 사람이긴 하지만...',\n", 624 | " '계속 절 따라오셨죠',\n", 625 | " '세계 4위 특허출원 강국이지만 지식재산 심사 품질과 보호 수준이 낮아 가치를 제대로 인정받지 못하고 있다.',\n", 626 | " '나는 전부터 이 학교를 잘 알고 있었어.',\n", 627 | " '뭔가 꿍꿍이를 꾸미는게 틀림없어',\n", 628 | " '어쩔 수 없이 과학 수업에 늦었습니다.',\n", 629 | " '가르시아 에르난데스 토니 그란데 하비에르 미냐노 등 축구대표팀의 외국인 코치들이 8일 오후 오스트리아 레오강의 대표팀 숙소에서 인터뷰에 응하고있다.',\n", 630 | " '검찰은 다 녹음 됐으니까 따지려면 따져볼 수 있다고 말하기도 했다.']" 631 | ] 632 | }, 633 | "execution_count": 17, 634 | "metadata": {}, 635 | "output_type": "execute_result" 636 | } 637 | ], 638 | "source": [ 639 | "kor_corpus = new_corpus\n", 640 | "kor_corpus[:20]" 641 | ] 642 | }, 643 | { 644 | "cell_type": "markdown", 645 | "metadata": {}, 646 | "source": [ 647 | "* 대문자 필터링 (발음으로 전사 ex) K리그 => 케이리그)" 648 | ] 649 | }, 650 | { 651 | "cell_type": "code", 652 | "execution_count": 18, 653 | "metadata": {}, 654 | "outputs": [], 655 | "source": [ 656 | "new_corpus = list()\n", 657 | "upper = ['A', 'B', 'C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']\n", 658 | "upper_dict = {'A' : '에이', 'B' : '비', 'C' : '씨','D' : '디','E' : '이','F' : '에프','G' : '쥐',\n", 659 | " 'H' : '에이취','I' : '아이','J' : '제이','K' : '케이','L' : '엘','M' : '엠','N' : '엔',\n", 660 | " 'O' : '오','P' : '피','Q' : '큐','R' : '알','S' : '에스','T' : '티','U' : '유','V' : '브이','W' : '떠블유',\n", 661 | " 'X' : '엑스','Y' : '와이','Z' : '지'}\n", 662 | "\n", 663 | "for sentence in kor_corpus:\n", 664 | " new_sentence = str()\n", 665 | " for idx, ch in enumerate(sentence):\n", 666 | " if ch in upper:\n", 667 | " if idx == 0 and (idx == len(sentence) - 1 or sentence[idx + 1] == ' '):\n", 668 | " continue\n", 669 | " \n", 670 | " elif idx != 0 and idx < len(sentence) - 1 and sentence[idx + 1] == ' ' and sentence[idx - 1] == ' ':\n", 671 | " continue\n", 672 | " \n", 673 | " elif idx != 0 and sentence[idx -1] == ' ' and idx + 1 == len(sentence):\n", 674 | " continue\n", 675 | " \n", 676 | " else:\n", 677 | " new_sentence += upper_dict[ch]\n", 678 | " \n", 679 | " else:\n", 680 | " new_sentence += ch\n", 681 | " \n", 682 | " new_corpus.append(new_sentence)" 683 | ] 684 | }, 685 | { 686 | "cell_type": "code", 687 | "execution_count": 19, 688 | "metadata": {}, 689 | "outputs": [ 690 | { 691 | "data": { 692 | "text/plain": [ 693 | "['어디 보자...',\n", 694 | " '7대 왕국 종족 중에 오크는 처음 듣는다',\n", 695 | " '이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.',\n", 696 | " '라니스터 가의',\n", 697 | " '별 희한한 생각이 다 떠오르곤 하죠',\n", 698 | " '아!',\n", 699 | " '회의 전에 마실 것 좀 드릴까요? 커피와 차가 있습니다.',\n", 700 | " '나는 그런 일에는 전혀 흥미가 없어요.',\n", 701 | " '일자리가 있고 제대로 된 회사가 있다면 늘 사람들이 붐비고 활력이 넘치는 도시로 자리매김할 수 있다.',\n", 702 | " '카카오가 많이 함유된 다크 초콜릿이 더 좋다는 것을 잊지 말자.',\n", 703 | " '비오 13세는 누구인가?',\n", 704 | " '네 어..',\n", 705 | " '치밀한 사람이긴 하지만...',\n", 706 | " '계속 절 따라오셨죠',\n", 707 | " '세계 4위 특허출원 강국이지만 지식재산 심사 품질과 보호 수준이 낮아 가치를 제대로 인정받지 못하고 있다.',\n", 708 | " '나는 전부터 이 학교를 잘 알고 있었어.',\n", 709 | " '뭔가 꿍꿍이를 꾸미는게 틀림없어',\n", 710 | " '어쩔 수 없이 과학 수업에 늦었습니다.',\n", 711 | " '가르시아 에르난데스 토니 그란데 하비에르 미냐노 등 축구대표팀의 외국인 코치들이 8일 오후 오스트리아 레오강의 대표팀 숙소에서 인터뷰에 응하고있다.',\n", 712 | " '검찰은 다 녹음 됐으니까 따지려면 따져볼 수 있다고 말하기도 했다.']" 713 | ] 714 | }, 715 | "execution_count": 19, 716 | "metadata": {}, 717 | "output_type": "execute_result" 718 | } 719 | ], 720 | "source": [ 721 | "new_corpus[:20]" 722 | ] 723 | }, 724 | { 725 | "cell_type": "markdown", 726 | "metadata": {}, 727 | "source": [ 728 | "* 데이터셋 업데이트" 729 | ] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "execution_count": 20, 734 | "metadata": {}, 735 | "outputs": [], 736 | "source": [ 737 | "kor_corpus = new_corpus" 738 | ] 739 | }, 740 | { 741 | "cell_type": "markdown", 742 | "metadata": {}, 743 | "source": [ 744 | "* 긴 공백 => ' '" 745 | ] 746 | }, 747 | { 748 | "cell_type": "code", 749 | "execution_count": 21, 750 | "metadata": {}, 751 | "outputs": [], 752 | "source": [ 753 | "new_corpus = list()\n", 754 | "\n", 755 | "for sentence in kor_corpus:\n", 756 | " new_sentence = str()\n", 757 | " tokens = sentence.split()\n", 758 | " for idx, token in enumerate(tokens):\n", 759 | " if idx == 0:\n", 760 | " new_sentence += token + ' '\n", 761 | " elif idx == len(tokens) - 1:\n", 762 | " new_sentence += token\n", 763 | " else:\n", 764 | " new_sentence += token + ' '\n", 765 | " \n", 766 | " new_corpus.append(new_sentence)" 767 | ] 768 | }, 769 | { 770 | "cell_type": "code", 771 | "execution_count": 22, 772 | "metadata": {}, 773 | "outputs": [ 774 | { 775 | "data": { 776 | "text/plain": [ 777 | "['어디 보자...',\n", 778 | " '7대 왕국 종족 중에 오크는 처음 듣는다',\n", 779 | " '이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.',\n", 780 | " '라니스터 가의',\n", 781 | " '별 희한한 생각이 다 떠오르곤 하죠',\n", 782 | " '아! ',\n", 783 | " '회의 전에 마실 것 좀 드릴까요? 커피와 차가 있습니다.',\n", 784 | " '나는 그런 일에는 전혀 흥미가 없어요.',\n", 785 | " '일자리가 있고 제대로 된 회사가 있다면 늘 사람들이 붐비고 활력이 넘치는 도시로 자리매김할 수 있다.',\n", 786 | " '카카오가 많이 함유된 다크 초콜릿이 더 좋다는 것을 잊지 말자.',\n", 787 | " '비오 13세는 누구인가?',\n", 788 | " '네 어..',\n", 789 | " '치밀한 사람이긴 하지만...',\n", 790 | " '계속 절 따라오셨죠',\n", 791 | " '세계 4위 특허출원 강국이지만 지식재산 심사 품질과 보호 수준이 낮아 가치를 제대로 인정받지 못하고 있다.',\n", 792 | " '나는 전부터 이 학교를 잘 알고 있었어.',\n", 793 | " '뭔가 꿍꿍이를 꾸미는게 틀림없어',\n", 794 | " '어쩔 수 없이 과학 수업에 늦었습니다.',\n", 795 | " '가르시아 에르난데스 토니 그란데 하비에르 미냐노 등 축구대표팀의 외국인 코치들이 8일 오후 오스트리아 레오강의 대표팀 숙소에서 인터뷰에 응하고있다.',\n", 796 | " '검찰은 다 녹음 됐으니까 따지려면 따져볼 수 있다고 말하기도 했다.']" 797 | ] 798 | }, 799 | "execution_count": 22, 800 | "metadata": {}, 801 | "output_type": "execute_result" 802 | } 803 | ], 804 | "source": [ 805 | "new_corpus[:20]" 806 | ] 807 | }, 808 | { 809 | "cell_type": "markdown", 810 | "metadata": {}, 811 | "source": [ 812 | "* 데이터셋 업데이트" 813 | ] 814 | }, 815 | { 816 | "cell_type": "code", 817 | "execution_count": 23, 818 | "metadata": {}, 819 | "outputs": [], 820 | "source": [ 821 | "kor_corpus = new_corpus" 822 | ] 823 | }, 824 | { 825 | "cell_type": "markdown", 826 | "metadata": {}, 827 | "source": [ 828 | "* 데이터셋 크기 확인" 829 | ] 830 | }, 831 | { 832 | "cell_type": "code", 833 | "execution_count": 24, 834 | "metadata": {}, 835 | "outputs": [ 836 | { 837 | "data": { 838 | "text/plain": [ 839 | "3048264" 840 | ] 841 | }, 842 | "execution_count": 24, 843 | "metadata": {}, 844 | "output_type": "execute_result" 845 | } 846 | ], 847 | "source": [ 848 | "len(kor_corpus)" 849 | ] 850 | }, 851 | { 852 | "cell_type": "markdown", 853 | "metadata": {}, 854 | "source": [ 855 | "* 필터링 되면서 빈문자열만 남은 데이터 제외" 856 | ] 857 | }, 858 | { 859 | "cell_type": "code", 860 | "execution_count": 25, 861 | "metadata": {}, 862 | "outputs": [], 863 | "source": [ 864 | "new_corpus = list()\n", 865 | "\n", 866 | "for sentence in kor_corpus:\n", 867 | " if len(sentence) == 0:\n", 868 | " continue\n", 869 | " new_corpus.append(sentence)" 870 | ] 871 | }, 872 | { 873 | "cell_type": "markdown", 874 | "metadata": {}, 875 | "source": [ 876 | "* 약 25,000개 필터링" 877 | ] 878 | }, 879 | { 880 | "cell_type": "code", 881 | "execution_count": 26, 882 | "metadata": {}, 883 | "outputs": [ 884 | { 885 | "data": { 886 | "text/plain": [ 887 | "3023296" 888 | ] 889 | }, 890 | "execution_count": 26, 891 | "metadata": {}, 892 | "output_type": "execute_result" 893 | } 894 | ], 895 | "source": [ 896 | "len(new_corpus)" 897 | ] 898 | }, 899 | { 900 | "cell_type": "markdown", 901 | "metadata": {}, 902 | "source": [ 903 | "* 데이터셋 업데이트" 904 | ] 905 | }, 906 | { 907 | "cell_type": "code", 908 | "execution_count": 27, 909 | "metadata": {}, 910 | "outputs": [], 911 | "source": [ 912 | "kor_corpus = new_corpus" 913 | ] 914 | }, 915 | { 916 | "cell_type": "markdown", 917 | "metadata": {}, 918 | "source": [ 919 | "* 아무지점이나 찍어서 중간확인" 920 | ] 921 | }, 922 | { 923 | "cell_type": "code", 924 | "execution_count": 28, 925 | "metadata": {}, 926 | "outputs": [ 927 | { 928 | "data": { 929 | "text/plain": [ 930 | "['이란 세계식량계획 사무소 대표 및 국가 담당국장 쥐는 2018년 올해 주이란 한국대사관 측이 220만 달러를 기부했을 때 한국 국민과 정부에 감사의 뜻을 전하기도 했다.',\n", 931 | " '조심해요 오염된 거예요',\n", 932 | " '그건 그렇고 메이즈는 어때?',\n", 933 | " '간에 있는 덩어리가..',\n", 934 | " '. ',\n", 935 | " '물렸어요? ',\n", 936 | " '안 보이나 보네',\n", 937 | " '그 전엔 아무것도 쓸 것 이 없었어요. 닥터 그레이.',\n", 938 | " '하루만 머무를 거요',\n", 939 | " '그런건 내 일이 아니라서 다른 도움이 필요하면 말해 난 쇼파에 있을꺼야',\n", 940 | " '고마워 ',\n", 941 | " '취업준비생들은 현실적으로 청년취업이 어려운 상황에서 아쉽다는 반응이었다.',\n", 942 | " '시장을 떠나는 순간 덮치겠음',\n", 943 | " '그리고 디자인이랑',\n", 944 | " '저항권이란 위헌적인 국가권력의 행사 때문에 헌법이 침해 또는 파괴되는 경우 주권자인 국민이 헌법을 보장하기 위해 최후 비상수단으로 국가권력에 실력으로 저항하는 것입니다.',\n", 945 | " '현대모비스 양동근도 정규 리그에서 상대해 본 결과 함지훈을 막을 선수가 없다는 게 전자랜드의 약점이라고 지적했다.',\n", 946 | " '아사히는 꾸준히 한국어 공부를 하면서 내가 하고 싶은 노래를 만들기 위해 항상 들고 다닌다고 말해 아티스트로서 더욱 발전할 수 있는 가능성을 보여줬다.',\n", 947 | " '서양마사지 방식으로는 스웨디시 마사지 에스알엔 그리고 홀리스틱 마사지가 있다.',\n", 948 | " '내 말을 들었더라면',\n", 949 | " '앞으로 이렇게 수도권 광역 경제발전위원회가 자주 열려서 다양한 이야기를 해야 한다.']" 950 | ] 951 | }, 952 | "execution_count": 28, 953 | "metadata": {}, 954 | "output_type": "execute_result" 955 | } 956 | ], 957 | "source": [ 958 | "kor_corpus[6000:6020]" 959 | ] 960 | }, 961 | { 962 | "cell_type": "markdown", 963 | "metadata": {}, 964 | "source": [ 965 | "* 뒤에가 공백으로 끝나는 문자열 공백 제거" 966 | ] 967 | }, 968 | { 969 | "cell_type": "code", 970 | "execution_count": 29, 971 | "metadata": {}, 972 | "outputs": [], 973 | "source": [ 974 | "for idx, sentence in enumerate(kor_corpus):\n", 975 | " if sentence[-1] == ' ':\n", 976 | " kor_corpus[idx] = sentence[:-1] " 977 | ] 978 | }, 979 | { 980 | "cell_type": "code", 981 | "execution_count": 30, 982 | "metadata": {}, 983 | "outputs": [ 984 | { 985 | "data": { 986 | "text/plain": [ 987 | "['이란 세계식량계획 사무소 대표 및 국가 담당국장 쥐는 2018년 올해 주이란 한국대사관 측이 220만 달러를 기부했을 때 한국 국민과 정부에 감사의 뜻을 전하기도 했다.',\n", 988 | " '조심해요 오염된 거예요',\n", 989 | " '그건 그렇고 메이즈는 어때?',\n", 990 | " '간에 있는 덩어리가..',\n", 991 | " '.',\n", 992 | " '물렸어요?',\n", 993 | " '안 보이나 보네',\n", 994 | " '그 전엔 아무것도 쓸 것 이 없었어요. 닥터 그레이.',\n", 995 | " '하루만 머무를 거요',\n", 996 | " '그런건 내 일이 아니라서 다른 도움이 필요하면 말해 난 쇼파에 있을꺼야',\n", 997 | " '고마워',\n", 998 | " '취업준비생들은 현실적으로 청년취업이 어려운 상황에서 아쉽다는 반응이었다.',\n", 999 | " '시장을 떠나는 순간 덮치겠음',\n", 1000 | " '그리고 디자인이랑',\n", 1001 | " '저항권이란 위헌적인 국가권력의 행사 때문에 헌법이 침해 또는 파괴되는 경우 주권자인 국민이 헌법을 보장하기 위해 최후 비상수단으로 국가권력에 실력으로 저항하는 것입니다.',\n", 1002 | " '현대모비스 양동근도 정규 리그에서 상대해 본 결과 함지훈을 막을 선수가 없다는 게 전자랜드의 약점이라고 지적했다.',\n", 1003 | " '아사히는 꾸준히 한국어 공부를 하면서 내가 하고 싶은 노래를 만들기 위해 항상 들고 다닌다고 말해 아티스트로서 더욱 발전할 수 있는 가능성을 보여줬다.',\n", 1004 | " '서양마사지 방식으로는 스웨디시 마사지 에스알엔 그리고 홀리스틱 마사지가 있다.',\n", 1005 | " '내 말을 들었더라면',\n", 1006 | " '앞으로 이렇게 수도권 광역 경제발전위원회가 자주 열려서 다양한 이야기를 해야 한다.']" 1007 | ] 1008 | }, 1009 | "execution_count": 30, 1010 | "metadata": {}, 1011 | "output_type": "execute_result" 1012 | } 1013 | ], 1014 | "source": [ 1015 | "kor_corpus[6000:6020]" 1016 | ] 1017 | }, 1018 | { 1019 | "cell_type": "markdown", 1020 | "metadata": {}, 1021 | "source": [ 1022 | "* 현재 데이터셋 크기 확인" 1023 | ] 1024 | }, 1025 | { 1026 | "cell_type": "code", 1027 | "execution_count": 31, 1028 | "metadata": {}, 1029 | "outputs": [ 1030 | { 1031 | "data": { 1032 | "text/plain": [ 1033 | "3023296" 1034 | ] 1035 | }, 1036 | "execution_count": 31, 1037 | "metadata": {}, 1038 | "output_type": "execute_result" 1039 | } 1040 | ], 1041 | "source": [ 1042 | "len(kor_corpus)" 1043 | ] 1044 | }, 1045 | { 1046 | "cell_type": "markdown", 1047 | "metadata": {}, 1048 | "source": [ 1049 | "* '?', '!', '.'만 남은 문자열 필터링" 1050 | ] 1051 | }, 1052 | { 1053 | "cell_type": "code", 1054 | "execution_count": 32, 1055 | "metadata": {}, 1056 | "outputs": [ 1057 | { 1058 | "data": { 1059 | "text/plain": [ 1060 | "3001447" 1061 | ] 1062 | }, 1063 | "execution_count": 32, 1064 | "metadata": {}, 1065 | "output_type": "execute_result" 1066 | } 1067 | ], 1068 | "source": [ 1069 | "new_corpus = list()\n", 1070 | "\n", 1071 | "for sentence in kor_corpus:\n", 1072 | " if sentence == '?' or sentence == '!' or sentence == '.':\n", 1073 | " continue\n", 1074 | " \n", 1075 | " else:\n", 1076 | " new_corpus.append(sentence)\n", 1077 | " \n", 1078 | "len(new_corpus)" 1079 | ] 1080 | }, 1081 | { 1082 | "cell_type": "markdown", 1083 | "metadata": {}, 1084 | "source": [ 1085 | "* 약 2만개 필터링 후 데이터셋 업데이트" 1086 | ] 1087 | }, 1088 | { 1089 | "cell_type": "code", 1090 | "execution_count": 33, 1091 | "metadata": {}, 1092 | "outputs": [], 1093 | "source": [ 1094 | "kor_corpus = new_corpus" 1095 | ] 1096 | }, 1097 | { 1098 | "cell_type": "markdown", 1099 | "metadata": {}, 1100 | "source": [ 1101 | "* import **KoNPron** (Korean Number Pronunciaion) => 숫자를 발음전사로 변환" 1102 | ] 1103 | }, 1104 | { 1105 | "cell_type": "code", 1106 | "execution_count": 34, 1107 | "metadata": {}, 1108 | "outputs": [], 1109 | "source": [ 1110 | "from konpron import KoNPron\n", 1111 | "\n", 1112 | "kpr = KoNPron()" 1113 | ] 1114 | }, 1115 | { 1116 | "cell_type": "markdown", 1117 | "metadata": {}, 1118 | "source": [ 1119 | "* **KoNPron** 테스트" 1120 | ] 1121 | }, 1122 | { 1123 | "cell_type": "code", 1124 | "execution_count": 35, 1125 | "metadata": {}, 1126 | "outputs": [ 1127 | { 1128 | "name": "stdout", 1129 | "output_type": "stream", 1130 | "text": [ 1131 | "천이백삼십사랑, 십이 점 삼사랑, 천이백삼십사랑, 백이십삼의 사 승랑, 일의 이백삼십사 승랑, 일 점 이삼의 사 승랑, 십이 점 삼의 사 승랑, 일이삼사랑, 이억 구천삼백삼십만 천삼백 점 오오삼영일의 사 승\n" 1132 | ] 1133 | } 1134 | ], 1135 | "source": [ 1136 | "sentence = \"1234랑, 12.34랑, 1,234랑, 123^4랑, 1²³⁴랑, 1.23^4랑, 12.3⁴랑, 12·34랑, 293301300.55301⁴\"\n", 1137 | "\n", 1138 | "ret = kpr.convert(sentence)\n", 1139 | "if ret is not None:\n", 1140 | " print(ret)" 1141 | ] 1142 | }, 1143 | { 1144 | "cell_type": "markdown", 1145 | "metadata": {}, 1146 | "source": [ 1147 | "* **KoNPron**으로 숫자 => 발음전사로 변환" 1148 | ] 1149 | }, 1150 | { 1151 | "cell_type": "code", 1152 | "execution_count": 36, 1153 | "metadata": {}, 1154 | "outputs": [], 1155 | "source": [ 1156 | "new_corpus = list()\n", 1157 | "\n", 1158 | "for sentence in kor_corpus:\n", 1159 | " try:\n", 1160 | " ret = kpr.convert(sentence)\n", 1161 | " if ret is not None:\n", 1162 | " new_corpus.append(ret)\n", 1163 | " else:\n", 1164 | " new_corpus.append(sentence)\n", 1165 | " except:\n", 1166 | " print(sentence)" 1167 | ] 1168 | }, 1169 | { 1170 | "cell_type": "code", 1171 | "execution_count": 37, 1172 | "metadata": {}, 1173 | "outputs": [], 1174 | "source": [ 1175 | "kor_corpus = new_corpus" 1176 | ] 1177 | }, 1178 | { 1179 | "cell_type": "markdown", 1180 | "metadata": {}, 1181 | "source": [ 1182 | "* 음향모델에서 사용하는 2,040개의 문자 레이블로만 이루어진 문장 수 확인" 1183 | ] 1184 | }, 1185 | { 1186 | "cell_type": "code", 1187 | "execution_count": 38, 1188 | "metadata": {}, 1189 | "outputs": [ 1190 | { 1191 | "name": "stdout", 1192 | "output_type": "stream", 1193 | "text": [ 1194 | "2992120\n" 1195 | ] 1196 | } 1197 | ], 1198 | "source": [ 1199 | "acoustic_labels = list(pd.read_csv('train_labels.csv')['char'])\n", 1200 | "count = 0\n", 1201 | "\n", 1202 | "for sentence in kor_corpus:\n", 1203 | " for ch in sentence:\n", 1204 | " if ch not in acoustic_labels:\n", 1205 | " count += 1\n", 1206 | " break\n", 1207 | "\n", 1208 | "print(len(kor_corpus) - count)" 1209 | ] 1210 | }, 1211 | { 1212 | "cell_type": "markdown", 1213 | "metadata": {}, 1214 | "source": [ 1215 | "* 조건에 만족하는 2,992,638개의 문장들로 최종 코퍼스 구성" 1216 | ] 1217 | }, 1218 | { 1219 | "cell_type": "code", 1220 | "execution_count": 41, 1221 | "metadata": {}, 1222 | "outputs": [ 1223 | { 1224 | "data": { 1225 | "text/plain": [ 1226 | "2992120" 1227 | ] 1228 | }, 1229 | "execution_count": 41, 1230 | "metadata": {}, 1231 | "output_type": "execute_result" 1232 | } 1233 | ], 1234 | "source": [ 1235 | "final_corpus = list()\n", 1236 | "\n", 1237 | "for sentence in kor_corpus:\n", 1238 | " flag = True\n", 1239 | " for ch in sentence:\n", 1240 | " if ch not in acoustic_labels:\n", 1241 | " flag = False\n", 1242 | " break\n", 1243 | " if flag:\n", 1244 | " final_corpus.append(sentence)\n", 1245 | "\n", 1246 | "len(final_corpus)" 1247 | ] 1248 | }, 1249 | { 1250 | "cell_type": "code", 1251 | "execution_count": 44, 1252 | "metadata": {}, 1253 | "outputs": [], 1254 | "source": [ 1255 | "import csv\n", 1256 | "\n", 1257 | "def load_label(label_path, encoding='utf-8'):\n", 1258 | " char2id = dict()\n", 1259 | " id2char = dict()\n", 1260 | "\n", 1261 | " with open(label_path, 'r', encoding=encoding) as f:\n", 1262 | " labels = csv.reader(f, delimiter=',')\n", 1263 | " next(labels)\n", 1264 | "\n", 1265 | " for row in labels:\n", 1266 | " char2id[row[1]] = row[0]\n", 1267 | " id2char[int(row[0])] = row[1]\n", 1268 | "\n", 1269 | " return char2id, id2char\n", 1270 | "\n", 1271 | "char2id, id2char = load_label('train_labels.csv')" 1272 | ] 1273 | }, 1274 | { 1275 | "cell_type": "code", 1276 | "execution_count": 49, 1277 | "metadata": {}, 1278 | "outputs": [ 1279 | { 1280 | "name": "stdout", 1281 | "output_type": "stream", 1282 | "text": [ 1283 | "2992120\n", 1284 | "2992120\n" 1285 | ] 1286 | } 1287 | ], 1288 | "source": [ 1289 | "corpus_labels = list()\n", 1290 | "\n", 1291 | "for sentence in final_corpus:\n", 1292 | " label = str()\n", 1293 | " for ch in sentence:\n", 1294 | " if idx == 0:\n", 1295 | " label += char2id[ch] + ' '\n", 1296 | " elif idx == len(sentence) - 1:\n", 1297 | " label += char2id[ch]\n", 1298 | " else:\n", 1299 | " label += char2id[ch] + ' '\n", 1300 | " \n", 1301 | " corpus_labels.append(label)\n", 1302 | " \n", 1303 | "print(len(final_corpus))\n", 1304 | "print(len(corpus_labels))" 1305 | ] 1306 | }, 1307 | { 1308 | "cell_type": "code", 1309 | "execution_count": 50, 1310 | "metadata": {}, 1311 | "outputs": [ 1312 | { 1313 | "data": { 1314 | "text/html": [ 1315 | "
\n", 1316 | "\n", 1329 | "\n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | "
koid
0어디 보자...8 190 0 42 45 1 1 1
1칠대 왕국 종족 중에 오크는 처음 듣는다318 50 0 576 170 0 363 401 0 129 17 0 57 238 4...
2이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요.3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22...
3라니스터 가의32 20 79 162 0 6 130
4별 희한한 생각이 다 떠오르곤 하죠233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5...
\n", 1365 | "
" 1366 | ], 1367 | "text/plain": [ 1368 | " ko \\\n", 1369 | "0 어디 보자... \n", 1370 | "1 칠대 왕국 종족 중에 오크는 처음 듣는다 \n", 1371 | "2 이번에 새로 개시할 제품 계열은 광고 모델을 계약해서 홍보하는 게 좋겠어요. \n", 1372 | "3 라니스터 가의 \n", 1373 | "4 별 희한한 생각이 다 떠오르곤 하죠 \n", 1374 | "\n", 1375 | " id \n", 1376 | "0 8 190 0 42 45 1 1 1 \n", 1377 | "1 318 50 0 576 170 0 363 401 0 129 17 0 57 238 4... \n", 1378 | "2 3 93 17 0 260 38 0 168 47 91 0 43 467 0 142 22... \n", 1379 | "3 32 20 79 162 0 6 130 \n", 1380 | "4 233 0 439 27 27 0 71 100 3 0 15 0 512 57 114 5... " 1381 | ] 1382 | }, 1383 | "execution_count": 50, 1384 | "metadata": {}, 1385 | "output_type": "execute_result" 1386 | } 1387 | ], 1388 | "source": [ 1389 | "corpus_dict = {'ko' : final_corpus,\n", 1390 | " 'id' : corpus_labels}\n", 1391 | "corpus_df = pd.DataFrame(corpus_dict)\n", 1392 | "\n", 1393 | "corpus_df.head()" 1394 | ] 1395 | }, 1396 | { 1397 | "cell_type": "code", 1398 | "execution_count": 53, 1399 | "metadata": {}, 1400 | "outputs": [ 1401 | { 1402 | "data": { 1403 | "text/plain": [ 1404 | "2992120" 1405 | ] 1406 | }, 1407 | "execution_count": 53, 1408 | "metadata": {}, 1409 | "output_type": "execute_result" 1410 | } 1411 | ], 1412 | "source": [ 1413 | "len(corpus_df)" 1414 | ] 1415 | }, 1416 | { 1417 | "cell_type": "markdown", 1418 | "metadata": {}, 1419 | "source": [ 1420 | "* 데이터셋 피클로 저장" 1421 | ] 1422 | }, 1423 | { 1424 | "cell_type": "code", 1425 | "execution_count": 52, 1426 | "metadata": {}, 1427 | "outputs": [], 1428 | "source": [ 1429 | "import pickle\n", 1430 | "\n", 1431 | "with open('corpus_df.bin', 'wb') as f:\n", 1432 | " pickle.dump(corpus_df, f)" 1433 | ] 1434 | } 1435 | ], 1436 | "metadata": { 1437 | "kernelspec": { 1438 | "display_name": "Python 3", 1439 | "language": "python", 1440 | "name": "python3" 1441 | }, 1442 | "language_info": { 1443 | "codemirror_mode": { 1444 | "name": "ipython", 1445 | "version": 3 1446 | }, 1447 | "file_extension": ".py", 1448 | "mimetype": "text/x-python", 1449 | "name": "python", 1450 | "nbconvert_exporter": "python", 1451 | "pygments_lexer": "ipython3", 1452 | "version": "3.7.6" 1453 | } 1454 | }, 1455 | "nbformat": 4, 1456 | "nbformat_minor": 2 1457 | } 1458 | -------------------------------------------------------------------------------- /preprocess/konpron.py: -------------------------------------------------------------------------------- 1 | class KoNPron: 2 | def __init__(self): 3 | self.base_digit = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'] 4 | self.super_digit = ['⁰', '¹', '²', '³', '⁴', '⁵', '⁶', '⁷', '⁸', '⁹'] 5 | self.small_scale = ['', '십', '백', '천'] 6 | self.large_scale = ['', '만 ', '억 ', '조 ', '경 ', '해 '] 7 | self.literal = ['영', '일', '이', '삼', '사', '오', '육', '칠', '팔', '구'] 8 | self.spoken_unit = ['', '하나', '둘', '셋', '넷', '다섯', '여섯', '일곱', '여덟', '아홉'] 9 | self.spoke_tens = ['', '열', '스물', '서른', '마흔', '쉰', '예순', '일흔', '여든', '아흔'] 10 | self.sentence = str() 11 | 12 | def _detect(self, sentence): 13 | self.sentence = sentence 14 | detection_data = list() 15 | tmp = str() 16 | 17 | total_len = len(sentence) 18 | point_count = 0 19 | continuous_count = 0 20 | digit_type = 'vanilla' 21 | 22 | detected = False 23 | zero_started = False 24 | 25 | for idx, char in enumerate(sentence): 26 | if char in self.base_digit: 27 | if not detected: 28 | detected = True 29 | if char is '0': 30 | zero_started = True 31 | tmp += char 32 | continuous_count += 1 33 | if zero_started and continuous_count > 8: 34 | digit_type = 'telephone/none' 35 | if continuous_count > 20: 36 | digit_type = 'enormous/none' 37 | else: 38 | continous_count = 0 39 | zero_started = False 40 | if char == ',': 41 | if idx + 1 < total_len and idx > 0: 42 | if sentence[idx - 1] in self.base_digit and sentence[idx + 1] in self.base_digit: 43 | tmp += char 44 | elif char == '.': 45 | if idx + 1 < total_len and idx > 0: 46 | if sentence[idx - 1] in self.base_digit and sentence[idx + 1] in self.base_digit: 47 | point_count += 1 48 | if point_count == 1: 49 | digit_type = 'fraction' 50 | if point_count > 1: 51 | digit_type = 'version' 52 | tmp += char 53 | elif char == '^': 54 | if idx + 1 < total_len and idx > 0: 55 | if sentence[idx - 1] in self.base_digit and sentence[idx + 1] in self.base_digit: 56 | digit_type += '/square' 57 | tmp += char 58 | else: 59 | if digit_type != 'exception/none': 60 | digit_type = 'exception/none' 61 | tmp += char 62 | elif char in self.super_digit: 63 | if idx > 0: 64 | if sentence[idx - 1] in self.base_digit or sentence[idx - 1] in self.super_digit: 65 | if not 'square' in digit_type: 66 | digit_type += '/square' 67 | tmp += char 68 | else: 69 | if digit_type != 'exception/none': 70 | digit_type = 'exception/none' 71 | tmp += char 72 | elif char == '·': 73 | if idx + 1 < total_len and idx > 0: 74 | if sentence[idx - 1] in self.base_digit and sentence[idx + 1] in self.base_digit: 75 | digit_type = 'date' 76 | tmp += char 77 | else: 78 | if detected: 79 | detected = False 80 | if '/' not in digit_type: 81 | digit_type += '/none' 82 | detection_data.append((digit_type, tmp)) 83 | tmp = str() 84 | digit_type = 'vanilla' 85 | point_count = 0 86 | continuous_count = 0 87 | else: 88 | if tmp: 89 | detection_data.append((digit_type, tmp)) 90 | tmp = str() 91 | digit_type = 'vanilla' 92 | point_count = 0 93 | continuous_count = 0 94 | 95 | if detected: 96 | if '/' not in digit_type: 97 | digit_type += '/none' 98 | detection_data.append((digit_type, tmp)) 99 | 100 | elif tmp: 101 | detection_data.append((digit_type, tmp)) 102 | return detection_data 103 | 104 | def _preprocess(self, detection_data): 105 | preprocessed_data = list() 106 | 107 | for digit_type, target in detection_data: 108 | original = target 109 | target_seq = list() 110 | reading_method = list() 111 | target_len = len(target) 112 | 113 | main_type, sub_type = digit_type.split('/') 114 | if main_type == 'exception': 115 | return None 116 | 117 | if main_type == 'version': 118 | splited = target.split('.') 119 | for count, frag in enumerate(splited): 120 | target_seq.append(frag) 121 | reading_method.append('individual') 122 | if count < len(splited) - 1: 123 | target_seq.append('.') 124 | reading_method.append('point') 125 | preprocessed_data.append((reading_method, target_seq, original)) 126 | 127 | if main_type == 'date': 128 | target = target.split('·') 129 | for frag in target: 130 | target_seq.append(frag) 131 | reading_method.append('individual') 132 | preprocessed_data.append((reading_method, target_seq, original)) 133 | 134 | if main_type == 'telephone' or main_type == 'enormous': 135 | target = target.replace('0', '_') 136 | target_seq.append(target) 137 | reading_method.append('individual') 138 | preprocessed_data.append((reading_method, target_seq, original)) 139 | 140 | if sub_type != 'none': 141 | if main_type == 'vanilla': 142 | target = [target.replace(',', '')] 143 | if main_type == 'fraction': 144 | target = target.split('.') 145 | 146 | if sub_type == 'square': 147 | if '^' not in target[-1]: 148 | super_part = str() 149 | for idx, digit in enumerate(target[-1]): 150 | if digit not in self.base_digit: 151 | super_idx = idx 152 | break 153 | super_num = list(target[-1][idx:]) 154 | for idx in range(len(super_num)): 155 | super_num[idx] = str(self.super_digit.index(super_num[idx])) 156 | super_num = ''.join(super_num) 157 | super_part += target[-1][:super_idx] + '^' + super_num 158 | if main_type == 'vanilla': 159 | target = [super_part] 160 | else: 161 | target = target[:1] + [super_part] 162 | if main_type == 'fraction': 163 | target_seq.append(target[0]) 164 | reading_method.append('literal') 165 | target_seq.append('.') 166 | reading_method.append('point') 167 | 168 | tmp_len = len(target_seq) 169 | tmp = target[-1].split('^') 170 | for seq in tmp: 171 | target_seq.append(seq) 172 | if 'point' in reading_method: 173 | if 'of' in reading_method: 174 | reading_method.append('literal') 175 | else: 176 | reading_method.append('individual') 177 | else: 178 | reading_method.append('literal') 179 | if len(target_seq) == tmp_len + 1: 180 | target_seq.append("^") 181 | reading_method.append("of") 182 | elif len(target_seq) == tmp_len + 3: 183 | target_seq.append("^") 184 | reading_method.append("super") 185 | preprocessed_data.append((reading_method, target_seq, original)) 186 | 187 | else: 188 | if main_type == 'vanilla': 189 | target = target.replace(',', '') 190 | target_seq.append(target) 191 | reading_method.append('literal') 192 | preprocessed_data.append((reading_method, target_seq, original)) 193 | 194 | if main_type == 'fraction': 195 | target = target.split('.') 196 | for frag in target: 197 | target_seq.append(frag) 198 | if 'point' in reading_method: 199 | reading_method.append('individual') 200 | else: 201 | reading_method.append('literal') 202 | if len(target_seq) == 1: 203 | target_seq.append(".") 204 | reading_method.append('point') 205 | 206 | preprocessed_data.append((reading_method, target_seq, original)) 207 | 208 | return preprocessed_data 209 | 210 | def _read(self, preprocessed_data, mode='informal'): 211 | def literal_read(self, frag, mode='informal'): 212 | korean = str() 213 | tmp = str() 214 | length = len(frag) 215 | for idx, digit in enumerate(frag): 216 | digit = int(digit) 217 | inversed_idx = length - idx - 1 218 | if mode == 'formal': 219 | if inversed_idx % 4: 220 | if digit: 221 | tmp += self.literal[digit] 222 | tmp += self.small_scale[inversed_idx % 4] 223 | else: 224 | if digit or length == 1: 225 | tmp += self.literal[digit] 226 | if tmp: 227 | tmp += self.large_scale[inversed_idx // 4] 228 | korean += tmp 229 | tmp = str() 230 | 231 | elif mode == 'informal': 232 | if inversed_idx % 4: 233 | if digit > 1: 234 | tmp += self.literal[digit] 235 | if digit: 236 | tmp += self.small_scale[inversed_idx % 4] 237 | else: 238 | if digit or length == 1: 239 | tmp += self.literal[digit] 240 | if tmp: 241 | tmp += self.large_scale[inversed_idx // 4] 242 | if length == 5 and digit == 1 and inversed_idx == 4: 243 | tmp = tmp[1:] 244 | korean += tmp 245 | tmp = str() 246 | korean += tmp 247 | return korean 248 | 249 | def individual_read(self, frag): 250 | korean = str() 251 | for digit in frag: 252 | if digit == '_': 253 | korean += '공' 254 | else: 255 | korean += self.literal[int(digit)] 256 | return korean 257 | 258 | result = self.sentence 259 | if preprocessed_data is None: 260 | return None 261 | 262 | for each in preprocessed_data: 263 | reading_method, target_seq, original = each 264 | readed = str() 265 | for idx, frag in enumerate(target_seq): 266 | if reading_method[idx] == 'literal': 267 | readed += literal_read(self, frag) 268 | if reading_method[idx] == 'individual': 269 | readed += individual_read(self, frag) 270 | if reading_method[idx] == 'point': 271 | readed += ' 점 ' 272 | if reading_method[idx] == 'of': 273 | readed += '의 ' 274 | if reading_method[idx] == 'super': 275 | readed += ' 승' 276 | result = result.replace(original, readed, 1) 277 | return result 278 | 279 | def convert(self, sentence): 280 | return self._read(self._preprocess(self._detect(sentence))) -------------------------------------------------------------------------------- /preprocess/special.csv: -------------------------------------------------------------------------------- 1 | special,replace 2 | , 3 | .,. 4 | “, 5 | """", 6 | -, 7 | ",", 8 | ·,. 9 | ”, 10 | ㎏,킬로그램 11 | ㎞,킬로미터 12 | ?,? 13 | ‘, 14 | ’, 15 | ;, 16 | %,퍼센트 17 | !,! 18 | ㎓,기가헤르츠 19 | ', 20 | ♪, 21 | ∼, 22 | [, 23 | ], 24 | &, 25 | ~, 26 | /, 27 | <, 28 | >, 29 | 「, 30 | 」, 31 | $,달러 32 | :, 33 | _, 34 | 《, 35 | 》, 36 | ㈜, 37 | Ⅱ, 38 | ㎡,제곱미터 39 | ¾,사 분의 삼 40 | , 41 | …, 42 | ㎜,밀리미터 43 | =,는 44 | #,샾 45 | 『, 46 | 』, 47 | ‧, 48 | ♬, 49 | ㏊, 50 | @,앳 51 | ℉,화씨 52 | ㎥,세제곱미터 53 | Ⅲ, 54 | →, 55 | 〈, 56 | 〉, 57 | ㎝,센티미터 58 | {, 59 | \,원 60 | }, 61 | ¥, 62 | ㎾,킬로와트 63 | Ⅰ, 64 | ․,. 65 | ㈔, 66 | ㎖,밀리리터 67 | ○, 68 | ㎍,마이크로그램 69 | ​, 70 | ㎿,메가와트 71 | ℃,도 72 | ㎳,밀리세컨드 73 | +,플러스 74 | ™, 75 | <, 76 | >, 77 | *, 78 | ㎢,제곱킬로미터 79 | ⅓,삼 분의 일 80 | &, 81 | ^, 82 | ㎚,나노미터 83 | ‭, 84 | √, 85 | ㎎,밀리그램 86 | 〮, 87 | ♡, 88 | ∙, 89 | ㎛,마이크로미터 90 | °,도 91 | –, 92 | , 93 | Ⅹ, 94 | ″, 95 | ±,플러스 마이너스 96 | ㎘,킬로리터 97 | ≪, 98 | ≫, 99 | £, 100 | ×, 101 | �, 102 | ㎉,킬로칼로리 103 | ★, 104 | □, 105 | ˚, 106 | ㎧,미터 퍼 세크 107 | ㎈,칼로리 108 | •, 109 | ′, 110 | ¡, 111 | 〓, 112 | 【, 113 | 】, 114 | 〔, 115 | 〕, 116 | ・, 117 | ˙, 118 | ㏈,데시벨 119 | Ⅵ, 120 | Ⅳ, 121 | ?,? 122 | ⅔,삼 분의 이 123 | 、, 124 | 〜, 125 | ♥, 126 | `, 127 | ®, 128 | 〇, 129 | ㎠,제곱센티미터 130 | ㏄,씨씨 131 | ▒, 132 | ‬, 133 | —, 134 | 「, 135 | 」, 136 | !,! 137 | ―, 138 | ♩, 139 | ⓛ, 140 | ㎒,메가헤르츠 141 | ▶, 142 | :, 143 | ˜, 144 | ´, 145 | ㎽,밀리와트 146 | , 147 | ~, 148 | ┃, 149 | ·, 150 | ⼐, 151 | ㎃,밀리암페어 152 | %,퍼센트 153 | ◆, 154 | -, 155 | ⇒, 156 | ‰,퍼밀 157 | (, 158 | ), 159 | ㍱,헥토파스칼 160 | ━, 161 | ㎐,헤르츠 162 | ≡, 163 | ¿, 164 | 。, 165 | €, 166 | ¶, 167 | ・, 168 | ㎔,테라헤르츠 169 | ※, 170 | Ⅸ, 171 | ⑯, 172 | –, 173 | ,, 174 | ↗, 175 | ↘, 176 | *, 177 | ㎸,킬로볼트 178 | ㉦, 179 | ½,이 분의 일 180 | ⟪, 181 | ⟫, 182 | ƒ, 183 | …, 184 | Ⅶ, 185 | ., 186 | §, 187 | ㎨,미터 퍼 세크 제곱 188 | △, 189 | ☆, 190 | ÷,나누기 191 | ∪, 192 | ㎩,파스칼 193 | , 194 | 〃, 195 | ㎫,메가파스칼 196 | £, 197 | ㎗, 198 | ■, 199 | , 200 | , 201 | ‵, 202 | ⓧ, 203 | ­, 204 | ¢, 205 | ♫, 206 | €, 207 | ㎟,제곱밀리미터 208 | ➡, 209 | ⃗, 210 | ㎑,킬로헤르츠 211 | ​, 212 | -------------------------------------------------------------------------------- /preprocess/train_labels.csv: -------------------------------------------------------------------------------- 1 | id,char,freq 2 | 0, ,5774462 3 | 1,.,640924 4 | 2,그,556373 5 | 3,이,509291 6 | 4,는,374559 7 | 5,아,370444 8 | 6,가,369698 9 | 7,고,356378 10 | 8,어,333842 11 | 9,거,306987 12 | 10,지,276453 13 | 11,데,249269 14 | 12,?,235024 15 | 13,나,229646 16 | 14,하,226136 17 | 15,다,221216 18 | 16,서,211193 19 | 17,에,204330 20 | 18,도,190561 21 | 19,게,177140 22 | 20,니,173284 23 | 21,",",152938 24 | 22,기,149467 25 | 23,은,144674 26 | 24,면,142025 27 | 25,야,137553 28 | 26,있,133155 29 | 27,한,121564 30 | 28,을,121048 31 | 29,까,119483 32 | 30,해,115148 33 | 31,리,111855 34 | 32,라,111479 35 | 33,래,105784 36 | 34,사,100533 37 | 35,근,99781 38 | 36,들,99447 39 | 37,안,97043 40 | 38,로,91847 41 | 39,일,88319 42 | 40,뭐,87328 43 | 41,내,85968 44 | 42,보,82911 45 | 43,제,80874 46 | 44,같,79626 47 | 45,자,76298 48 | 46,만,76093 49 | 47,시,72836 50 | 48,런,70919 51 | 49,너,69192 52 | 50,대,68756 53 | 51,때,67179 54 | 52,되,66237 55 | 53,으,66106 56 | 54,진,62831 57 | 55,를,61802 58 | 56,잖,61455 59 | 57,오,60782 60 | 58,러,60629 61 | 59,인,60234 62 | 60,막,59994 63 | 61,무,58705 64 | 62,었,58385 65 | 63,구,57294 66 | 64,했,57209 67 | 65,수,56787 68 | 66,간,55275 69 | 67,애,54476 70 | 68,우,53539 71 | 69,요,53234 72 | 70,마,53125 73 | 71,생,52815 74 | 72,렇,50798 75 | 73,냥,49989 76 | 74,짜,49581 77 | 75,주,48969 78 | 76,없,48392 79 | 77,말,47929 80 | 78,학,46285 81 | 79,스,46225 82 | 80,더,44487 83 | 81,많,43607 84 | 82,원,41379 85 | 83,음,41348 86 | 84,정,39775 87 | 85,겠,39691 88 | 86,여,39203 89 | 87,먹,39194 90 | 88,금,38720 91 | 89,든,38476 92 | 90,부,38398 93 | 91,할,38262 94 | 92,전,36575 95 | 93,번,36375 96 | 94,좋,36363 97 | 95,랑,36081 98 | 96,네,35514 99 | 97,람,33799 100 | 98,약,33412 101 | 99,건,33371 102 | 100,각,32167 103 | 101,좀,31738 104 | 102,알,30893 105 | 103,잘,30132 106 | 104,걸,29634 107 | 105,모,29629 108 | 106,것,28482 109 | 107,상,28247 110 | 108,려,28218 111 | 109,장,27856 112 | 110,히,27705 113 | 111,않,27305 114 | 112,맞,27202 115 | 113,던,27082 116 | 114,르,26286 117 | 115,교,26116 118 | 116,바,25994 119 | 117,냐,25742 120 | 118,드,25702 121 | 119,십,25654 122 | 120,날,25556 123 | 121,치,25287 124 | 122,비,25278 125 | 123,단,25129 126 | 124,동,25047 127 | 125,또,24720 128 | 126,못,24528 129 | 127,저,24074 130 | 128,얘,23990 131 | 129,중,23851 132 | 130,의,23607 133 | 131,난,23318 134 | 132,엄,23057 135 | 133,봤,22930 136 | 134,걔,22732 137 | 135,화,22593 138 | 136,응,22254 139 | 137,싶,21756 140 | 138,갔,21628 141 | 139,았,21052 142 | 140,집,20850 143 | 141,왜,20801 144 | 142,계,20757 145 | 143,공,20620 146 | 144,긴,20547 147 | 145,신,20371 148 | 146,적,20244 149 | 147,연,20225 150 | 148,직,20061 151 | 149,실,19467 152 | 150,영,19454 153 | 151,미,19444 154 | 152,봐,18931 155 | 153,분,18893 156 | 154,테,18829 157 | 155,년,18669 158 | 156,트,18654 159 | 157,문,18230 160 | 158,와,18114 161 | 159,돼,18028 162 | 160,물,17889 163 | 161,예,17864 164 | 162,터,17722 165 | 163,세,17719 166 | 164,럼,17521 167 | 165,청,17479 168 | 166,차,17455 169 | 167,친,17355 170 | 168,개,17355 171 | 169,삼,17242 172 | 170,국,17224 173 | 171,두,17129 174 | 172,소,17125 175 | 173,살,16893 176 | 174,재,16635 177 | 175,운,15949 178 | 176,쫌,15780 179 | 177,유,15516 180 | 178,속,15255 181 | 179,명,15190 182 | 180,랬,15155 183 | 181,본,15148 184 | 182,갈,15084 185 | 183,방,15069 186 | 184,돈,14998 187 | 185,타,14919 188 | 186,처,14908 189 | 187,빠,14851 190 | 188,과,14843 191 | 189,식,14743 192 | 190,디,14633 193 | 191,배,14403 194 | 192,피,14093 195 | 193,뭔,14093 196 | 194,선,13929 197 | 195,남,13909 198 | 196,경,13868 199 | 197,달,13787 200 | 198,언,13579 201 | 199,받,13519 202 | 200,심,13409 203 | 201,월,13367 204 | 202,버,13339 205 | 203,왔,13265 206 | 204,느,13223 207 | 205,점,12960 208 | 206,올,12923 209 | 207,업,12861 210 | 208,른,12801 211 | 209,성,12717 212 | 210,회,12591 213 | 211,조,12570 214 | 212,워,12424 215 | 213,따,12410 216 | 214,행,12350 217 | 215,반,12158 218 | 216,님,11998 219 | 217,딱,11908 220 | 218,관,11828 221 | 219,입,11275 222 | 220,카,11235 223 | 221,당,11068 224 | 222,였,10977 225 | 223,케,10576 226 | 224,쪽,10430 227 | 225,천,10384 228 | 226,작,10381 229 | 227,누,10336 230 | 228,열,10252 231 | 229,얼,10250 232 | 230,울,10246 233 | 231,찮,10231 234 | 232,된,10191 235 | 233,별,10159 236 | 234,떻,10108 237 | 235,머,9876 238 | 236,쓰,9853 239 | 237,위,9841 240 | 238,크,9838 241 | 239,노,9799 242 | 240,괜,9735 243 | 241,강,9698 244 | 242,편,9668 245 | 243,몰,9623 246 | 244,맛,9382 247 | 245,준,9342 248 | 246,줄,9294 249 | 247,파,9282 250 | 248,백,9252 251 | 249,매,9181 252 | 250,산,9160 253 | 251,술,9142 254 | 252,힘,9056 255 | 253,프,9019 256 | 254,즘,8997 257 | 255,임,8969 258 | 256,체,8888 259 | 257,형,8790 260 | 258,몇,8742 261 | 259,맨,8712 262 | 260,새,8623 263 | 261,용,8571 264 | 262,키,8547 265 | 263,통,8410 266 | 264,양,8363 267 | 265,끝,8361 268 | 266,싸,8328 269 | 267,볼,8188 270 | 268,혼,8179 271 | 269,온,8132 272 | 270,등,8123 273 | 271,길,8067 274 | 272,될,8033 275 | 273,밌,7998 276 | 274,육,7924 277 | 275,늘,7909 278 | 276,슨,7835 279 | 277,됐,7738 280 | 278,놀,7707 281 | 279,외,7608 282 | 280,팔,7601 283 | 281,져,7551 284 | 282,레,7485 285 | 283,억,7461 286 | 284,발,7450 287 | 285,결,7412 288 | 286,초,7290 289 | 287,감,7180 290 | 288,군,7174 291 | 289,호,7173 292 | 290,름,7146 293 | 291,솔,7079 294 | 292,닌,7051 295 | 293,밖,7013 296 | 294,불,7007 297 | 295,밥,6784 298 | 296,포,6676 299 | 297,싫,6631 300 | 298,완,6582 301 | 299,갖,6511 302 | 300,겨,6468 303 | 301,질,6453 304 | 302,토,6448 305 | 303,험,6417 306 | 304,색,6371 307 | 305,떤,6352 308 | 306,역,6340 309 | 307,티,6319 310 | 308,갑,6316 311 | 309,목,6262 312 | 310,린,6256 313 | 311,추,6204 314 | 312,격,6174 315 | 313,후,6119 316 | 314,확,6095 317 | 315,루,6079 318 | 316,민,6024 319 | 317,끼,6023 320 | 318,칠,6019 321 | 319,돌,5997 322 | 320,찍,5956 323 | 321,쪼,5946 324 | 322,깐,5788 325 | 323,필,5786 326 | 324,빨,5693 327 | 325,났,5657 328 | 326,락,5561 329 | 327,박,5537 330 | 328,끔,5483 331 | 329,낌,5403 332 | 330,럴,5397 333 | 331,취,5344 334 | 332,복,5315 335 | 333,둘,5264 336 | 334,페,5217 337 | 335,렸,5198 338 | 336,써,5197 339 | 337,줘,5173 340 | 338,급,5067 341 | 339,력,5065 342 | 340,잡,5030 343 | 341,씩,5006 344 | 342,찾,4990 345 | 343,놓,4987 346 | 344,최,4894 347 | 345,코,4891 348 | 346,넘,4870 349 | 347,졌,4803 350 | 348,섯,4799 351 | 349,브,4793 352 | 350,현,4764 353 | 351,눈,4760 354 | 352,항,4751 355 | 353,귀,4708 356 | 354,설,4688 357 | 355,벌,4666 358 | 356,담,4647 359 | 357,앞,4640 360 | 358,책,4630 361 | 359,절,4629 362 | 360,플,4523 363 | 361,폰,4513 364 | 362,태,4496 365 | 363,종,4487 366 | 364,옛,4450 367 | 365,증,4413 368 | 366,튼,4411 369 | 367,글,4408 370 | 368,습,4383 371 | 369,병,4377 372 | 370,론,4373 373 | 371,출,4364 374 | 372,능,4354 375 | 373,침,4345 376 | 374,순,4339 377 | 375,줬,4308 378 | 376,평,4303 379 | 377,메,4287 380 | 378,똑,4281 381 | 379,커,4261 382 | 380,엔,4248 383 | 381,꾸,4230 384 | 382,란,4194 385 | 383,듣,4083 386 | 384,씨,4009 387 | 385,큰,4002 388 | 386,표,3995 389 | 387,잠,3942 390 | 388,먼,3942 391 | 389,쁘,3840 392 | 390,활,3820 393 | 391,합,3787 394 | 392,접,3732 395 | 393,럽,3722 396 | 394,옷,3705 397 | 395,쳐,3690 398 | 396,손,3689 399 | 397,붙,3645 400 | 398,망,3640 401 | 399,죽,3609 402 | 400,투,3606 403 | 401,족,3603 404 | 402,셨,3589 405 | 403,참,3572 406 | 404,떨,3567 407 | 405,웃,3533 408 | 406,졸,3516 409 | 407,쉬,3492 410 | 408,뭘,3447 411 | 409,변,3406 412 | 410,릴,3374 413 | 411,웠,3293 414 | 412,홍,3267 415 | 413,즈,3265 416 | 414,랐,3245 417 | 415,독,3243 418 | 416,충,3239 419 | 417,짝,3217 420 | 418,떡,3197 421 | 419,뒤,3195 422 | 420,휴,3161 423 | 421,셔,3142 424 | 422,넣,3135 425 | 423,쨌,3075 426 | 424,악,3073 427 | 425,패,3049 428 | 426,빼,3041 429 | 427,슬,2983 430 | 428,특,2975 431 | 429,꺼,2970 432 | 430,숙,2951 433 | 431,쯤,2934 434 | 432,텐,2905 435 | 433,창,2901 436 | 434,겼,2888 437 | 435,굴,2869 438 | 436,판,2863 439 | 437,죠,2851 440 | 438,답,2820 441 | 439,희,2816 442 | 440,허,2815 443 | 441,옆,2798 444 | 442,료,2791 445 | 443,닐,2790 446 | 444,택,2769 447 | 445,림,2760 448 | 446,읽,2742 449 | 447,핸,2733 450 | 448,축,2730 451 | 449,풀,2716 452 | 450,틀,2712 453 | 451,몸,2694 454 | 452,골,2690 455 | 453,황,2635 456 | 454,켜,2635 457 | 455,익,2625 458 | 456,베,2613 459 | 457,북,2600 460 | 458,법,2578 461 | 459,늦,2578 462 | 460,함,2568 463 | 461,랜,2555 464 | 462,꼬,2555 465 | 463,향,2547 466 | 464,석,2541 467 | 465,환,2533 468 | 466,슷,2529 469 | 467,품,2518 470 | 468,혀,2513 471 | 469,블,2512 472 | 470,쓸,2503 473 | 471,채,2472 474 | 472,며,2470 475 | 473,욕,2463 476 | 474,권,2450 477 | 475,검,2445 478 | 476,굳,2428 479 | 477,록,2425 480 | 478,톡,2408 481 | 479,김,2408 482 | 480,넌,2383 483 | 481,깨,2375 484 | 482,션,2374 485 | 483,캐,2369 486 | 484,송,2339 487 | 485,녀,2336 488 | 486,탈,2327 489 | 487,광,2321 490 | 488,혹,2313 491 | 489,퍼,2300 492 | 490,뽑,2278 493 | 491,철,2265 494 | 492,째,2249 495 | 493,움,2232 496 | 494,밤,2231 497 | 495,꼭,2226 498 | 496,샀,2224 499 | 497,끊,2212 500 | 498,땐,2203 501 | 499,깔,2179 502 | 500,멀,2145 503 | 501,높,2141 504 | 502,께,2140 505 | 503,큼,2121 506 | 504,녁,2104 507 | 505,곳,2082 508 | 506,잔,2070 509 | 507,쉽,2070 510 | 508,짐,2067 511 | 509,암,2063 512 | 510,극,2061 513 | 511,련,2056 514 | 512,떠,2056 515 | 513,벽,2049 516 | 514,헤,2047 517 | 515,C,2040 518 | 516,끄,2024 519 | 517,곱,2015 520 | 518,승,2011 521 | 519,봉,2009 522 | 520,착,2006 523 | 521,촌,1990 524 | 522,껴,1986 525 | 523,딩,1983 526 | 524,류,1978 527 | 525,뜨,1970 528 | 526,넷,1941 529 | 527,놨,1922 530 | 528,궁,1894 531 | 529,논,1882 532 | 530,곤,1875 533 | 531,클,1869 534 | 532,싼,1859 535 | 533,앉,1854 536 | 534,컴,1849 537 | 535,맥,1841 538 | 536,팀,1830 539 | 537,썼,1818 540 | 538,낫,1801 541 | 539,튜,1788 542 | 540,걱,1786 543 | 541,쁜,1770 544 | 542,킨,1760 545 | 543,빌,1752 546 | 544,쿠,1748 547 | 545,찌,1738 548 | 546,쌤,1719 549 | 547,T,1715 550 | 548,밀,1711 551 | 549,빵,1702 552 | 550,냈,1702 553 | 551,센,1691 554 | 552,딴,1688 555 | 553,쩌,1678 556 | 554,딸,1678 557 | 555,걍,1596 558 | 556,획,1588 559 | 557,씬,1582 560 | 558,챙,1541 561 | 559,첫,1536 562 | 560,범,1530 563 | 561,핑,1519 564 | 562,굉,1519 565 | 563,쩔,1514 566 | 564,팅,1507 567 | 565,긍,1486 568 | 566,탄,1471 569 | 567,덟,1470 570 | 568,퇴,1469 571 | 569,뛰,1469 572 | 570,층,1467 573 | 571,춰,1454 574 | 572,훨,1447 575 | 573,찬,1439 576 | 574,듯,1424 577 | 575,S,1396 578 | 576,왕,1392 579 | 577,텔,1385 580 | 578,뉴,1382 581 | 579,렌,1377 582 | 580,탕,1374 583 | 581,짓,1371 584 | 582,밑,1365 585 | 583,헬,1358 586 | 584,존,1339 587 | 585,립,1323 588 | 586,녔,1318 589 | 587,꼈,1305 590 | 588,빡,1304 591 | 589,낮,1287 592 | 590,견,1282 593 | 591,링,1281 594 | 592,볶,1271 595 | 593,낙,1271 596 | 594,릭,1267 597 | 595,젠,1263 598 | 596,퓨,1262 599 | 597,츠,1256 600 | 598,맘,1252 601 | 599,놔,1247 602 | 600,렵,1241 603 | 601,땜,1232 604 | 602,쇼,1224 605 | 603,값,1215 606 | 604,닭,1203 607 | 605,깝,1200 608 | 606,픈,1194 609 | 607,탁,1183 610 | 608,쓴,1179 611 | 609,농,1172 612 | 610,량,1166 613 | 611,염,1156 614 | 612,홉,1144 615 | 613,척,1130 616 | 614,겁,1129 617 | 615,콘,1127 618 | 616,섭,1125 619 | 617,냄,1125 620 | 618,P,1125 621 | 619,효,1124 622 | 620,규,1124 623 | 621,꿈,1121 624 | 622,곡,1093 625 | 623,액,1090 626 | 624,쎄,1077 627 | 625,덜,1075 628 | 626,턴,1065 629 | 627,킹,1061 630 | 628,훈,1057 631 | 629,쳤,1054 632 | 630,널,1047 633 | 631,멋,1037 634 | 632,꿀,1034 635 | 633,깜,1019 636 | 634,짧,1014 637 | 635,롤,1013 638 | 636,낼,1012 639 | 637,꽤,1003 640 | 638,총,984 641 | 639,램,984 642 | 640,덕,980 643 | 641,믄,974 644 | 642,믿,972 645 | 643,흥,970 646 | 644,롱,967 647 | 645,뜻,962 648 | 646,짤,958 649 | 647,쌍,957 650 | 648,컨,953 651 | 649,셋,952 652 | 650,잤,950 653 | 651,닥,950 654 | 652,웬,946 655 | 653,엽,944 656 | 654,혜,939 657 | 655,찰,935 658 | 656,뻐,935 659 | 657,뿌,934 660 | 658,빈,934 661 | 659,꿔,934 662 | 660,낸,932 663 | 661,뻔,928 664 | 662,쌓,926 665 | 663,즐,919 666 | 664,튀,914 667 | 665,겹,909 668 | 666,득,899 669 | 667,끌,896 670 | 668,M,880 671 | 669,V,877 672 | 670,녹,876 673 | 671,푸,870 674 | 672,쭉,863 675 | 673,싱,858 676 | 674,팬,857 677 | 675,A,854 678 | 676,!,841 679 | 677,념,836 680 | 678,맡,825 681 | 679,쟁,814 682 | 680,엑,810 683 | 681,켓,809 684 | 682,뀌,808 685 | 683,털,803 686 | 684,풍,802 687 | 685,웨,799 688 | 686,땡,792 689 | 687,롯,791 690 | 688,롭,788 691 | 689,젊,781 692 | 690,넓,778 693 | 691,멘,777 694 | 692,냉,772 695 | 693,칼,771 696 | 694,잉,768 697 | 695,빙,768 698 | 696,뿐,767 699 | 697,옮,761 700 | 698,젤,760 701 | 699,B,757 702 | 700,죄,756 703 | 701,탔,752 704 | 702,샤,746 705 | 703,홀,745 706 | 704,떼,743 707 | 705,줌,738 708 | 706,징,734 709 | 707,폭,727 710 | 708,G,721 711 | 709,킬,713 712 | 710,흔,712 713 | 711,딜,711 714 | 712,슈,703 715 | 713,율,700 716 | 714,즌,697 717 | 715,씀,694 718 | 716,앙,689 719 | 717,눠,688 720 | 718,콩,686 721 | 719,얻,684 722 | 720,숨,682 723 | 721,닝,673 724 | 722,꽃,668 725 | 723,쌀,667 726 | 724,컬,666 727 | 725,춤,666 728 | 726,c,664 729 | 727,뚫,663 730 | 728,엠,661 731 | 729,몬,659 732 | 730,D,658 733 | 731,흐,656 734 | 732,앤,646 735 | 733,똥,645 736 | 734,콜,644 737 | 735,델,635 738 | 736,렀,628 739 | 737,폐,627 740 | 738,엘,624 741 | 739,쁠,623 742 | 740,랄,622 743 | 741,걘,621 744 | 742,벤,619 745 | 743,봄,611 746 | 744,왠,609 747 | 745,씻,609 748 | 746,률,608 749 | 747,켰,600 750 | 748,짱,600 751 | 749,웹,599 752 | 750,압,599 753 | 751,럭,596 754 | 752,땅,596 755 | 753,멍,595 756 | 754,랩,595 757 | 755,댓,595 758 | 756,깊,595 759 | 757,뮤,592 760 | 758,령,590 761 | 759,릿,589 762 | 760,낀,589 763 | 761,윤,586 764 | 762,옥,584 765 | 763,룸,582 766 | 764,딘,579 767 | 765,객,578 768 | 766,댄,576 769 | 767,컵,574 770 | 768,폴,573 771 | 769,쟤,570 772 | 770,뷰,569 773 | 771,템,568 774 | 772,덴,567 775 | 773,눌,559 776 | 774,캠,558 777 | 775,홈,557 778 | 776,삶,557 779 | 777,삭,555 780 | 778,벨,555 781 | 779,엉,552 782 | 780,헐,549 783 | 781,벅,545 784 | 782,벗,544 785 | 783,혈,543 786 | 784,밍,539 787 | 785,셀,536 788 | 786,낭,534 789 | 787,춥,533 790 | 788,릉,533 791 | 789,t,533 792 | 790,잃,532 793 | 791,I,529 794 | 792,놈,526 795 | 793,춘,524 796 | 794,찜,520 797 | 795,R,519 798 | 796,걷,518 799 | 797,삐,515 800 | 798,헌,510 801 | 799,딨,510 802 | 800,빛,504 803 | 801,흘,503 804 | 802,닫,502 805 | 803,균,502 806 | 804,p,495 807 | 805,L,494 808 | 806,좌,492 809 | 807,껄,491 810 | 808,펜,489 811 | 809,N,487 812 | 810,싹,486 813 | 811,탑,485 814 | 812,쏘,483 815 | 813,O,482 816 | 814,픽,480 817 | 815,덩,477 818 | 816,햄,476 819 | 817,큐,473 820 | 818,힐,472 821 | 819,곧,471 822 | 820,낳,470 823 | 821,힌,468 824 | 822,팩,468 825 | 823,뒷,468 826 | 824,툰,467 827 | 825,섬,465 828 | 826,꽂,463 829 | 827,례,462 830 | 828,핫,460 831 | 829,섞,460 832 | 830,촬,458 833 | 831,흰,457 834 | 832,둥,455 835 | 833,K,450 836 | 834,괴,449 837 | 835,s,448 838 | 836,핀,446 839 | 837,꿨,444 840 | 838,틱,441 841 | 839,밝,441 842 | 840,랙,440 843 | 841,땠,440 844 | 842,둔,440 845 | 843,슴,439 846 | 844,첨,438 847 | 845,밴,432 848 | 846,렁,431 849 | 847,칭,429 850 | 848,묻,428 851 | 849,뜬,425 852 | 850,깎,424 853 | 851,엇,423 854 | 852,컸,421 855 | 853,퀴,420 856 | 854,납,418 857 | 855,협,417 858 | 856,몽,416 859 | 857,꼐,415 860 | 858,떴,414 861 | 859,썰,410 862 | 860,찐,407 863 | 861,꼴,407 864 | 862,갠,406 865 | 863,턱,405 866 | 864,틴,398 867 | 865,낄,398 868 | 866,뒀,397 869 | 867,끗,396 870 | 868,꼼,395 871 | 869,F,395 872 | 870,샵,394 873 | 871,휘,392 874 | 872,뼈,390 875 | 873,뚜,389 876 | 874,쩍,388 877 | 875,팡,386 878 | 876,멜,386 879 | 877,톤,385 880 | 878,앨,385 881 | 879,탐,384 882 | 880,칸,384 883 | 881,끓,383 884 | 882,뚱,381 885 | 883,닮,378 886 | 884,깃,375 887 | 885,짬,374 888 | 886,빤,371 889 | 887,측,370 890 | 888,혔,369 891 | 889,꽁,369 892 | 890,펴,368 893 | 891,앴,368 894 | 892,겸,368 895 | 893,쿨,367 896 | 894,릇,363 897 | 895,얀,362 898 | 896,쿄,358 899 | 897,컷,358 900 | 898,팠,356 901 | 899,끈,355 902 | 900,렴,354 903 | 901,잊,352 904 | 902,덤,350 905 | 903,갤,342 906 | 904,븐,340 907 | 905,흡,337 908 | 906,덮,337 909 | 907,씹,335 910 | 908,뽀,335 911 | 909,뚝,335 912 | 910,갚,335 913 | 911,찔,334 914 | 912,댔,333 915 | 913,혁,332 916 | 914,띠,328 917 | 915,벼,327 918 | 916,얇,324 919 | 917,뺐,324 920 | 918,팝,323 921 | 919,잇,322 922 | 920,왼,322 923 | 921,낚,321 924 | 922,칙,316 925 | 923,겉,316 926 | 924,뜯,313 927 | 925,닦,312 928 | 926,짠,311 929 | 927,썹,310 930 | 928,뷔,310 931 | 929,묶,310 932 | 930,꾼,306 933 | 931,빅,305 934 | 932,땄,305 935 | 933,캡,304 936 | 934,묘,304 937 | 935,샘,303 938 | 936,묵,303 939 | 937,a,302 940 | 938,쭈,300 941 | 939,b,300 942 | 940,겪,299 943 | 941,둬,298 944 | 942,J,298 945 | 943,쫄,296 946 | 944,랫,296 947 | 945,뀐,296 948 | 946,흑,295 949 | 947,댕,295 950 | 948,꽉,295 951 | 949,곰,294 952 | 950,붕,293 953 | 951,땀,292 954 | 952,릎,290 955 | 953,뽕,289 956 | 954,쥐,288 957 | 955,렉,287 958 | 956,숭,283 959 | 957,샐,283 960 | 958,v,282 961 | 959,렛,281 962 | 960,녕,281 963 | 961,힙,280 964 | 962,쫙,279 965 | 963,촉,278 966 | 964,쩜,277 967 | 965,긋,277 968 | 966,샌,276 969 | 967,o,275 970 | 968,쫓,273 971 | 969,쩐,273 972 | 970,헷,272 973 | 971,X,268 974 | 972,웅,267 975 | 973,뺏,267 976 | 974,쵸,266 977 | 975,쪘,266 978 | 976,랍,266 979 | 977,E,266 980 | 978,좁,265 981 | 979,앱,265 982 | 980,썸,264 983 | 981,냅,264 984 | 982,펙,263 985 | 983,늙,263 986 | 984,껌,261 987 | 985,n,261 988 | 986,e,261 989 | 987,랭,260 990 | 988,귤,260 991 | 989,찢,259 992 | 990,닿,259 993 | 991,띄,258 994 | 992,긁,255 995 | 993,귄,253 996 | 994,굽,253 997 | 995,갓,253 998 | 996,캔,252 999 | 997,멈,252 1000 | 998,욱,250 1001 | 999,뺄,250 1002 | 1000,뇌,250 1003 | 1001,팟,249 1004 | 1002,쌌,248 1005 | 1003,룹,248 1006 | 1004,덥,248 1007 | 1005,폼,246 1008 | 1006,톱,244 1009 | 1007,듬,244 1010 | 1008,껍,244 1011 | 1009,흠,243 1012 | 1010,팍,243 1013 | 1011,맹,243 1014 | 1012,쉴,242 1015 | 1013,썩,240 1016 | 1014,밟,240 1017 | 1015,맵,237 1018 | 1016,돋,236 1019 | 1017,콤,235 1020 | 1018,맙,234 1021 | 1019,뱅,233 1022 | 1020,쫍,231 1023 | 1021,윗,229 1024 | 1022,뜩,229 1025 | 1023,찝,228 1026 | 1024,뺀,227 1027 | 1025,닷,226 1028 | 1026,넨,226 1029 | 1027,쌈,225 1030 | 1028,쩨,224 1031 | 1029,붓,224 1032 | 1030,쩡,223 1033 | 1031,믹,223 1034 | 1032,잼,221 1035 | 1033,r,221 1036 | 1034,쭐,220 1037 | 1035,엊,219 1038 | 1036,g,219 1039 | 1037,췄,217 1040 | 1038,룩,217 1041 | 1039,텀,215 1042 | 1040,쇠,213 1043 | 1041,숫,212 1044 | 1042,풋,210 1045 | 1043,쌩,208 1046 | 1044,쾌,207 1047 | 1045,볍,207 1048 | 1046,뤄,207 1049 | 1047,겐,207 1050 | 1048,m,207 1051 | 1049,펌,206 1052 | 1050,쪄,206 1053 | 1051,뻥,206 1054 | 1052,i,206 1055 | 1053,뻤,205 1056 | 1054,k,204 1057 | 1055,핵,203 1058 | 1056,셉,200 1059 | 1057,듀,198 1060 | 1058,닉,198 1061 | 1059,략,197 1062 | 1060,넉,197 1063 | 1061,딪,196 1064 | 1062,낯,195 1065 | 1063,텍,194 1066 | 1064,뱃,194 1067 | 1065,멤,194 1068 | 1066,윈,192 1069 | 1067,엎,192 1070 | 1068,뭉,192 1071 | 1069,젝,191 1072 | 1070,셜,190 1073 | 1071,빴,190 1074 | 1072,룰,190 1075 | 1073,앗,189 1076 | 1074,궈,189 1077 | 1075,윙,188 1078 | 1076,엥,188 1079 | 1077,d,185 1080 | 1078,꼽,183 1081 | 1079,챔,182 1082 | 1080,쉐,182 1083 | 1081,봇,182 1084 | 1082,푼,180 1085 | 1083,댁,179 1086 | 1084,칵,178 1087 | 1085,뿔,177 1088 | 1086,뺑,176 1089 | 1087,탱,175 1090 | 1088,쿼,174 1091 | 1089,갱,174 1092 | 1090,퉁,173 1093 | 1091,빔,173 1094 | 1092,썬,172 1095 | 1093,빽,172 1096 | 1094,둑,172 1097 | 1095,헛,169 1098 | 1096,빗,168 1099 | 1097,탓,167 1100 | 1098,륵,165 1101 | 1099,꼰,165 1102 | 1100,쎈,163 1103 | 1101,쥬,162 1104 | 1102,깡,162 1105 | 1103,퀄,161 1106 | 1104,빚,161 1107 | 1105,즉,160 1108 | 1106,삿,160 1109 | 1107,밭,160 1110 | 1108,u,160 1111 | 1109,혐,159 1112 | 1110,햇,159 1113 | 1111,툭,159 1114 | 1112,탠,158 1115 | 1113,샷,158 1116 | 1114,맣,157 1117 | 1115,껏,157 1118 | 1116,핏,156 1119 | 1117,앵,155 1120 | 1118,뜰,155 1121 | 1119,굿,155 1122 | 1120,U,155 1123 | 1121,섹,154 1124 | 1122,펑,153 1125 | 1123,맻,153 1126 | 1124,뀔,153 1127 | 1125,깥,152 1128 | 1126,뱀,150 1129 | 1127,뢰,150 1130 | 1128,껀,150 1131 | 1129,뉘,149 1132 | 1130,흉,146 1133 | 1131,틈,146 1134 | 1132,쏟,146 1135 | 1133,훔,143 1136 | 1134,쇄,142 1137 | 1135,뎅,142 1138 | 1136,칩,141 1139 | 1137,띵,140 1140 | 1138,푹,139 1141 | 1139,넥,139 1142 | 1140,퀘,137 1143 | 1141,훅,136 1144 | 1142,융,136 1145 | 1143,멸,136 1146 | 1144,냠,136 1147 | 1145,횟,134 1148 | 1146,찼,134 1149 | 1147,Y,134 1150 | 1148,룬,133 1151 | 1149,귈,133 1152 | 1150,H,133 1153 | 1151,젓,132 1154 | 1152,쏠,132 1155 | 1153,숲,132 1156 | 1154,냬,132 1157 | 1155,l,132 1158 | 1156,짰,130 1159 | 1157,멕,130 1160 | 1158,뇨,130 1161 | 1159,팽,128 1162 | 1160,깼,127 1163 | 1161,숏,126 1164 | 1162,굔,125 1165 | 1163,슐,123 1166 | 1164,쉰,123 1167 | 1165,얄,122 1168 | 1166,뱉,122 1169 | 1167,렬,122 1170 | 1168,굶,122 1171 | 1169,팁,121 1172 | 1170,츄,121 1173 | 1171,뭣,121 1174 | 1172,륙,121 1175 | 1173,횡,120 1176 | 1174,옹,119 1177 | 1175,뻘,117 1178 | 1176,옵,116 1179 | 1177,옴,116 1180 | 1178,얹,116 1181 | 1179,쑥,116 1182 | 1180,깄,116 1183 | 1181,므,115 1184 | 1182,찡,112 1185 | 1183,젖,112 1186 | 1184,꽈,112 1187 | 1185,틸,111 1188 | 1186,콕,111 1189 | 1187,첩,111 1190 | 1188,똘,111 1191 | 1189,쿵,110 1192 | 1190,왤,110 1193 | 1191,괌,110 1194 | 1192,밸,109 1195 | 1193,녜,109 1196 | 1194,갸,109 1197 | 1195,펀,108 1198 | 1196,칫,108 1199 | 1197,맺,108 1200 | 1198,탭,107 1201 | 1199,쁨,106 1202 | 1200,폈,105 1203 | 1201,펼,105 1204 | 1202,첼,105 1205 | 1203,숱,105 1206 | 1204,섰,105 1207 | 1205,킥,104 1208 | 1206,맑,103 1209 | 1207,랗,103 1210 | 1208,펐,102 1211 | 1209,넛,102 1212 | 1210,솜,101 1213 | 1211,벙,101 1214 | 1212,껑,101 1215 | 1213,f,101 1216 | 1214,룡,100 1217 | 1215,훌,99 1218 | 1216,x,98 1219 | 1217,쓱,97 1220 | 1218,늬,97 1221 | 1219,곽,97 1222 | 1220,y,97 1223 | 1221,욘,96 1224 | 1222,돔,96 1225 | 1223,겄,96 1226 | 1224,텝,95 1227 | 1225,훠,94 1228 | 1226,텅,94 1229 | 1227,씌,94 1230 | 1228,꺾,94 1231 | 1229,벚,93 1232 | 1230,렷,93 1233 | 1231,귓,93 1234 | 1232,찹,92 1235 | 1233,툴,91 1236 | 1234,깅,91 1237 | 1235,쭤,90 1238 | 1236,욜,90 1239 | 1237,얌,90 1240 | 1238,짖,89 1241 | 1239,옳,89 1242 | 1240,벳,88 1243 | 1241,뛸,88 1244 | 1242,깠,88 1245 | 1243,퍽,87 1246 | 1244,퀸,87 1247 | 1245,엮,87 1248 | 1246,삽,87 1249 | 1247,겟,87 1250 | 1248,왓,86 1251 | 1249,댈,86 1252 | 1250,샴,85 1253 | 1251,뻗,85 1254 | 1252,됨,84 1255 | 1253,얜,83 1256 | 1254,굵,83 1257 | 1255,눕,82 1258 | 1256,갇,82 1259 | 1257,셰,81 1260 | 1258,늫,81 1261 | 1259,텨,80 1262 | 1260,숍,80 1263 | 1261,뻑,80 1264 | 1262,됩,80 1265 | 1263,잎,79 1266 | 1264,뭇,79 1267 | 1265,퐁,78 1268 | 1266,팸,78 1269 | 1267,쯔,78 1270 | 1268,넜,78 1271 | 1269,깍,78 1272 | 1270,쌔,77 1273 | 1271,셈,77 1274 | 1272,읍,76 1275 | 1273,픔,75 1276 | 1274,펫,75 1277 | 1275,콧,75 1278 | 1276,얗,75 1279 | 1277,눅,75 1280 | 1278,j,74 1281 | 1279,쬐,73 1282 | 1280,렙,73 1283 | 1281,닙,73 1284 | 1282,슥,72 1285 | 1283,흙,71 1286 | 1284,쭝,71 1287 | 1285,짭,71 1288 | 1286,샹,71 1289 | 1287,릏,71 1290 | 1288,럿,71 1291 | 1289,덧,71 1292 | 1290,즙,70 1293 | 1291,늑,70 1294 | 1292,괄,70 1295 | 1293,킷,68 1296 | 1294,쿡,68 1297 | 1295,캉,68 1298 | 1296,둡,68 1299 | 1297,톨,67 1300 | 1298,엣,67 1301 | 1299,숟,67 1302 | 1300,낑,67 1303 | 1301,펭,66 1304 | 1302,왁,66 1305 | 1303,쏴,66 1306 | 1304,쏙,66 1307 | 1305,봅,66 1308 | 1306,멧,66 1309 | 1307,줏,65 1310 | 1308,뵈,65 1311 | 1309,쫑,64 1312 | 1310,륨,64 1313 | 1311,h,64 1314 | 1312,펄,63 1315 | 1313,짼,63 1316 | 1314,짚,63 1317 | 1315,껐,63 1318 | 1316,겜,63 1319 | 1317,싯,62 1320 | 1318,붐,62 1321 | 1319,렐,62 1322 | 1320,돗,62 1323 | 1321,팥,61 1324 | 1322,웰,61 1325 | 1323,륜,61 1326 | 1324,잣,60 1327 | 1325,슝,60 1328 | 1326,붉,60 1329 | 1327,윽,59 1330 | 1328,삘,59 1331 | 1329,딲,59 1332 | 1330,갯,58 1333 | 1331,횐,57 1334 | 1332,헨,57 1335 | 1333,캘,57 1336 | 1334,쩰,57 1337 | 1335,뤘,57 1338 | 1336,랴,57 1339 | 1337,껜,57 1340 | 1338,펠,56 1341 | 1339,킵,56 1342 | 1340,컹,56 1343 | 1341,렘,56 1344 | 1342,뛴,56 1345 | 1343,헝,55 1346 | 1344,씽,55 1347 | 1345,뮬,55 1348 | 1346,젯,54 1349 | 1347,샜,54 1350 | 1348,뿜,54 1351 | 1349,뒹,54 1352 | 1350,뎌,54 1353 | 1351,깬,54 1354 | 1352,챠,53 1355 | 1353,왈,53 1356 | 1354,뾰,53 1357 | 1355,뚤,53 1358 | 1356,꾹,53 1359 | 1357,갛,53 1360 | 1358,잌,52 1361 | 1359,엿,52 1362 | 1360,솥,52 1363 | 1361,벡,52 1364 | 1362,룻,52 1365 | 1363,꿍,52 1366 | 1364,곈,52 1367 | 1365,팜,51 1368 | 1366,튕,51 1369 | 1367,컥,51 1370 | 1368,첸,51 1371 | 1369,줍,51 1372 | 1370,섀,51 1373 | 1371,몫,51 1374 | 1372,뜸,51 1375 | 1373,깁,51 1376 | 1374,핬,50 1377 | 1375,쭘,50 1378 | 1376,쌰,50 1379 | 1377,넬,50 1380 | 1378,큘,49 1381 | 1379,쾅,49 1382 | 1380,캄,48 1383 | 1381,괘,48 1384 | 1382,쟀,47 1385 | 1383,윌,47 1386 | 1384,엌,47 1387 | 1385,앓,47 1388 | 1386,씁,47 1389 | 1387,륭,47 1390 | 1388,W,47 1391 | 1389,쑤,46 1392 | 1390,삥,46 1393 | 1391,돕,46 1394 | 1392,깰,46 1395 | 1393,핍,45 1396 | 1394,텃,45 1397 | 1395,슛,45 1398 | 1396,맸,45 1399 | 1397,롬,45 1400 | 1398,갭,45 1401 | 1399,얽,44 1402 | 1400,쏭,44 1403 | 1401,랠,44 1404 | 1402,겔,44 1405 | 1403,Q,44 1406 | 1404,핥,43 1407 | 1405,킴,43 1408 | 1406,읏,43 1409 | 1407,앚,43 1410 | 1408,숯,43 1411 | 1409,밋,43 1412 | 1410,뽐,42 1413 | 1411,뻣,42 1414 | 1412,눴,42 1415 | 1413,잭,41 1416 | 1414,뽈,41 1417 | 1415,뗐,41 1418 | 1416,꽝,41 1419 | 1417,훑,40 1420 | 1418,캥,40 1421 | 1419,쫀,40 1422 | 1420,뵙,40 1423 | 1421,홋,39 1424 | 1422,펍,39 1425 | 1423,뺨,39 1426 | 1424,됬,39 1427 | 1425,끽,39 1428 | 1426,빕,38 1429 | 1427,밉,38 1430 | 1428,꿋,38 1431 | 1429,헉,37 1432 | 1430,캣,37 1433 | 1431,촘,37 1434 | 1432,셌,37 1435 | 1433,삔,37 1436 | 1434,삑,37 1437 | 1435,뽂,37 1438 | 1436,뮌,37 1439 | 1437,뗄,37 1440 | 1438,텼,36 1441 | 1439,탬,36 1442 | 1440,쨍,36 1443 | 1441,웜,36 1444 | 1442,앰,36 1445 | 1443,맴,36 1446 | 1444,띡,36 1447 | 1445,꿇,36 1448 | 1446,걀,36 1449 | 1447,흩,35 1450 | 1448,쥴,35 1451 | 1449,씸,35 1452 | 1450,낡,35 1453 | 1451,곁,35 1454 | 1452,휙,34 1455 | 1453,쿤,34 1456 | 1454,켄,34 1457 | 1455,츤,34 1458 | 1456,얕,34 1459 | 1457,썪,34 1460 | 1458,둠,34 1461 | 1459,촛,33 1462 | 1460,챌,33 1463 | 1461,죙,33 1464 | 1462,쟈,33 1465 | 1463,잰,33 1466 | 1464,뵀,33 1467 | 1465,w,33 1468 | 1466,푠,32 1469 | 1467,폿,32 1470 | 1468,쨈,32 1471 | 1469,밲,32 1472 | 1470,랖,32 1473 | 1471,떵,32 1474 | 1472,찻,31 1475 | 1473,옇,31 1476 | 1474,뽁,31 1477 | 1475,둣,31 1478 | 1476,닳,31 1479 | 1477,긱,31 1480 | 1478,곶,31 1481 | 1479,휠,30 1482 | 1480,춧,30 1483 | 1481,쐈,30 1484 | 1482,썽,30 1485 | 1483,뼛,29 1486 | 1484,떳,29 1487 | 1485,굘,29 1488 | 1486,훗,28 1489 | 1487,퀵,28 1490 | 1488,봬,28 1491 | 1489,릅,28 1492 | 1490,꺠,28 1493 | 1491,슉,27 1494 | 1492,눔,27 1495 | 1493,끙,27 1496 | 1494,궐,27 1497 | 1495,2,27 1498 | 1496,촥,26 1499 | 1497,젬,26 1500 | 1498,솟,26 1501 | 1499,맷,26 1502 | 1500,룐,26 1503 | 1501,뎃,26 1504 | 1502,깽,26 1505 | 1503,툼,25 1506 | 1504,쎌,25 1507 | 1505,쉼,25 1508 | 1506,쉘,25 1509 | 1507,숑,25 1510 | 1508,뎀,25 1511 | 1509,냔,25 1512 | 1510,쫘,24 1513 | 1511,쎘,24 1514 | 1512,싣,24 1515 | 1513,섣,24 1516 | 1514,샾,24 1517 | 1515,맬,24 1518 | 1516,뗀,24 1519 | 1517,꾀,24 1520 | 1518,헹,23 1521 | 1519,햐,23 1522 | 1520,톰,23 1523 | 1521,췌,23 1524 | 1522,챈,23 1525 | 1523,봔,23 1526 | 1524,밧,23 1527 | 1525,맏,23 1528 | 1526,딥,23 1529 | 1527,늠,23 1530 | 1528,낵,23 1531 | 1529,낱,23 1532 | 1530,꺄,23 1533 | 1531,갬,23 1534 | 1532,훼,22 1535 | 1533,핼,22 1536 | 1534,튠,22 1537 | 1535,웩,22 1538 | 1536,쏜,22 1539 | 1537,뿅,22 1540 | 1538,빰,22 1541 | 1539,딤,22 1542 | 1540,꿉,22 1543 | 1541,걜,22 1544 | 1542,1,22 1545 | 1543,짙,21 1546 | 1544,얍,21 1547 | 1545,샛,21 1548 | 1546,뗘,21 1549 | 1547,듭,21 1550 | 1548,챘,20 1551 | 1549,쯧,20 1552 | 1550,짹,20 1553 | 1551,잦,20 1554 | 1552,옐,20 1555 | 1553,빳,20 1556 | 1554,몹,20 1557 | 1555,몄,20 1558 | 1556,똔,20 1559 | 1557,딧,20 1560 | 1558,놉,20 1561 | 1559,궜,20 1562 | 1560,굼,20 1563 | 1561,헥,19 1564 | 1562,캬,19 1565 | 1563,챕,19 1566 | 1564,쟨,19 1567 | 1565,멓,19 1568 | 1566,똠,19 1569 | 1567,댐,19 1570 | 1568,텁,18 1571 | 1569,켈,18 1572 | 1570,첵,18 1573 | 1571,숄,18 1574 | 1572,띨,18 1575 | 1573,듦,18 1576 | 1574,궤,18 1577 | 1575,곗,18 1578 | 1576,튈,17 1579 | 1577,좆,17 1580 | 1578,윷,17 1581 | 1579,옅,17 1582 | 1580,얏,17 1583 | 1581,믈,17 1584 | 1582,룽,17 1585 | 1583,띃,17 1586 | 1584,딕,17 1587 | 1585,뎁,17 1588 | 1586,닛,17 1589 | 1587,냑,17 1590 | 1588,겅,17 1591 | 1589,휩,16 1592 | 1590,팎,16 1593 | 1591,틋,16 1594 | 1592,콸,16 1595 | 1593,콥,16 1596 | 1594,잴,16 1597 | 1595,웁,16 1598 | 1596,슘,16 1599 | 1597,멱,16 1600 | 1598,랏,16 1601 | 1599,떄,16 1602 | 1600,뒨,16 1603 | 1601,꿰,16 1604 | 1602,깻,16 1605 | 1603,긌,16 1606 | 1604,젼,15 1607 | 1605,윰,15 1608 | 1606,웍,15 1609 | 1607,앳,15 1610 | 1608,샬,15 1611 | 1609,샥,15 1612 | 1610,볕,15 1613 | 1611,멩,15 1614 | 1612,넹,15 1615 | 1613,넙,15 1616 | 1614,끕,15 1617 | 1615,휜,14 1618 | 1616,텄,14 1619 | 1617,쫒,14 1620 | 1618,쩝,14 1621 | 1619,쨋,14 1622 | 1620,윳,14 1623 | 1621,쉭,14 1624 | 1622,쇽,14 1625 | 1623,셧,14 1626 | 1624,뵐,14 1627 | 1625,땔,14 1628 | 1626,덱,14 1629 | 1627,댑,14 1630 | 1628,꺽,14 1631 | 1629,곪,14 1632 | 1630,켔,13 1633 | 1631,츰,13 1634 | 1632,읜,13 1635 | 1633,쑈,13 1636 | 1634,볐,13 1637 | 1635,및,13 1638 | 1636,롷,13 1639 | 1637,딛,13 1640 | 1638,냇,13 1641 | 1639,3,13 1642 | 1640,혓,12 1643 | 1641,팻,12 1644 | 1642,팰,12 1645 | 1643,킁,12 1646 | 1644,촤,12 1647 | 1645,쨰,12 1648 | 1646,잿,12 1649 | 1647,옌,12 1650 | 1648,쐬,12 1651 | 1649,쌋,12 1652 | 1650,슁,12 1653 | 1651,쉑,12 1654 | 1652,빢,12 1655 | 1653,뵌,12 1656 | 1654,뭄,12 1657 | 1655,묽,12 1658 | 1656,뎠,12 1659 | 1657,늪,12 1660 | 1658,뇽,12 1661 | 1659,훤,11 1662 | 1660,횔,11 1663 | 1661,홧,11 1664 | 1662,햅,11 1665 | 1663,푯,11 1666 | 1664,팼,11 1667 | 1665,탯,11 1668 | 1666,탤,11 1669 | 1667,큭,11 1670 | 1668,짢,11 1671 | 1669,읊,11 1672 | 1670,롸,11 1673 | 1671,띤,11 1674 | 1672,놋,11 1675 | 1673,넴,11 1676 | 1674,귿,11 1677 | 1675,q,11 1678 | 1676,휑,10 1679 | 1677,퓸,10 1680 | 1678,튿,10 1681 | 1679,튄,10 1682 | 1680,촐,10 1683 | 1681,쭌,10 1684 | 1682,짊,10 1685 | 1683,숴,10 1686 | 1684,숀,10 1687 | 1685,뿡,10 1688 | 1686,뻬,10 1689 | 1687,렜,10 1690 | 1688,뗬,10 1691 | 1689,늄,10 1692 | 1690,끍,10 1693 | 1691,곌,10 1694 | 1692,갰,10 1695 | 1693,폄,9 1696 | 1694,콰,9 1697 | 1695,캇,9 1698 | 1696,캅,9 1699 | 1697,늉,9 1700 | 1698,뉜,9 1701 | 1699,큔,8 1702 | 1700,콱,8 1703 | 1701,켠,8 1704 | 1702,쳔,8 1705 | 1703,쳇,8 1706 | 1704,읔,8 1707 | 1705,읎,8 1708 | 1706,욤,8 1709 | 1707,뼘,8 1710 | 1708,롹,8 1711 | 1709,렝,8 1712 | 1710,뚠,8 1713 | 1711,땋,8 1714 | 1712,덨,8 1715 | 1713,넵,8 1716 | 1714,넝,8 1717 | 1715,넋,8 1718 | 1716,꿩,8 1719 | 1717,꼿,8 1720 | 1718,깟,8 1721 | 1719,곯,8 1722 | 1720,4,8 1723 | 1721,힝,7 1724 | 1722,헀,7 1725 | 1723,푤,7 1726 | 1724,쿱,7 1727 | 1725,캤,7 1728 | 1726,챗,7 1729 | 1727,쯩,7 1730 | 1728,쮸,7 1731 | 1729,읗,7 1732 | 1730,윅,7 1733 | 1731,얠,7 1734 | 1732,씰,7 1735 | 1733,썅,7 1736 | 1734,쌜,7 1737 | 1735,쌉,7 1738 | 1736,슌,7 1739 | 1737,쉿,7 1740 | 1738,쇳,7 1741 | 1739,셍,7 1742 | 1740,뿍,7 1743 | 1741,뼌,7 1744 | 1742,뺌,7 1745 | 1743,밈,7 1746 | 1744,룟,7 1747 | 1745,룔,7 1748 | 1746,뜀,7 1749 | 1747,끅,7 1750 | 1748,꾜,7 1751 | 1749,겡,7 1752 | 1750,Z,7 1753 | 1751,횰,6 1754 | 1752,헴,6 1755 | 1753,핌,6 1756 | 1754,펩,6 1757 | 1755,펨,6 1758 | 1756,켸,6 1759 | 1757,쩠,6 1760 | 1758,잽,6 1761 | 1759,엡,6 1762 | 1760,앝,6 1763 | 1761,쓕,6 1764 | 1762,썜,6 1765 | 1763,쌘,6 1766 | 1764,삣,6 1767 | 1765,빻,6 1768 | 1766,몀,6 1769 | 1767,뤼,6 1770 | 1768,롄,6 1771 | 1769,떫,6 1772 | 1770,덫,6 1773 | 1771,뉸,6 1774 | 1772,꽥,6 1775 | 1773,궂,6 1776 | 1774,괭,6 1777 | 1775,0,6 1778 | 1776,힉,5 1779 | 1777,휸,5 1780 | 1778,퓰,5 1781 | 1779,퉤,5 1782 | 1780,퀭,5 1783 | 1781,켐,5 1784 | 1782,췻,5 1785 | 1783,챡,5 1786 | 1784,쨀,5 1787 | 1785,젱,5 1788 | 1786,읒,5 1789 | 1787,웡,5 1790 | 1788,옙,5 1791 | 1789,얬,5 1792 | 1790,앎,5 1793 | 1791,씼,5 1794 | 1792,쑨,5 1795 | 1793,쐐,5 1796 | 1794,쌨,5 1797 | 1795,쉈,5 1798 | 1796,숩,5 1799 | 1797,셤,5 1800 | 1798,삠,5 1801 | 1799,뱄,5 1802 | 1800,뭥,5 1803 | 1801,멨,5 1804 | 1802,먀,5 1805 | 1803,랒,5 1806 | 1804,땟,5 1807 | 1805,땁,5 1808 | 1806,뉠,5 1809 | 1807,꽌,5 1810 | 1808,귐,5 1811 | 1809,굥,5 1812 | 1810,굣,5 1813 | 1811,겋,5 1814 | 1812,8,5 1815 | 1813,5,5 1816 | 1814,훙,4 1817 | 1815,혠,4 1818 | 1816,폔,4 1819 | 1817,튬,4 1820 | 1818,퉜,4 1821 | 1819,탉,4 1822 | 1820,큽,4 1823 | 1821,퀀,4 1824 | 1822,쿰,4 1825 | 1823,켤,4 1826 | 1824,켕,4 1827 | 1825,칡,4 1828 | 1826,쯘,4 1829 | 1827,짯,4 1830 | 1828,짔,4 1831 | 1829,쥔,4 1832 | 1830,줜,4 1833 | 1831,죗,4 1834 | 1832,욧,4 1835 | 1833,쒸,4 1836 | 1834,쏼,4 1837 | 1835,쌂,4 1838 | 1836,셴,4 1839 | 1837,샨,4 1840 | 1838,뽜,4 1841 | 1839,뻠,4 1842 | 1840,빝,4 1843 | 1841,봥,4 1844 | 1842,뫼,4 1845 | 1843,맜,4 1846 | 1844,맀,4 1847 | 1845,뤠,4 1848 | 1846,띌,4 1849 | 1847,띈,4 1850 | 1848,뛌,4 1851 | 1849,뚸,4 1852 | 1850,떽,4 1853 | 1851,떰,4 1854 | 1852,떈,4 1855 | 1853,땍,4 1856 | 1854,돠,4 1857 | 1855,뎄,4 1858 | 1856,늣,4 1859 | 1857,뇸,4 1860 | 1858,놥,4 1861 | 1859,녈,4 1862 | 1860,넒,4 1863 | 1861,낏,4 1864 | 1862,갉,4 1865 | 1863,z,4 1866 | 1864,힛,3 1867 | 1865,휀,3 1868 | 1866,헙,3 1869 | 1867,픗,3 1870 | 1868,퐈,3 1871 | 1869,팹,3 1872 | 1870,틔,3 1873 | 1871,퇸,3 1874 | 1872,텟,3 1875 | 1873,탰,3 1876 | 1874,퀼,3 1877 | 1875,칬,3 1878 | 1876,췰,3 1879 | 1877,쳥,3 1880 | 1878,챂,3 1881 | 1879,찧,3 1882 | 1880,쭙,3 1883 | 1881,쬬,3 1884 | 1882,쩄,3 1885 | 1883,쨉,3 1886 | 1884,줴,3 1887 | 1885,죵,3 1888 | 1886,좍,3 1889 | 1887,좃,3 1890 | 1888,읐,3 1891 | 1889,읃,3 1892 | 1890,웟,3 1893 | 1891,왯,3 1894 | 1892,옭,3 1895 | 1893,씅,3 1896 | 1894,쒀,3 1897 | 1895,쑹,3 1898 | 1896,쏵,3 1899 | 1897,쎔,3 1900 | 1898,쎅,3 1901 | 1899,슾,3 1902 | 1900,쉣,3 1903 | 1901,솨,3 1904 | 1902,셸,3 1905 | 1903,셩,3 1906 | 1904,삯,3 1907 | 1905,뿟,3 1908 | 1906,뾱,3 1909 | 1907,뽝,3 1910 | 1908,뻰,3 1911 | 1909,뵤,3 1912 | 1910,볏,3 1913 | 1911,뱌,3 1914 | 1912,뱁,3 1915 | 1913,밨,3 1916 | 1914,밎,3 1917 | 1915,믐,3 1918 | 1916,뭡,3 1919 | 1917,뭠,3 1920 | 1918,뭍,3 1921 | 1919,묭,3 1922 | 1920,뫠,3 1923 | 1921,멏,3 1924 | 1922,멎,3 1925 | 1923,맽,3 1926 | 1924,뙤,3 1927 | 1925,떔,3 1928 | 1926,땃,3 1929 | 1927,덷,3 1930 | 1928,댜,3 1931 | 1929,늰,3 1932 | 1930,놘,3 1933 | 1931,냘,3 1934 | 1932,뀜,3 1935 | 1933,꽹,3 1936 | 1934,꽐,3 1937 | 1935,꼇,3 1938 | 1936,껸,3 1939 | 1937,껨,3 1940 | 1938,꺅,3 1941 | 1939,곘,3 1942 | 1940,겝,3 1943 | 1941,흣,2 1944 | 1942,흝,2 1945 | 1943,흄,2 1946 | 1944,헸,2 1947 | 1945,헜,2 1948 | 1946,햘,2 1949 | 1947,햑,2 1950 | 1948,핟,2 1951 | 1949,픙,2 1952 | 1950,픕,2 1953 | 1951,푀,2 1954 | 1952,퐉,2 1955 | 1953,틑,2 1956 | 1954,틍,2 1957 | 1955,큣,2 1958 | 1956,퀏,2 1959 | 1957,콴,2 1960 | 1958,켁,2 1961 | 1959,칟,2 1962 | 1960,츳,2 1963 | 1961,츨,2 1964 | 1962,츈,2 1965 | 1963,쵝,2 1966 | 1964,촙,2 1967 | 1965,찦,2 1968 | 1966,찟,2 1969 | 1967,쭁,2 1970 | 1968,쫏,2 1971 | 1969,쩬,2 1972 | 1970,쩧,2 1973 | 1971,짆,2 1974 | 1972,쥰,2 1975 | 1973,쥘,2 1976 | 1974,좐,2 1977 | 1975,졍,2 1978 | 1976,쟝,2 1979 | 1977,잧,2 1980 | 1978,잍,2 1981 | 1979,욍,2 1982 | 1980,왬,2 1983 | 1981,옫,2 1984 | 1982,옜,2 1985 | 1983,옉,2 1986 | 1984,엤,2 1987 | 1985,얐,2 1988 | 1986,얉,2 1989 | 1987,쒯,2 1990 | 1988,쑐,2 1991 | 1989,쏸,2 1992 | 1990,썻,2 1993 | 1991,쌕,2 1994 | 1992,싷,2 1995 | 1993,솝,2 1996 | 1994,솎,2 1997 | 1995,샅,2 1998 | 1996,삻,2 1999 | 1997,쁩,2 2000 | 1998,빘,2 2001 | 1999,뷸,2 2002 | 2000,뷴,2 2003 | 2001,봽,2 2004 | 2002,봈,2 2005 | 2003,볌,2 2006 | 2004,벋,2 2007 | 2005,믕,2 2008 | 2006,뮴,2 2009 | 2007,뭬,2 2010 | 2008,뭑,2 2011 | 2009,묜,2 2012 | 2010,맇,2 2013 | 2011,릈,2 2014 | 2012,륑,2 2015 | 2013,랮,2 2016 | 2014,랟,2 2017 | 2015,띔,2 2018 | 2016,떱,2 2019 | 2017,듈,2 2020 | 2018,됴,2 2021 | 2019,됫,2 2022 | 2020,됙,2 2023 | 2021,됑,2 2024 | 2022,댤,2 2025 | 2023,닯,2 2026 | 2024,늗,2 2027 | 2025,뉩,2 2028 | 2026,눟,2 2029 | 2027,눗,2 2030 | 2028,넚,2 2031 | 2029,냡,2 2032 | 2030,낢,2 2033 | 2031,꾿,2 2034 | 2032,꼳,2 2035 | 2033,꼄,2 2036 | 2034,겊,2 2037 | 2035,갼,2 2038 | 2036,6,2 2039 | 2037,,0 2040 | 2038,,0 2041 | 2039,_,0 2042 | --------------------------------------------------------------------------------