├── .gitignore ├── .travis.yml ├── CHANGELOG.md ├── LICENSE ├── README.en.md ├── README.md ├── setup.cfg ├── setup.py ├── test.py └── tossi ├── __about__.py ├── __init__.py ├── coda.py ├── formatter.py ├── hangul.py ├── particles.py ├── tolerance.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.py[cod] 2 | 3 | # C extensions 4 | *.so 5 | 6 | # Packages 7 | *.egg 8 | *.egg-info 9 | dist 10 | build 11 | eggs 12 | parts 13 | bin 14 | var 15 | sdist 16 | develop-eggs 17 | .installed.cfg 18 | lib 19 | lib64 20 | 21 | # Installer logs 22 | pip-log.txt 23 | 24 | # Unit test / coverage reports 25 | .coverage 26 | .tox 27 | .cache 28 | nosetests.xml 29 | 30 | # Translations 31 | *.mo 32 | 33 | # Mr Developer 34 | .mr.developer.cfg 35 | .project 36 | .pydevproject 37 | 38 | # Vim swap files 39 | .*.sw[ponm] 40 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | sudo: false 3 | python: 4 | - 2.7 5 | - 3.3 6 | - 3.4 7 | - 3.5 8 | - 3.5-dev 9 | - pypy 10 | - pypy3 11 | install: 12 | - pip install -e . 13 | - pip install flake8 flake8-import-order pytest pytest-cov coveralls 14 | script: 15 | - | # flake8 16 | flake8 tossi test.py setup.py -v --show-source 17 | - | # pytest 18 | py.test -v --cov=tossi --cov-report=term-missing 19 | after_success: 20 | - coveralls 21 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | ## Version 0.1 2 | 3 | Released on Jun 10 2016. 4 | 5 | The first public release. 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2016, What! Studio 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without modification, 5 | are permitted provided that the following conditions are met: 6 | 7 | Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | Redistributions in binary form must reproduce the above copyright notice, this 11 | list of conditions and the following disclaimer in the documentation and/or 12 | other materials provided with the distribution. 13 | 14 | Neither the name of the copyright holder nor the names of its 15 | contributors may be used to endorse or promote products derived from 16 | this software without specific prior written permission. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 19 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 20 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR 22 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 23 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 24 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON 25 | ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 26 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 27 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 28 | -------------------------------------------------------------------------------- /README.en.md: -------------------------------------------------------------------------------- 1 | # Tossi 2 | 3 | [![Build Status]( 4 | https://travis-ci.org/what-studio/tossi.svg?branch=master 5 | )](https://travis-ci.org/what-studio/tossi) 6 | [![Coverage Status]( 7 | https://coveralls.io/repos/github/what-studio/tossi/badge.svg?branch=master 8 | )](https://coveralls.io/r/what-studio/tossi) 9 | [![README in English]( 10 | https://img.shields.io/badge/readme-korean-blue.svg?style=flat 11 | )](README.en.md) 12 | 13 | "Tossi(토씨)" is a pure-Korean name for grammatical particles. Some of Korean 14 | particles has allomorphic variant forms depending on a leading word. The Tossi 15 | library determines most natural form. 16 | 17 | ## Installation 18 | 19 | ```console 20 | $ pip install tossi 21 | ``` 22 | 23 | ## Usage 24 | 25 | ```python 26 | >>> import tossi 27 | >>> tossi.postfix_particle(u'집', u'(으)로') 28 | 집으로 29 | >>> tossi.postfix_particle(u'말', u'으로는') 30 | 말로는 31 | >>> tossi.postfix_particle(u'대한민국', u'은(는)') 32 | 대한민국은 33 | >>> tossi.postfix_particle(u'민주공화국', u'다') 34 | 민주공화국이다 35 | ``` 36 | 37 | ## Natural Form for Particles 38 | 39 | These particles do not have allomorphic variant. They always appear in same 40 | form: `의`, `도`, `만~`, `에~`, `께~`, `뿐~`, `하~`, `보다~`, `밖에~`, `같이~`, 41 | `부터~`, `까지~`, `마저~`, `조차~`, `마냥~`, `처럼~`, and `커녕~`: 42 | 43 | > 나오**의**, 모리안**의**, 키홀**의**, 나오**도**, 모리안**도**, 키홀**도** 44 | 45 | Meanwhile, these particles appear in different form depending on whether the 46 | leading word have a final consonant or not: `은(는)`, `이(가)`, `을(를)`, and 47 | `과(와)~`: 48 | 49 | > 나오**는**, 모리안**은**, 키홀**은** 50 | 51 | `(으)로~` also have similar rule but if the final consonant is `ㄹ`, it appears 52 | same with after non final consonant: 53 | 54 | > 나오**로**, 모리안**으로**, 키홀**로** 55 | 56 | `(이)다` which is a predicative particle have more diverse forms. Its end can 57 | be inflected in general: 58 | 59 | > 나오**지만**, 모리안**이지만**, 키홀**이에요**, 나오**예요** 60 | 61 | Tossi tries to determine most natural form for particles. But if it fails to 62 | do, determines both forms like `은(는)` or `(으)로` for tolerance: 63 | 64 | ```python 65 | >>> tossi.postfix_particle(u'벽돌', u'으로') 66 | 벽돌로 67 | >>> tossi.postfix_particle(u'짚', u'으로') 68 | 짚으로 69 | >>> tossi.postfix_particle(u'黃金', u'으로') 70 | 黃金(으)로 71 | ``` 72 | 73 | If the leading word ends with number, a natural form can be determined: 74 | 75 | ```python 76 | >>> tossi.postfix_particle(u'레벨 10', u'이') 77 | 레벨 10이 78 | >>> tossi.postfix_particle(u'레벨 999', u'이') 79 | 레벨 999가 80 | ``` 81 | 82 | Words in a parentheses are ignored: 83 | 84 | ```python 85 | >>> tossi.postfix_particle(u'나뭇가지(만렙)', u'을') 86 | 나뭇가지(만렙)를 87 | ``` 88 | 89 | ## Tolerance Styles 90 | 91 | When Tossi can't determine the natural form, the result includes the both 92 | forms. In this case, you can choose the order of the forms. For example, if 93 | the most words are Japanese, they probably will not end with final consonants. 94 | Therefore `는(은)` is better than `은(는)` which is the default style: 95 | 96 | ```python 97 | >>> tolerance_style = tossi.parse_tolerance_style(u'는(은)') 98 | >>> tossi.postfix_particle(u'さくら', u'이', tolerance_style=tolerance_style) 99 | さくら가(이) 100 | ``` 101 | 102 | Choose one of `은(는)`, `(은)는`, `는(은)`, `(는)은` for your project. 103 | 104 | ## Licensing 105 | 106 | Written by [Heungsub Lee][sublee] and [Chanwoong Kim][kexplo] at 107 | [What! Studio][what-studio] in [Nexon][nexon], and distributed under 108 | [the BSD 3-Clause license][bsd-3-clause]. 109 | 110 | [nexon]: http://nexon.com/ 111 | [what-studio]: https://github.com/what-studio 112 | [sublee]: http://subl.ee/ 113 | [kexplo]: http://chanwoong.kim/ 114 | [bsd-3-clause]: http://opensource.org/licenses/BSD-3-Clause 115 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 토씨 2 | 3 | [![Build Status]( 4 | https://travis-ci.org/what-studio/tossi.svg?branch=master 5 | )](https://travis-ci.org/what-studio/tossi) 6 | [![Coverage Status]( 7 | https://coveralls.io/repos/github/what-studio/tossi/badge.svg?branch=master 8 | )](https://coveralls.io/r/what-studio/tossi) 9 | [![README in English]( 10 | https://img.shields.io/badge/readme-english-blue.svg?style=flat 11 | )](README.en.md) 12 | 13 | '토씨'는 '조사'의 순우리말 이름입니다. 토씨 라이브러리는 임의의 단어 뒤에 올 14 | 가장 자연스러운 한국어 조사 형태를 골라줍니다. 15 | 16 | ## 설치 17 | 18 | ```console 19 | $ pip install tossi 20 | ``` 21 | 22 | ## 사용법 23 | 24 | ```python 25 | >>> import tossi 26 | >>> tossi.postfix(u'집', u'(으)로') 27 | 집으로 28 | >>> tossi.postfix(u'말', u'으로는') 29 | 말로는 30 | >>> tossi.postfix(u'대한민국', u'은(는)') 31 | 대한민국은 32 | >>> tossi.postfix(u'민주공화국', u'다') 33 | 민주공화국이다 34 | ``` 35 | 36 | ## 자연스러운 조사 선택 37 | 38 | `의`, `도`, `만~`, `에~`, `께~`, `뿐~`, `하~`, `보다~`, `밖에~`, `같이~`, 39 | `부터~`, `까지~`, `마저~`, `조차~`, `마냥~`, `처럼~`, `커녕~`에는 어떤 단어가 40 | 앞서도 형태가 변하지 않습니다: 41 | 42 | > 나오**의**, 모리안**의**, 키홀**의**, 나오**도**, 모리안**도**, 키홀**도** 43 | 44 | 반면 `은(는)`, `이(가)`, `을(를)`, `과(와)~`는 앞선 단어의 마지막 음절의 받침 45 | 유무에 따라 형태가 달라집니다: 46 | 47 | > 나오**는**, 모리안**은**, 키홀**은** 48 | 49 | `(으)로~`도 비슷한 규칙을 따르지만 앞선 받침이 `ㄹ`일 경우엔 받침이 없는 것과 50 | 같게 취급합니다: 51 | 52 | > 나오**로**, 모리안**으로**, 키홀**로** 53 | 54 | 서술격 조사 `(이)다`는 어미가 활용되어 다양한 형태로 변형될 수 있습니다: 55 | 56 | > 나오**지만**, 모리안**이지만**, 키홀**이에요**, 나오**예요** 57 | 58 | 토씨는 가장 자연스러운 조사 형태를 선택합니다. 만약 어떤 형태가 자연스러운지 59 | 알 수 없을 때에는 `은(는)`, `(으)로`처럼 모든 형태를 병기합니다: 60 | 61 | ```python 62 | >>> tossi.postfix(u'벽돌', u'으로') 63 | 벽돌로 64 | >>> tossi.postfix(u'짚', u'으로') 65 | 짚으로 66 | >>> tossi.postfix(u'黃金', u'으로') 67 | 黃金(으)로 68 | ``` 69 | 70 | 단어가 숫자로 끝나더라도 자연스러운 조사 형태가 선택됩니다: 71 | 72 | ```python 73 | >>> tossi.postfix(u'레벨 10', u'이') 74 | 레벨 10이 75 | >>> tossi.postfix(u'레벨 999', u'이') 76 | 레벨 999가 77 | ``` 78 | 79 | 괄호 속 단어나 구두점은 조사 형태를 선택할 때 참고하지 않습니다: 80 | 81 | ```python 82 | >>> tossi.postfix(u'나뭇가지(만렙)', u'을') 83 | 나뭇가지(만렙)를 84 | ``` 85 | 86 | ## 병기 순서 87 | 88 | 조사의 형태를 모두 병기해야할 때 병기할 순서를 고를 수 있습니다. 가령 대부분의 89 | 인자가 일본어 단어일 경우엔 단어가 모음으로 끝날 확률이 높습니다. 이 경우 90 | 기본형인 `은(는)` 스타일보단 `는(은)` 스타일이 더 자연스러울 수 있습니다: 91 | 92 | ```python 93 | >>> tolerance_style = tossi.parse_tolerance_style(u'는(은)') 94 | >>> tossi.postfix(u'さくら', u'이', tolerance_style=tolerance_style) 95 | さくら가(이) 96 | ``` 97 | 98 | `은(는)`, `(은)는`, `는(은)`, `(는)은` 네 가지 스타일 중 프로젝트에 맞는 것을 99 | 고르세요. 100 | 101 | ## API 102 | 103 | ### `tossi.pick(word, morph) -> str` 104 | 105 | `word`에 자연스럽게 뒤따르는 조사 형태를 구합니다. 106 | 107 | ```python 108 | >>> tossi.pick(u'토씨', '은') 109 | 는 110 | >>> tossi.pick(u'우리말', '은') 111 | 은 112 | ``` 113 | 114 | ### `tossi.postfix(word, morph) -> str` 115 | 116 | 단어와 조사를 자연스럽게 연결합니다. 117 | 118 | ```python 119 | >>> tossi.postfix(u'토씨', '은') 120 | 토씨는 121 | >>> tossi.postfix(u'우리말', '은') 122 | 우리말은 123 | ``` 124 | 125 | ### `tossi.parse(morph) -> Particle` 126 | 127 | 문자열로 된 조사 표기로부터 조사 객체를 얻습니다. 128 | 129 | ```python 130 | >>> tossi.parse(u'으로') 131 | 132 | >>> tossi.parse(u'(은)는') 133 | 134 | >>> tossi.parse(u'이면') 135 | 136 | ``` 137 | 138 | ### `Particle[word[:morph]] -> str` 139 | 140 | `word`에 뒤따르는 표기를 구합니다. 141 | 142 | ```python 143 | >>> Eun = tossi.parse(u'은') 144 | >>> Eun[u'라면'] 145 | 은 146 | >>> Eun[u'라볶이'] 147 | 는 148 | ``` 149 | 150 | `morph`를 지정해서 어미에 변화를 줄 수 있습니다. 151 | 152 | ```python 153 | >>> Euro = tossi.parse(u'으로') 154 | >>> Euro[u'라면':u'으론'] 155 | 으론 156 | >>> Euro[u'라볶이':u'으론'] 157 | 론 158 | ``` 159 | 160 | ## 만든이와 사용권 161 | 162 | [넥슨][nexon] [왓 스튜디오][what-studio]의 [이흥섭][sublee]과 163 | [김찬웅][kexplo]이 만들었고 [제3조항을 포함하는 BSD 허가서][bsd-3-clause]를 164 | 채택했습니다. 165 | 166 | [nexon]: http://nexon.com/ 167 | [what-studio]: https://github.com/what-studio 168 | [sublee]: http://subl.ee/ 169 | [kexplo]: http://chanwoong.kim/ 170 | [bsd-3-clause]: http://opensource.org/licenses/BSD-3-Clause 171 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [flake8] 2 | ignore = E301, E731 3 | import_order_style = google 4 | application-import-names = tossi 5 | 6 | [pytest] 7 | python_files = test.py 8 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import os 3 | 4 | from setuptools import find_packages, setup 5 | 6 | 7 | # Include __about__.py. 8 | __dir__ = os.path.dirname(__file__) 9 | about = {} 10 | with open(os.path.join(__dir__, 'tossi', '__about__.py')) as f: 11 | exec(f.read(), about) 12 | 13 | 14 | setup( 15 | name='tossi', 16 | version=about['__version__'], 17 | license=about['__license__'], 18 | author=about['__author__'], 19 | maintainer=about['__maintainer__'], 20 | maintainer_email=about['__maintainer_email__'], 21 | url='https://github.com/what-studio/tossi', 22 | description='Supports Korean particles', 23 | platforms='any', 24 | packages=find_packages(), 25 | zip_safe=False, 26 | classifiers=[ 27 | 'Development Status :: 4 - Beta', 28 | 'Intended Audience :: Developers', 29 | 'Intended Audience :: Science/Research', 30 | 'License :: OSI Approved :: BSD License', 31 | 'Natural Language :: Korean', 32 | 'Operating System :: OS Independent', 33 | 'Programming Language :: Python', 34 | 'Programming Language :: Python :: 2', 35 | 'Programming Language :: Python :: 2.7', 36 | 'Programming Language :: Python :: 3.4', 37 | 'Programming Language :: Python :: Implementation :: CPython', 38 | 'Programming Language :: Python :: Implementation :: PyPy', 39 | 'Topic :: Software Development :: Libraries :: Python Modules', 40 | 'Topic :: Software Development :: Localization', 41 | 'Topic :: Text Processing :: Linguistic', 42 | ], 43 | install_requires=['bidict', 'six'], 44 | ) 45 | -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import functools 3 | 4 | import pytest 5 | from six import PY2, text_type as str, with_metaclass 6 | 7 | import tossi 8 | from tossi import postfix as f, registry 9 | from tossi.coda import pick_coda_from_decimal 10 | from tossi.hangul import join_phonemes, split_phonemes 11 | from tossi.particles import Euro, Ida, Particle, SingletonParticleMeta 12 | from tossi.tolerance import ( 13 | generate_tolerances, get_tolerance, get_tolerance_from_iterator, 14 | MORPH1_AND_OPTIONAL_MORPH2, OPTIONAL_MORPH2_AND_MORPH1, 15 | parse_tolerance_style) 16 | 17 | 18 | Eun = tossi.parse(u'은') 19 | Eul = tossi.parse(u'을') 20 | Gwa = tossi.parse(u'과') 21 | 22 | 23 | def test_about(): 24 | __import__('tossi.__about__') 25 | 26 | 27 | def test_particle(): 28 | assert str(Eun) == u'은(는)' 29 | assert str(Eul) == u'을(를)' 30 | assert str(Ida) == u'(이)' 31 | if PY2: 32 | try: 33 | __import__('unidecode') 34 | except ImportError: 35 | assert repr(Ida) == u"" 36 | else: 37 | assert repr(Ida) == u'' 38 | else: 39 | assert repr(Ida) == u'' 40 | 41 | 42 | def test_frontend(): 43 | assert tossi.parse(u'을') is Eul 44 | assert tossi.parse(u'를') is Eul 45 | assert tossi.parse(u'을(를)') is Eul 46 | assert tossi.parse(u'이다') is Ida 47 | assert tossi.parse(u'이었다') is Ida 48 | 49 | 50 | def test_split_phonemes(): 51 | assert split_phonemes(u'쏚') == (u'ㅆ', u'ㅗ', u'ㄲ') 52 | assert split_phonemes(u'섭') == (u'ㅅ', u'ㅓ', u'ㅂ') 53 | assert split_phonemes(u'투') == (u'ㅌ', u'ㅜ', u'') 54 | assert split_phonemes(u'투', onset=False) == (None, u'ㅜ', u'') 55 | with pytest.raises(ValueError): 56 | split_phonemes(u'X') 57 | with pytest.raises(ValueError): 58 | split_phonemes(u'섭섭') 59 | 60 | 61 | def test_join_phonemes(): 62 | assert join_phonemes(u'ㅅ', u'ㅓ', u'ㅂ') == u'섭' 63 | assert join_phonemes((u'ㅅ', u'ㅓ', u'ㅂ')) == u'섭' 64 | assert join_phonemes(u'ㅊ', u'ㅠ') == u'츄' 65 | assert join_phonemes(u'ㅊ', u'ㅠ', u'') == u'츄' 66 | assert join_phonemes((u'ㅊ', u'ㅠ')) == u'츄' 67 | with pytest.raises(TypeError): 68 | join_phonemes(u'ㄷ', u'ㅏ', u'ㄹ', u'ㄱ') 69 | 70 | 71 | def test_particle_tolerances(): 72 | t = lambda _1, _2: set(generate_tolerances(_1, _2)) 73 | s = lambda x: set(x.split()) 74 | assert t(u'이', u'가') == s(u'이(가) (이)가 가(이) (가)이') 75 | assert t(u'이', u'') == s(u'(이)') 76 | assert t(u'으로', u'로') == s(u'(으)로') 77 | assert t(u'이여', u'여') == s(u'(이)여') 78 | assert t(u'이시여', u'시여') == s(u'(이)시여') 79 | assert t(u'아', u'야') == s(u'아(야) (아)야 야(아) (야)아') 80 | assert \ 81 | t(u'가나다', u'나나다') == \ 82 | s(u'가(나)나다 (가)나나다 나(가)나다 (나)가나다') 83 | assert \ 84 | t(u'가나다', u'마바사') == \ 85 | s(u'가나다(마바사) (가나다)마바사 마바사(가나다) (마바사)가나다') 86 | 87 | 88 | def test_euro(): 89 | assert Euro[u'나오'] == u'로' 90 | assert Euro[u'키홀'] == u'로' 91 | assert Euro[u'모리안'] == u'으로' 92 | assert Euro[u'Nao'] == u'(으)로' 93 | assert Euro[u'나오':u'로서'] == u'로서' 94 | assert Euro[u'키홀':u'로서'] == u'로서' 95 | assert Euro[u'모리안':u'로서'] == u'으로서' 96 | assert Euro[u'나오':u'로써'] == u'로써' 97 | assert Euro[u'키홀':u'로써'] == u'로써' 98 | assert Euro[u'모리안':u'로써'] == u'으로써' 99 | assert Euro[u'나오':u'로부터'] == u'로부터' 100 | assert Euro[u'키홀':u'로부터'] == u'로부터' 101 | assert Euro[u'모리안':u'로부터'] == u'으로부터' 102 | assert Euro[u'나오':u'(으)로부터의'] == u'로부터의' 103 | assert Euro[u'밖':u'론'] == u'으론' 104 | 105 | 106 | def test_combinations(): 107 | assert f(u'이 방법', u'만으로는') == u'이 방법만으로는' 108 | assert f(u'나', u'조차도') == u'나조차도' 109 | assert f(u'그 친구', u'과는') == u'그 친구와는' 110 | assert f(u'그것', u'와는') == u'그것과는' 111 | assert f(u'사건', u'과(와)는') == u'사건과는' 112 | assert f(u'그 친구', u'관') == u'그 친구완' 113 | 114 | 115 | def test_exceptions(): 116 | # Empty. 117 | assert f(u'', u'를') == u'을(를)' 118 | # Onsets only. 119 | assert f(u'ㅋㅋㅋ', u'를') == u'ㅋㅋㅋ을(를)' 120 | 121 | 122 | def test_insignificant(): 123 | assert f(u'나오(Lv.25)', u'으로') == u'나오(Lv.25)로' 124 | assert f(u'나오 (Lv.25)', u'을') == u'나오 (Lv.25)를' 125 | assert f(u'나(?)오', u'으로') == u'나(?)오로' 126 | assert f(u'헬로월드!', u'으로') == u'헬로월드!로' 127 | assert f(u'?_?', u'으로') == u'?_?(으)로' 128 | assert f(u'임창정,,,', u'가') == u'임창정,,,이' 129 | assert f(u'《듀랑고》', u'을') == u'《듀랑고》를' 130 | assert f(u'불완전괄호)', u'은') == u'불완전괄호)는' 131 | assert f(u'이상한괄호)))', u'는') == u'이상한괄호)))는' 132 | assert f(u'이상한괄호)()', u'은') == u'이상한괄호)()는' 133 | assert f(u'이상한괄호())', u'(는)은') == u'이상한괄호())는' 134 | assert f(u'^_^', u'이었다.') == u'^_^(이)었다.' 135 | assert f(u'웃는얼굴^_^', u'이었다.') == u'웃는얼굴^_^이었다.' 136 | assert f(u'폭탄(가짜)...', u'이었다.') == u'폭탄(가짜)...이었다.' 137 | assert f(u'16(7)?!', u'으로') == u'16(7)?!으로' 138 | assert f(u'7(16)?!', u'으로') == u'7(16)?!로' 139 | assert f(u'검색\ue000', u'를') == u'검색\ue000을' 140 | 141 | 142 | def test_only_parentheses(): 143 | assert f(u'(1, 2)', u'를') == u'(1, 2)를' 144 | assert f(u'(2, 3)', u'를') == u'(2, 3)을' 145 | 146 | 147 | def test_vocative_particles(): 148 | assert f(u'친구', u'야') == u'친구야' 149 | assert f(u'사랑', u'야') == u'사랑아' 150 | assert f(u'사랑', u'아') == u'사랑아' 151 | assert f(u'친구', u'여') == u'친구여' 152 | assert f(u'사랑', u'여') == u'사랑이여' 153 | assert f(u'하늘', u'이시여') == u'하늘이시여' 154 | assert f(u'바다', u'이시여') == u'바다시여' 155 | 156 | 157 | def test_ida(): 158 | """Cases for '이다' which is a copulative and existential verb.""" 159 | # Do or don't inject '이'. 160 | assert f(u'나오', u'이다') == u'나오다' 161 | assert f(u'키홀', u'이다') == u'키홀이다' 162 | # Merge with the following vowel as /j/. 163 | assert f(u'나오', u'이에요') == u'나오예요' 164 | assert f(u'키홀', u'이에요') == u'키홀이에요' 165 | # No allomorphs. 166 | assert f(u'나오', u'입니다') == u'나오입니다' 167 | assert f(u'키홀', u'입니다') == u'키홀입니다' 168 | # Give up to select an allomorph. 169 | assert f(u'God', u'이다') == u'God(이)다' 170 | assert f(u'God', u'이에요') == u'God(이)에요' 171 | assert f(u'God', u'입니다') == u'God입니다' 172 | assert f(u'God', u'였습니다') == u'God(이)었습니다' 173 | # Many examples. 174 | assert f(u'키홀', u'였습니다') == u'키홀이었습니다' 175 | assert f(u'나오', u'였습니다') == u'나오였습니다' 176 | assert f(u'나오', u'이었다') == u'나오였다' 177 | assert f(u'나오', u'이었지만') == u'나오였지만' 178 | assert f(u'나오', u'이지만') == u'나오지만' 179 | assert f(u'키홀', u'이지만') == u'키홀이지만' 180 | assert f(u'나오', u'지만') == u'나오지만' 181 | assert f(u'키홀', u'지만') == u'키홀이지만' 182 | assert f(u'나오', u'다') == u'나오다' 183 | assert f(u'키홀', u'다') == u'키홀이다' 184 | assert f(u'나오', u'이에요') == u'나오예요' 185 | assert f(u'키홀', u'이에요') == u'키홀이에요' 186 | assert f(u'나오', u'고') == u'나오고' 187 | assert f(u'키홀', u'고') == u'키홀이고' 188 | assert f(u'모리안', u'고') == u'모리안이고' 189 | assert f(u'나오', u'여서') == u'나오여서' 190 | assert f(u'키홀', u'여서') == u'키홀이어서' 191 | assert f(u'나오', u'이어서') == u'나오여서' 192 | assert f(u'키홀', u'라고라') == u'키홀이라고라' 193 | assert f(u'키홀', u'든지') == u'키홀이든지' 194 | assert f(u'키홀', u'던가') == u'키홀이던가' 195 | assert f(u'키홀', u'여도') == u'키홀이어도' 196 | assert f(u'키홀', u'야말로') == u'키홀이야말로' 197 | assert f(u'키홀', u'인양') == u'키홀인양' 198 | assert f(u'나오', u'인양') == u'나오인양' 199 | 200 | 201 | def test_invariant_particles(): 202 | assert f(u'나오', u'도') == u'나오도' 203 | assert f(u'모리안', u'도') == u'모리안도' 204 | assert f(u'판교', u'에서') == u'판교에서' 205 | assert f(u'판교', u'에서는') == u'판교에서는' 206 | assert f(u'선생님', u'께서도') == u'선생님께서도' 207 | assert f(u'나오', u'의') == u'나오의' 208 | assert f(u'모리안', u'만') == u'모리안만' 209 | assert f(u'키홀', u'하고') == u'키홀하고' 210 | assert f(u'콩', u'만큼') == u'콩만큼' 211 | assert f(u'콩', u'마냥') == u'콩마냥' 212 | assert f(u'콩', u'처럼') == u'콩처럼' 213 | 214 | 215 | def test_tolerances(): 216 | assert f(u'나오', u'은(는)') == u'나오는' 217 | assert f(u'나오', u'(은)는') == u'나오는' 218 | assert f(u'나오', u'는(은)') == u'나오는' 219 | assert f(u'나오', u'(는)은') == u'나오는' 220 | 221 | 222 | def test_decimal(): 223 | assert f(u'레벨30', u'이') == u'레벨30이' 224 | assert f(u'레벨34', u'이') == u'레벨34가' 225 | assert f(u'레벨7', u'으로') == u'레벨7로' 226 | assert f(u'레벨42', u'으로') == u'레벨42로' 227 | assert f(u'레벨100', u'으로') == u'레벨100으로' 228 | assert pick_coda_from_decimal('1') == u'ㄹ' 229 | assert pick_coda_from_decimal('2') == u'' 230 | assert pick_coda_from_decimal('3') == u'ㅁ' 231 | assert pick_coda_from_decimal('10') == u'ㅂ' 232 | assert pick_coda_from_decimal('16') == u'ㄱ' 233 | assert pick_coda_from_decimal('19') == u'' 234 | assert pick_coda_from_decimal('200') == u'ㄱ' 235 | assert pick_coda_from_decimal('30000') == u'ㄴ' 236 | assert pick_coda_from_decimal('400000') == u'ㄴ' 237 | assert pick_coda_from_decimal('500000000') == u'ㄱ' 238 | assert pick_coda_from_decimal('1' + '0' * 50) == u'ㄱ' 239 | assert pick_coda_from_decimal('1' + '0' * 100) is None 240 | assert pick_coda_from_decimal('0') == u'ㅇ' 241 | assert pick_coda_from_decimal('1.0') == u'ㅇ' 242 | assert pick_coda_from_decimal('1.234567890') == u'ㅇ' 243 | assert pick_coda_from_decimal('3.14') == u'' 244 | 245 | 246 | def test_match(): 247 | # (n)eun 248 | assert Eun.match(u'은') == u'' 249 | assert Eun.match(u'는') == u'' 250 | assert Eun.match(u'은(는)') == u'' 251 | assert Eun.match(u'는(은)') == u'' 252 | assert Eun.match(u'(은)는') == u'' 253 | assert Eun.match(u'(는)은') == u'' 254 | assert Eun.match(u'는는') == u'는' 255 | # (r)eul (final=True) 256 | assert Eul.match(u'를') == u'' 257 | assert Eul.match(u'을을') is None 258 | # (g)wa 259 | assert Gwa.match(u'과') == u'' 260 | assert Gwa.match(u'과는') == u'는' 261 | assert Gwa.match(u'관') == u'ㄴ' 262 | # (eu)ro 263 | assert Euro.match(u'으로도') == u'도' 264 | assert Euro.match(u'론') == u'ㄴ' 265 | 266 | 267 | def test_combine(): 268 | assert Euro[u'집':u'로'] == u'으로' 269 | assert Euro[u'집':u'론'] == u'으론' 270 | assert Euro[u'집':u'로는'] == u'으로는' 271 | assert Euro[u'집':u'론123'] == u'으론123' 272 | 273 | 274 | def test_tolerances_for_coda_combination(): 275 | assert Euro[u'Hello':u'론'] == u'(으)론' 276 | assert Gwa[u'Hello':u'완'] == u'관(완)' 277 | assert Gwa[u'Hello':u'완':OPTIONAL_MORPH2_AND_MORPH1] == u'(완)관' 278 | assert Gwa[u'Hello':u'완완완'] == u'관(완)완완' 279 | assert Particle(u'크', u'')[u'Hello':u'큰큰'] == u'(큰)큰' 280 | 281 | 282 | def test_igyuho2006(): 283 | """Particles from , 284 | I Gyu-ho, 2006. 285 | """ 286 | def ff(particle_string): 287 | return f(u'남', particle_string), f(u'나', particle_string) 288 | # p181-182: 289 | assert ff(u'의') == (u'남의', u'나의') 290 | assert ff(u'과') == (u'남과', u'나와') 291 | assert ff(u'와') == (u'남과', u'나와') 292 | assert ff(u'하고') == (u'남하고', u'나하고') 293 | assert ff(u'이랑') == (u'남이랑', u'나랑') 294 | assert ff(u'이니') == (u'남이니', u'나니') 295 | assert ff(u'이다') == (u'남이다', u'나다') 296 | assert ff(u'이라든가') == (u'남이라든가', u'나라든가') 297 | assert ff(u'이라든지') == (u'남이라든지', u'나라든지') 298 | assert ff(u'이며') == (u'남이며', u'나며') 299 | assert ff(u'이야') == (u'남이야', u'나야') 300 | assert ff(u'이요') == (u'남이요', u'나요') 301 | assert ff(u'이랴') == (u'남이랴', u'나랴') 302 | assert ff(u'에') == (u'남에', u'나에') 303 | assert ff(u'하며') == (u'남하며', u'나하며') 304 | assert ff(u'커녕') == (u'남커녕', u'나커녕') 305 | assert ff(u'은커녕') == (u'남은커녕', u'나는커녕') 306 | assert ff(u'이고') == (u'남이고', u'나고') 307 | assert ff(u'이나') == (u'남이나', u'나나') 308 | assert ff(u'에다') == (u'남에다', u'나에다') 309 | assert ff(u'에다가') == (u'남에다가', u'나에다가') 310 | assert ff(u'이란') == (u'남이란', u'나란') 311 | assert ff(u'이면') == (u'남이면', u'나면') 312 | assert ff(u'이거나') == (u'남이거나', u'나거나') 313 | assert ff(u'이건') == (u'남이건', u'나건') 314 | assert ff(u'이든') == (u'남이든', u'나든') 315 | assert ff(u'이든가') == (u'남이든가', u'나든가') 316 | assert ff(u'이든지') == (u'남이든지', u'나든지') 317 | assert ff(u'인가') == (u'남인가', u'나인가') 318 | assert ff(u'인지') == (u'남인지', u'나인지') 319 | # p188-189: 320 | assert ff(u'인') == (u'남인', u'나인') 321 | assert ff(u'는') == (u'남은', u'나는') 322 | assert ff(u'이라는') == (u'남이라는', u'나라는') 323 | assert ff(u'이네') == (u'남이네', u'나네') 324 | assert ff(u'도') == (u'남도', u'나도') 325 | assert ff(u'이면서') == (u'남이면서', u'나면서') 326 | assert ff(u'이자') == (u'남이자', u'나자') 327 | assert ff(u'하고도') == (u'남하고도', u'나하고도') 328 | assert ff(u'이냐') == (u'남이냐', u'나냐') 329 | 330 | 331 | def test_tolerance_style(): 332 | assert Gwa[u'Hello'::OPTIONAL_MORPH2_AND_MORPH1] == u'(와)과' 333 | assert parse_tolerance_style(0) == MORPH1_AND_OPTIONAL_MORPH2 334 | assert parse_tolerance_style(u'을(를)') == MORPH1_AND_OPTIONAL_MORPH2 335 | assert parse_tolerance_style(u'(를)을') == OPTIONAL_MORPH2_AND_MORPH1 336 | with pytest.raises(ValueError): 337 | parse_tolerance_style(u'과') 338 | with pytest.raises(ValueError): 339 | parse_tolerance_style(u'이다') 340 | with pytest.raises(ValueError): 341 | parse_tolerance_style(u'(이)') 342 | assert get_tolerance([u'예제'], OPTIONAL_MORPH2_AND_MORPH1) == u'예제' 343 | assert get_tolerance_from_iterator(iter([u'예제']), 344 | OPTIONAL_MORPH2_AND_MORPH1) == u'예제' 345 | 346 | 347 | def test_static_tolerance_style(): 348 | assert f(u'나오', u'을', tolerance_style=u'을/를') == u'나오를' 349 | assert f(u'키홀', u'를', tolerance_style=u'을/를') == u'키홀을' 350 | assert f(u'Tossi', u'을', tolerance_style=u'을/를') == u'Tossi을/를' 351 | 352 | 353 | def test_pick(): 354 | assert tossi.pick(u'나오', u'을') == u'를' 355 | assert tossi.pick(u'키홀', u'를') == u'을' 356 | assert tossi.pick(u'남', u'면서') == u'이면서' 357 | assert tossi.pick(u'Tossi', u'을') == u'을(를)' 358 | assert tossi.pick(u'Tossi', u'을', tolerance_style=u'을/를') == u'을/를' 359 | 360 | 361 | def test_custom_guess_coda(): 362 | def dont_guess_coda(word): 363 | return None 364 | assert Euro.allomorph(u'밖', u'으로', 365 | guess_coda=dont_guess_coda) == u'(으)로' 366 | 367 | 368 | def test_unmatch(): 369 | assert Eul[u'예제':u'는'] is None 370 | 371 | 372 | def test_formatter(): 373 | t = u'{0:으로} {0:을}' 374 | f1 = functools.partial(tossi.Formatter(registry).format, t) 375 | f2 = functools.partial(tossi.format, t) 376 | assert f1(u'나오') == f2(u'나오') == u'나오로 나오를' 377 | assert f1(u'키홀') == f2(u'키홀') == u'키홀로 키홀을' 378 | assert f1(u'모리안') == f2(u'모리안') == u'모리안으로 모리안을' 379 | 380 | 381 | def test_singleton_error(): 382 | with pytest.raises(TypeError): 383 | class Fail(with_metaclass(SingletonParticleMeta, object)): 384 | pass 385 | 386 | 387 | def test_deprecations(): 388 | pytest.deprecated_call(registry.postfix_particle, u'테스트', u'으로부터') 389 | pytest.deprecated_call(tossi.postfix_particle, u'테스트', u'으로부터') 390 | pytest.deprecated_call(tossi.get_particle, u'으로부터') 391 | -------------------------------------------------------------------------------- /tossi/__about__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | tossi.__about__ 4 | ~~~~~~~~~~~~~~~ 5 | """ 6 | __version__ = '0.3.1' 7 | __license__ = 'BSD' 8 | __author__ = 'What! Studio' 9 | __maintainer__ = 'Heungsub Lee' 10 | __maintainer_email__ = 'sub@nexon.co.kr' 11 | -------------------------------------------------------------------------------- /tossi/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | tossi 4 | ~~~~~ 5 | 6 | Supports Korean particles. 7 | 8 | :copyright: (c) 2016-2017 by What! Studio 9 | :license: BSD, see LICENSE for more details. 10 | 11 | """ 12 | import re 13 | import warnings 14 | 15 | from tossi.coda import guess_coda 16 | from tossi.formatter import Formatter 17 | from tossi.particles import Euro, Ida, Particle 18 | from tossi.tolerance import ( 19 | MORPH1_AND_OPTIONAL_MORPH2, MORPH2_AND_OPTIONAL_MORPH1, 20 | OPTIONAL_MORPH1_AND_MORPH2, OPTIONAL_MORPH2_AND_MORPH1, 21 | parse_tolerance_style) 22 | 23 | 24 | __all__ = ['get_particle', 'guess_coda', 'MORPH1_AND_OPTIONAL_MORPH2', 25 | 'MORPH2_AND_OPTIONAL_MORPH1', 'OPTIONAL_MORPH1_AND_MORPH2', 26 | 'OPTIONAL_MORPH2_AND_MORPH1', 'parse', 'parse_tolerance_style', 27 | 'Particle', 'pick', 'postfix', 'postfix_particle', 28 | 'Formatter', 'format'] 29 | 30 | 31 | def index_particles(particles): 32 | """Indexes :class:`Particle` objects. It returns a regex pattern which 33 | matches to any particle morphs and a dictionary indexes the given particles 34 | by regex groups. 35 | """ 36 | patterns, indices = [], {} 37 | for x, p in enumerate(particles): 38 | group = u'_%d' % x 39 | indices[group] = x 40 | patterns.append(u'(?P<%s>%s)' % (group, p.regex_pattern())) 41 | pattern = re.compile(u'|'.join(patterns)) 42 | return pattern, indices 43 | 44 | 45 | class ParticleRegistry(object): 46 | 47 | __slots__ = ('default', 'particles', 'pattern', 'indices') 48 | 49 | def __init__(self, default, particles): 50 | self.default = default 51 | self.particles = particles 52 | self.pattern, self.indices = index_particles(particles) 53 | 54 | def _get_by_match(self, match): 55 | x = self.indices[match.lastgroup] 56 | return self.particles[x] 57 | 58 | def parse(self, morph): 59 | m = self.pattern.match(morph) 60 | if m is None: 61 | return self.default 62 | return self._get_by_match(m) 63 | 64 | def pick(self, word, morph, **kwargs): 65 | particle = self.parse(morph) 66 | return particle.allomorph(word, morph, **kwargs) 67 | 68 | def postfix(self, word, morph, **kwargs): 69 | return word + self.pick(word, morph, **kwargs) 70 | 71 | def get(self, morph): 72 | warnings.warn(DeprecationWarning('Use parse() instead')) 73 | return self.parse(morph) 74 | 75 | def postfix_particle(self, word, morph, **kwargs): 76 | warnings.warn(DeprecationWarning('Use postfix() instead')) 77 | return self.postfix(word, morph, **kwargs) 78 | 79 | 80 | #: The default registry for well-known Korean particles. 81 | registry = ParticleRegistry(Ida, [ 82 | # Simple allomorphic rule: 83 | Particle(u'이', u'가', final=True), 84 | Particle(u'을', u'를', final=True), 85 | Particle(u'은', u'는'), # "은(는)" includes "은(는)커녕". 86 | Particle(u'과', u'와'), 87 | # Vocative particles: 88 | Particle(u'아', u'야', final=True), 89 | Particle(u'이여', u'여', final=True), 90 | Particle(u'이시여', u'시여', final=True), 91 | # Invariant particles: 92 | Particle(u'의', final=True), 93 | Particle(u'도', final=True), 94 | Particle(u'만'), 95 | Particle(u'에'), 96 | Particle(u'께'), 97 | Particle(u'뿐'), 98 | Particle(u'하'), 99 | Particle(u'보다'), 100 | Particle(u'밖에'), 101 | Particle(u'같이'), 102 | Particle(u'부터'), 103 | Particle(u'까지'), 104 | Particle(u'마저'), 105 | Particle(u'조차'), 106 | Particle(u'마냥'), 107 | Particle(u'처럼'), 108 | Particle(u'커녕'), 109 | # Special particles: 110 | Euro, 111 | ]) 112 | formatter = Formatter(registry) 113 | 114 | 115 | def parse(morph): 116 | """Shortcut for :class:`ParticleRegistry.parse` of the default registry.""" 117 | return registry.parse(morph) 118 | 119 | 120 | def pick(word, morph, **kwargs): 121 | """Shortcut for :class:`ParticleRegistry.pick` of the default registry. 122 | """ 123 | return registry.pick(word, morph, **kwargs) 124 | 125 | 126 | def postfix(word, morph, **kwargs): 127 | """Shortcut for :class:`ParticleRegistry.postfix` of the default registry. 128 | """ 129 | return registry.postfix(word, morph, **kwargs) 130 | 131 | 132 | def get_particle(morph): 133 | warnings.warn(DeprecationWarning('Use parse() instead')) 134 | return parse(morph) 135 | 136 | 137 | def postfix_particle(word, morph, **kwargs): 138 | warnings.warn(DeprecationWarning('Use postfix() instead')) 139 | return postfix(word, morph, **kwargs) 140 | 141 | 142 | def format(message, *args, **kwargs): 143 | """Shortcut for :class:`tossi.Formatter.format` of the default registry. 144 | """ 145 | return formatter.vformat(message, args, kwargs) 146 | -------------------------------------------------------------------------------- /tossi/coda.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | tossi.coda 4 | ~~~~~~~~~~ 5 | 6 | Coda is final consonant in a Korean syllable. That is important because 7 | required when determining a particle allomorph in Korean. 8 | 9 | This module implements :func:`guess_coda` and related functions to guess a 10 | coda from any words as correct as possible. 11 | 12 | :copyright: (c) 2016-2017 by What! Studio 13 | :license: BSD, see LICENSE for more details. 14 | 15 | """ 16 | from bisect import bisect_right 17 | from decimal import Decimal 18 | import re 19 | import unicodedata 20 | 21 | from tossi.hangul import split_phonemes 22 | 23 | 24 | __all__ = ['filter_only_significant', 'guess_coda', 25 | 'guess_coda_from_significant_word', 'pick_coda_from_decimal', 26 | 'pick_coda_from_letter'] 27 | 28 | 29 | #: Matches to a decimal at the end of a word. 30 | DECIMAL_PATTERN = re.compile(r'[0-9]+(\.[0-9]+)?$') 31 | 32 | 33 | def guess_coda(word): 34 | """Guesses the coda of the given word as correct as possible. If it fails 35 | to guess the coda, returns ``None``. 36 | """ 37 | word = filter_only_significant(word) 38 | return guess_coda_from_significant_word(word) 39 | 40 | 41 | def guess_coda_from_significant_word(word): 42 | if not word: 43 | return None 44 | decimal_m = DECIMAL_PATTERN.search(word) 45 | if decimal_m: 46 | return pick_coda_from_decimal(decimal_m.group(0)) 47 | return pick_coda_from_letter(word[-1]) 48 | 49 | 50 | # Patterns which match to significant or insignificant letters at the end of 51 | # words. 52 | INSIGNIFICANT_PARENTHESIS_PATTERN = re.compile(r'\(.*?\)$') 53 | SIGNIFICANT_UNICODE_CATEGORY_PATTERN = re.compile(r'^([LN].|S[cmo])$') 54 | 55 | 56 | def filter_only_significant(word): 57 | """Gets a word which removes insignificant letters at the end of the given 58 | word:: 59 | 60 | >>> pick_significant(u'넥슨(코리아)') 61 | 넥슨 62 | >>> pick_significant(u'메이플스토리...') 63 | 메이플스토리 64 | 65 | """ 66 | if not word: 67 | return word 68 | # Unwrap a complete parenthesis. 69 | if word.startswith(u'(') and word.endswith(u')'): 70 | return filter_only_significant(word[1:-1]) 71 | x = len(word) 72 | while x > 0: 73 | x -= 1 74 | c = word[x] 75 | # Skip a complete parenthesis. 76 | if c == u')': 77 | m = INSIGNIFICANT_PARENTHESIS_PATTERN.search(word[:x + 1]) 78 | if m is not None: 79 | x = m.start() 80 | continue 81 | # Skip unreadable characters such as punctuations. 82 | unicode_category = unicodedata.category(c) 83 | if not SIGNIFICANT_UNICODE_CATEGORY_PATTERN.match(unicode_category): 84 | continue 85 | break 86 | return word[:x + 1] 87 | 88 | 89 | def pick_coda_from_letter(letter): 90 | """Picks only a coda from a Hangul letter. It returns ``None`` if the 91 | given letter is not Hangul. 92 | """ 93 | try: 94 | __, __, coda = \ 95 | split_phonemes(letter, onset=False, nucleus=False, coda=True) 96 | except ValueError: 97 | return None 98 | else: 99 | return coda 100 | 101 | 102 | # Data for picking coda from a decimal. 103 | DIGITS = u'영일이삼사오육칠팔구' 104 | EXPS = {1: u'십', 2: u'백', 3: u'천', 4: u'만', 105 | 8: u'억', 12: u'조', 16: u'경', 20: u'해', 106 | 24: u'자', 28: u'양', 32: u'구', 36: u'간', 107 | 40: u'정', 44: u'재', 48: u'극', 52: u'항하사', 108 | 56: u'아승기', 60: u'나유타', 64: u'불가사의', 68: u'무량대수', 109 | 72: u'겁', 76: u'업'} 110 | DIGIT_CODAS = [pick_coda_from_letter(x[-1]) for x in DIGITS] 111 | EXP_CODAS = {exp: pick_coda_from_letter(x[-1]) for exp, x in EXPS.items()} 112 | EXP_INDICES = list(sorted(EXPS.keys())) 113 | 114 | 115 | # Mark the first unreadable exponent. 116 | _unreadable_exp = max(EXP_INDICES) + 4 117 | EXP_CODAS[_unreadable_exp] = None 118 | EXP_INDICES.append(_unreadable_exp) 119 | del _unreadable_exp 120 | 121 | 122 | def pick_coda_from_decimal(decimal): 123 | """Picks only a coda from a decimal.""" 124 | decimal = Decimal(decimal) 125 | __, digits, exp = decimal.as_tuple() 126 | if exp < 0: 127 | return DIGIT_CODAS[digits[-1]] 128 | __, digits, exp = decimal.normalize().as_tuple() 129 | index = bisect_right(EXP_INDICES, exp) - 1 130 | if index < 0: 131 | return DIGIT_CODAS[digits[-1]] 132 | else: 133 | return EXP_CODAS[EXP_INDICES[index]] 134 | -------------------------------------------------------------------------------- /tossi/formatter.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | tossi.formatter 4 | ~~~~~~~~~~~~~~~ 5 | 6 | String formatter for Tossi. 7 | 8 | :copyright: (c) 2016-2017 by What! Studio 9 | :license: BSD, see LICENSE for more details. 10 | 11 | """ 12 | import re 13 | from string import Formatter as StringFormatter 14 | 15 | 16 | class Formatter(StringFormatter): 17 | """String formatter supports tossi format spec. 18 | 19 | >>> f = Formatter(tossi.registry) 20 | >>> t = u'{0:으로} {0:을}' 21 | >>> assert f.format(t, u'나오') == u'나오로 나오를' 22 | >>> assert f.format(t, u'키홀') == u'키홀로 키홀을' 23 | >>> assert f.format(t, u'모리안') == u'모리안으로 모리안을' 24 | """ 25 | hangul_pattern = re.compile(u'[가-힣]+') 26 | 27 | def __init__(self, registry): 28 | self.registry = registry 29 | 30 | def format_field(self, value, format_spec): 31 | if re.match(self.hangul_pattern, format_spec): 32 | return self.registry.postfix(value, format_spec) 33 | else: 34 | return super(Formatter, self).format_field(value, format_spec) 35 | -------------------------------------------------------------------------------- /tossi/hangul.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | tossi.hangul 4 | ~~~~~~~~~~~~ 5 | 6 | Manipulates Hangul letters. 7 | 8 | :copyright: (c) 2016-2017 by What! Studio 9 | :license: BSD, see LICENSE for more details. 10 | 11 | """ 12 | from six import unichr 13 | 14 | 15 | __all__ = ['combine_words', 'is_consonant', 'is_hangul', 'join_phonemes', 16 | 'split_phonemes'] 17 | 18 | 19 | # Korean phonemes as known as 자소 including 20 | # onset(초성), nucleus(중성), and coda(종성). 21 | ONSETS = list(u'ㄱㄲㄴㄷㄸㄹㅁㅂㅃㅅㅆㅇㅈㅉㅊㅋㅌㅍㅎ') 22 | NUCLEUSES = list(u'ㅏㅐㅑㅒㅓㅔㅕㅖㅗㅘㅙㅚㅛㅜㅝㅞㅟㅠㅡㅢㅣ') 23 | CODAS = [u''] 24 | CODAS.extend(u'ㄱㄲㄳㄴㄵㄶㄷㄹㄺㄻㄼㄽㄾㄿㅀㅁㅂㅄㅅㅆㅇㅈㅊㅋㅌㅍㅎ') 25 | 26 | # Lengths of the phonemes. 27 | NUM_ONSETS = len(ONSETS) 28 | NUM_NUCLEUSES = len(NUCLEUSES) 29 | NUM_CODAS = len(CODAS) 30 | 31 | #: The Unicode offset of "가" which is the base offset for all Hangul letters. 32 | FIRST_HANGUL_OFFSET = ord(u'가') 33 | 34 | 35 | def is_hangul(letter): 36 | return u'가' <= letter <= u'힣' 37 | 38 | 39 | def is_consonant(letter): 40 | return u'ㄱ' <= letter <= u'ㅎ' 41 | 42 | 43 | def join_phonemes(*args): 44 | """Joins a Hangul letter from Korean phonemes.""" 45 | # Normalize arguments as onset, nucleus, coda. 46 | if len(args) == 1: 47 | # tuple of (onset, nucleus[, coda]) 48 | args = args[0] 49 | if len(args) == 2: 50 | args += (CODAS[0],) 51 | try: 52 | onset, nucleus, coda = args 53 | except ValueError: 54 | raise TypeError('join_phonemes() takes at most 3 arguments') 55 | offset = ( 56 | (ONSETS.index(onset) * NUM_NUCLEUSES + NUCLEUSES.index(nucleus)) * 57 | NUM_CODAS + CODAS.index(coda) 58 | ) 59 | return unichr(FIRST_HANGUL_OFFSET + offset) 60 | 61 | 62 | def split_phonemes(letter, onset=True, nucleus=True, coda=True): 63 | """Splits Korean phonemes as known as "자소" from a Hangul letter. 64 | 65 | :returns: (onset, nucleus, coda) 66 | :raises ValueError: `letter` is not a Hangul single letter. 67 | 68 | """ 69 | if len(letter) != 1 or not is_hangul(letter): 70 | raise ValueError('Not Hangul letter: %r' % letter) 71 | offset = ord(letter) - FIRST_HANGUL_OFFSET 72 | phonemes = [None] * 3 73 | if onset: 74 | phonemes[0] = ONSETS[offset // (NUM_NUCLEUSES * NUM_CODAS)] 75 | if nucleus: 76 | phonemes[1] = NUCLEUSES[(offset // NUM_CODAS) % NUM_NUCLEUSES] 77 | if coda: 78 | phonemes[2] = CODAS[offset % NUM_CODAS] 79 | return tuple(phonemes) 80 | 81 | 82 | def combine_words(word1, word2): 83 | """Combines two words. If the first word ends with a vowel and the initial 84 | letter of the second word is only consonant, it merges them into one 85 | letter:: 86 | 87 | >>> combine_words(u'다', u'ㄺ') 88 | 닭 89 | >>> combine_words(u'가오', u'ㄴ누리') 90 | 가온누리 91 | 92 | """ 93 | if word1 and word2 and is_consonant(word2[0]): 94 | onset, nucleus, coda = split_phonemes(word1[-1]) 95 | if not coda: 96 | glue = join_phonemes(onset, nucleus, word2[0]) 97 | return word1[:-1] + glue + word2[1:] 98 | return word1 + word2 99 | -------------------------------------------------------------------------------- /tossi/particles.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | tossi.particles 4 | ~~~~~~~~~~~~~~~ 5 | 6 | Models for Korean allomorphic particles. 7 | 8 | :copyright: (c) 2016-2017 by What! Studio 9 | :license: BSD, see LICENSE for more details. 10 | 11 | """ 12 | from itertools import chain 13 | import re 14 | 15 | from bidict import bidict 16 | from six import PY2, python_2_unicode_compatible, text_type, with_metaclass 17 | 18 | from tossi.coda import guess_coda, pick_coda_from_letter 19 | from tossi.hangul import ( 20 | combine_words, is_consonant, join_phonemes, split_phonemes) 21 | from tossi.tolerance import ( 22 | generate_tolerances, get_tolerance, get_tolerance_from_iterator, 23 | MORPH1_AND_OPTIONAL_MORPH2) 24 | from tossi.utils import cached_property, CacheMeta 25 | 26 | 27 | __all__ = ['DEFAULT_GUESS_CODA', 'DEFAULT_TOLERANCE_STYLE', 28 | 'Euro', 'Ida', 'Particle'] 29 | 30 | 31 | #: The default tolerance style. 32 | DEFAULT_TOLERANCE_STYLE = MORPH1_AND_OPTIONAL_MORPH2 33 | 34 | #: The default function to guess the coda from a word. 35 | DEFAULT_GUESS_CODA = guess_coda 36 | 37 | 38 | @python_2_unicode_compatible 39 | class Particle(with_metaclass(CacheMeta)): 40 | """Represents a Korean particle as known as "조사". 41 | 42 | This also implements the general allomorphic rule for most common 43 | particles. 44 | 45 | :param morph1: an allomorph after a consonant. 46 | :param morph2: an allomorph after a vowel. If it is omitted, there's no 47 | no alternative allomorph. So `morph1` always will be 48 | selected. 49 | :param final: whether the particle disallows combination with another 50 | postpositions. (default: ``False``) 51 | 52 | """ 53 | 54 | __slots__ = ('morph1', 'morph2', 'final') 55 | 56 | def __init__(self, morph1, morph2=None, final=False): 57 | self.morph1 = morph1 58 | self.morph2 = morph1 if morph2 is None else morph2 59 | self.final = final 60 | 61 | @cached_property 62 | def tolerances(self): 63 | """The tuple containing all the possible tolerant morphs.""" 64 | return tuple(generate_tolerances(self.morph1, self.morph2)) 65 | 66 | def tolerance(self, style=DEFAULT_TOLERANCE_STYLE): 67 | """Gets a tolerant morph.""" 68 | return get_tolerance(self.tolerances, style) 69 | 70 | def rule(self, coda): 71 | """Determines one of allomorphic morphs based on a coda.""" 72 | if coda: 73 | return self.morph1 74 | else: 75 | return self.morph2 76 | 77 | def allomorph(self, word, morph, tolerance_style=DEFAULT_TOLERANCE_STYLE, 78 | guess_coda=DEFAULT_GUESS_CODA): 79 | """Determines one of allomorphic morphs based on a word. 80 | 81 | .. see also:: :meth:`allomorph`. 82 | 83 | """ 84 | suffix = self.match(morph) 85 | if suffix is None: 86 | return None 87 | coda = guess_coda(word) 88 | if coda is not None: 89 | # Coda guessed successfully. 90 | morph = self.rule(coda) 91 | elif isinstance(tolerance_style, text_type): 92 | # User specified the style themselves 93 | morph = tolerance_style 94 | elif not suffix or not is_consonant(suffix[0]): 95 | # Choose the tolerant morph. 96 | morph = self.tolerance(tolerance_style) 97 | else: 98 | # Suffix starts with a consonant. Generate a new tolerant morph 99 | # by combined morphs. 100 | morph1 = (combine_words(self.morph1, suffix) 101 | if self.morph1 else suffix[1:]) 102 | morph2 = (combine_words(self.morph2, suffix) 103 | if self.morph2 else suffix[1:]) 104 | tolerances = generate_tolerances(morph1, morph2) 105 | return get_tolerance_from_iterator(tolerances, tolerance_style) 106 | return combine_words(morph, suffix) 107 | 108 | def __getitem__(self, key): 109 | """The syntax sugar to determine one of allomorphic morphs based on a 110 | word:: 111 | 112 | eun = Particle(u'은', u'는') 113 | assert eun[u'나오'] == u'는' 114 | assert eun[u'모리안'] == u'은' 115 | 116 | """ 117 | if isinstance(key, slice): 118 | word = key.start 119 | morph = key.stop or self.morph1 120 | tolerance_style = key.step or DEFAULT_TOLERANCE_STYLE 121 | else: 122 | word, morph = key, self.morph1 123 | tolerance_style = DEFAULT_TOLERANCE_STYLE 124 | return self.allomorph(word, morph, tolerance_style) 125 | 126 | @cached_property 127 | def regex(self): 128 | return re.compile(self.regex_pattern()) 129 | 130 | @cached_property 131 | def morphs(self): 132 | """The tuple containing the given morphs and all the possible tolerant 133 | morphs. Longer is first. 134 | """ 135 | seen = set() 136 | saw = seen.add 137 | morphs = chain([self.morph1, self.morph2], self.tolerances) 138 | unique_morphs = (x for x in morphs if x and not (x in seen or saw(x))) 139 | return tuple(sorted(unique_morphs, key=len, reverse=True)) 140 | 141 | def match(self, morph): 142 | m = self.regex.match(morph) 143 | if m is None: 144 | return None 145 | x = m.end() 146 | if self.final or m.group() == self.morphs[m.lastindex - 1]: 147 | return morph[x:] 148 | coda = pick_coda_from_letter(morph[x - 1]) 149 | return coda + morph[x:] 150 | 151 | def regex_pattern(self): 152 | if self.final: 153 | return u'^(?:%s)$' % u'|'.join(re.escape(f) for f in self.morphs) 154 | patterns = [] 155 | for morph in self.morphs: 156 | try: 157 | onset, nucleus, coda = split_phonemes(morph[-1]) 158 | except ValueError: 159 | coda = None 160 | if coda == u'': 161 | start = morph[-1] 162 | end = join_phonemes(onset, nucleus, u'ㅎ') 163 | pattern = re.escape(morph[:-1]) + u'[%s-%s]' % (start, end) 164 | else: 165 | pattern = re.escape(morph) 166 | patterns.append(pattern) 167 | return u'^(?:%s)' % u'|'.join(u'(%s)' % p for p in patterns) 168 | 169 | def __str__(self): 170 | return self.tolerance() 171 | 172 | if PY2: 173 | def __repr__(self): 174 | try: 175 | from unidecode import unidecode 176 | except ImportError: 177 | return '' % self.tolerance() 178 | else: 179 | return '' % unidecode(self.tolerance()) 180 | else: 181 | def __repr__(self): 182 | return '' % self.tolerance() 183 | 184 | 185 | class SingletonParticleMeta(type(Particle)): 186 | 187 | def __new__(meta, name, bases, attrs): 188 | base_meta = super(SingletonParticleMeta, meta) 189 | cls = base_meta.__new__(meta, name, bases, attrs) 190 | if not issubclass(cls, Particle): 191 | raise TypeError('Not particle class') 192 | # Instantiate directly instead of returning a class. 193 | return cls() 194 | 195 | 196 | class SingletonParticle(Particle): 197 | 198 | # Concrete classes should set these strings. 199 | morph1 = morph2 = final = NotImplemented 200 | 201 | def __init__(self): 202 | pass 203 | 204 | 205 | def singleton_particle(*bases): 206 | """Defines a singleton instance immediately when defining the class. The 207 | name of the class will refer the instance instead. 208 | """ 209 | return with_metaclass(SingletonParticleMeta, SingletonParticle, *bases) 210 | 211 | 212 | class Euro(singleton_particle(Particle)): 213 | """Particles starting with "으로" have a special allomorphic rule after 214 | coda "ㄹ". "으로" can also be extended with some of suffixes such as 215 | "으로서", "으로부터". 216 | """ 217 | 218 | __slots__ = () 219 | 220 | morph1 = u'으로' 221 | morph2 = u'로' 222 | final = False 223 | 224 | def rule(self, coda): 225 | if coda and coda != u'ㄹ': 226 | return self.morph1 227 | else: 228 | return self.morph2 229 | 230 | 231 | class Ida(singleton_particle(Particle)): 232 | """"이다" is a verbal particle. Like other Korean verbs, it is also 233 | fusional. 234 | """ 235 | 236 | __slots__ = () 237 | 238 | morph1 = u'이' 239 | morph2 = u'' 240 | final = False 241 | 242 | #: Matches with initial "이" or "(이)" to normalize fusioned verbal morphs. 243 | I_PATTERN = re.compile(u'^이|\(이\)') 244 | 245 | #: The mapping for vowels which should be transmorphed by /j/ injection. 246 | J_INJECTIONS = bidict({u'ㅓ': u'ㅕ', u'ㅔ': u'ㅖ'}) 247 | 248 | def allomorph(self, word, morph, tolerance_style=DEFAULT_TOLERANCE_STYLE, 249 | guess_coda=DEFAULT_GUESS_CODA): 250 | suffix = self.I_PATTERN.sub(u'', morph) 251 | coda = guess_coda(word) 252 | next_onset, next_nucleus, next_coda = split_phonemes(suffix[0]) 253 | if next_onset == u'ㅇ': 254 | if next_nucleus == u'ㅣ': 255 | # No allomorphs when a morph starts with "이" and has a coda. 256 | return suffix 257 | mapping = None 258 | if coda == u'' and next_nucleus in self.J_INJECTIONS: 259 | # Squeeze "이어" or "이에" to "여" or "예" 260 | # after a word which ends with a nucleus. 261 | mapping = self.J_INJECTIONS 262 | elif coda != u'' and next_nucleus in self.J_INJECTIONS.inv: 263 | # Lengthen "여" or "예" to "이어" or "이에" 264 | # after a word which ends with a consonant. 265 | mapping = self.J_INJECTIONS.inv 266 | if mapping is not None: 267 | next_nucleus = mapping[next_nucleus] 268 | next_letter = join_phonemes(u'ㅇ', next_nucleus, next_coda) 269 | suffix = next_letter + suffix[1:] 270 | if coda is None: 271 | morph = self.tolerance(tolerance_style) 272 | else: 273 | morph = self.rule(coda) 274 | return morph + suffix 275 | -------------------------------------------------------------------------------- /tossi/tolerance.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | tossi.tolerance 4 | ~~~~~~~~~~~~~~~ 5 | 6 | Utilities for tolerant particle morphs. 7 | 8 | :copyright: (c) 2016-2017 by What! Studio 9 | :license: BSD, see LICENSE for more details. 10 | 11 | """ 12 | from six import integer_types 13 | 14 | 15 | __all__ = ['generate_tolerances', 'get_tolerance', 16 | 'get_tolerance_from_iterator', 'parse_tolerance_style'] 17 | 18 | 19 | # Tolerance styles: 20 | MORPH1_AND_OPTIONAL_MORPH2 = 0 # 은(는) 21 | OPTIONAL_MORPH1_AND_MORPH2 = 1 # (은)는 22 | MORPH2_AND_OPTIONAL_MORPH1 = 2 # 는(은) 23 | OPTIONAL_MORPH2_AND_MORPH1 = 3 # (는)은 24 | 25 | 26 | def generate_tolerances(morph1, morph2): 27 | """Generates all reasonable tolerant particle morphs:: 28 | 29 | >>> set(generate_tolerances(u'이', u'가')) 30 | set([u'이(가)', u'(이)가', u'가(이)', u'(가)이']) 31 | >>> set(generate_tolerances(u'이면', u'면')) 32 | set([u'(이)면']) 33 | 34 | """ 35 | if morph1 == morph2: 36 | # Tolerance not required. 37 | return 38 | if not (morph1 and morph2): 39 | # Null allomorph exists. 40 | yield u'(%s)' % (morph1 or morph2) 41 | return 42 | len1, len2 = len(morph1), len(morph2) 43 | if len1 != len2: 44 | longer, shorter = (morph1, morph2) if len1 > len2 else (morph2, morph1) 45 | if longer.endswith(shorter): 46 | # Longer morph ends with shorter morph. 47 | yield u'(%s)%s' % (longer[:-len(shorter)], shorter) 48 | return 49 | # Find common suffix between two morphs. 50 | for x, (let1, let2) in enumerate(zip(reversed(morph1), reversed(morph2))): 51 | if let1 != let2: 52 | break 53 | if x: 54 | # They share the common suffix. 55 | x1, x2 = len(morph1) - x, len(morph2) - x 56 | common_suffix = morph1[x1:] 57 | morph1, morph2 = morph1[:x1], morph2[:x2] 58 | else: 59 | # No similarity with each other. 60 | common_suffix = '' 61 | for morph1, morph2 in [(morph1, morph2), (morph2, morph1)]: 62 | yield u'%s(%s)%s' % (morph1, morph2, common_suffix) 63 | yield u'(%s)%s%s' % (morph1, morph2, common_suffix) 64 | 65 | 66 | def parse_tolerance_style(style, registry=None): 67 | """Resolves a tolerance style of the given tolerant particle morph:: 68 | 69 | >>> parse_tolerance_style(u'은(는)') 70 | 0 71 | >>> parse_tolerance_style(u'(은)는') 72 | 1 73 | >>> parse_tolerance_style(OPTIONAL_MORPH2_AND_MORPH1) 74 | 3 75 | 76 | """ 77 | if isinstance(style, integer_types): 78 | return style 79 | if registry is None: 80 | from . import registry 81 | particle = registry.parse(style) 82 | if len(particle.tolerances) != 4: 83 | raise ValueError('Set tolerance style by general allomorphic particle') 84 | return particle.tolerances.index(style) 85 | 86 | 87 | def get_tolerance(tolerances, style): 88 | try: 89 | return tolerances[style] 90 | except IndexError: 91 | return tolerances[0] 92 | 93 | 94 | def get_tolerance_from_iterator(tolerances, style): 95 | for x, morph in enumerate(tolerances): 96 | if style == x: 97 | return morph 98 | return morph 99 | -------------------------------------------------------------------------------- /tossi/utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | tossi.utils 4 | ~~~~~~~~~~~ 5 | 6 | Utilities for internal use. 7 | 8 | :copyright: (c) 2016-2017 by What! Studio 9 | :license: BSD, see LICENSE for more details. 10 | 11 | """ 12 | import functools 13 | 14 | 15 | __all__ = ['cached_property', 'CacheMeta'] 16 | 17 | 18 | def cached_property(f): 19 | """Similar to `@property` but it calls the function just once and caches 20 | the result. The object has to can have ``__cache__`` attribute. 21 | 22 | If you define `__slots__` for optimization, the metaclass should be a 23 | :class:`CacheMeta`. 24 | 25 | """ 26 | @property 27 | @functools.wraps(f) 28 | def wrapped(self, name=f.__name__): 29 | try: 30 | cache = self.__cache__ 31 | except AttributeError: 32 | self.__cache__ = cache = {} 33 | try: 34 | return cache[name] 35 | except KeyError: 36 | cache[name] = rv = f(self) 37 | return rv 38 | return wrapped 39 | 40 | 41 | class CacheMeta(type): 42 | 43 | def __new__(meta, name, bases, attrs): 44 | if '__slots__' in attrs: 45 | attrs['__slots__'] += ('__cache__',) 46 | return super(CacheMeta, meta).__new__(meta, name, bases, attrs) 47 | --------------------------------------------------------------------------------