├── kobart_transformers ├── __init__.py └── load_model.py ├── setup.py ├── .gitignore └── README.md /kobart_transformers/__init__.py: -------------------------------------------------------------------------------- 1 | from kobart_transformers.load_model import get_kobart_model, get_kobart_tokenizer, get_kobart_for_conditional_generation 2 | 3 | __ALL__ = [get_kobart_model, get_kobart_tokenizer, get_kobart_for_conditional_generation] 4 | -------------------------------------------------------------------------------- /kobart_transformers/load_model.py: -------------------------------------------------------------------------------- 1 | from transformers import BartModel, PreTrainedTokenizerFast, BartForConditionalGeneration 2 | 3 | 4 | def get_kobart_model(): 5 | return BartModel.from_pretrained("hyunwoongko/kobart") 6 | 7 | 8 | def get_kobart_for_conditional_generation(): 9 | return BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart") 10 | 11 | 12 | def get_kobart_tokenizer(): 13 | return PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart") 14 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import codecs 2 | from setuptools import setup, find_packages 3 | 4 | 5 | def read_file(filename, cb): 6 | with codecs.open(filename, 'r', 'utf8') as f: 7 | return cb(f) 8 | 9 | 10 | with open('README.md', encoding='utf-8') as f: 11 | long_description = f.read() 12 | 13 | setup( 14 | name='kobart-transformers', 15 | version='0.1.4', 16 | author='Hyunwoong Ko', 17 | author_email='kevin.woong@kakaobrain.com', 18 | url='https://github.com/hyunwoongko/kobart-transformers', 19 | license='BSD 3-Clause "New" or "Revised" License', 20 | description='Kobart model on huggingface transformers', 21 | long_description_content_type='text/markdown', 22 | platforms=['any'], 23 | long_description=long_description, 24 | packages=find_packages(exclude=[]), 25 | keywords=["kobart", "huggingface", "deep learning"], 26 | python_requires='>=3', 27 | package_data={}, 28 | zip_safe=False, 29 | classifiers=[ 30 | 'Programming Language :: Python :: 3', 31 | 'Programming Language :: Python :: 3.2', 32 | 'Programming Language :: Python :: 3.3', 33 | 'Programming Language :: Python :: 3.4', 34 | 'Programming Language :: Python :: 3.5', 35 | 'Programming Language :: Python :: 3.6', 36 | 'Programming Language :: Python :: 3.7', 37 | 'Programming Language :: Python :: 3.8', 38 | ], 39 | ) 40 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | 131 | .idea/ 132 | *.iml 133 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## KoBART-Transformers 2 | - SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다. 3 |

4 | 5 | ### Install (Optional) 6 | - `BartModel`과 `PreTrainedTokenizerFast`를 이용하면 설치하실 필요 없습니다. 7 | ```console 8 | pip install kobart-transformers 9 | ``` 10 |
11 | 12 | ### Tokenizer 13 | - `PreTrainedTokenizerFast`를 이용하여 구현되었습니다. 14 | - `PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart")`와 동일합니다. 15 | ```python 16 | >>> from kobart_transformers import get_kobart_tokenizer 17 | >>> # from transformers import PreTrainedTokenizerFast 18 | 19 | >>> kobart_tokenizer = get_kobart_tokenizer() 20 | >>> # kobart_tokenizer = PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart") 21 | 22 | >>> kobart_tokenizer.tokenize("안녕하세요. 한국어 BART 입니다.🤣:)l^o") 23 | ['▁안녕하', '세요.', '▁한국어', '▁B', 'A', 'R', 'T', '▁입', '니다.', '🤣', ':)', 'l^o'] 24 | ``` 25 |
26 | 27 | ### Model 28 | - `BartModel`을 이용하여 구현되었습니다. 29 | - `BartModel.from_pretrained("hyunwoongko/kobart")`와 동일합니다. 30 | ```python 31 | >>> from kobart_transformers import get_kobart_model, get_kobart_tokenizer 32 | >>> # from transformers import BartModel 33 | 34 | >>> kobart_tokenizer = get_kobart_tokenizer() 35 | >>> model = get_kobart_model() 36 | >>> # model = BartModel.from_pretrained("hyunwoongko/kobart") 37 | 38 | >>> inputs = kobart_tokenizer(['안녕하세요.'], return_tensors='pt') 39 | >>> model(inputs['input_ids']) 40 | Seq2SeqModelOutput(last_hidden_state=tensor([[[-0.4488, -4.3651, 3.2349, ..., 5.8916, 4.0497, 3.5468], 41 | [-0.4096, -4.6106, 2.7189, ..., 6.1745, 2.9832, 3.0930]]], 42 | grad_fn=), past_key_values=None, decoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, encoder_last_hidden_state=tensor([[[ 0.4624, -0.2475, 0.0902, ..., 0.1127, 0.6529, 0.2203], 43 | [ 0.4538, -0.2948, 0.2556, ..., -0.0442, 0.6858, 0.4372]]], 44 | grad_fn=), encoder_hidden_states=None, encoder_attentions=None) 45 | ``` 46 |
47 | 48 | ### For Seq2Seq Training 49 | - seq2seq 학습시에는 아래와 같이 `get_kobart_for_conditional_generation()`을 이용합니다. 50 | - `BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart")`와 동일합니다. 51 | ```python 52 | >>> from kobart_transformers import get_kobart_for_conditional_generation 53 | >>> # from transformers import BartForConditionalGeneration 54 | 55 | >>> model = get_kobart_for_conditional_generation() 56 | >>> # model = BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart") 57 | ``` 58 |
59 | 60 | ### Updates Notes 61 | #### version 0.1 62 | - `pad` 토큰이 설정되지 않은 에러를 해결하였습니다. 63 | ```python 64 | from kobart import get_kobart_tokenizer 65 | kobart_tokenizer = get_kobart_tokenizer() 66 | kobart_tokenizer(["한국어", "BART 모델을", "소개합니다."], truncation=True, padding=True) 67 | { 68 | 'input_ids': [[28324, 3, 3, 3, 3], [15085, 264, 281, 283, 24224], [15630, 20357, 3, 3, 3]], 69 | 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 70 | 'attention_mask': [[1, 0, 0, 0, 0], [1, 1, 1, 1, 1], [1, 1, 0, 0, 0]] 71 | } 72 | ``` 73 | #### version 0.1.3 74 | - `get_kobart_for_conditional_generation()`를 `__init__.py`에 등록하였습니다. 75 | 76 | #### version 0.1.4 77 | - 누락되었던 `special_tokens_map.json`을 추가하였습니다. 78 | - 이제 `pip install` 없이 KoBART를 이용할 수 있습니다. 79 | - thanks to [bernardscumm](https://github.com/bernardscumm) 80 | 81 | #### version 0.1.5 82 | - tokenizer 사용시 ``, ``가 자동으로 붙게끔 템플릿 프로세싱을 추가했습니다. 83 | 84 |
85 | 86 | ### Reference 87 | - [SKT KoBART](https://github.com/SKT-AI/KoBART) 88 | - [Huggingface Transformers](https://github.com/huggingface/transformers) 89 | --------------------------------------------------------------------------------