├── kobart_transformers
    ├── __init__.py
    └── load_model.py
├── setup.py
├── .gitignore
└── README.md


/kobart_transformers/__init__.py:
--------------------------------------------------------------------------------
1 | from kobart_transformers.load_model import get_kobart_model, get_kobart_tokenizer, get_kobart_for_conditional_generation
2 | 
3 | __ALL__ = [get_kobart_model, get_kobart_tokenizer, get_kobart_for_conditional_generation]
4 | 


--------------------------------------------------------------------------------
/kobart_transformers/load_model.py:
--------------------------------------------------------------------------------
 1 | from transformers import BartModel, PreTrainedTokenizerFast, BartForConditionalGeneration
 2 | 
 3 | 
 4 | def get_kobart_model():
 5 |     return BartModel.from_pretrained("hyunwoongko/kobart")
 6 | 
 7 | 
 8 | def get_kobart_for_conditional_generation():
 9 |     return BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart")
10 | 
11 | 
12 | def get_kobart_tokenizer():
13 |     return PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart")
14 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | import codecs
 2 | from setuptools import setup, find_packages
 3 | 
 4 | 
 5 | def read_file(filename, cb):
 6 |     with codecs.open(filename, 'r', 'utf8') as f:
 7 |         return cb(f)
 8 | 
 9 | 
10 | with open('README.md', encoding='utf-8') as f:
11 |     long_description = f.read()
12 | 
13 | setup(
14 |     name='kobart-transformers',
15 |     version='0.1.4',
16 |     author='Hyunwoong Ko',
17 |     author_email='kevin.woong@kakaobrain.com',
18 |     url='https://github.com/hyunwoongko/kobart-transformers',
19 |     license='BSD 3-Clause "New" or "Revised" License',
20 |     description='Kobart model on huggingface transformers',
21 |     long_description_content_type='text/markdown',
22 |     platforms=['any'],
23 |     long_description=long_description,
24 |     packages=find_packages(exclude=[]),
25 |     keywords=["kobart", "huggingface", "deep learning"],
26 |     python_requires='>=3',
27 |     package_data={},
28 |     zip_safe=False,
29 |     classifiers=[
30 |         'Programming Language :: Python :: 3',
31 |         'Programming Language :: Python :: 3.2',
32 |         'Programming Language :: Python :: 3.3',
33 |         'Programming Language :: Python :: 3.4',
34 |         'Programming Language :: Python :: 3.5',
35 |         'Programming Language :: Python :: 3.6',
36 |         'Programming Language :: Python :: 3.7',
37 |         'Programming Language :: Python :: 3.8',
38 |     ],
39 | )
40 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .nox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | *.py,cover
 51 | .hypothesis/
 52 | .pytest_cache/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | target/
 76 | 
 77 | # Jupyter Notebook
 78 | .ipynb_checkpoints
 79 | 
 80 | # IPython
 81 | profile_default/
 82 | ipython_config.py
 83 | 
 84 | # pyenv
 85 | .python-version
 86 | 
 87 | # pipenv
 88 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 89 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 90 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 91 | #   install all needed dependencies.
 92 | #Pipfile.lock
 93 | 
 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 95 | __pypackages__/
 96 | 
 97 | # Celery stuff
 98 | celerybeat-schedule
 99 | celerybeat.pid
100 | 
101 | # SageMath parsed files
102 | *.sage.py
103 | 
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 | 
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 | 
117 | # Rope project settings
118 | .ropeproject
119 | 
120 | # mkdocs documentation
121 | /site
122 | 
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 | 
128 | # Pyre type checker
129 | .pyre/
130 | 
131 | .idea/
132 | *.iml
133 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## KoBART-Transformers
 2 | - SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다.
 3 | <br><br>
 4 | 
 5 | ### Install (Optional)
 6 | - `BartModel`과 `PreTrainedTokenizerFast`를 이용하면 설치하실 필요 없습니다.
 7 | ```console
 8 | pip install kobart-transformers
 9 | ```
10 | <br>
11 | 
12 | ### Tokenizer
13 | - `PreTrainedTokenizerFast`를 이용하여 구현되었습니다.
14 | - `PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart")`와 동일합니다.
15 | ```python
16 | >>> from kobart_transformers import get_kobart_tokenizer
17 | >>> # from transformers import PreTrainedTokenizerFast
18 | 
19 | >>> kobart_tokenizer = get_kobart_tokenizer()
20 | >>> # kobart_tokenizer = PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart")
21 | 
22 | >>> kobart_tokenizer.tokenize("안녕하세요. 한국어 BART 입니다.🤣:)l^o")
23 | ['▁안녕하', '세요.', '▁한국어', '▁B', 'A', 'R', 'T', '▁입', '니다.', '🤣', ':)', 'l^o']
24 | ```
25 | <br>
26 | 
27 | ### Model
28 | - `BartModel`을 이용하여 구현되었습니다.
29 | - `BartModel.from_pretrained("hyunwoongko/kobart")`와 동일합니다.
30 | ```python
31 | >>> from kobart_transformers import get_kobart_model, get_kobart_tokenizer
32 | >>> # from transformers import BartModel
33 | 
34 | >>> kobart_tokenizer = get_kobart_tokenizer()
35 | >>> model = get_kobart_model()
36 | >>> # model = BartModel.from_pretrained("hyunwoongko/kobart")
37 | 
38 | >>> inputs = kobart_tokenizer(['안녕하세요.'], return_tensors='pt')
39 | >>> model(inputs['input_ids'])
40 | Seq2SeqModelOutput(last_hidden_state=tensor([[[-0.4488, -4.3651,  3.2349,  ...,  5.8916,  4.0497,  3.5468],
41 |          [-0.4096, -4.6106,  2.7189,  ...,  6.1745,  2.9832,  3.0930]]],
42 |        grad_fn=<TransposeBackward0>), past_key_values=None, decoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, encoder_last_hidden_state=tensor([[[ 0.4624, -0.2475,  0.0902,  ...,  0.1127,  0.6529,  0.2203],
43 |          [ 0.4538, -0.2948,  0.2556,  ..., -0.0442,  0.6858,  0.4372]]],
44 |        grad_fn=<TransposeBackward0>), encoder_hidden_states=None, encoder_attentions=None)
45 | ```
46 | <br>
47 | 
48 | ### For Seq2Seq Training
49 | - seq2seq 학습시에는 아래와 같이 `get_kobart_for_conditional_generation()`을 이용합니다.
50 | - `BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart")`와 동일합니다.
51 | ```python
52 | >>> from kobart_transformers import get_kobart_for_conditional_generation
53 | >>> # from transformers import BartForConditionalGeneration
54 | 
55 | >>> model = get_kobart_for_conditional_generation()
56 | >>> # model = BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart")
57 | ```
58 | <br>
59 | 
60 | ### Updates Notes
61 | #### version 0.1
62 | - `pad` 토큰이 설정되지 않은 에러를 해결하였습니다.
63 | ```python
64 | from kobart import get_kobart_tokenizer
65 | kobart_tokenizer = get_kobart_tokenizer()
66 | kobart_tokenizer(["한국어", "BART 모델을", "소개합니다."], truncation=True, padding=True)
67 | {
68 | 'input_ids': [[28324, 3, 3, 3, 3], [15085, 264, 281, 283, 24224], [15630, 20357, 3, 3, 3]], 
69 | 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 
70 | 'attention_mask': [[1, 0, 0, 0, 0], [1, 1, 1, 1, 1], [1, 1, 0, 0, 0]]
71 | }
72 | ```
73 | #### version 0.1.3
74 | - `get_kobart_for_conditional_generation()`를 `__init__.py`에 등록하였습니다.
75 | 
76 | #### version 0.1.4
77 | - 누락되었던 `special_tokens_map.json`을 추가하였습니다.
78 | - 이제 `pip install` 없이 KoBART를 이용할 수 있습니다.
79 | - thanks to [bernardscumm](https://github.com/bernardscumm)
80 | 
81 | #### version 0.1.5
82 | - tokenizer 사용시 `<s>`, `</s>`가 자동으로 붙게끔 템플릿 프로세싱을 추가했습니다.
83 | 
84 | <br>
85 | 
86 | ### Reference
87 | - [SKT KoBART](https://github.com/SKT-AI/KoBART)
88 | - [Huggingface Transformers](https://github.com/huggingface/transformers)
89 | 


--------------------------------------------------------------------------------