├── kobart_transformers
├── __init__.py
└── load_model.py
├── setup.py
├── .gitignore
└── README.md
/kobart_transformers/__init__.py:
--------------------------------------------------------------------------------
1 | from kobart_transformers.load_model import get_kobart_model, get_kobart_tokenizer, get_kobart_for_conditional_generation
2 |
3 | __ALL__ = [get_kobart_model, get_kobart_tokenizer, get_kobart_for_conditional_generation]
4 |
--------------------------------------------------------------------------------
/kobart_transformers/load_model.py:
--------------------------------------------------------------------------------
1 | from transformers import BartModel, PreTrainedTokenizerFast, BartForConditionalGeneration
2 |
3 |
4 | def get_kobart_model():
5 | return BartModel.from_pretrained("hyunwoongko/kobart")
6 |
7 |
8 | def get_kobart_for_conditional_generation():
9 | return BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart")
10 |
11 |
12 | def get_kobart_tokenizer():
13 | return PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart")
14 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | import codecs
2 | from setuptools import setup, find_packages
3 |
4 |
5 | def read_file(filename, cb):
6 | with codecs.open(filename, 'r', 'utf8') as f:
7 | return cb(f)
8 |
9 |
10 | with open('README.md', encoding='utf-8') as f:
11 | long_description = f.read()
12 |
13 | setup(
14 | name='kobart-transformers',
15 | version='0.1.4',
16 | author='Hyunwoong Ko',
17 | author_email='kevin.woong@kakaobrain.com',
18 | url='https://github.com/hyunwoongko/kobart-transformers',
19 | license='BSD 3-Clause "New" or "Revised" License',
20 | description='Kobart model on huggingface transformers',
21 | long_description_content_type='text/markdown',
22 | platforms=['any'],
23 | long_description=long_description,
24 | packages=find_packages(exclude=[]),
25 | keywords=["kobart", "huggingface", "deep learning"],
26 | python_requires='>=3',
27 | package_data={},
28 | zip_safe=False,
29 | classifiers=[
30 | 'Programming Language :: Python :: 3',
31 | 'Programming Language :: Python :: 3.2',
32 | 'Programming Language :: Python :: 3.3',
33 | 'Programming Language :: Python :: 3.4',
34 | 'Programming Language :: Python :: 3.5',
35 | 'Programming Language :: Python :: 3.6',
36 | 'Programming Language :: Python :: 3.7',
37 | 'Programming Language :: Python :: 3.8',
38 | ],
39 | )
40 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | pip-wheel-metadata/
24 | share/python-wheels/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | MANIFEST
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .nox/
44 | .coverage
45 | .coverage.*
46 | .cache
47 | nosetests.xml
48 | coverage.xml
49 | *.cover
50 | *.py,cover
51 | .hypothesis/
52 | .pytest_cache/
53 |
54 | # Translations
55 | *.mo
56 | *.pot
57 |
58 | # Django stuff:
59 | *.log
60 | local_settings.py
61 | db.sqlite3
62 | db.sqlite3-journal
63 |
64 | # Flask stuff:
65 | instance/
66 | .webassets-cache
67 |
68 | # Scrapy stuff:
69 | .scrapy
70 |
71 | # Sphinx documentation
72 | docs/_build/
73 |
74 | # PyBuilder
75 | target/
76 |
77 | # Jupyter Notebook
78 | .ipynb_checkpoints
79 |
80 | # IPython
81 | profile_default/
82 | ipython_config.py
83 |
84 | # pyenv
85 | .python-version
86 |
87 | # pipenv
88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
91 | # install all needed dependencies.
92 | #Pipfile.lock
93 |
94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
95 | __pypackages__/
96 |
97 | # Celery stuff
98 | celerybeat-schedule
99 | celerybeat.pid
100 |
101 | # SageMath parsed files
102 | *.sage.py
103 |
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 |
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 |
117 | # Rope project settings
118 | .ropeproject
119 |
120 | # mkdocs documentation
121 | /site
122 |
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 |
128 | # Pyre type checker
129 | .pyre/
130 |
131 | .idea/
132 | *.iml
133 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## KoBART-Transformers
2 | - SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다.
3 |
4 |
5 | ### Install (Optional)
6 | - `BartModel`과 `PreTrainedTokenizerFast`를 이용하면 설치하실 필요 없습니다.
7 | ```console
8 | pip install kobart-transformers
9 | ```
10 |
11 |
12 | ### Tokenizer
13 | - `PreTrainedTokenizerFast`를 이용하여 구현되었습니다.
14 | - `PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart")`와 동일합니다.
15 | ```python
16 | >>> from kobart_transformers import get_kobart_tokenizer
17 | >>> # from transformers import PreTrainedTokenizerFast
18 |
19 | >>> kobart_tokenizer = get_kobart_tokenizer()
20 | >>> # kobart_tokenizer = PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart")
21 |
22 | >>> kobart_tokenizer.tokenize("안녕하세요. 한국어 BART 입니다.🤣:)l^o")
23 | ['▁안녕하', '세요.', '▁한국어', '▁B', 'A', 'R', 'T', '▁입', '니다.', '🤣', ':)', 'l^o']
24 | ```
25 |
26 |
27 | ### Model
28 | - `BartModel`을 이용하여 구현되었습니다.
29 | - `BartModel.from_pretrained("hyunwoongko/kobart")`와 동일합니다.
30 | ```python
31 | >>> from kobart_transformers import get_kobart_model, get_kobart_tokenizer
32 | >>> # from transformers import BartModel
33 |
34 | >>> kobart_tokenizer = get_kobart_tokenizer()
35 | >>> model = get_kobart_model()
36 | >>> # model = BartModel.from_pretrained("hyunwoongko/kobart")
37 |
38 | >>> inputs = kobart_tokenizer(['안녕하세요.'], return_tensors='pt')
39 | >>> model(inputs['input_ids'])
40 | Seq2SeqModelOutput(last_hidden_state=tensor([[[-0.4488, -4.3651, 3.2349, ..., 5.8916, 4.0497, 3.5468],
41 | [-0.4096, -4.6106, 2.7189, ..., 6.1745, 2.9832, 3.0930]]],
42 | grad_fn=), past_key_values=None, decoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, encoder_last_hidden_state=tensor([[[ 0.4624, -0.2475, 0.0902, ..., 0.1127, 0.6529, 0.2203],
43 | [ 0.4538, -0.2948, 0.2556, ..., -0.0442, 0.6858, 0.4372]]],
44 | grad_fn=), encoder_hidden_states=None, encoder_attentions=None)
45 | ```
46 |
47 |
48 | ### For Seq2Seq Training
49 | - seq2seq 학습시에는 아래와 같이 `get_kobart_for_conditional_generation()`을 이용합니다.
50 | - `BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart")`와 동일합니다.
51 | ```python
52 | >>> from kobart_transformers import get_kobart_for_conditional_generation
53 | >>> # from transformers import BartForConditionalGeneration
54 |
55 | >>> model = get_kobart_for_conditional_generation()
56 | >>> # model = BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart")
57 | ```
58 |
59 |
60 | ### Updates Notes
61 | #### version 0.1
62 | - `pad` 토큰이 설정되지 않은 에러를 해결하였습니다.
63 | ```python
64 | from kobart import get_kobart_tokenizer
65 | kobart_tokenizer = get_kobart_tokenizer()
66 | kobart_tokenizer(["한국어", "BART 모델을", "소개합니다."], truncation=True, padding=True)
67 | {
68 | 'input_ids': [[28324, 3, 3, 3, 3], [15085, 264, 281, 283, 24224], [15630, 20357, 3, 3, 3]],
69 | 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]],
70 | 'attention_mask': [[1, 0, 0, 0, 0], [1, 1, 1, 1, 1], [1, 1, 0, 0, 0]]
71 | }
72 | ```
73 | #### version 0.1.3
74 | - `get_kobart_for_conditional_generation()`를 `__init__.py`에 등록하였습니다.
75 |
76 | #### version 0.1.4
77 | - 누락되었던 `special_tokens_map.json`을 추가하였습니다.
78 | - 이제 `pip install` 없이 KoBART를 이용할 수 있습니다.
79 | - thanks to [bernardscumm](https://github.com/bernardscumm)
80 |
81 | #### version 0.1.5
82 | - tokenizer 사용시 ``, ``가 자동으로 붙게끔 템플릿 프로세싱을 추가했습니다.
83 |
84 |
85 |
86 | ### Reference
87 | - [SKT KoBART](https://github.com/SKT-AI/KoBART)
88 | - [Huggingface Transformers](https://github.com/huggingface/transformers)
89 |
--------------------------------------------------------------------------------