├── render.py ├── .gitignore └── README.md /render.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | 4 | from PIL import Image, ImageFont, ImageDraw 5 | import numpy as np 6 | 7 | def render(text, font): 8 | mask = font.getmask(text) 9 | size = mask.size[::-1] 10 | a = np.asarray(mask).reshape(size) 11 | return a 12 | 13 | def ascii_print(glyph_array): 14 | for l in glyph_array: 15 | for c in l: 16 | if c != 0: 17 | print '#', 18 | else: 19 | print ' ', 20 | print 21 | 22 | if __name__ == '__main__': 23 | s = u'你好' 24 | print 'check utf-8 support:', 25 | print s.encode('utf-8') 26 | font = ImageFont.truetype('NotoSansCJKsc-hinted/NotoSansCJKsc-Regular.otf', 24) 27 | a = render(s, font) 28 | print a 29 | ascii_print(a) 30 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | local_settings.py 55 | 56 | # Flask stuff: 57 | instance/ 58 | .webassets-cache 59 | 60 | # Scrapy stuff: 61 | .scrapy 62 | 63 | # Sphinx documentation 64 | docs/_build/ 65 | 66 | # PyBuilder 67 | target/ 68 | 69 | # IPython Notebook 70 | .ipynb_checkpoints 71 | 72 | # pyenv 73 | .python-version 74 | 75 | # celery beat schedule file 76 | celerybeat-schedule 77 | 78 | # dotenv 79 | .env 80 | 81 | # virtualenv 82 | venv/ 83 | ENV/ 84 | 85 | # Spyder project settings 86 | .spyderproject 87 | 88 | # Rope project settings 89 | .ropeproject 90 | 91 | # mac 92 | .DS_Store 93 | 94 | # project specific 95 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # chinese-char-lm 2 | This is the code associated with the publication _Glyph-aware Embedding of Chinese Characters_ by Dai and Cai. Please consider to cite the paper if you find the code useful in some way for your research. 3 | 4 | ``` 5 | Dai, Falcon Z., and Zheng Cai. "Glyph-aware Embedding of Chinese Characters." EMNLP 2017 (2017): 64. 6 | ``` 7 | 8 | ```bibtex 9 | @article{dai2017glyph, 10 | title={Glyph-aware Embedding of Chinese Characters}, 11 | author={Dai, Falcon Z and Cai, Zheng}, 12 | journal={EMNLP 2017}, 13 | pages={64}, 14 | year={2017} 15 | } 16 | ``` 17 | 18 | # usage 19 | 20 | - We used Google Noto font for all of our experiments. Download Google Noto Simplified Chinese fonts (https://www.google.com/get/noto/#sans-hans). Unzip it under the project directory. It is needed to render the glyphs. 21 | - Requires Tensorflow v1.1 and Python 2.7.x 22 | - Clone the repo and check out a particular branch or a specific commit with `$ git checkout ` 23 | 24 | # replication 25 | 26 | In favor of replicability, we git-tagged the original git commits we used to obtain the published figures. Please see the [release](https://github.com/falcondai/chinese-char-lm/releases) for a complete list of git tags (compare with the model names in the paper). Please use the issues page to contact us with code issues so more people can benefit from the conversations. 27 | 28 | ## summary of our implementation 29 | 30 | Commit [msr-m1](https://github.com/falcondai/chinese-char-lm/tree/msr-m1) is a good place to start for language modeling. See https://github.com/falcondai/chinese-char-lm/blob/msr-m1/train_id_cnn_lm.py#L35 for a few related models (they differ by whether they use character-id embedding, glyph embedding, or both). For the Chinese segmentation task (tokenizing Chinese sentences which lack whitespaces by convention), you probably want to consult https://github.com/falcondai/chinese-char-lm/blob/segmentation/train_cnn_segmentation.py. 31 | 32 | On a high level, our implementation uses no pre-trained embeddings and render the characters into glyphs on-the-fly. Glyph rendering calls are slow, so we cache the glyphs of seen characters which gives a dramatic speedup (see https://github.com/falcondai/chinese-char-lm/blob/msr-m1/train_cnn_lm.py#L18). We consider the input activation, - the combined output of a CNN over the glyph and a trained character-id embedding -, to the RNN as the _effective embedding_ for an input character. 33 | 34 | In terms of implementation: 35 | 1. It takes in the path to a text file (utf-8 encoded) and the path to a vocabulary as input (see https://github.com/falcondai/chinese-char-lm/blob/msr-m1/train_id_cnn_lm.py#L61) to build a tensorflow input pipeline. In the case of segmentation, an additional path to the ground truth segmentation annotations. 36 | 2. The characters are rendered into glyphs (see https://github.com/falcondai/chinese-char-lm/blob/msr-m1/train_id_cnn_lm.py#L13) and pass to the CNN. In parallel, we also look up the embedding using its vocabulary id. (We do both for all models, and then simply use a 0/1 multiplier to shutdown the path we don't need before outputting to the RNN in the specific model variant. See https://github.com/falcondai/chinese-char-lm/blob/msr-m1/train_id_cnn_lm.py#L40) 37 | 3. Lastly the output is fed into an standard RNN as common in other contemporary works. 38 | 4. Train end-to-end for the given task. 39 | 40 | # authors 41 | Falcon Dai (dai@ttic.edu) 42 | 43 | Zheng Cai (jontsai@uchicago.edu) 44 | --------------------------------------------------------------------------------