├── render.py
├── .gitignore
└── README.md


/render.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | from PIL import Image, ImageFont, ImageDraw
 5 | import numpy as np
 6 | 
 7 | def render(text, font):
 8 |     mask = font.getmask(text)
 9 |     size = mask.size[::-1]
10 |     a = np.asarray(mask).reshape(size)
11 |     return a
12 | 
13 | def ascii_print(glyph_array):
14 |     for l in glyph_array:
15 |         for c in l:
16 |             if c != 0:
17 |                 print '#',
18 |             else:
19 |                 print ' ',
20 |         print
21 | 
22 | if __name__ == '__main__':
23 |     s = u'你好'
24 |     print 'check utf-8 support:',
25 |     print s.encode('utf-8')
26 |     font = ImageFont.truetype('NotoSansCJKsc-hinted/NotoSansCJKsc-Regular.otf', 24)
27 |     a = render(s, font)
28 |     print a
29 |     ascii_print(a)
30 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 | 
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 | 
60 | # Scrapy stuff:
61 | .scrapy
62 | 
63 | # Sphinx documentation
64 | docs/_build/
65 | 
66 | # PyBuilder
67 | target/
68 | 
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 | 
72 | # pyenv
73 | .python-version
74 | 
75 | # celery beat schedule file
76 | celerybeat-schedule
77 | 
78 | # dotenv
79 | .env
80 | 
81 | # virtualenv
82 | venv/
83 | ENV/
84 | 
85 | # Spyder project settings
86 | .spyderproject
87 | 
88 | # Rope project settings
89 | .ropeproject
90 | 
91 | # mac
92 | .DS_Store
93 | 
94 | # project specific
95 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # chinese-char-lm
 2 | This is the code associated with the publication _Glyph-aware Embedding of Chinese Characters_ by Dai and Cai. Please consider to cite the paper if you find the code useful in some way for your research.
 3 | 
 4 | ```
 5 | Dai, Falcon Z., and Zheng Cai. "Glyph-aware Embedding of Chinese Characters." EMNLP 2017 (2017): 64.
 6 | ```
 7 | 
 8 | ```bibtex
 9 | @article{dai2017glyph,
10 |   title={Glyph-aware Embedding of Chinese Characters},
11 |   author={Dai, Falcon Z and Cai, Zheng},
12 |   journal={EMNLP 2017},
13 |   pages={64},
14 |   year={2017}
15 | }
16 | ```
17 | 
18 | # usage
19 | 
20 | - We used Google Noto font for all of our experiments. Download Google Noto Simplified Chinese fonts (https://www.google.com/get/noto/#sans-hans). Unzip it under the project directory. It is needed to render the glyphs.
21 | - Requires Tensorflow v1.1 and Python 2.7.x
22 | - Clone the repo and check out a particular branch or a specific commit with `$ git checkout <branch-name or git-tag>`
23 | 
24 | # replication
25 | 
26 | In favor of replicability, we git-tagged the original git commits we used to obtain the published figures. Please see the [release](https://github.com/falcondai/chinese-char-lm/releases) for a complete list of git tags (compare with the model names in the paper). Please use the issues page to contact us with code issues so more people can benefit from the conversations.
27 | 
28 | ## summary of our implementation
29 | 
30 | Commit [msr-m1](https://github.com/falcondai/chinese-char-lm/tree/msr-m1) is a good place to start for language modeling. See https://github.com/falcondai/chinese-char-lm/blob/msr-m1/train_id_cnn_lm.py#L35 for a few related models (they differ by whether they use character-id embedding, glyph embedding, or both). For the Chinese segmentation task (tokenizing Chinese sentences which lack whitespaces by convention), you probably want to consult https://github.com/falcondai/chinese-char-lm/blob/segmentation/train_cnn_segmentation.py.
31 | 
32 | On a high level, our implementation uses no pre-trained embeddings and render the characters into glyphs on-the-fly. Glyph rendering calls are slow, so we cache the glyphs of seen characters which gives a dramatic speedup (see https://github.com/falcondai/chinese-char-lm/blob/msr-m1/train_cnn_lm.py#L18). We consider the input activation, - the combined output of a CNN over the glyph and a trained character-id embedding -, to the RNN as the _effective embedding_ for an input character. 
33 | 
34 | In terms of implementation: 
35 | 1. It takes in the path to a text file (utf-8 encoded) and the path to a vocabulary as input (see https://github.com/falcondai/chinese-char-lm/blob/msr-m1/train_id_cnn_lm.py#L61) to build a tensorflow input pipeline. In the case of segmentation, an additional path to the ground truth segmentation annotations. 
36 | 2. The characters are rendered into glyphs (see https://github.com/falcondai/chinese-char-lm/blob/msr-m1/train_id_cnn_lm.py#L13) and pass to the CNN. In parallel, we also look up the embedding using its vocabulary id. (We do both for all models, and then simply use a 0/1 multiplier to shutdown the path we don't need before outputting to the RNN in the specific model variant. See https://github.com/falcondai/chinese-char-lm/blob/msr-m1/train_id_cnn_lm.py#L40)
37 | 3. Lastly the output is fed into an standard RNN as common in other contemporary works.
38 | 4. Train end-to-end for the given task.
39 | 
40 | # authors
41 | Falcon Dai (dai@ttic.edu)
42 | 
43 | Zheng Cai (jontsai@uchicago.edu)
44 | 


--------------------------------------------------------------------------------