├── .gitignore
├── LICENSE
├── README.md
├── data_load.py
├── eval.py
├── hyperparams.py
├── modules.py
├── requirements.txt
├── results
    └── model_epoch_15_gs_4914
└── train.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 | 
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 | 
60 | # Scrapy stuff:
61 | .scrapy
62 | 
63 | # Sphinx documentation
64 | docs/_build/
65 | 
66 | # PyBuilder
67 | target/
68 | 
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 | 
72 | # pyenv
73 | .python-version
74 | 
75 | # celery beat schedule file
76 | celerybeat-schedule
77 | 
78 | # dotenv
79 | .env
80 | 
81 | # virtualenv
82 | venv/
83 | ENV/
84 | 
85 | # Spyder project settings
86 | .spyderproject
87 | 
88 | # Rope project settings
89 | .ropeproject
90 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2016 Kyubyong Park
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Neural Tokenizer
  2 | 
  3 | ## Motivation
  4 | Tokenization, or segmentation, is often the first step in text processing. It is not a trivial problem in such languages as Chinese, Japanese, or Vietnamese. For English, generally speaking, tokenization is not as important as those languages. However, it may not be so in the mobile environment as people often neglect it. Plus, in my opinion, English is a good testbed before we attack other challenging languages. Tradionally Conditional Random Fields have been successfully employed for tokenization, but neural networks can be an alternative. This is a simple, and/but fun task. Probably you can see the results in less than 10 minutes on a single GPU!
  5 | 
  6 | ## Model Description
  7 | Modified CBHG model, which was introduced in [Tacotron: Towards End-to-End Speech Synthesis](https://arxiv.org/abs/1703.10135), is employed. It is a very powerful architecture with a reasonable number of hyperparameters.
  8 | 
  9 | ## Data
 10 | We use the brown corpus which can be obtained from `nltk`. It is not big enough, but publicly available. Besides, we don't have to clean it.
 11 | 
 12 | ## Requirements
 13 |  * NumPy >= 1.11.1
 14 |  * TensorFlow = 1.2
 15 |  * nltk >= 3.2.1 (You need to download `brown` corpus)
 16 |  * tqdm >= 4.14.0
 17 | 
 18 | ## File description
 19 | 
 20 |  * `hyperparams.py` includes all hyper parameters that are needed.
 21 |  * `data_load.py` loads data and put them in queues.
 22 |  * `modules.py` contains building blocks for the network.
 23 |  * `train.py` is for training.
 24 |  * `eval.py` is for evaluation.
 25 | 
 26 | ## Training
 27 |   * STEP 0. Make sure you meet the requirements.
 28 |   * STEP 1. Adjust hyper parameters in hyperparams.py if necessary.
 29 |   * STEP 2. Run `train.py` or download my [pretrained files](https://www.dropbox.com/s/fxl3ixo5jwl7ihv/logdir.zip?dl=0).
 30 | 
 31 | ## Evaluation
 32 |   * Run `eval.py`.
 33 | 
 34 | ## Results
 35 | I got test accuracy of 0.9877 against the model of 15 epochs, or 4,914 global steps. The baseline result is acquired when we assume we didn't touch anything on the untokenized data. Some of the results are shown below. Details are available in the `results` folder.
 36 | 
 37 |  * Final Accuracy = 209086/211699=0.9877
 38 |  * Baseline Accuracy = 166107/211699=0.7846
 39 | 
 40 | ▌Expected: Likewise the ivory Chinese female figure known as a doctor lady '' provenance Honan<br>
 41 | ▌Got: Likewise theivory Chinese female figure known as a doctor lady '' provenance Honan<br>
 42 | 
 43 | ▌Expected: a friend of mine removing her from the curio cabinet for inspection was felled as if by a hammer but he had previously drunk a quantity of applejack<br>
 44 | ▌Got: a friend of mine removing her from the curiocabinet for inspection was felled as if by a hammer but he had previously drunk a quantity of apple jack<br>
 45 | 
 46 | ▌Expected: The three Indian brass deities though Ganessa Siva and Krishna are an altogether different cup of tea<br>
 47 | ▌Got: The three Indian brass deities though Ganess a Siva and Krishna are an altogether different cup of tea<br>
 48 | 
 49 | ▌Expected: They hail from Travancore a state in the subcontinent where Kali the goddess of death is worshiped<br>
 50 | ▌Got: They hail from Travan core a state in the subcontinent where Kalit he goddess of deat his worshiped<br>
 51 | 
 52 | ▌Expected: Have you ever heard of Thuggee<br>
 53 | ▌Got: Have you ever heard of Thuggee<br>
 54 | 
 55 | ▌Expected: Oddly enough this is an amulet against housebreakers presented to the mem and me by a local rajah in<br>
 56 | ▌Got: Oddly enough this is an a mulet against house breakers presented to the memand me by a local rajahin<br>
 57 | 
 58 | ▌Expected: Inscribed around its base is a charm in Balinese a dialect I take it you don't comprehend<br>
 59 | ▌Got: Inscribed around its base is a charm in Baline seadialect I take it you don't comprehend<br>
 60 | 
 61 | ▌Expected: Neither do I but the Tjokorda Agoeng was good enough to translate and I'll do as much for you<br>
 62 | ▌Got: Neither do I but the Tjokord a Agoeng was good enough to translate and I'll do as much for you<br>
 63 | 
 64 | ▌Expected: Whosoever violates our rooftree the legend states can expect maximal sorrow<br>
 65 | ▌Got: Who so ever violate sour roof treethe legend states can expect maximal s orrow<br>
 66 | 
 67 | ▌Expected: The teeth will rain from his mouth like pebbles his wife will make him cocu with fishmongers and a trolley car will grow in his stomach<br>
 68 | ▌Got: The teeth will rain from his mouth like pebbles his wife will make him cocu with fish mongers and a trolley car will grow in his stomach<br>
 69 | 
 70 | ▌Expected: Furthermore and this to me strikes an especially warming note it shall avail the vandals naught to throw away or dispose of their loot<br>
 71 | ▌Got: Furthermore and this tome strikes an especially warming note it shall avail the vand alsnaught to throw away or dispose of their loot<br>
 72 | 
 73 | ▌Expected: The cycle of disaster starts the moment they touch any belonging of ours and dogs them unto the fortyfifth generation<br>
 74 | ▌Got: The cycle of disaster starts the moment they touch any belonging of ours and dogs them un to the fortyfifth generation<br>
 75 | 
 76 | ▌Expected: Sort of remorseless isn't it<br>
 77 | ▌Got: Sort of remorseless isn't it<br>
 78 | 
 79 | ▌Expected: Still there it is<br>
 80 | ▌Got: Still there it is<br>
 81 | 
 82 | ▌Expected: Now you no doubt regard the preceding as pap<br>
 83 | ▌Got: Now you no doubt regard the preceding aspap<br>
 84 | 
 85 | ▌Expected: In that case listen to what befell another wisenheimer who tangled with our joss<br>
 86 | ▌Got: In that case listen to what be fell anotherwisen heimer who tangled with our joss<br>
 87 | 
 88 | ▌Expected: A couple of years back I occupied a Village apartment whose outer staircase contained the type of niche called a coffin turn ''<br>
 89 | ▌Got: A couple of yearsback I occupied a Village apartment whose outerstair case contained the type of niche called a coffinturn ''<br>
 90 | 
 91 | ▌Expected: After a while we became aware that the money was disappearing as fast as we replenished it<br>
 92 | ▌Got: After a while we became aware that the money was disappearing as fast as were plenished it<br>
 93 | 
 94 | ▌Expected: The more I probed into this young man's activities and character the less savory I found him<br>
 95 | ▌Got: The more I probed into this young man's activities and character the less savory I found him<br>
 96 | 
 97 | ▌Expected: His energy was prodigious<br>
 98 | ▌Got: His energy was prodigious<br>
 99 | 
100 | ▌Expected: In short and to borrow an arboreal phrase slash timber<br>
101 | ▌Got: In short and toborrow an arboreal phrases lash timber<br>
102 | 
103 | 
104 | 
105 | 
106 | 
107 | 


--------------------------------------------------------------------------------
/data_load.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # /usr/bin/python2
 3 | '''
 4 | June 2017 by kyubyong park.
 5 | kbpark.linguist@gmail.com.
 6 | https://www.github.com/kyubyong/neural_tokenizer
 7 | '''
 8 | from __future__ import print_function
 9 | from hyperparams import Hyperparams as hp
10 | import tensorflow as tf
11 | import numpy as np
12 | import re
13 | 
14 | 
15 | def load_vocab():
16 |     vocab = "_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'-" # _: sentinel for Padding
17 |     word2idx = {word: idx for idx, word in enumerate(vocab)}
18 |     idx2word = {idx: word for idx, word in enumerate(vocab)}
19 |     return word2idx, idx2word
20 | 
21 | def load_data(mode="train"):
22 |     word2idx, idx2word = load_vocab()
23 | 
24 |     from nltk.corpus import brown
25 |     sents = [" ".join(words) for words in brown.sents()]
26 | 
27 |     xs, ys = [], []
28 |     for sent in sents:
29 |         sent = re.sub(r"[^ A-Za-z']", "", sent)
30 |         if hp.minlen <= len(sent) <= hp.maxlen:
31 |             x, y = [], []
32 |             for word in sent.split():
33 |                 for char in word:
34 |                     x.append(word2idx[char])
35 |                     y.append(0) # 0: no space
36 |                 y[-1] = 1 # space for end of a word
37 |             y[-1] = 0 # no space for end of sentence
38 | 
39 |             xs.append(x + [0] * (hp.maxlen-len(x)))
40 |             ys.append(y + [0] * (hp.maxlen-len(x)))
41 | 
42 |     # Convert to ndarrays
43 |     X = np.array(xs, np.int32)
44 |     Y = np.array(ys, np.int32)
45 | 
46 |     # mode
47 |     if mode=="train":
48 |         X, Y = X[: int(len(X) * .8)], Y[: int(len(Y) * .8)]
49 |         # X, Y = X[: 128], Y[: 128]
50 |     elif mode=="val":
51 |         X, Y = X[int(len(X) * .8): -int(len(X) * .1)], Y[int(len(X) * .8): -int(len(X) * .1)]
52 |     else:
53 |         X, Y = X[-int(len(X) * .1):], Y[-int(len(X) * .1):]
54 | 
55 |     return X, Y
56 | 
57 | def get_batch_data():
58 |     # Load data
59 |     X, Y = load_data()
60 | 
61 |     # calc total batch count
62 |     num_batch = len(X) // hp.batch_size
63 | 
64 |     # Convert to tensor
65 |     X = tf.convert_to_tensor(X, tf.int32)
66 |     Y = tf.convert_to_tensor(Y, tf.int32)
67 | 
68 |     # Create Queues
69 |     input_queues = tf.train.slice_input_producer([X, Y])
70 | 
71 |     # create batch queues
72 |     x, y = tf.train.batch(input_queues,
73 |                           num_threads=8,
74 |                           batch_size=hp.batch_size,
75 |                           capacity=hp.batch_size * 64,
76 |                           allow_smaller_final_batch=False)
77 | 
78 |     return x, y, num_batch  # (N, T), (N, T), ()
79 | 
80 | 


--------------------------------------------------------------------------------
/eval.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # /usr/bin/python2
 3 | '''
 4 | By kyubyong park. kbpark.linguist@gmail.com.
 5 | https://www.github.com/kyubyong/neural_tokenizer
 6 | '''
 7 | 
 8 | from __future__ import print_function
 9 | 
10 | import os
11 | 
12 | from hyperparams import Hyperparams as hp
13 | import numpy as np
14 | import tensorflow as tf
15 | from train import Graph
16 | from data_load import get_batch_data, load_vocab, load_data
17 | 
18 | 
19 | def eval():
20 |     # Load graph
21 |     g = Graph(is_training=False)
22 |     print("Graph loaded")
23 | 
24 |     # Load data
25 |     X, Y = load_data(mode="test")  # texts
26 |     char2idx, idx2char = load_vocab()
27 | 
28 |     with g.graph.as_default():
29 |         sv = tf.train.Supervisor()
30 |         with sv.managed_session(config=tf.ConfigProto(allow_soft_placement=True)) as sess:
31 |             # Restore parameters
32 |             sv.saver.restore(sess, tf.train.latest_checkpoint(hp.logdir))
33 |             print("Restored!")
34 | 
35 |             # Get model
36 |             mname = open(hp.logdir + '/checkpoint', 'r').read().split('"')[1]  # model name
37 | 
38 |             # Inference
39 |             if not os.path.exists(hp.savedir): os.mkdir(hp.savedir)
40 |             with open("{}/{}".format(hp.savedir, mname), 'w') as fout:
41 |                 results = []
42 |                 baseline_results = []
43 |                 for step in range(len(X) // hp.batch_size):
44 |                     x = X[step * hp.batch_size: (step + 1) * hp.batch_size]
45 |                     y = Y[step * hp.batch_size: (step + 1) * hp.batch_size]
46 | 
47 |                     # predict characters
48 |                     preds = sess.run(g.preds, {g.x: x})
49 | 
50 |                     for xx, yy, pp in zip(x, y, preds):  # sentence-wise
51 |                         expected = ''
52 |                         got = ''
53 |                         for xxx, yyy, ppp in zip(xx, yy, pp):  # character-wise
54 |                             if xxx == 0:
55 |                                 break
56 |                             else:
57 |                                 got += idx2char.get(xxx, "*")
58 |                                 expected += idx2char.get(xxx, "*")
59 |                             if ppp == 1: got += " "
60 |                             if yyy == 1: expected += " "
61 | 
62 |                             # prediction results
63 |                             if ppp == yyy:
64 |                                 results.append(1)
65 |                             else:
66 |                                 results.append(0)
67 | 
68 |                             # baseline results
69 |                             if yyy == 0: # no space
70 |                                 baseline_results.append(1)
71 |                             else:
72 |                                 baseline_results.append(0)
73 | 
74 |                         fout.write("▌Expected: " + expected + "\n")
75 |                         fout.write("▌Got: " + got + "\n\n")
76 |                 fout.write(
77 |                     "Final Accuracy = %d/%d=%.4f\n" % (sum(results), len(results), float(sum(results)) / len(results)))
78 |                 fout.write(
79 |                     "Baseline Accuracy = %d/%d=%.4f" % (sum(baseline_results), len(baseline_results), float(sum(baseline_results)) / len(baseline_results)))
80 | 
81 | 
82 | 
83 | if __name__ == '__main__':
84 |     eval()
85 |     print("Done")
86 | 
87 | 


--------------------------------------------------------------------------------
/hyperparams.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # /usr/bin/python2
 3 | '''
 4 | June 2017 by kyubyong park.
 5 | kbpark.linguist@gmail.com.
 6 | https://www.github.com/kyubyong/neural_tokenizer
 7 | '''
 8 | 
 9 | class Hyperparams:
10 |     '''Hyperparameters'''
11 | 
12 |     # model
13 |     maxlen = 150  # Maximum number of characters in a sentence. alias = T.
14 |     minlen = 10 # Minimum number of characters in a sentence. alias = T.
15 |     hidden_units = 256  # alias = E
16 |     num_blocks = 6  # number of encoder/decoder blocks
17 |     num_heads = 8
18 |     dropout_rate = 0.2
19 |     encoder_num_banks = 16
20 |     num_highwaynet_blocks = 4
21 | 
22 | 
23 |     # training
24 |     num_epochs = 20
25 |     batch_size = 128  # alias = N
26 |     lr = 0.0001  # learning rate.
27 |     logdir = 'logdir'  # log directory
28 |     savedir = "results" # save directory
29 | 
30 | 
31 | 
32 | 
33 | 
34 | 


--------------------------------------------------------------------------------
/modules.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # /usr/bin/python2
  3 | '''
  4 | June 2017 by kyubyong park.
  5 | kbpark.linguist@gmail.com.
  6 | https://www.github.com/kyubyong/neural_tokenizer
  7 | '''
  8 | 
  9 | from __future__ import print_function
 10 | import tensorflow as tf
 11 | 
 12 | def embedding(inputs,
 13 |               vocab_size,
 14 |               num_units,
 15 |               zero_pad=True,
 16 |               scale=True,
 17 |               scope="embedding",
 18 |               reuse=None):
 19 |     '''Embeds a given tensor.
 20 | 
 21 |     Args:
 22 |       inputs: A `Tensor` with type `int32` or `int64` containing the ids
 23 |          to be looked up in `lookup table`.
 24 |       vocab_size: An int. Vocabulary size.
 25 |       num_units: An int. Number of embedding hidden units.
 26 |       zero_pad: A boolean. If True, all the values of the fist row (id 0)
 27 |         should be constant zeros.
 28 |       scale: A boolean. If True. the outputs is multiplied by sqrt num_units.
 29 |       scope: Optional scope for `variable_scope`.
 30 |       reuse: Boolean, whether to reuse the weights of a previous layer
 31 |         by the same name.
 32 | 
 33 |     Returns:
 34 |       A `Tensor` with one more rank than inputs's. The last dimensionality
 35 |         should be `num_units`.
 36 | 
 37 |     For example,
 38 | 
 39 |     ```
 40 |     import tensorflow as tf
 41 | 
 42 |     inputs = tf.to_int32(tf.reshape(tf.range(2*3), (2, 3)))
 43 |     outputs = embedding(inputs, 6, 2, zero_pad=True)
 44 |     with tf.Session() as sess:
 45 |         sess.run(tf.global_variables_initializer())
 46 |         print sess.run(outputs)
 47 |     >>
 48 |     [[[ 0.          0.        ]
 49 |       [ 0.09754146  0.67385566]
 50 |       [ 0.37864095 -0.35689294]]
 51 | 
 52 |      [[-1.01329422 -1.09939694]
 53 |       [ 0.7521342   0.38203377]
 54 |       [-0.04973143 -0.06210355]]]
 55 |     ```
 56 | 
 57 |     ```
 58 |     import tensorflow as tf
 59 | 
 60 |     inputs = tf.to_int32(tf.reshape(tf.range(2*3), (2, 3)))
 61 |     outputs = embedding(inputs, 6, 2, zero_pad=False)
 62 |     with tf.Session() as sess:
 63 |         sess.run(tf.global_variables_initializer())
 64 |         print sess.run(outputs)
 65 |     >>
 66 |     [[[-0.19172323 -0.39159766]
 67 |       [-0.43212751 -0.66207761]
 68 |       [ 1.03452027 -0.26704335]]
 69 | 
 70 |      [[-0.11634696 -0.35983452]
 71 |       [ 0.50208133  0.53509563]
 72 |       [ 1.22204471 -0.96587461]]]
 73 |     ```
 74 |     '''
 75 |     with tf.variable_scope(scope, reuse=reuse):
 76 |         lookup_table = tf.get_variable('lookup_table',
 77 |                                        dtype=tf.float32,
 78 |                                        shape=[vocab_size, num_units],
 79 |                                        initializer=tf.contrib.layers.xavier_initializer())
 80 |         if zero_pad:
 81 |             lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
 82 |                                       lookup_table[1:, :]), 0)
 83 |         outputs = tf.nn.embedding_lookup(lookup_table, inputs)
 84 | 
 85 |         if scale:
 86 |             outputs = outputs * (num_units ** 0.5)
 87 | 
 88 |     return outputs
 89 | 
 90 | 
 91 | def normalize(inputs,
 92 |               type="bn",
 93 |               decay=.999,
 94 |               epsilon=1e-8,
 95 |               is_training=True,
 96 |               activation_fn=None,
 97 |               reuse=None,
 98 |               scope="normalize"):
 99 |     '''Applies {batch|layer} normalization.
100 | 
101 |     Args:
102 |       inputs: A tensor with 2 or more dimensions, where the first dimension has
103 |         `batch_size`.
104 |       type: A string. Either "bn" or "ln" or "ins" or None.
105 |       decay: Decay for the moving average. Reasonable values for `decay` are close
106 |         to 1.0, typically in the multiple-nines range: 0.999, 0.99, 0.9, etc.
107 |         Lower `decay` value (recommend trying `decay`=0.9) if model experiences
108 |         reasonably good training performance but poor validation and/or test
109 |         performance.
110 |       is_training: Whether or not the layer is in training mode.
111 |       activation_fn: Activation function.
112 |       scope: Optional scope for `variable_scope`.
113 | 
114 |     Returns:
115 |       A tensor with the same shape and data dtype as `inputs`.
116 |     '''
117 |     if type == "bn":
118 |         inputs_shape = inputs.get_shape()
119 |         inputs_rank = inputs_shape.ndims
120 | 
121 |         # use fused batch norm if inputs_rank in [2, 3, 4] as it is much faster.
122 |         # pay attention to the fact that fused_batch_norm requires shape to be rank 4 of NHWC.
123 |         if inputs_rank in [2, 3, 4]:
124 |             if inputs_rank == 2:
125 |                 inputs = tf.expand_dims(inputs, axis=1)
126 |                 inputs = tf.expand_dims(inputs, axis=2)
127 |             elif inputs_rank == 3:
128 |                 inputs = tf.expand_dims(inputs, axis=1)
129 | 
130 |             outputs = tf.contrib.layers.batch_norm(inputs=inputs,
131 |                                                    decay=decay,
132 |                                                    center=True,
133 |                                                    scale=True,
134 |                                                    activation_fn=activation_fn,
135 |                                                    updates_collections=None,
136 |                                                    is_training=is_training,
137 |                                                    scope=scope,
138 |                                                    zero_debias_moving_mean=True,
139 |                                                    fused=True,
140 |                                                    reuse=reuse)
141 |             # restore original shape
142 |             if inputs_rank == 2:
143 |                 outputs = tf.squeeze(outputs, axis=[1, 2])
144 |             elif inputs_rank == 3:
145 |                 outputs = tf.squeeze(outputs, axis=1)
146 |         else:  # fallback to naive batch norm
147 |             outputs = tf.contrib.layers.batch_norm(inputs=inputs,
148 |                                                    decay=decay,
149 |                                                    center=True,
150 |                                                    scale=True,
151 |                                                    activation_fn=activation_fn,
152 |                                                    updates_collections=None,
153 |                                                    is_training=is_training,
154 |                                                    scope=scope,
155 |                                                    reuse=reuse,
156 |                                                    fused=False)
157 |     elif type in ("ln", "ins"):
158 |         reduction_axis = -1 if type == "ln" else 1
159 |         with tf.variable_scope(scope, reuse=reuse):
160 |             inputs_shape = inputs.get_shape()
161 |             params_shape = inputs_shape[-1:]
162 | 
163 |             mean, variance = tf.nn.moments(inputs, [reduction_axis], keep_dims=True)
164 |             beta = tf.Variable(tf.zeros(params_shape))
165 |             gamma = tf.Variable(tf.ones(params_shape))
166 |             normalized = (inputs - mean) / ((variance + epsilon) ** (.5))
167 |             outputs = gamma * normalized + beta
168 |     else:
169 |         outputs = inputs
170 | 
171 |     if activation_fn:
172 |         outputs = activation_fn(outputs)
173 | 
174 |     return outputs
175 | 
176 | 
177 | def conv1d(inputs,
178 |            filters=None,
179 |            size=1,
180 |            rate=1,
181 |            padding="SAME",
182 |            use_bias=False,
183 |            activation_fn=None,
184 |            scope="conv1d",
185 |            reuse=None):
186 |     '''
187 |     Args:
188 |       inputs: A 3-D tensor with shape of [batch, time, depth].
189 |       filters: An int. Number of outputs (=activation maps)
190 |       size: An int. Filter size.
191 |       rate: An int. Dilation rate.
192 |       padding: Either `same` or `valid` or `causal` (case-insensitive).
193 |       use_bias: A boolean.
194 |       scope: Optional scope for `variable_scope`.
195 |       reuse: Boolean, whether to reuse the weights of a previous layer
196 |         by the same name.
197 | 
198 |     Returns:
199 |       A masked tensor of the same shape and dtypes as `inputs`.
200 |     '''
201 | 
202 |     with tf.variable_scope(scope):
203 |         if padding.lower() == "causal":
204 |             # pre-padding for causality
205 |             pad_len = (size - 1) * rate  # padding size
206 |             inputs = tf.pad(inputs, [[0, 0], [pad_len, 0], [0, 0]])
207 |             padding = "valid"
208 | 
209 |         if filters is None:
210 |             filters = inputs.get_shape().as_list[-1]
211 | 
212 |         params = {"inputs": inputs, "filters": filters, "kernel_size": size,
213 |                   "dilation_rate": rate, "padding": padding, "activation": activation_fn,
214 |                   "use_bias": use_bias, "reuse": reuse}
215 | 
216 |         outputs = tf.layers.conv1d(**params)
217 |     return outputs
218 | 
219 | 
220 | def conv1d_banks(inputs, K=16, num_units=None, norm_type=None, is_training=True, scope="conv1d_banks", reuse=None):
221 |     '''Applies a series of conv1d separately.
222 | 
223 |     Args:
224 |       inputs: A 3d tensor with shape of [N, T, C]
225 |       K: An int. The size of conv1d banks. That is,
226 |         The `inputs` are convolved with K filters: 1, 2, ..., K.
227 |       num_units: An int. The number of hidden units.
228 |       norm_type: A string. Either "bn" or "ln" or "ins" or None.
229 |       is_training: A boolean. This is passed to an argument of `batch_normalize`.
230 | 
231 |     Returns:
232 |       A 3d tensor with shape of [N, T, K*Hp.embed_size//2].
233 |     '''
234 |     if num_units is None:
235 |         num_units = inputs.get_shape()[-1]
236 | 
237 |     with tf.variable_scope(scope, reuse=reuse):
238 |         outputs = conv1d(inputs, num_units, 1)  # k=1
239 |         for k in range(2, K + 1):  # k = 2...K
240 |             with tf.variable_scope("num_{}".format(k)):
241 |                 output = conv1d(inputs, num_units, k)
242 |                 outputs = tf.concat((outputs, output), -1)
243 |         outputs = normalize(outputs, type=norm_type, is_training=is_training,
244 |                             activation_fn=tf.nn.relu)
245 | 
246 |     return outputs  # (N, T, Hp.embed_size//2*K)
247 | 
248 | 
249 | def gru(inputs, num_units=None, bidirection=False, scope="gru", reuse=None):
250 |     '''Applies a GRU.
251 | 
252 |     Args:
253 |       inputs: A 3d tensor with shape of [N, T, C].
254 |       num_units: An int. The number of hidden units.
255 |       bidirection: A boolean. If True, bidirectional results
256 |         are concatenated.
257 |       scope: Optional scope for `variable_scope`.
258 |       reuse: Boolean, whether to reuse the weights of a previous layer
259 |         by the same name.
260 | 
261 |     Returns:
262 |       If bidirection is True, a 3d tensor with shape of [N, T, 2*num_units],
263 |         otherwise [N, T, num_units].
264 |     '''
265 |     if num_units is None:
266 |         num_units = inputs.get_shape()[-1]
267 | 
268 |     with tf.variable_scope(scope, reuse=reuse):
269 |         if num_units is None:
270 |             num_units = inputs.get_shape().as_list[-1]
271 | 
272 |         cell = tf.contrib.rnn.GRUCell(num_units)
273 |         if bidirection:
274 |             cell_bw = tf.contrib.rnn.GRUCell(num_units)
275 |             outputs, _ = tf.nn.bidirectional_dynamic_rnn(cell, cell_bw, inputs, dtype=tf.float32)
276 |             return tf.concat(outputs, 2)
277 |         else:
278 |             outputs, _ = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32)
279 |             return outputs
280 | 
281 | 
282 | def prenet(inputs, num_units=None, dropout_rate=0, is_training=True, scope="prenet", reuse=None):
283 |     '''Prenet for Encoder and Decoder.
284 |     Args:
285 |       inputs: A 3D tensor of shape [N, T, hp.embed_size].
286 |       num_units" A list of two integers.
287 |       is_training: A boolean.
288 |       scope: Optional scope for `variable_scope`.
289 |       reuse: Boolean, whether to reuse the weights of a previous layer
290 |         by the same name.
291 | 
292 |     Returns:
293 |       A 3D tensor of shape [N, T, num_units/2].
294 |     '''
295 |     if num_units is None:
296 |         num_units = [inputs.get_shape()[-1], inputs.get_shape()[-1]]
297 | 
298 |     with tf.variable_scope(scope, reuse=reuse):
299 |         outputs = tf.layers.dense(inputs, units=num_units[0], activation=tf.nn.relu, name="dense1")
300 |         outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=is_training, name="dropout1")
301 |         outputs = tf.layers.dense(outputs, units=num_units[1], activation=tf.nn.relu, name="dense2")
302 |         outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=is_training, name="dropout2")
303 | 
304 |     return outputs  # (N, T, num_units[1])
305 | 
306 | 
307 | def highwaynet(inputs, num_units=None, scope="highwaynet", reuse=None):
308 |     '''Highway networks, see https://arxiv.org/abs/1505.00387
309 |     Args:
310 |       inputs: A 3D tensor of shape [N, T, W].
311 |       num_units: An int or `None`. Specifies the number of units in the highway layer
312 |              or uses the input size if `None`.
313 |       scope: Optional scope for `variable_scope`.
314 |       reuse: Boolean, whether to reuse the weights of a previous layer
315 |         by the same name.
316 |     Returns:
317 |       A 3D tensor of shape [N, T, W].
318 |     '''
319 |     if num_units is None:
320 |         num_units = inputs.get_shape()[-1]
321 | 
322 |     with tf.variable_scope(scope, reuse=reuse):
323 |         H = tf.layers.dense(inputs, units=num_units, activation=tf.nn.relu, name="H")
324 |         T = tf.layers.dense(inputs, units=num_units, activation=tf.nn.sigmoid, name="T")
325 |         C = 1. - T
326 |         outputs = H * T + inputs * C
327 | 
328 |     return outputs
329 | 
330 | 
331 | 
332 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy >= 1.11.1
2 | sugartensor >= 0.0.1.8 (Check [this repository](https://github.com/buriburisuri/sugartensor) for installing sugartensor)
3 | nltk >= 3.2.1 (You need to download `brown` corpus)
4 | gensim >= 0.13.1 (For creating word2vec models)
5 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # /usr/bin/python2
  3 | '''
  4 | June 2017 by kyubyong park.
  5 | kbpark.linguist@gmail.com.
  6 | https://www.github.com/kyubyong/neural_tokenizer
  7 | '''
  8 | from __future__ import print_function
  9 | from hyperparams import Hyperparams as hp
 10 | import tensorflow as tf
 11 | from data_load import get_batch_data, load_vocab, load_data
 12 | from modules import *
 13 | from tqdm import tqdm
 14 | 
 15 | class Graph:
 16 |     def __init__(self, is_training=True):
 17 |         self.graph = tf.Graph()
 18 |         with self.graph.as_default():
 19 |             # Load data
 20 |             self.x, self.y, self.num_batch = get_batch_data()  # (N, T)
 21 | 
 22 |             # Load vocabulary
 23 |             char2idx, idx2char = load_vocab()
 24 | 
 25 |             # Encoder
 26 |             ## Embedding
 27 |             enc = embedding(self.x,
 28 |                              vocab_size=len(char2idx),
 29 |                              num_units=hp.hidden_units,
 30 |                              scale=False,
 31 |                              scope="enc_embed")
 32 | 
 33 |             # Encoder pre-net
 34 |             prenet_out = prenet(enc,
 35 |                                 num_units=[hp.hidden_units, hp.hidden_units//2],
 36 |                                 dropout_rate=hp.dropout_rate,
 37 |                                 is_training=is_training)  # (N, T, E/2)
 38 | 
 39 |             # Encoder CBHG
 40 |             ## Conv1D bank
 41 |             enc = conv1d_banks(prenet_out,
 42 |                                K=hp.encoder_num_banks,
 43 |                                num_units=hp.hidden_units//2,
 44 |                                norm_type="ins",
 45 |                                is_training=is_training)  # (N, T, K * E / 2)
 46 | 
 47 |             ### Max pooling
 48 |             enc = tf.layers.max_pooling1d(enc, 2, 1, padding="same")  # (N, T, K * E / 2)
 49 | 
 50 |             ### Conv1D projections
 51 |             enc = conv1d(enc, hp.hidden_units//2, 3, scope="conv1d_1")  # (N, T, E/2)
 52 |             enc = normalize(enc, type="ins", is_training=is_training, activation_fn=tf.nn.relu)
 53 |             enc = conv1d(enc, hp.hidden_units//2, 3, scope="conv1d_2")  # (N, T, E/2)
 54 |             enc += prenet_out  # (N, T, E/2) # residual connections
 55 | 
 56 |             ### Highway Nets
 57 |             for i in range(hp.num_highwaynet_blocks):
 58 |                 enc = highwaynet(enc, num_units=hp.hidden_units//2,
 59 |                                  scope='highwaynet_{}'.format(i))  # (N, T, E/2)
 60 | 
 61 |             ### Bidirectional GRU
 62 |             enc = gru(enc, hp.hidden_units//2, True)  # (N, T, E)
 63 | 
 64 |             # Final linear projection
 65 |             self.logits = tf.layers.dense(enc, 2) # 0 for non-space, 1 for space
 66 | 
 67 |             self.preds = tf.to_int32(tf.arg_max(self.logits, dimension=-1))
 68 |             self.istarget = tf.to_float(tf.not_equal(self.x, 0)) # masking
 69 |             self.num_hits = tf.reduce_sum(tf.to_float(tf.equal(self.preds, self.y)) * self.istarget)
 70 |             self.num_targets = tf.reduce_sum(self.istarget)
 71 |             self.acc = self.num_hits / self.num_targets
 72 | 
 73 |             if is_training:
 74 |                 # Loss
 75 |                 self.loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits, labels=self.y)
 76 |                 self.mean_loss = tf.reduce_sum(self.loss * self.istarget) / (tf.reduce_sum(self.istarget))
 77 | 
 78 |                 # Training Scheme
 79 |                 self.global_step = tf.Variable(0, name='global_step', trainable=False)
 80 |                 self.optimizer = tf.train.AdamOptimizer(learning_rate=hp.lr, beta1=0.9, beta2=0.98, epsilon=1e-8)
 81 |                 self.train_op = self.optimizer.minimize(self.mean_loss, global_step=self.global_step)
 82 | 
 83 |                 # # Summary
 84 |                 # tf.summary.scalar('mean_loss', self.mean_loss)
 85 |                 # tf.summary.merge_all()
 86 | 
 87 | 
 88 | 
 89 | if __name__ == '__main__':
 90 |     # Construct graph
 91 |     g = Graph()
 92 |     print("Graph loaded")
 93 | 
 94 |     char2idx, idx2char = load_vocab()
 95 |     with g.graph.as_default():
 96 |         # For validation
 97 |         X_val, Y_val = load_data(mode="val")
 98 |         num_batch = len(X_val) // hp.batch_size
 99 | 
100 |         # Start session
101 |         sv = tf.train.Supervisor(graph=g.graph,
102 |                                  logdir=hp.logdir,
103 |                                  save_model_secs=0)
104 |         with sv.managed_session() as sess:
105 |             for epoch in range(1, hp.num_epochs + 1):
106 |                 if sv.should_stop(): break
107 |                 for step in tqdm(range(g.num_batch), total=g.num_batch, ncols=70, leave=False, unit='b'):
108 |                     sess.run(g.train_op)
109 | 
110 |                     # logging
111 |                     if step % 100 == 0:
112 |                         gs, mean_loss = sess.run([g.global_step, g.mean_loss])
113 |                         print("\nAfter global steps %d, the training loss is %.2f" % (gs, mean_loss))
114 | 
115 |                 # Save
116 |                 gs = sess.run(g.global_step)
117 |                 sv.saver.save(sess, hp.logdir + '/model_epoch_%02d_gs_%d' % (epoch, gs))
118 | 
119 |                 # Validation check
120 |                 total_hits, total_targets = 0, 0
121 |                 for step in tqdm(range(num_batch), total=num_batch, ncols=70, leave=False, unit='b'):
122 |                     x = X_val[step*hp.batch_size:(step+1)*hp.batch_size]
123 |                     y = Y_val[step*hp.batch_size:(step+1)*hp.batch_size]
124 |                     num_hits, num_targets = sess.run([g.num_hits, g.num_targets], {g.x: x, g.y: y})
125 |                     total_hits += num_hits
126 |                     total_targets += num_targets
127 |                 print("\nAfter epoch %d, the validation accuracy is %d/%d=%.2f" % (epoch, total_hits, total_targets, total_hits/total_targets))
128 | 
129 |     print("Done")
130 | 
131 | 
132 | 


--------------------------------------------------------------------------------