├── README.md
├── dat
└── README.md
├── src
├── input.txt
├── learn.py
├── network.py
├── output.txt
├── performace.py
├── vec4net.py
├── vietseg.py
└── word2vec.py
└── var
├── 100features_10context_7minwords.vec
└── 30hidden-30epochs-10batch-0.5eta-5.0lambda-7window-100shape-3run.net
/README.md:
--------------------------------------------------------------------------------
1 | VietSeg
2 | =======
3 |
4 | A Vietnamese Word Segmentation program using Neural Network
5 |
6 | ## How to use:
7 |
8 | - Download the source code
9 |
10 | - Change to `src` folder
11 |
12 | - Put a text file into `src` folder for segmenting, here I use `input.txt`
13 |
14 | - Run `python3 vietseg.py input.txt output.txt` (Yeah, this program use
15 | Python3, and Python2 won't work on it, you can fix this, of course)
16 |
17 | - Now you got `output.txt` which is segmented
18 |
19 | ## Performance:
20 |
21 | Precision, Recall, and F1-measure in the same data as described
22 | in [this paper][10]:
23 |
24 | RESULT:
25 | ===================
26 | Run 0: P = 0.9156, R = 0.9294, F = 0.9225
27 | Run 1: P = 0.9015, R = 0.9183, F = 0.9099
28 | Run 2: P = 0.9189, R = 0.9327, F = 0.9258
29 | Run 3: P = 0.9208, R = 0.9339, F = 0.9273
30 | Run 4: P = 0.9166, R = 0.9295, F = 0.9230
31 | ===================
32 | Avg. P = 0.9147, R = 0.9288, F = 0.9217
33 |
34 |
35 | And here is the best performance in the [paper][10]:
36 |
37 | P = 94.00, R = 94.45, F = 94.23
38 |
39 | The program use some random shuffers, so your result may not be the same
40 | as mine.
41 |
42 | ## Train the model yourself:
43 |
44 | - Get the data (see links below) and put in the `dat` folder
45 |
46 | - Change working directory to `src` folder
47 |
48 | - Run `python3 word2vec.py` to get vectorized words for our segmenting model
49 | using Word2Vec library (Word2Vec itself is a neural network)
50 |
51 | - Run `python3 learn.py` to really train the segmenting model
52 |
53 | - Run `python3 performace.py` for examining the peformance of the model
54 |
55 | - Now you can use `python3 vietseg.py ` as described above
56 |
57 |
58 | ## Data for training model:
59 |
60 | - Vietnamese corpus:
61 | - File: [VNESEcorpus.txt][2]
62 | - Move the file to `dat` folder
63 |
64 | - Vietnamese IOB training data:
65 | - File: [trainingdata.tar.gz][3]
66 | - Untar and put 10 files: test1.iob2 -> test5.iob2, train1.iob2 -> train5.iob2 to
67 | `dat` folder, along with `VNESEcorpus.txt`
68 |
69 | ## Future works:
70 |
71 | - Speed up the network
72 | - Use a professional deep learning package (Theano, Caffe, etc)
73 | - Train the model with bigger corpus and training data file, like [these][9]
74 | - Deal with uppercase characters
75 | - Build a web app
76 |
77 | ## Acknowledgment:
78 |
79 | This program use some code from [wendykan][4] and [mnielsen][5].
80 | View the source code for detail.
81 |
82 | ## Similar programs:
83 |
84 | - [JVnSegmenter][1]: Java
85 | - [vnTokenizer][6]: Java
86 | - [Dongdu][7]: C++
87 | - [Roy\_VnTokenizer][11]: Python
88 | - [VLSP][8]: PHP?
89 |
90 | ## Last words:
91 |
92 | > sophisticated algorithm ≤ simple learning algorithm + good training data
93 |
94 |
95 | [1]: http://jvnsegmenter.sourceforge.net/
96 | [2]: http://viet.jnlp.org/download-du-lieu-tu-vung-corpus
97 | [3]: http://sourceforge.net/projects/jvnsegmenter/files/jvnsegmenter/JVnSegmenter/trainingdata.tar.gz/download
98 | [4]: https://github.com/wendykan/DeepLearningMovies
99 | [5]: https://github.com/mnielsen/neural-networks-and-deep-learning
100 | [6]: http://mim.hus.vnu.edu.vn/phuonglh/softwares/vnTokenizer
101 | [7]: https://github.com/rockkhuya/DongDu
102 | [8]: http://vlsp.vietlp.org:8080/demo/?page=seg_pos_chunk
103 | [9]: http://vlsp.vietlp.org:8080/demo/?page=resources
104 | [11]: https://github.com/roy-a/Roy_VnTokenizer
105 | [10]: http://aclweb.org/anthology/Y/Y06/Y06-1028.pdf
106 |
107 |
108 |
109 |
110 |
--------------------------------------------------------------------------------
/dat/README.md:
--------------------------------------------------------------------------------
1 | [VNESEcorpust.txt](http://viet.jnlp.org/download-du-lieu-tu-vung-corpus)
2 |
3 | [trainingdata.tar.gz](http://sourceforge.net/projects/jvnsegmenter/files/jvnsegmenter/JVnSegmenter/trainingdata.tar.gz/download)
4 |
--------------------------------------------------------------------------------
/src/input.txt:
--------------------------------------------------------------------------------
1 | Chủ nhân website muốn duy trì domain .com phải trả 6,42 USD ( tăng 7 % ) , còn .net sẽ tốn 3,85 USD ( tăng 10 % ) mỗi năm .
2 | Hãng điều hành tên miền VeriSign sẽ thu thêm 29 triệu USD/năm từ 62 triệu trang .com và 9,1 triệu trang .net .
3 | Sự thay đổi này sẽ bắt đầu vào ngày 15/10 năm nay và chỉ áp dụng với tên đăng ký mới hoặc gia hạn .
4 | Những tên miền đã đóng phí duy trì 10 năm hay 100 năm vẫn theo giá cũ .
5 | VeriSign cho biết nhu cầu đăng ký tên miền ngày càng tăng .
6 | Hiện , máy chủ DNS của họ nhận 30 tỷ yêu cầu mỗi ngày , gấp 30 lần so với năm 2000 .
7 | VeriSign phải khởi động dự án mang tên Titan để mở rộng dung lượng của hệ thống , đáp ứng được 4 nghìn tỷ thắc mắc/ngày vào năm 2010 .
8 | Tăng dung lượng máy chủ cũng là một biện pháp đối phó với nguy cơ tấn công từ chối dịch vụ .
9 | Nếu để xảy ra tình trạng này , toàn bộ các trang mà VeriSign quản lý sẽ `` chết đứng '' .
10 | T. H. ( theo AP ) T. H. Nhân
11 |
--------------------------------------------------------------------------------
/src/learn.py:
--------------------------------------------------------------------------------
1 |
2 | ###############################################################################
3 | # Learn the neural network parameters
4 | ###############################################################################
5 |
6 | import sys
7 | import os
8 | import re
9 | import codecs
10 |
11 | from vec4net import make_list, WINDOW, SHAPE
12 | SIZE = WINDOW*SHAPE
13 |
14 | import network
15 |
16 | def get_files(ft='train'):
17 | "Get train/test file list for further process"
18 | path_data = '../dat/'
19 | tmp_lst = os.listdir(path_data)
20 | file_lst = []
21 | for item in tmp_lst:
22 | if re.search('\A'+ft+'.*\.iob2', item):
23 | file_lst.append(path_data + item)
24 | return file_lst
25 |
26 | def get_sentences(file_lst, RUN):
27 | "Get sentences and coresponding tags from file list in CV no. RUN and RUN+1"
28 | sentences = []
29 | item = file_lst[RUN]
30 | f = codecs.open(item, encoding='utf-8', mode = 'r', errors = 'ignore')
31 | sent = []
32 | tags = []
33 | word = ''
34 | tag = ''
35 | for line in f:
36 | line = line.lower().split() # TODO: Think about lower!
37 | if not line: # End of sentence.
38 | if word:
39 | word = '' # Flush word buffer.
40 | tag = ''
41 | if sent:
42 | sentences.append((sent, tags))
43 | sent = [] # Flush sentence buffer.
44 | tags = []
45 | continue
46 | word, tag = line
47 | sent.append(word)
48 | tags.append(tag[0]) # First letter
49 | f.close()
50 | if sent: sentences.append((sent, tags))
51 | return sentences
52 |
53 | def make_train_test(RUN):
54 | train = []
55 | test = []
56 | train_list = get_files('train')
57 | train_sent = get_sentences(train_list, RUN)
58 | for s in train_sent:
59 | train += list(make_list(s))
60 | test_list = get_files('test')
61 | test_sent = get_sentences(test_list, RUN)
62 | for s in test_sent:
63 | test += list(make_list(s))
64 | return train, test
65 |
66 |
67 | # Network learning, this is the awesome part
68 | if __name__ == '__main__':
69 | HIDDEN = 30
70 | EPOCHS = 30
71 | MINI_BATCH_SIZE = 10
72 | ETA = 0.5
73 | LAMBDA = 5.0
74 | accuracy = {}
75 | n_args = len(sys.argv)
76 | if n_args == 1:
77 | a = 0
78 | b = 5
79 | elif n_args == 2:
80 | a = int(sys.argv[1])
81 | b = a+1
82 | else:
83 | a = int(sys.argv[1])
84 | b = int(sys.argv[2]) + 1
85 | try:
86 | assert 0 <= a < b <= 5
87 | except:
88 | print('''
89 | Usage: python3 learn.py [from] [to]
90 | ===================================
91 | * [from] and [to] is optional, default to 0 and 4, i.e. run full
92 | 5 cross-validation from 0 to 4
93 | * If specified, [from] and [to] must meet the inequility
94 | 0 <= [from] < [to] <= 4
95 | ''')
96 | sys.exit(1)
97 | # We run a to b cross-validation and take the average accuracy
98 | ranges = range(a, b)
99 | for RUN in ranges:
100 | print('Run cross-validation #{}'.format(RUN))
101 | net = network.Network([SIZE, HIDDEN, 3])
102 | training_data, test_data = make_train_test(RUN)
103 | # training_data = training_data[:1000]; test_data = test_data[:100]
104 | acc = net.SGD(training_data, EPOCHS, MINI_BATCH_SIZE, ETA,
105 | lmbda=LAMBDA,
106 | evaluation_data=test_data,
107 | monitor_evaluation_accuracy=True)
108 | # Take the last (not best) accuracy and length of the test set
109 | accuracy[RUN] = (acc[-1], len(test_data))
110 | # Save the model for later use
111 | net.save('../var/{}hidden-{}epochs-{}batch-{}eta-{}lambda-{}window-{}shape-{}run.net'.\
112 | format(HIDDEN, EPOCHS, MINI_BATCH_SIZE, ETA, LAMBDA, WINDOW, SHAPE, RUN))
113 | # Display the result
114 | print("===================")
115 | print("FINAL RESULTS:")
116 | print("===================")
117 | for i in ranges:
118 | print("Accuracy on CV #{0} is {1:.2f} %"\
119 | .format(i, accuracy[i][0]/accuracy[i][1]*100))
120 | print("===================")
121 | a = sum([accuracy[i][0] for i in ranges])
122 | b = sum([accuracy[i][1] for i in ranges])
123 | print("Average accuracy: {:.2f} %".format(a/b*100))
124 |
125 |
126 |
127 |
128 |
--------------------------------------------------------------------------------
/src/network.py:
--------------------------------------------------------------------------------
1 |
2 | ###############################################################################
3 | # Simple neural network model
4 | # Modified version of 'network2.py' in:
5 | # https://github.com/mnielsen/neural-networks-and-deep-learning
6 | ###############################################################################
7 |
8 | # TODO:
9 | # 1) Implement matrix-based computation for mini batch
10 | # 2) Adapt a version for on-line learning for using iterators object in Python3
11 | # When using iterators objects it would permit to use larger Word2Vec dimention
12 | # of word as well as input2vec context window. Maybe the accuracy will increase
13 |
14 | import json
15 | import random
16 | import sys
17 |
18 | import numpy as np
19 | from statistics import mean
20 |
21 | #### Define the cross-entropy cost functions
22 | class CrossEntropyCost:
23 |
24 | @staticmethod
25 | def fn(a, y):
26 | """Return the cost associated with an output ``a`` and desired output
27 | ``y``. Note that np.nan_to_num is used to ensure numerical
28 | stability. In particular, if both ``a`` and ``y`` have a 1.0
29 | in the same slot, then the expression (1-y)*np.log(1-a)
30 | returns nan. The np.nan_to_num ensures that that is converted
31 | to the correct value (0.0).
32 |
33 | """
34 | return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a)))
35 |
36 | @staticmethod
37 | def delta(a, y):
38 | """Return the error delta from the output layer.
39 | """
40 | return (a-y)
41 |
42 |
43 | #### Main Network class
44 | class Network():
45 |
46 | def __init__(self, sizes, cost=CrossEntropyCost):
47 | """The list ``sizes`` contains the number of neurons in the respective
48 | layers of the network. For example, if the list was [2, 3, 1]
49 | then it would be a three-layer network, with the first layer
50 | containing 2 neurons, the second layer 3 neurons, and the
51 | third layer 1 neuron. The biases and weights for the network
52 | are initialized randomly, using
53 | ``self.default_weight_initializer`` (see docstring for that
54 | method).
55 |
56 | """
57 | self.num_layers = len(sizes)
58 | self.sizes = sizes
59 | self.default_weight_initializer()
60 | self.cost=cost
61 |
62 | def default_weight_initializer(self):
63 | """Initialize each weight using a Gaussian distribution with mean 0
64 | and standard deviation 1 over the square root of the number of
65 | weights connecting to the same neuron. Initialize the biases
66 | using a Gaussian distribution with mean 0 and standard
67 | deviation 1.
68 |
69 | Note that the first layer is assumed to be an input layer, and
70 | by convention we won't set any biases for those neurons, since
71 | biases are only ever used in computing the outputs from later
72 | layers.
73 |
74 | """
75 | self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
76 | self.weights = [np.random.randn(y, x)/np.sqrt(x)
77 | for x, y in zip(self.sizes[:-1], self.sizes[1:])]
78 |
79 | def feedforward(self, a):
80 | """Return the output of the network if ``a`` is input."""
81 | for b, w in zip(self.biases, self.weights):
82 | a = sigmoid_vec(np.dot(w, a)+b)
83 | return a
84 |
85 | def SGD(self, training_data, epochs, mini_batch_size, eta,
86 | lmbda = 0.0,
87 | evaluation_data=None,
88 | monitor_evaluation_cost=False,
89 | monitor_evaluation_accuracy=False):
90 | """Train the neural network using mini-batch stochastic gradient
91 | descent. The ``training_data`` is a list of tuples ``(x, y)``
92 | representing the training inputs and the desired outputs. The
93 | other non-optional parameters are self-explanatory, as is the
94 | regularization parameter ``lmbda``. The method also accepts
95 | ``evaluation_data``, usually either the validation or test data.
96 | The method returns the list of accuracies on the evaluation data.
97 | All values are evaluated at the end of each training epoch.
98 | """
99 | if evaluation_data: n_data = len(evaluation_data)
100 | n = len(training_data)
101 | evaluation_accuracy = []
102 | for j in range(epochs):
103 | random.shuffle(training_data)
104 | mini_batches = [
105 | training_data[k:k+mini_batch_size]
106 | for k in range(0, n, mini_batch_size)]
107 | for mini_batch in mini_batches:
108 | self.update_mini_batch(
109 | mini_batch, eta, lmbda, len(training_data))
110 | print("Epoch %s training complete" % j)
111 | if monitor_evaluation_accuracy:
112 | accuracy = self.accuracy(evaluation_data)
113 | print("Accuracy on evaluation data: {0} / {1} <=> {2:.2f} %".\
114 | format(accuracy, n_data, accuracy/n_data*100)
115 | )
116 | # Decrease eta by half when having no impoving in 3 loops
117 | if len(evaluation_accuracy) > 3 and \
118 | accuracy <= mean(evaluation_accuracy[-3:]):
119 | eta /= 2
120 | print("ETA now is {}".format(eta))
121 | evaluation_accuracy.append(accuracy)
122 | return evaluation_accuracy
123 |
124 | def update_mini_batch(self, mini_batch, eta, lmbda, n):
125 | """Update the network's weights and biases by applying gradient
126 | descent using backpropagation to a single mini batch. The
127 | ``mini_batch`` is a list of tuples ``(x, y)``, ``eta`` is the
128 | learning rate, ``lmbda`` is the regularization parameter, and
129 | ``n`` is the total size of the training data set.
130 | """
131 | nabla_b = [np.zeros(b.shape) for b in self.biases]
132 | nabla_w = [np.zeros(w.shape) for w in self.weights]
133 | for x, y in mini_batch:
134 | delta_nabla_b, delta_nabla_w = self.backprop(x, y)
135 | nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
136 | nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
137 | self.weights = [(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*nw
138 | for w, nw in zip(self.weights, nabla_w)]
139 | self.biases = [b-(eta/len(mini_batch))*nb
140 | for b, nb in zip(self.biases, nabla_b)]
141 |
142 | def backprop(self, x, y):
143 | """Return a tuple ``(nabla_b, nabla_w)`` representing the
144 | gradient for the cost function C_x. ``nabla_b`` and
145 | ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
146 | to ``self.biases`` and ``self.weights``."""
147 | nabla_b = [np.zeros(b.shape) for b in self.biases]
148 | nabla_w = [np.zeros(w.shape) for w in self.weights]
149 | # feedforward
150 | activation = x
151 | activations = [x] # list to store all the activations, layer by layer
152 | zs = [] # list to store all the z vectors, layer by layer
153 | for b, w in zip(self.biases, self.weights):
154 | z = np.dot(w, activation)+b
155 | zs.append(z)
156 | activation = sigmoid_vec(z)
157 | activations.append(activation)
158 | # backward pass
159 | delta = (self.cost).delta(activations[-1], y)
160 | nabla_b[-1] = delta
161 | nabla_w[-1] = np.dot(delta, activations[-2].transpose())
162 | # l = 1 means the last layer of neurons, l = 2 is the
163 | # second-last layer, and so on.
164 | for l in range(2, self.num_layers):
165 | z = zs[-l]
166 | spv = sigmoid_prime_vec(z)
167 | delta = np.dot(self.weights[-l+1].transpose(), delta) * spv
168 | nabla_b[-l] = delta
169 | nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
170 | return (nabla_b, nabla_w)
171 |
172 | def accuracy(self, data):
173 | """Return the number of inputs in ``data`` for which the neural
174 | network outputs the correct result. The neural network's
175 | output is assumed to be the index of whichever neuron in the
176 | final layer has the highest activation.
177 | """
178 | results = [(np.argmax(self.feedforward(x)), np.argmax(y))
179 | for (x, y) in data]
180 | return sum((x == y).all() for (x, y) in results)
181 |
182 | def save(self, filename):
183 | """Save the neural network to the file ``filename``."""
184 | data = {"sizes": self.sizes,
185 | "weights": [w.tolist() for w in self.weights],
186 | "biases": [b.tolist() for b in self.biases],
187 | "cost": str(self.cost.__name__)}
188 | f = open(filename, "w")
189 | json.dump(data, f)
190 | f.close()
191 |
192 | #### Loading a Network
193 | def load(filename):
194 | """Load a neural network from the file ``filename``. Returns an
195 | instance of Network.
196 | """
197 | f = open(filename, "r")
198 | data = json.load(f)
199 | f.close()
200 | cost = getattr(sys.modules[__name__], data["cost"])
201 | net = Network(data["sizes"], cost=cost)
202 | net.weights = [np.array(w) for w in data["weights"]]
203 | net.biases = [np.array(b) for b in data["biases"]]
204 | return net
205 |
206 | #### Miscellaneous functions
207 | def vectorized_result(j):
208 | """Return a 3-dimensional unit vector with a 1.0 in the j'th position
209 | and zeroes elsewhere. This is used to convert a iob code ('i', 'o', 'b')
210 | into a corresponding desired output from the neural network.
211 | """
212 | e = np.zeros((3, 1))
213 | e[j] = 1.0
214 | return e
215 |
216 | def sigmoid(z):
217 | """The sigmoid function."""
218 | return 1.0/(1.0+np.exp(-z))
219 |
220 | sigmoid_vec = np.vectorize(sigmoid)
221 |
222 | def sigmoid_prime(z):
223 | """Derivative of the sigmoid function."""
224 | return sigmoid(z)*(1-sigmoid(z))
225 |
226 | sigmoid_prime_vec = np.vectorize(sigmoid_prime)
227 |
228 |
229 |
230 |
--------------------------------------------------------------------------------
/src/output.txt:
--------------------------------------------------------------------------------
1 | Chủ_nhân website muốn duy_trì domain .com phải trả 6,42 USD ( tăng 7 % ) , còn .net sẽ tốn 3,85 USD ( tăng 10 % ) mỗi năm .
2 | Hãng điều_hành tên miền VeriSign sẽ thu thêm 29 triệu USD/năm từ 62 triệu trang .com và 9,1 triệu trang .net .
3 | Sự thay_đổi này sẽ bắt_đầu vào ngày 15/10 năm_nay và chỉ áp_dụng với tên đăng_ký mới hoặc gia_hạn .
4 | Những tên miền đã đóng_phí duy_trì 10 năm hay 100 năm vẫn theo giá cũ .
5 | VeriSign cho_biết nhu_cầu đăng_ký tên miền ngày càng tăng .
6 | Hiện , máy chủ DNS của họ nhận 30 tỷ yêu_cầu mỗi ngày , gấp 30 lần so với năm 2000 .
7 | VeriSign phải khởi_động dự_án mang tên Titan để mở_rộng dung_lượng của hệ_thống , đáp ứng được 4 nghìn tỷ thắc mắc/ngày vào năm 2010 .
8 | Tăng dung_lượng máy_chủ cũng là một biện_pháp đối_phó với nguy_cơ tấn_công từ chối dịch_vụ .
9 | Nếu để xảy_ra tình_trạng này , toàn bộ các trang mà VeriSign quản_lý sẽ `` chết đứng '' .
10 | T. H. ( theo AP ) T. H. Nhân
11 |
--------------------------------------------------------------------------------
/src/performace.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from vietseg import _make_words, _get_iob
3 | from learn import get_files, get_sentences
4 | from vec4net import make_vec
5 | import network
6 |
7 |
8 |
9 | def _make_hyp_ref(RUN):
10 | "Make human sentences and machine sentences"
11 | JSON = '../var/30hidden-30epochs-10batch-0.5eta-5.0lambda-7window-100shape-{}run.net'.format(RUN)
12 | net = network.load(JSON)
13 |
14 | def classify(token_list):
15 | "Classify a list of token to compare with human's segmentation"
16 | result = []
17 | sen_vec = make_vec(token_list)
18 | for x in sen_vec:
19 | result.append(_get_iob(net.feedforward(x)))
20 | return result
21 |
22 | test_list = get_files('test')
23 | test_sent = get_sentences(test_list, RUN)
24 | sents_ref = [_make_words(sent, tags) for sent, tags in test_sent]
25 | sents_hyp = [_make_words(sent, classify(sent)) for sent, _ in test_sent]
26 | return sents_ref, sents_hyp
27 |
28 | def _run_test(RUN):
29 | "Test performace on CV #RUN"
30 | # change RUN to constant for evaluate 1 specific model to all test sets
31 | # sents_ref, sents_hyp = _make_hyp_ref(3)
32 | sents_ref, sents_hyp = _make_hyp_ref(RUN)
33 | n_hyp = 0
34 | n_corr = 0
35 | n_ref = 0
36 | nSents = len(sents_ref)
37 | n_hyps = []
38 | n_refs = []
39 | n_corrs = []
40 | for n in range(nSents):
41 | sent1 = sents_ref[n] # Human's segmenting
42 | sent2 = sents_hyp[n] # Machine's segmenting
43 |
44 | n_ref_ = len(sent1)
45 | n_hyp_ = len(sent2)
46 |
47 | # Finding optimal alignment and consequently no. of correct words
48 | # by dynamic programming. Longest Common Subsequence problem.
49 | l = np.zeros([n_ref_+1, n_hyp_+1])
50 |
51 | for row in range(1,l.shape[0]):
52 | for col in range(1,l.shape[1]):
53 | if sent1[row-1] == sent2[col-1]:
54 | l[row][col] = l[row-1][col-1] + 1
55 | else:
56 | l[row][col] = max([l[row][col-1], l[row-1][col]])
57 | n_corr_ = l[n_ref_][n_hyp_]
58 |
59 | n_hyp += n_hyp_
60 | n_ref += n_ref_
61 | n_corr += n_corr_
62 | n_hyps.append(n_hyp_)
63 | n_corrs.append(n_corr_)
64 | n_refs.append(n_ref_)
65 |
66 | prec = n_corr / n_hyp
67 | recall = n_corr / n_ref
68 | fratio = 2*prec*recall / (prec + recall)
69 |
70 | return prec, recall, fratio
71 |
72 | def _main(nRuns):
73 | "Run the test"
74 | P_ = R_ = F_ = 0
75 | print("RESULT:")
76 | print("===================")
77 | for RUN in range(nRuns):
78 | prec, recall, fratio = _run_test(RUN)
79 | print("Run " + "%.0f" % RUN + ": P = " + "%.4f" % prec + ", R = " + \
80 | "%.4f" % recall + ", F = " + "%.4f" % fratio)
81 |
82 | P_ += prec
83 | R_ += recall
84 | F_ += fratio
85 |
86 | P_ /= nRuns
87 | R_ /= nRuns
88 | F_ /= nRuns
89 |
90 | print("===================")
91 | print("Avg. P = "+ "%.4f" % P_ + ", R = " + "%.4f" % R_ + ", F = " +\
92 | "%.4f" % F_)
93 |
94 | if __name__ == '__main__':
95 | _main(5)
96 |
97 |
98 |
--------------------------------------------------------------------------------
/src/vec4net.py:
--------------------------------------------------------------------------------
1 |
2 | ###############################################################################
3 | # Convert word to vector
4 | ###############################################################################
5 |
6 | import numpy as np
7 | from gensim.models import Word2Vec
8 |
9 | # Load Word2Vec model from file, use your model's result or use mine
10 | MODEL = Word2Vec.load('../var/100features_10context_7minwords.vec')
11 | WINDOW = 7
12 | SHAPE = MODEL.syn0.shape[1]
13 |
14 | def word2index(model, word):
15 | "Convert word to index using result from Word2Vec learning"
16 | try:
17 | if word == -1:
18 | result = np.zeros((SHAPE,))
19 | else:
20 | result = model[word]
21 | except:
22 | np.random.seed(len(word))
23 | result = 0.2 * np.random.uniform(-1, 1, SHAPE)
24 | return result
25 |
26 | def context_window(l):
27 | "Make context window for a given word, 'WINDOW' must be odd"
28 | l = list(l)
29 | lpadded = WINDOW//2 * [-1] + l + WINDOW//2 * [-1]
30 | out = [ lpadded[i:i+WINDOW] for i in range(len(l)) ]
31 | assert len(out) == len(l)
32 | return out
33 |
34 | def context_matrix(model, conwin):
35 | "Return a list contain map element for each context window of 1 word"
36 | return [map(lambda x: word2index(model, x), win) for win in conwin]
37 |
38 | def context_vector(cm):
39 | "Convert context matrix to vector"
40 | return [np.squeeze(np.asarray(list(x))).reshape((WINDOW*SHAPE,1)) for x in cm]
41 |
42 | def iob_map(iob):
43 | "Convert i/o/b to vector"
44 | d = {'i': [1,0,0], 'o': [0,1,0], 'b': [0,0,1]}
45 | return np.asarray(d[iob]).reshape((3,1))
46 |
47 | def iob_vector(iob):
48 | "Make a vector for iob list"
49 | return map(iob_map, iob)
50 |
51 | def make_tuple(context_vector, iob_vector):
52 | "Pairing sentences and iob tag"
53 | return list(zip(context_vector, iob_vector))
54 |
55 | def make_vec(sen):
56 | "Make vector from a sentence for feeding to neural net directly"
57 | cw = context_window(sen)
58 | cm = context_matrix(MODEL, cw)
59 | return context_vector(cm)
60 |
61 | def make_list(sentences):
62 | "Make list for neural net learning"
63 | sen, iob = sentences
64 | cw = context_window(sen)
65 | cm = context_matrix(MODEL, cw)
66 | cv = context_vector(cm)
67 | iv = iob_vector(iob)
68 | return make_tuple(cv, iv)
69 |
70 | if __name__ == '__main__':
71 | sen = ['mobifone', 'đầu', 'tư', 'hơn', '2', 'tỉ', 'đồng', 'phát', 'triển', 'mạng']
72 | iob = ['b', 'b', 'i', 'b', 'b', 'b', 'b', 'b', 'i', 'b']
73 | sentences = (sen, iob)
74 | lst = make_list(sentences)
75 | vec = make_vec(sen)
76 |
77 |
78 |
79 |
80 |
--------------------------------------------------------------------------------
/src/vietseg.py:
--------------------------------------------------------------------------------
1 |
2 | ###############################################################################
3 | # Model application
4 | ###############################################################################
5 |
6 | import sys
7 | import os
8 | import network
9 | from vec4net import make_vec
10 | import numpy as np
11 |
12 | # Replace this with your model's result
13 | JSON = '../var/30hidden-30epochs-10batch-0.5eta-5.0lambda-7window-100shape-3run.net'
14 | net = network.load(JSON)
15 |
16 |
17 | def _get_iob(arr):
18 | d = {0: 'i', 1: 'o', 2: 'b'}
19 | n = np.argmax(arr)
20 | return d[n]
21 |
22 | def _classify(token_list):
23 | "Classify a list of token"
24 | result = []
25 | sen_vec = make_vec(token_list)
26 | for x in sen_vec:
27 | result.append(_get_iob(net.feedforward(x)))
28 | return result
29 |
30 | def _make_words(token_list, iob_list):
31 | "Make segmented words from token list and corresponding iob list"
32 | if not iob_list: return
33 | t = token_list[0:1]
34 | tokens = []
35 | for i in range(1, len(iob_list)):
36 | if iob_list[i] == 'i':
37 | t.append(token_list[i])
38 | continue
39 | if iob_list[i] == 'b':
40 | if not t:
41 | t = token_list[i:i+1]
42 | tokens.append(t)
43 | t = []
44 | else:
45 | tokens.append(t)
46 | t = token_list[i:i+1]
47 | continue
48 | if iob_list[i] == 'o':
49 | if t:
50 | tokens.append(t)
51 | t = []
52 | tokens.append(token_list[i:i+1])
53 | if t: tokens.append(t)
54 | return ['_'.join(tok) for tok in tokens]
55 |
56 | def _test():
57 | tok = ['chủ', 'nhân', 'website', 'muốn', 'duy', 'trì', 'domain', '.com', 'phải', 'trả', '6,42', 'usd', '(', 'tăng', '7', '%', ')', ',', 'còn', '.net', 'sẽ', 'tốn', '3,85', 'usd', '(', 'tăng', '10', '%', ')', 'mỗi', 'năm', '.']
58 | iob = ['b', 'i', 'b', 'b', 'b', 'i', 'b', 'b', 'b', 'b', 'b', 'b', 'o', 'b', 'b', 'o', 'o', 'o', 'b', 'b', 'b', 'b', 'b', 'b', 'o', 'b', 'b', 'o', 'o', 'b', 'b', 'o']
59 | return _make_words(tok, iob)
60 |
61 | if __name__ == '__main__':
62 | n_args = len(sys.argv)
63 | if n_args < 3:
64 | print("""
65 | VietSeg - Ver. 0.0.1
66 | ==================================
67 | Usage: python3 vietseg.py
68 | * The must be in UTF-8 encoding.
69 | * The segmented text will be written in the .
70 | ==================================
71 | http://github.com/manhtai/vietseg
72 | """)
73 | exit(1)
74 | else:
75 | input_file = sys.argv[1]
76 | output_file = sys.argv[2]
77 | if not os.path.isfile(input_file):
78 | print('Input text file "' + input_file+ '" does not exist.')
79 | exit(1)
80 | with open(input_file, 'r') as fi, open(output_file, 'w') as fo:
81 | for line in fi:
82 | in_line = line.split()
83 | if not in_line:
84 | fo.write(line)
85 | continue
86 | token_list = line.lower().split()
87 | iob_list = _classify(token_list)
88 | out_line = _make_words(in_line, iob_list)
89 | fo.write(' '.join(out_line)+'\n')
90 |
91 |
92 |
93 |
94 |
95 |
--------------------------------------------------------------------------------
/src/word2vec.py:
--------------------------------------------------------------------------------
1 |
2 | ###############################################################################
3 | # Training Word2Vec to get word embedding vector
4 | #
5 | # Some code are taken from:
6 | # https://github.com/wendykan/DeepLearningMovies
7 | ###############################################################################
8 |
9 | import re
10 | import logging
11 | from gensim.models import Word2Vec
12 | from bs4 import BeautifulSoup
13 |
14 |
15 | def strip_tags(html):
16 | "Strip html tags"
17 | return BeautifulSoup(html).get_text(' ')
18 |
19 | def text_to_token(text):
20 | "Get list of token for training"
21 | # Strip HTML
22 | text = strip_tags(text)
23 | # Keep only word
24 | text = re.sub("\W", " ", text)
25 | # Lower and split sentence
26 | token = text.lower().split()
27 | # Don't remember the number
28 | for i in range(len(token)):
29 | token[i] = len(token[i])*'DIGIT' if token[i].isdigit() else token[i]
30 | return token
31 |
32 | def read_sentences(fp='../dat/VNESEcorpus.txt'):
33 | "Read and split token from text file"
34 | sentences = []
35 | with open(fp, 'r') as f:
36 | for line in f:
37 | if '|' not in line: # Remove menu items in some newspaper
38 | sentences.append(text_to_token(line.strip()))
39 | return sentences
40 |
41 | if __name__ == '__main__':
42 | # Read data from files
43 | sentences = read_sentences()
44 | print('Loaded {} sentences!'.format(len(sentences)))
45 | # Import the built-in logging module and configure it so that Word2Vec
46 | # creates nice output messages
47 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
48 | level=logging.INFO)
49 | # Set values for various parameters
50 | num_features = 100 # Word vector dimensionality
51 | min_word_count = 7 # Minimum word count
52 | num_workers = 4 # Number of threads to run in parallel
53 | context = 10 # Context window size
54 | downsampling = 1e-3 # Downsample setting for frequent words
55 | print("Training Word2Vec model...")
56 | # Initialize and train the model
57 | model = Word2Vec(sentences, workers=num_workers, \
58 | size=num_features, min_count = min_word_count, \
59 | window = context, sample = downsampling, seed=1)
60 | model.init_sims(replace=True)
61 | model_name = "../var/{}features_{}context_{}minwords.vec".format(
62 | num_features, context, min_word_count
63 | )
64 | model.save(model_name)
65 |
66 |
67 |
68 |
69 |
--------------------------------------------------------------------------------
/var/100features_10context_7minwords.vec:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/manhtai/vietseg/8dfff92873aa3e25cd17b97c0929b22be1ce98e3/var/100features_10context_7minwords.vec
--------------------------------------------------------------------------------