├── README.md ├── dat └── README.md ├── src ├── input.txt ├── learn.py ├── network.py ├── output.txt ├── performace.py ├── vec4net.py ├── vietseg.py └── word2vec.py └── var ├── 100features_10context_7minwords.vec └── 30hidden-30epochs-10batch-0.5eta-5.0lambda-7window-100shape-3run.net /README.md: -------------------------------------------------------------------------------- 1 | VietSeg 2 | ======= 3 | 4 | A Vietnamese Word Segmentation program using Neural Network 5 | 6 | ## How to use: 7 | 8 | - Download the source code 9 | 10 | - Change to `src` folder 11 | 12 | - Put a text file into `src` folder for segmenting, here I use `input.txt` 13 | 14 | - Run `python3 vietseg.py input.txt output.txt` (Yeah, this program use 15 | Python3, and Python2 won't work on it, you can fix this, of course) 16 | 17 | - Now you got `output.txt` which is segmented 18 | 19 | ## Performance: 20 | 21 | Precision, Recall, and F1-measure in the same data as described 22 | in [this paper][10]: 23 | 24 | RESULT: 25 | =================== 26 | Run 0: P = 0.9156, R = 0.9294, F = 0.9225 27 | Run 1: P = 0.9015, R = 0.9183, F = 0.9099 28 | Run 2: P = 0.9189, R = 0.9327, F = 0.9258 29 | Run 3: P = 0.9208, R = 0.9339, F = 0.9273 30 | Run 4: P = 0.9166, R = 0.9295, F = 0.9230 31 | =================== 32 | Avg. P = 0.9147, R = 0.9288, F = 0.9217 33 | 34 | 35 | And here is the best performance in the [paper][10]: 36 | 37 | P = 94.00, R = 94.45, F = 94.23 38 | 39 | The program use some random shuffers, so your result may not be the same 40 | as mine. 41 | 42 | ## Train the model yourself: 43 | 44 | - Get the data (see links below) and put in the `dat` folder 45 | 46 | - Change working directory to `src` folder 47 | 48 | - Run `python3 word2vec.py` to get vectorized words for our segmenting model 49 | using Word2Vec library (Word2Vec itself is a neural network) 50 | 51 | - Run `python3 learn.py` to really train the segmenting model 52 | 53 | - Run `python3 performace.py` for examining the peformance of the model 54 | 55 | - Now you can use `python3 vietseg.py ` as described above 56 | 57 | 58 | ## Data for training model: 59 | 60 | - Vietnamese corpus: 61 | - File: [VNESEcorpus.txt][2] 62 | - Move the file to `dat` folder 63 | 64 | - Vietnamese IOB training data: 65 | - File: [trainingdata.tar.gz][3] 66 | - Untar and put 10 files: test1.iob2 -> test5.iob2, train1.iob2 -> train5.iob2 to 67 | `dat` folder, along with `VNESEcorpus.txt` 68 | 69 | ## Future works: 70 | 71 | - Speed up the network 72 | - Use a professional deep learning package (Theano, Caffe, etc) 73 | - Train the model with bigger corpus and training data file, like [these][9] 74 | - Deal with uppercase characters 75 | - Build a web app 76 | 77 | ## Acknowledgment: 78 | 79 | This program use some code from [wendykan][4] and [mnielsen][5]. 80 | View the source code for detail. 81 | 82 | ## Similar programs: 83 | 84 | - [JVnSegmenter][1]: Java 85 | - [vnTokenizer][6]: Java 86 | - [Dongdu][7]: C++ 87 | - [Roy\_VnTokenizer][11]: Python 88 | - [VLSP][8]: PHP? 89 | 90 | ## Last words: 91 | 92 | > sophisticated algorithm ≤ simple learning algorithm + good training data 93 | 94 | 95 | [1]: http://jvnsegmenter.sourceforge.net/ 96 | [2]: http://viet.jnlp.org/download-du-lieu-tu-vung-corpus 97 | [3]: http://sourceforge.net/projects/jvnsegmenter/files/jvnsegmenter/JVnSegmenter/trainingdata.tar.gz/download 98 | [4]: https://github.com/wendykan/DeepLearningMovies 99 | [5]: https://github.com/mnielsen/neural-networks-and-deep-learning 100 | [6]: http://mim.hus.vnu.edu.vn/phuonglh/softwares/vnTokenizer 101 | [7]: https://github.com/rockkhuya/DongDu 102 | [8]: http://vlsp.vietlp.org:8080/demo/?page=seg_pos_chunk 103 | [9]: http://vlsp.vietlp.org:8080/demo/?page=resources 104 | [11]: https://github.com/roy-a/Roy_VnTokenizer 105 | [10]: http://aclweb.org/anthology/Y/Y06/Y06-1028.pdf 106 | 107 | 108 | 109 | 110 | -------------------------------------------------------------------------------- /dat/README.md: -------------------------------------------------------------------------------- 1 | [VNESEcorpust.txt](http://viet.jnlp.org/download-du-lieu-tu-vung-corpus) 2 | 3 | [trainingdata.tar.gz](http://sourceforge.net/projects/jvnsegmenter/files/jvnsegmenter/JVnSegmenter/trainingdata.tar.gz/download) 4 | -------------------------------------------------------------------------------- /src/input.txt: -------------------------------------------------------------------------------- 1 | Chủ nhân website muốn duy trì domain .com phải trả 6,42 USD ( tăng 7 % ) , còn .net sẽ tốn 3,85 USD ( tăng 10 % ) mỗi năm . 2 | Hãng điều hành tên miền VeriSign sẽ thu thêm 29 triệu USD/năm từ 62 triệu trang .com và 9,1 triệu trang .net . 3 | Sự thay đổi này sẽ bắt đầu vào ngày 15/10 năm nay và chỉ áp dụng với tên đăng ký mới hoặc gia hạn . 4 | Những tên miền đã đóng phí duy trì 10 năm hay 100 năm vẫn theo giá cũ . 5 | VeriSign cho biết nhu cầu đăng ký tên miền ngày càng tăng . 6 | Hiện , máy chủ DNS của họ nhận 30 tỷ yêu cầu mỗi ngày , gấp 30 lần so với năm 2000 . 7 | VeriSign phải khởi động dự án mang tên Titan để mở rộng dung lượng của hệ thống , đáp ứng được 4 nghìn tỷ thắc mắc/ngày vào năm 2010 . 8 | Tăng dung lượng máy chủ cũng là một biện pháp đối phó với nguy cơ tấn công từ chối dịch vụ . 9 | Nếu để xảy ra tình trạng này , toàn bộ các trang mà VeriSign quản lý sẽ `` chết đứng '' . 10 | T. H. ( theo AP ) T. H. Nhân 11 | -------------------------------------------------------------------------------- /src/learn.py: -------------------------------------------------------------------------------- 1 | 2 | ############################################################################### 3 | # Learn the neural network parameters 4 | ############################################################################### 5 | 6 | import sys 7 | import os 8 | import re 9 | import codecs 10 | 11 | from vec4net import make_list, WINDOW, SHAPE 12 | SIZE = WINDOW*SHAPE 13 | 14 | import network 15 | 16 | def get_files(ft='train'): 17 | "Get train/test file list for further process" 18 | path_data = '../dat/' 19 | tmp_lst = os.listdir(path_data) 20 | file_lst = [] 21 | for item in tmp_lst: 22 | if re.search('\A'+ft+'.*\.iob2', item): 23 | file_lst.append(path_data + item) 24 | return file_lst 25 | 26 | def get_sentences(file_lst, RUN): 27 | "Get sentences and coresponding tags from file list in CV no. RUN and RUN+1" 28 | sentences = [] 29 | item = file_lst[RUN] 30 | f = codecs.open(item, encoding='utf-8', mode = 'r', errors = 'ignore') 31 | sent = [] 32 | tags = [] 33 | word = '' 34 | tag = '' 35 | for line in f: 36 | line = line.lower().split() # TODO: Think about lower! 37 | if not line: # End of sentence. 38 | if word: 39 | word = '' # Flush word buffer. 40 | tag = '' 41 | if sent: 42 | sentences.append((sent, tags)) 43 | sent = [] # Flush sentence buffer. 44 | tags = [] 45 | continue 46 | word, tag = line 47 | sent.append(word) 48 | tags.append(tag[0]) # First letter 49 | f.close() 50 | if sent: sentences.append((sent, tags)) 51 | return sentences 52 | 53 | def make_train_test(RUN): 54 | train = [] 55 | test = [] 56 | train_list = get_files('train') 57 | train_sent = get_sentences(train_list, RUN) 58 | for s in train_sent: 59 | train += list(make_list(s)) 60 | test_list = get_files('test') 61 | test_sent = get_sentences(test_list, RUN) 62 | for s in test_sent: 63 | test += list(make_list(s)) 64 | return train, test 65 | 66 | 67 | # Network learning, this is the awesome part 68 | if __name__ == '__main__': 69 | HIDDEN = 30 70 | EPOCHS = 30 71 | MINI_BATCH_SIZE = 10 72 | ETA = 0.5 73 | LAMBDA = 5.0 74 | accuracy = {} 75 | n_args = len(sys.argv) 76 | if n_args == 1: 77 | a = 0 78 | b = 5 79 | elif n_args == 2: 80 | a = int(sys.argv[1]) 81 | b = a+1 82 | else: 83 | a = int(sys.argv[1]) 84 | b = int(sys.argv[2]) + 1 85 | try: 86 | assert 0 <= a < b <= 5 87 | except: 88 | print(''' 89 | Usage: python3 learn.py [from] [to] 90 | =================================== 91 | * [from] and [to] is optional, default to 0 and 4, i.e. run full 92 | 5 cross-validation from 0 to 4 93 | * If specified, [from] and [to] must meet the inequility 94 | 0 <= [from] < [to] <= 4 95 | ''') 96 | sys.exit(1) 97 | # We run a to b cross-validation and take the average accuracy 98 | ranges = range(a, b) 99 | for RUN in ranges: 100 | print('Run cross-validation #{}'.format(RUN)) 101 | net = network.Network([SIZE, HIDDEN, 3]) 102 | training_data, test_data = make_train_test(RUN) 103 | # training_data = training_data[:1000]; test_data = test_data[:100] 104 | acc = net.SGD(training_data, EPOCHS, MINI_BATCH_SIZE, ETA, 105 | lmbda=LAMBDA, 106 | evaluation_data=test_data, 107 | monitor_evaluation_accuracy=True) 108 | # Take the last (not best) accuracy and length of the test set 109 | accuracy[RUN] = (acc[-1], len(test_data)) 110 | # Save the model for later use 111 | net.save('../var/{}hidden-{}epochs-{}batch-{}eta-{}lambda-{}window-{}shape-{}run.net'.\ 112 | format(HIDDEN, EPOCHS, MINI_BATCH_SIZE, ETA, LAMBDA, WINDOW, SHAPE, RUN)) 113 | # Display the result 114 | print("===================") 115 | print("FINAL RESULTS:") 116 | print("===================") 117 | for i in ranges: 118 | print("Accuracy on CV #{0} is {1:.2f} %"\ 119 | .format(i, accuracy[i][0]/accuracy[i][1]*100)) 120 | print("===================") 121 | a = sum([accuracy[i][0] for i in ranges]) 122 | b = sum([accuracy[i][1] for i in ranges]) 123 | print("Average accuracy: {:.2f} %".format(a/b*100)) 124 | 125 | 126 | 127 | 128 | -------------------------------------------------------------------------------- /src/network.py: -------------------------------------------------------------------------------- 1 | 2 | ############################################################################### 3 | # Simple neural network model 4 | # Modified version of 'network2.py' in: 5 | # https://github.com/mnielsen/neural-networks-and-deep-learning 6 | ############################################################################### 7 | 8 | # TODO: 9 | # 1) Implement matrix-based computation for mini batch 10 | # 2) Adapt a version for on-line learning for using iterators object in Python3 11 | # When using iterators objects it would permit to use larger Word2Vec dimention 12 | # of word as well as input2vec context window. Maybe the accuracy will increase 13 | 14 | import json 15 | import random 16 | import sys 17 | 18 | import numpy as np 19 | from statistics import mean 20 | 21 | #### Define the cross-entropy cost functions 22 | class CrossEntropyCost: 23 | 24 | @staticmethod 25 | def fn(a, y): 26 | """Return the cost associated with an output ``a`` and desired output 27 | ``y``. Note that np.nan_to_num is used to ensure numerical 28 | stability. In particular, if both ``a`` and ``y`` have a 1.0 29 | in the same slot, then the expression (1-y)*np.log(1-a) 30 | returns nan. The np.nan_to_num ensures that that is converted 31 | to the correct value (0.0). 32 | 33 | """ 34 | return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a))) 35 | 36 | @staticmethod 37 | def delta(a, y): 38 | """Return the error delta from the output layer. 39 | """ 40 | return (a-y) 41 | 42 | 43 | #### Main Network class 44 | class Network(): 45 | 46 | def __init__(self, sizes, cost=CrossEntropyCost): 47 | """The list ``sizes`` contains the number of neurons in the respective 48 | layers of the network. For example, if the list was [2, 3, 1] 49 | then it would be a three-layer network, with the first layer 50 | containing 2 neurons, the second layer 3 neurons, and the 51 | third layer 1 neuron. The biases and weights for the network 52 | are initialized randomly, using 53 | ``self.default_weight_initializer`` (see docstring for that 54 | method). 55 | 56 | """ 57 | self.num_layers = len(sizes) 58 | self.sizes = sizes 59 | self.default_weight_initializer() 60 | self.cost=cost 61 | 62 | def default_weight_initializer(self): 63 | """Initialize each weight using a Gaussian distribution with mean 0 64 | and standard deviation 1 over the square root of the number of 65 | weights connecting to the same neuron. Initialize the biases 66 | using a Gaussian distribution with mean 0 and standard 67 | deviation 1. 68 | 69 | Note that the first layer is assumed to be an input layer, and 70 | by convention we won't set any biases for those neurons, since 71 | biases are only ever used in computing the outputs from later 72 | layers. 73 | 74 | """ 75 | self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]] 76 | self.weights = [np.random.randn(y, x)/np.sqrt(x) 77 | for x, y in zip(self.sizes[:-1], self.sizes[1:])] 78 | 79 | def feedforward(self, a): 80 | """Return the output of the network if ``a`` is input.""" 81 | for b, w in zip(self.biases, self.weights): 82 | a = sigmoid_vec(np.dot(w, a)+b) 83 | return a 84 | 85 | def SGD(self, training_data, epochs, mini_batch_size, eta, 86 | lmbda = 0.0, 87 | evaluation_data=None, 88 | monitor_evaluation_cost=False, 89 | monitor_evaluation_accuracy=False): 90 | """Train the neural network using mini-batch stochastic gradient 91 | descent. The ``training_data`` is a list of tuples ``(x, y)`` 92 | representing the training inputs and the desired outputs. The 93 | other non-optional parameters are self-explanatory, as is the 94 | regularization parameter ``lmbda``. The method also accepts 95 | ``evaluation_data``, usually either the validation or test data. 96 | The method returns the list of accuracies on the evaluation data. 97 | All values are evaluated at the end of each training epoch. 98 | """ 99 | if evaluation_data: n_data = len(evaluation_data) 100 | n = len(training_data) 101 | evaluation_accuracy = [] 102 | for j in range(epochs): 103 | random.shuffle(training_data) 104 | mini_batches = [ 105 | training_data[k:k+mini_batch_size] 106 | for k in range(0, n, mini_batch_size)] 107 | for mini_batch in mini_batches: 108 | self.update_mini_batch( 109 | mini_batch, eta, lmbda, len(training_data)) 110 | print("Epoch %s training complete" % j) 111 | if monitor_evaluation_accuracy: 112 | accuracy = self.accuracy(evaluation_data) 113 | print("Accuracy on evaluation data: {0} / {1} <=> {2:.2f} %".\ 114 | format(accuracy, n_data, accuracy/n_data*100) 115 | ) 116 | # Decrease eta by half when having no impoving in 3 loops 117 | if len(evaluation_accuracy) > 3 and \ 118 | accuracy <= mean(evaluation_accuracy[-3:]): 119 | eta /= 2 120 | print("ETA now is {}".format(eta)) 121 | evaluation_accuracy.append(accuracy) 122 | return evaluation_accuracy 123 | 124 | def update_mini_batch(self, mini_batch, eta, lmbda, n): 125 | """Update the network's weights and biases by applying gradient 126 | descent using backpropagation to a single mini batch. The 127 | ``mini_batch`` is a list of tuples ``(x, y)``, ``eta`` is the 128 | learning rate, ``lmbda`` is the regularization parameter, and 129 | ``n`` is the total size of the training data set. 130 | """ 131 | nabla_b = [np.zeros(b.shape) for b in self.biases] 132 | nabla_w = [np.zeros(w.shape) for w in self.weights] 133 | for x, y in mini_batch: 134 | delta_nabla_b, delta_nabla_w = self.backprop(x, y) 135 | nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] 136 | nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)] 137 | self.weights = [(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*nw 138 | for w, nw in zip(self.weights, nabla_w)] 139 | self.biases = [b-(eta/len(mini_batch))*nb 140 | for b, nb in zip(self.biases, nabla_b)] 141 | 142 | def backprop(self, x, y): 143 | """Return a tuple ``(nabla_b, nabla_w)`` representing the 144 | gradient for the cost function C_x. ``nabla_b`` and 145 | ``nabla_w`` are layer-by-layer lists of numpy arrays, similar 146 | to ``self.biases`` and ``self.weights``.""" 147 | nabla_b = [np.zeros(b.shape) for b in self.biases] 148 | nabla_w = [np.zeros(w.shape) for w in self.weights] 149 | # feedforward 150 | activation = x 151 | activations = [x] # list to store all the activations, layer by layer 152 | zs = [] # list to store all the z vectors, layer by layer 153 | for b, w in zip(self.biases, self.weights): 154 | z = np.dot(w, activation)+b 155 | zs.append(z) 156 | activation = sigmoid_vec(z) 157 | activations.append(activation) 158 | # backward pass 159 | delta = (self.cost).delta(activations[-1], y) 160 | nabla_b[-1] = delta 161 | nabla_w[-1] = np.dot(delta, activations[-2].transpose()) 162 | # l = 1 means the last layer of neurons, l = 2 is the 163 | # second-last layer, and so on. 164 | for l in range(2, self.num_layers): 165 | z = zs[-l] 166 | spv = sigmoid_prime_vec(z) 167 | delta = np.dot(self.weights[-l+1].transpose(), delta) * spv 168 | nabla_b[-l] = delta 169 | nabla_w[-l] = np.dot(delta, activations[-l-1].transpose()) 170 | return (nabla_b, nabla_w) 171 | 172 | def accuracy(self, data): 173 | """Return the number of inputs in ``data`` for which the neural 174 | network outputs the correct result. The neural network's 175 | output is assumed to be the index of whichever neuron in the 176 | final layer has the highest activation. 177 | """ 178 | results = [(np.argmax(self.feedforward(x)), np.argmax(y)) 179 | for (x, y) in data] 180 | return sum((x == y).all() for (x, y) in results) 181 | 182 | def save(self, filename): 183 | """Save the neural network to the file ``filename``.""" 184 | data = {"sizes": self.sizes, 185 | "weights": [w.tolist() for w in self.weights], 186 | "biases": [b.tolist() for b in self.biases], 187 | "cost": str(self.cost.__name__)} 188 | f = open(filename, "w") 189 | json.dump(data, f) 190 | f.close() 191 | 192 | #### Loading a Network 193 | def load(filename): 194 | """Load a neural network from the file ``filename``. Returns an 195 | instance of Network. 196 | """ 197 | f = open(filename, "r") 198 | data = json.load(f) 199 | f.close() 200 | cost = getattr(sys.modules[__name__], data["cost"]) 201 | net = Network(data["sizes"], cost=cost) 202 | net.weights = [np.array(w) for w in data["weights"]] 203 | net.biases = [np.array(b) for b in data["biases"]] 204 | return net 205 | 206 | #### Miscellaneous functions 207 | def vectorized_result(j): 208 | """Return a 3-dimensional unit vector with a 1.0 in the j'th position 209 | and zeroes elsewhere. This is used to convert a iob code ('i', 'o', 'b') 210 | into a corresponding desired output from the neural network. 211 | """ 212 | e = np.zeros((3, 1)) 213 | e[j] = 1.0 214 | return e 215 | 216 | def sigmoid(z): 217 | """The sigmoid function.""" 218 | return 1.0/(1.0+np.exp(-z)) 219 | 220 | sigmoid_vec = np.vectorize(sigmoid) 221 | 222 | def sigmoid_prime(z): 223 | """Derivative of the sigmoid function.""" 224 | return sigmoid(z)*(1-sigmoid(z)) 225 | 226 | sigmoid_prime_vec = np.vectorize(sigmoid_prime) 227 | 228 | 229 | 230 | -------------------------------------------------------------------------------- /src/output.txt: -------------------------------------------------------------------------------- 1 | Chủ_nhân website muốn duy_trì domain .com phải trả 6,42 USD ( tăng 7 % ) , còn .net sẽ tốn 3,85 USD ( tăng 10 % ) mỗi năm . 2 | Hãng điều_hành tên miền VeriSign sẽ thu thêm 29 triệu USD/năm từ 62 triệu trang .com và 9,1 triệu trang .net . 3 | Sự thay_đổi này sẽ bắt_đầu vào ngày 15/10 năm_nay và chỉ áp_dụng với tên đăng_ký mới hoặc gia_hạn . 4 | Những tên miền đã đóng_phí duy_trì 10 năm hay 100 năm vẫn theo giá cũ . 5 | VeriSign cho_biết nhu_cầu đăng_ký tên miền ngày càng tăng . 6 | Hiện , máy chủ DNS của họ nhận 30 tỷ yêu_cầu mỗi ngày , gấp 30 lần so với năm 2000 . 7 | VeriSign phải khởi_động dự_án mang tên Titan để mở_rộng dung_lượng của hệ_thống , đáp ứng được 4 nghìn tỷ thắc mắc/ngày vào năm 2010 . 8 | Tăng dung_lượng máy_chủ cũng là một biện_pháp đối_phó với nguy_cơ tấn_công từ chối dịch_vụ . 9 | Nếu để xảy_ra tình_trạng này , toàn bộ các trang mà VeriSign quản_lý sẽ `` chết đứng '' . 10 | T. H. ( theo AP ) T. H. Nhân 11 | -------------------------------------------------------------------------------- /src/performace.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from vietseg import _make_words, _get_iob 3 | from learn import get_files, get_sentences 4 | from vec4net import make_vec 5 | import network 6 | 7 | 8 | 9 | def _make_hyp_ref(RUN): 10 | "Make human sentences and machine sentences" 11 | JSON = '../var/30hidden-30epochs-10batch-0.5eta-5.0lambda-7window-100shape-{}run.net'.format(RUN) 12 | net = network.load(JSON) 13 | 14 | def classify(token_list): 15 | "Classify a list of token to compare with human's segmentation" 16 | result = [] 17 | sen_vec = make_vec(token_list) 18 | for x in sen_vec: 19 | result.append(_get_iob(net.feedforward(x))) 20 | return result 21 | 22 | test_list = get_files('test') 23 | test_sent = get_sentences(test_list, RUN) 24 | sents_ref = [_make_words(sent, tags) for sent, tags in test_sent] 25 | sents_hyp = [_make_words(sent, classify(sent)) for sent, _ in test_sent] 26 | return sents_ref, sents_hyp 27 | 28 | def _run_test(RUN): 29 | "Test performace on CV #RUN" 30 | # change RUN to constant for evaluate 1 specific model to all test sets 31 | # sents_ref, sents_hyp = _make_hyp_ref(3) 32 | sents_ref, sents_hyp = _make_hyp_ref(RUN) 33 | n_hyp = 0 34 | n_corr = 0 35 | n_ref = 0 36 | nSents = len(sents_ref) 37 | n_hyps = [] 38 | n_refs = [] 39 | n_corrs = [] 40 | for n in range(nSents): 41 | sent1 = sents_ref[n] # Human's segmenting 42 | sent2 = sents_hyp[n] # Machine's segmenting 43 | 44 | n_ref_ = len(sent1) 45 | n_hyp_ = len(sent2) 46 | 47 | # Finding optimal alignment and consequently no. of correct words 48 | # by dynamic programming. Longest Common Subsequence problem. 49 | l = np.zeros([n_ref_+1, n_hyp_+1]) 50 | 51 | for row in range(1,l.shape[0]): 52 | for col in range(1,l.shape[1]): 53 | if sent1[row-1] == sent2[col-1]: 54 | l[row][col] = l[row-1][col-1] + 1 55 | else: 56 | l[row][col] = max([l[row][col-1], l[row-1][col]]) 57 | n_corr_ = l[n_ref_][n_hyp_] 58 | 59 | n_hyp += n_hyp_ 60 | n_ref += n_ref_ 61 | n_corr += n_corr_ 62 | n_hyps.append(n_hyp_) 63 | n_corrs.append(n_corr_) 64 | n_refs.append(n_ref_) 65 | 66 | prec = n_corr / n_hyp 67 | recall = n_corr / n_ref 68 | fratio = 2*prec*recall / (prec + recall) 69 | 70 | return prec, recall, fratio 71 | 72 | def _main(nRuns): 73 | "Run the test" 74 | P_ = R_ = F_ = 0 75 | print("RESULT:") 76 | print("===================") 77 | for RUN in range(nRuns): 78 | prec, recall, fratio = _run_test(RUN) 79 | print("Run " + "%.0f" % RUN + ": P = " + "%.4f" % prec + ", R = " + \ 80 | "%.4f" % recall + ", F = " + "%.4f" % fratio) 81 | 82 | P_ += prec 83 | R_ += recall 84 | F_ += fratio 85 | 86 | P_ /= nRuns 87 | R_ /= nRuns 88 | F_ /= nRuns 89 | 90 | print("===================") 91 | print("Avg. P = "+ "%.4f" % P_ + ", R = " + "%.4f" % R_ + ", F = " +\ 92 | "%.4f" % F_) 93 | 94 | if __name__ == '__main__': 95 | _main(5) 96 | 97 | 98 | -------------------------------------------------------------------------------- /src/vec4net.py: -------------------------------------------------------------------------------- 1 | 2 | ############################################################################### 3 | # Convert word to vector 4 | ############################################################################### 5 | 6 | import numpy as np 7 | from gensim.models import Word2Vec 8 | 9 | # Load Word2Vec model from file, use your model's result or use mine 10 | MODEL = Word2Vec.load('../var/100features_10context_7minwords.vec') 11 | WINDOW = 7 12 | SHAPE = MODEL.syn0.shape[1] 13 | 14 | def word2index(model, word): 15 | "Convert word to index using result from Word2Vec learning" 16 | try: 17 | if word == -1: 18 | result = np.zeros((SHAPE,)) 19 | else: 20 | result = model[word] 21 | except: 22 | np.random.seed(len(word)) 23 | result = 0.2 * np.random.uniform(-1, 1, SHAPE) 24 | return result 25 | 26 | def context_window(l): 27 | "Make context window for a given word, 'WINDOW' must be odd" 28 | l = list(l) 29 | lpadded = WINDOW//2 * [-1] + l + WINDOW//2 * [-1] 30 | out = [ lpadded[i:i+WINDOW] for i in range(len(l)) ] 31 | assert len(out) == len(l) 32 | return out 33 | 34 | def context_matrix(model, conwin): 35 | "Return a list contain map element for each context window of 1 word" 36 | return [map(lambda x: word2index(model, x), win) for win in conwin] 37 | 38 | def context_vector(cm): 39 | "Convert context matrix to vector" 40 | return [np.squeeze(np.asarray(list(x))).reshape((WINDOW*SHAPE,1)) for x in cm] 41 | 42 | def iob_map(iob): 43 | "Convert i/o/b to vector" 44 | d = {'i': [1,0,0], 'o': [0,1,0], 'b': [0,0,1]} 45 | return np.asarray(d[iob]).reshape((3,1)) 46 | 47 | def iob_vector(iob): 48 | "Make a vector for iob list" 49 | return map(iob_map, iob) 50 | 51 | def make_tuple(context_vector, iob_vector): 52 | "Pairing sentences and iob tag" 53 | return list(zip(context_vector, iob_vector)) 54 | 55 | def make_vec(sen): 56 | "Make vector from a sentence for feeding to neural net directly" 57 | cw = context_window(sen) 58 | cm = context_matrix(MODEL, cw) 59 | return context_vector(cm) 60 | 61 | def make_list(sentences): 62 | "Make list for neural net learning" 63 | sen, iob = sentences 64 | cw = context_window(sen) 65 | cm = context_matrix(MODEL, cw) 66 | cv = context_vector(cm) 67 | iv = iob_vector(iob) 68 | return make_tuple(cv, iv) 69 | 70 | if __name__ == '__main__': 71 | sen = ['mobifone', 'đầu', 'tư', 'hơn', '2', 'tỉ', 'đồng', 'phát', 'triển', 'mạng'] 72 | iob = ['b', 'b', 'i', 'b', 'b', 'b', 'b', 'b', 'i', 'b'] 73 | sentences = (sen, iob) 74 | lst = make_list(sentences) 75 | vec = make_vec(sen) 76 | 77 | 78 | 79 | 80 | -------------------------------------------------------------------------------- /src/vietseg.py: -------------------------------------------------------------------------------- 1 | 2 | ############################################################################### 3 | # Model application 4 | ############################################################################### 5 | 6 | import sys 7 | import os 8 | import network 9 | from vec4net import make_vec 10 | import numpy as np 11 | 12 | # Replace this with your model's result 13 | JSON = '../var/30hidden-30epochs-10batch-0.5eta-5.0lambda-7window-100shape-3run.net' 14 | net = network.load(JSON) 15 | 16 | 17 | def _get_iob(arr): 18 | d = {0: 'i', 1: 'o', 2: 'b'} 19 | n = np.argmax(arr) 20 | return d[n] 21 | 22 | def _classify(token_list): 23 | "Classify a list of token" 24 | result = [] 25 | sen_vec = make_vec(token_list) 26 | for x in sen_vec: 27 | result.append(_get_iob(net.feedforward(x))) 28 | return result 29 | 30 | def _make_words(token_list, iob_list): 31 | "Make segmented words from token list and corresponding iob list" 32 | if not iob_list: return 33 | t = token_list[0:1] 34 | tokens = [] 35 | for i in range(1, len(iob_list)): 36 | if iob_list[i] == 'i': 37 | t.append(token_list[i]) 38 | continue 39 | if iob_list[i] == 'b': 40 | if not t: 41 | t = token_list[i:i+1] 42 | tokens.append(t) 43 | t = [] 44 | else: 45 | tokens.append(t) 46 | t = token_list[i:i+1] 47 | continue 48 | if iob_list[i] == 'o': 49 | if t: 50 | tokens.append(t) 51 | t = [] 52 | tokens.append(token_list[i:i+1]) 53 | if t: tokens.append(t) 54 | return ['_'.join(tok) for tok in tokens] 55 | 56 | def _test(): 57 | tok = ['chủ', 'nhân', 'website', 'muốn', 'duy', 'trì', 'domain', '.com', 'phải', 'trả', '6,42', 'usd', '(', 'tăng', '7', '%', ')', ',', 'còn', '.net', 'sẽ', 'tốn', '3,85', 'usd', '(', 'tăng', '10', '%', ')', 'mỗi', 'năm', '.'] 58 | iob = ['b', 'i', 'b', 'b', 'b', 'i', 'b', 'b', 'b', 'b', 'b', 'b', 'o', 'b', 'b', 'o', 'o', 'o', 'b', 'b', 'b', 'b', 'b', 'b', 'o', 'b', 'b', 'o', 'o', 'b', 'b', 'o'] 59 | return _make_words(tok, iob) 60 | 61 | if __name__ == '__main__': 62 | n_args = len(sys.argv) 63 | if n_args < 3: 64 | print(""" 65 | VietSeg - Ver. 0.0.1 66 | ================================== 67 | Usage: python3 vietseg.py 68 | * The must be in UTF-8 encoding. 69 | * The segmented text will be written in the . 70 | ================================== 71 | http://github.com/manhtai/vietseg 72 | """) 73 | exit(1) 74 | else: 75 | input_file = sys.argv[1] 76 | output_file = sys.argv[2] 77 | if not os.path.isfile(input_file): 78 | print('Input text file "' + input_file+ '" does not exist.') 79 | exit(1) 80 | with open(input_file, 'r') as fi, open(output_file, 'w') as fo: 81 | for line in fi: 82 | in_line = line.split() 83 | if not in_line: 84 | fo.write(line) 85 | continue 86 | token_list = line.lower().split() 87 | iob_list = _classify(token_list) 88 | out_line = _make_words(in_line, iob_list) 89 | fo.write(' '.join(out_line)+'\n') 90 | 91 | 92 | 93 | 94 | 95 | -------------------------------------------------------------------------------- /src/word2vec.py: -------------------------------------------------------------------------------- 1 | 2 | ############################################################################### 3 | # Training Word2Vec to get word embedding vector 4 | # 5 | # Some code are taken from: 6 | # https://github.com/wendykan/DeepLearningMovies 7 | ############################################################################### 8 | 9 | import re 10 | import logging 11 | from gensim.models import Word2Vec 12 | from bs4 import BeautifulSoup 13 | 14 | 15 | def strip_tags(html): 16 | "Strip html tags" 17 | return BeautifulSoup(html).get_text(' ') 18 | 19 | def text_to_token(text): 20 | "Get list of token for training" 21 | # Strip HTML 22 | text = strip_tags(text) 23 | # Keep only word 24 | text = re.sub("\W", " ", text) 25 | # Lower and split sentence 26 | token = text.lower().split() 27 | # Don't remember the number 28 | for i in range(len(token)): 29 | token[i] = len(token[i])*'DIGIT' if token[i].isdigit() else token[i] 30 | return token 31 | 32 | def read_sentences(fp='../dat/VNESEcorpus.txt'): 33 | "Read and split token from text file" 34 | sentences = [] 35 | with open(fp, 'r') as f: 36 | for line in f: 37 | if '|' not in line: # Remove menu items in some newspaper 38 | sentences.append(text_to_token(line.strip())) 39 | return sentences 40 | 41 | if __name__ == '__main__': 42 | # Read data from files 43 | sentences = read_sentences() 44 | print('Loaded {} sentences!'.format(len(sentences))) 45 | # Import the built-in logging module and configure it so that Word2Vec 46 | # creates nice output messages 47 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\ 48 | level=logging.INFO) 49 | # Set values for various parameters 50 | num_features = 100 # Word vector dimensionality 51 | min_word_count = 7 # Minimum word count 52 | num_workers = 4 # Number of threads to run in parallel 53 | context = 10 # Context window size 54 | downsampling = 1e-3 # Downsample setting for frequent words 55 | print("Training Word2Vec model...") 56 | # Initialize and train the model 57 | model = Word2Vec(sentences, workers=num_workers, \ 58 | size=num_features, min_count = min_word_count, \ 59 | window = context, sample = downsampling, seed=1) 60 | model.init_sims(replace=True) 61 | model_name = "../var/{}features_{}context_{}minwords.vec".format( 62 | num_features, context, min_word_count 63 | ) 64 | model.save(model_name) 65 | 66 | 67 | 68 | 69 | -------------------------------------------------------------------------------- /var/100features_10context_7minwords.vec: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manhtai/vietseg/8dfff92873aa3e25cd17b97c0929b22be1ce98e3/var/100features_10context_7minwords.vec --------------------------------------------------------------------------------