├── .gitignore
├── LICENSE
├── README.md
├── eval.py
└── parse.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 | 
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 | 
60 | # Scrapy stuff:
61 | .scrapy
62 | 
63 | # Sphinx documentation
64 | docs/_build/
65 | 
66 | # PyBuilder
67 | target/
68 | 
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 | 
72 | # pyenv
73 | .python-version
74 | 
75 | # celery beat schedule file
76 | celerybeat-schedule
77 | 
78 | # dotenv
79 | .env
80 | 
81 | # virtualenv
82 | venv/
83 | ENV/
84 | 
85 | # Spyder project settings
86 | .spyderproject
87 | 
88 | # Rope project settings
89 | .ropeproject
90 | 
91 | wikiextractor
92 | fastText
93 | source
94 | corpus
95 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2016 Takahiro Kubo
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # fastText Japanese Tutorial
 2 | 
 3 | Facebookの発表した[fastText](https://github.com/facebookresearch/fastText)を日本語で学習させるためのチュートリアルです。
 4 | 
 5 | ## Setup
 6 | 
 7 | 事前に、以下の環境のセットアップを行います。Windowsの場合、MeCabのインストールが鬼門のためWindows10ならbash on Windowsを利用してUbuntu環境で作業することを推奨します。
 8 | 
 9 | * Install Python (above 3.5.2)
10 | * Install [MeCab](http://taku910.github.io/mecab/)
11 | * Download (`git clone`) [WikiExtractor](https://github.com/attardi/wikiextractor)
12 | * Download (`git clone`) [fastText](https://github.com/facebookresearch/fastText)
13 | 
14 | ## 1. 学習に使用する文書を用意する
15 | 
16 | 日本語Wikipediaのダンプデータをダウンロードし、 `source` フォルダに格納してください。  
17 | 
18 | 
19 | ## 2. テキストを抽出する
20 | 
21 | WikiExtractorを使用し、sourceに格納したWikipediaのデータからテキストを抜き出し、`corpus`フォルダに格納します。
22 | 
23 | ```
24 | python wikiextractor/WikiExtractor.py -b 500M -o corpus source/jawiki-xxxxxxxx-pages-articles-multistream.xml.bz2
25 | ```
26 | 
27 | * `-b`は、抽出したデータそれぞれのファイルサイズで、この単位でテキストファイルが作成されていきます。必要に応じ調整をしてください。
28 | 
29 | Wikipediaのabstractのファイルを使う場合は、`parse.py`にabstract用の抽出処理を実装しているので、そちらを利用してください。
30 | 
31 | ```
32 | python parse.py jawiki-xxxxxxxxx-abstract4.xml  --extract
33 | ```
34 | 
35 | テキストを抽出したファイルは、最終的に一つにまとめます。これはコマンドで行ってもよいですが、`parse.py`に結合用のスクリプトを用意しているのでコマンドがわからない場合は使ってください。
36 | 
37 | ```
38 | python parse.py (対象フォルダ) --concat (対象ファイルに共通する名称(wiki_など))
39 | ```
40 | 
41 | これで、学習用のテキストデータの作成が完了しました。
42 | 
43 | ## 3. テキストを単語に分ける(分かち書きする)
44 | 
45 | テキストデータ内の単語を英語と同じようにスペースで分ける作業(=分かち書き)を行います。この作業には、MeCabを利用します。
46 | 
47 | ```
48 | mecab (対象テキストファイル) -O wakati -o (出力先ファイル)
49 | ```
50 | 
51 | これで、単語ごとに区切られたファイルができました。
52 | 
53 | ## 4. fastTextで学習する
54 | 
55 | 英語と同じように、単語ごとに区切られたファイルが手に入ったため、あとはfastTextを実行するだけです。[fastText](https://github.com/facebookresearch/fastText)のリポジトリをcloneしてきて、ドキュメントにある通りmakeによりビルドしてください。
56 | 
57 | 設定パラメーターは各種ありますが、論文を参考にすると、扱うデータセットにより単語の数値表現のサイズ(ベクトルの次元)は以下のようになっています(※tokenが何の単位なのかは言及がなかったのですが、おそらく単語カウントと思われます)。
58 | 
59 | * small(50M tokens): 100
60 | * mediam(200M tokens) :200
61 | * full:300
62 | 
63 | 要は、小さいデータセットなら小さい次元、ということです。Wikipedia全件のような場合はfullの300次元に相当するため、以下のように処理します。
64 | 
65 | ```
66 | ./fasttext skipgram -input (分かち書きしたファイル) -output model -dim 300
67 | ```
68 | 
69 | Word2Vecの学習と同等のパラメーターで行う場合は、以下のようになります(パラメーター設定などは、[こちら](http://aial.shiroyagi.co.jp/2015/12/word2vec/)に詳しいです)。
70 | 
71 | ```
72 | ./fasttext skipgram -input (分かち書きしたファイル) -output model -dim 200 -neg 25 -ws 8
73 | ```
74 | 
75 | (Issueにも上がっていますが、[パラメーターで結構変わる](https://github.com/facebookresearch/fastText/issues/5)らしいです。epoch、mincountなど。。。)
76 | 
77 | 学習が完了すると、`-output`で指定したファイル名について、`.bin`と`.vec`の二種類のファイルが作成されます。これらが学習された分散表現を収めたファイルになります。
78 | ただ、Wikipeida全件のような場合はデータサイズが大きすぎて`model`のファイルを読み込もうとするとMemoryErrorで飛ぶこともままあるほか、エンコードの問題が発生するケースがあります(というか発生したのですが)。そのような場合は、一旦単語の辞書を作り(「朝」->11など、単語をIDに変換する辞書を作る)、テキストファイルを単語IDの列に変換するなどして対応します。
79 | この作業のために、`parser.py`にtokenizeの機能を実装しているので、必要に応じて活用してください。
80 | 
81 | ```
82 | python parser.py (対象テキストファイル)  --tokenize
83 | ```
84 | 
85 | これで、`.vocab`という辞書ファイルと、`_tokenized`という単語ID化されたテキストファイルが手に入ります。
86 | 
87 | 
88 | ## 5.fastTextを活用する
89 | 
90 | `eval.py`を利用し、似ている単語を検索することができます。
91 | 
92 | ```
93 | python eval.py (単語)
94 | ```
95 | 
96 | こちらは、デフォルトで`fastText`内の`model.vec`を参照します。別のファイル名、または別の場所に保管している場合は`--path`オプションで位置を指定してください。
97 | 
98 | 


--------------------------------------------------------------------------------
/eval.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import argparse
 3 | import numpy as np
 4 | 
 5 | 
 6 | def read_words_vector(path):
 7 |     vectors = {}
 8 |     with open(path, "r", encoding="utf-8") as vec:
 9 |         for i, line in enumerate(vec):
10 |             try:
11 |                 elements = line.strip().split()
12 |                 word = elements[0]
13 |                 vec = np.array(elements[1:], dtype=float)
14 |                 if not word in vectors and len(vec) >= 100:
15 |                     # ignore the case that vector size is invalid
16 |                     vectors[word] = vec
17 |             except ValueError:
18 |                 continue
19 |             except UnicodeDecodeError:
20 |                 continue
21 |     
22 |     return vectors
23 | 
24 | 
25 | def similarity(v1, v2):
26 |     n1 = np.linalg.norm(v1)
27 |     n2 = np.linalg.norm(v2)
28 |     return np.dot(v1, v2) / n1 / n2
29 | 
30 | 
31 | def evaluate(path, word, negative, threshold):
32 |     if not word:
33 |         raise Exception("word is missing")
34 | 
35 |     vectors = read_words_vector(path)
36 |     
37 |     if word not in vectors:
38 |         raise Exception("Sorry, this word is not registered in model.")
39 | 
40 |     w_vec = vectors[word]
41 |     border_positive = threshold if threshold > 0 else 0.8
42 |     border_negative = threshold if threshold > 0 else 0.3
43 |     max_candidates = 15
44 |     candidates = {}
45 | 
46 |     for w in vectors:
47 |         try:
48 |             if w_vec.shape != vectors[w].shape:
49 |                 raise Exception("size not match")
50 |             s = similarity(w_vec, vectors[w])
51 |         except Exception as ex:
52 |             print(w + " is not valid word.")
53 |             continue
54 |         
55 |         if negative and s <= border_negative:
56 |             candidates[w] = s
57 |             if len(candidates) % 5 == 0:
58 |                 border_negative -= 0.05
59 |         elif not negative and s >= border_positive:
60 |             candidates[w] = s
61 |             if len(candidates) % 5 == 0:
62 |                 border_positive += 0.05
63 |         
64 |         if len(candidates) > max_candidates:
65 |             break
66 |     
67 |     sorted_candidates = sorted(candidates, key=candidates.get, reverse=not negative)
68 |     for c in sorted_candidates:
69 |         print("{0}, {1}".format(c, candidates[c]))
70 | 
71 | 
72 | if __name__ == "__main__":
73 |     parser = argparse.ArgumentParser(description="Evaluate Fast Text")
74 |     parser.add_argument("word", type=str, help="word to search similar words")
75 |     parser.add_argument("--path", type=str, help="path to model file", default="")
76 |     parser.add_argument("--negative", action="store_const", const=True, default=False, help="search opposite words")
77 |     parser.add_argument("--threshold", "--t", dest="threshold", type=float, default=-1, help="threashold to judege similarity")
78 | 
79 |     args = parser.parse_args()
80 |     path = args.path
81 |     if not path:
82 |         path = os.path.join(os.path.dirname(__file__), "fastText/model.vec")
83 | 
84 |     evaluate(path, args.word, args.negative, args.threshold)
85 | 


--------------------------------------------------------------------------------
/parse.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | import xml.etree.ElementTree as ET
  4 | 
  5 | 
  6 | MAKE_CORPUS_PATH = lambda f: os.path.join(os.path.dirname(__file__), "./corpus/" + f)
  7 | 
  8 | 
  9 | def extract(file_path):
 10 |     if not os.path.isfile(file_path):
 11 |         raise Exception("Abstract file does not found.")
 12 |     root = ET.parse(file_path)
 13 |     root.findall(".")
 14 | 
 15 |     file_path = MAKE_CORPUS_PATH("abstracts.txt")
 16 |     stream = open(file_path, mode="w", encoding="utf-8")
 17 | 
 18 |     for doc in root.findall("./doc"):
 19 |         abs = doc.find("./abstract").text
 20 |         if not abs:
 21 |             continue
 22 |         elif abs.startswith(("|", "thumb", "{", "・", ")", "(", "link")):
 23 |             continue
 24 |         title = doc.find("./title").text.replace("Wikipedia: ", "")
 25 | 
 26 |         if abs and title:
 27 |             stream.write(title + "\n")
 28 |             stream.write(abs + "\n")
 29 |     
 30 |     stream.close()
 31 | 
 32 | 
 33 | def concat(dir, prefix):
 34 |     if not os.path.isdir(dir):
 35 |         raise Exception("directory is not found")
 36 |     
 37 |     paths = []
 38 |     def fetch(_dir):
 39 |         files = os.listdir(_dir)
 40 |         for f in files:
 41 |             p = os.path.join(_dir, f)
 42 |             if os.path.isfile(p) and f.startswith(prefix):
 43 |                 paths.append(p)
 44 |             if os.path.isdir(p):
 45 |                 fetch(p)
 46 |     
 47 |     fetch(dir)
 48 |     file_path = MAKE_CORPUS_PATH(prefix + "_all.txt")
 49 |     with open(file_path, mode="w", encoding="utf-8") as o:
 50 |         for p in paths:
 51 |             print("concat {}.".format(p))
 52 |             with open(p, mode="r", encoding="utf-8") as f:
 53 |                 for line in f:
 54 |                     o.write(line)
 55 | 
 56 | 
 57 | def wakati(file_path):
 58 |     from janome.tokenizer import Tokenizer
 59 |     path, ext = os.path.splitext(file_path)
 60 |     wakati_path = path + "_wakati" + ext
 61 | 
 62 |     tokenizer = Tokenizer()
 63 | 
 64 |     def wsplit(text):
 65 |         ws = []
 66 |         tokens = tokenizer.tokenize(text.strip())
 67 |         for t in tokens:
 68 |             w = t.surface.strip()
 69 |             if w:
 70 |                 ws.append(w)
 71 |         return ws
 72 | 
 73 |     with open(file_path, mode="r", encoding="utf-8") as f:
 74 |         with open(wakati_path, mode="w", encoding="utf-8") as w:
 75 |             for line in f:
 76 |                 words = wsplit(line)
 77 |                 w.write(" ".join(words) + "\n")
 78 | 
 79 | 
 80 | def tokenize(file_path, vocab_size):
 81 |     import MeCab
 82 |     path, ext = os.path.splitext(file_path)
 83 |     vocab_path = path + ".vocab"
 84 |     tokenized_path = path + "_tokenized" + ext
 85 |     UNKNOWN = 0
 86 | 
 87 |     tagger = MeCab.Tagger("-Owakati")
 88 |     tagger.parse("")
 89 | 
 90 |     def wsplit(text):
 91 |         ws = []
 92 |         node = tagger.parseToNode(text.strip())
 93 |         while node:
 94 |             w = node.surface.strip()
 95 |             if w:
 96 |                 ws.append(w)
 97 |             node = node.next
 98 |         
 99 |         return ws
100 | 
101 |     # make vocab file
102 |     print("making vocabulary dictionary...")
103 |     vocab = {}
104 |     with open(file_path, mode="r", encoding="utf-8") as f:
105 |         for line in f:
106 |             words = wsplit(line)
107 |             for w in words:
108 |                 if w in vocab:
109 |                     vocab[w] += 1
110 |                 else:
111 |                     vocab[w] = 1
112 | 
113 |     dictionary = [UNKNOWN] + sorted(vocab, key=vocab.get, reverse=True)
114 |     dictionary = dictionary[:vocab_size]
115 |     with open(vocab_path, mode="w", encoding="utf-8") as v:
116 |         v.write("\n".join([str(_v) for _v in dictionary]))
117 | 
118 |     # make tokenized file
119 |     print("tokenize by vocabulary dictionary...")        
120 |     with open(file_path, mode="r", encoding="utf-8") as f:
121 |         with open(tokenized_path, mode="w", encoding="utf-8") as t:
122 |             for line in f:
123 |                 words = wsplit(line)
124 |                 tokens = [dictionary.index(w) if w in dictionary else UNKNOWN for w in words]
125 |                 t.write(" ".join([str(_t) for _t in tokens]) + "\n")
126 | 
127 | 
128 | if __name__ == "__main__":
129 |     parser = argparse.ArgumentParser(description="Utility Parser")
130 |     parser.add_argument("path", type=str, help="target path")
131 |     parser.add_argument("--extract", action="store_const", const=True, default=False, help="extract abstract xml")
132 |     parser.add_argument("--wakati", action="store_const", const=True, default=False, help="separate japanese text (You need janome)")
133 |     parser.add_argument("--concat", type=str, help="concatenate files that matches the pattern in target directory")
134 |     parser.add_argument("--tokenize", type=int, default=-1, help="make vocab file and tokenize target file. vocab size is directed size. (You need MeCab)")
135 | 
136 |     args = parser.parse_args()
137 |     path = args.path
138 |     if path.startswith("/"):
139 |         path = os.path.join(os.path.dirname(__file__), path)
140 | 
141 |     if args.extract:
142 |         extract(path)
143 |     elif args.wakati:
144 |         wakati(path)
145 |     elif args.concat:
146 |         concat(path, args.concat)
147 |     elif args.tokenize > 0:
148 |         tokenize(path, args.tokenize)
149 | 


--------------------------------------------------------------------------------