├── LICENSE.md
├── README.md
├── evaluation_dataset
└── twitter_sentiment
│ ├── README.md
│ ├── dev.tsv
│ ├── test.tsv
│ └── train.tsv
├── images
└── QR_hottoSNS-bert.png
├── requirements.txt
├── src
├── dataprocessor
│ ├── __init__.py
│ ├── __pycache__
│ │ ├── __init__.cpython-36.pyc
│ │ ├── custom.cpython-36.pyc
│ │ └── preset.cpython-36.pyc
│ ├── custom.py
│ └── preset.py
├── modeling.py
├── optimization.py
├── preprocess
│ ├── __init__.py
│ ├── __pycache__
│ │ ├── __init__.cpython-36.pyc
│ │ └── normalizer.cpython-36.pyc
│ └── normalizer.py
├── run_classifier.py
├── run_classifier.sh
├── tokenization.py
└── utility.py
└── trained_model
├── masked_lm_only_L-12_H-768_A-12
└── .gitkeep
├── multi_cased_L-12_H-768_A-12
└── .gitkeep
└── wikipedia_ja_L-12_H-768_A-12
└── .gitkeep
/LICENSE.md:
--------------------------------------------------------------------------------
1 | ```
2 | 第1条(定義)
3 | 本契約で用いられる用語の意味は、以下に定めるとおりとする。
4 | (1)「本規約」とは、本利用規約をいう。
5 | (2)「甲」とは、 株式会社ホットリンク(以下「甲」という)をいう。
6 | (3)「乙」とは、本規約に同意し、甲の承認を得て、甲が配布する文分散表現データを利用する個人をいう。
7 | (4)「本データ」とは、甲が作成した文分散表現データおよびそれに付随する全部をいう。
8 |
9 |
10 | 第2条(利用許諾)
11 | 甲は、乙が本規約に従って本データを利用することを非独占的に許諾する。なお、甲及び乙は、本規約に明示的に定める以外に、乙に本データに関していかなる権利も付与するものではないことを確認する。
12 |
13 |
14 | 第4条(許諾の条件)
15 | 甲が乙に本データの利用を許諾する条件は、以下の通りとする。
16 | (1)利用目的: 日本語に関する学術研究・産業研究(以下「本研究」という)を遂行するため。
17 | (2)利用の範囲: 乙及び乙が所属する研究グループ
18 | (3)利用方法: 本研究のために本データを乙が管理するコンピューター端末またはサーバーに複製し、本データを分析・研究しデータベース等に保存した解析データ(以下「本解析データ」という)を得る。
19 |
20 |
21 | 第5条(利用申込)
22 | 1.乙は、甲が指定するウェブ上の入力フォーム(以下、入力フォーム)を通じて、乙の名前や所属、連絡先等、甲が指定する項目を甲に送信し、本データの利用について甲の承認を得るものとする。 なお、甲が承認しなかった場合、甲はその理由を開示する義務を負わない。
23 | 2.前項に基づき甲に申告した内容に変更が生じる場合、乙は遅滞なくこれを甲に報告し、改めて甲の承認を得るものとする。
24 | 3.乙が入力フォームを送信した時点で、乙は本規約に同意したものとみなされる。
25 |
26 | 第6条(禁止事項)
27 | 乙は、本データの利用にあたり、以下に定める行為をしてはならない。
28 | (1)本データ及びその複製物(それらを復元できるデータを含む)を譲渡、貸与、販売すること。また、書面による甲の事前許諾なくこれらを配布、公衆送信、刊行物に転載するなど前項に定める範囲を超えて利用し、甲または第三者の権利を侵害すること。
29 | (2)本データを用いて甲又は第三者の名誉を毀損し、あるいはプライバシーを侵害するなどの権利侵害を行うこと。
30 | (3)乙及び乙が所属する研究グループ以外の第三者に本データを利用させること。
31 | (4)本規約で明示的に許諾された目的及び手段以外にデータを利用 すること。
32 |
33 | 第7条(対価)
34 | 本規約に基づく本データの利用許諾の対価は発生しない。
35 |
36 | 第8条(公表)
37 | 1.乙は、学術研究の目的に限り、本データを使用して得られた研究成果や知見を公表することができる。これらの公表には、本解析データや処理プログラムの公表を含む。
38 | 2.乙は、公表にあたっては、本データをもとにした成果であることを明記し、成果の公表の前にその概要を書面やメール等で甲に報告する。
39 | 3.乙は、論文発表の際も、本データを利用した旨を明記し、提出先の学会、発表年月日を所定のフォームから甲に提出するものとする。
40 |
41 |
42 |
43 | 第9条(乙の責任)
44 | 1.乙は、本データをダウンロードする為に必要な通信機器やソフトウェア、通信回線等の全てを乙の責任と費用で準備し、操作、接続等をする。
45 | 2.乙は、本データを本研究の遂行のみに使用する。
46 | 3.乙は、本データが漏洩しないよう善良な管理者の注意義務をもって管理し、乙のコンピューター端末等に適切な対策を施すものとする。
47 | 4.乙が、本研究を乙が所属するグループのメンバーと共同で遂行する場合、乙は、本規約の内容を当該グループの他のメンバーに遵守させるものとし、万一、当該他のメンバーが本規約に違反し甲又は第三者に損害を与えた場合は、乙はこれを自らの行為として連帯して責任を負うものとする。
48 | 5.甲が必要と判断する場合、乙に対して、本データの利用状況の開示を求めることができるものとし、乙はこれに応じなければならない。
49 |
50 |
51 | 第10条(知的財産権の帰属)
52 | 甲及び乙は、本データに関する一切の知的財産権、本データの利用に関連して甲が提供する書類やデータ等に関する全ての知的財産権について、甲に帰属することを確認する。ただし、本データ作成の素材となった各文書の著作権は正当な権利を有する第三者に帰属する。
53 |
54 | 第11条(非保証等)
55 | 1.甲は、本データが、第三者の著作権、特許権、その他の無体財産権、営業秘密、ノウハウその他の権利を侵害していないこと、法令に違反していないこと、本データ作成に利用したアルゴリズムに誤り、エラー、バグがないことについて一切保証せず、また、それらの信頼性、正確性、速報性、完全性、及び有効性、特定目的への適合性について一切保証しないものとし、瑕疵担保責任も負わない。
56 | 2.本データに関し万一、第三者から知的財産権侵害等の主張がなされた場合には、乙はただちに甲に対しその旨を通知し、甲に対する情報提供等、当該紛争の解決に最大限協力するものとする。
57 |
58 |
59 | 第12条(違反時の措置)
60 | 1.甲は、乙が次の各号の一つにでも該当した場合、甲は乙に対して本データの利用を差止めることができる。
61 | (1)本規約に違反した場合
62 | (2)法令に違反した場合
63 | (3)虚偽の申告等の不正を行った場合
64 | (4)信頼関係を破壊するような行為を行った場合
65 | (5)その他甲が不適当と認めた場合
66 | 2.前項の規定は甲から乙に対する損害賠償請求を妨げるものではない。
67 | 3.第1項に基づき、甲が乙に対して本データの利用の差し止めを求めた場合、乙は、乙が管理する設備から、本データ、本解析データ及びその複製物の一切を消去するものとする。
68 |
69 | 第13条(甲の事情による利用許諾の取り消し)
70 | 1.甲は、その理由の如何を問わず、なんらの催告なしに、本データの利用許諾を停止することができるものとする。その際は、第15条に基づき、乙は速やかに本データおよびその複製物の一切を消去または破棄する。
71 | 2.前項の破棄、消去の対象に本解析データは含まない。
72 |
73 |
74 | 第14条(利用期間)
75 | 1.乙による本データの利用可能期間は、第5条にもとづく甲の承認日より1年間とする。
76 | 2.乙が1年間を超えて本データの利用継続を希望する場合、第5条に基づく方法で再度利用申請を行うこととする。
77 |
78 |
79 | 第15条(本契約終了後の措置等)
80 | 1.理由の如何を問わず、第14条に定める利用期間が終了したとき、もしくは、本データの利用許諾が取り消しとなった場合、乙は本データおよびその複製物の一切を消去または破棄する。
81 | 2.前項の破棄、消去の対象に本解析データは含まない。ただし、乙は、本解析データから本データを復元して再利用することはできないものとする。
82 | 3.第10条、第11条、第15条から第19条は、本契約の終了後も有効に存続する。
83 |
84 | 第16条(権利義務譲渡の禁止)
85 | 乙は、相手方の書面による事前の承諾なき限り、本契約上の地位及び本契約から生じる権利義務を第三者に譲渡又は担保に供してはならない。
86 |
87 | 第17条 (個人情報等の保護および法令遵守)
88 | 1.甲が取得した乙の個人情報は、別途定める甲2のプライバシーポリシーに従って取り扱われる。
89 | 2.甲は、サーバー設備の故障その他のトラブル等に対処するため、乙の個人情報を他のサーバーに複写することがある。
90 |
91 | 第18条(準拠法)
92 | 本契約の準拠法は、日本法とする。
93 |
94 | 第19条(管轄裁判所)
95 | 本契約に起因し又は関連して生じた一切の紛争については、東京地方裁判所を第一審の専属的合意管轄裁判所とする。
96 |
97 | 第20条(協 議)
98 | 本契約に定めのない事項及び疑義を生じた事項は、甲乙誠意をもって協議し、円満にその解決にあたる。
99 |
100 | 第21条(本規約の効力)
101 | 本規約は、本データの利用の関する一切について適用される。なお、本規約は随時変更されることがあるが、変更後の規約は特別に定める場合を除き、ウェブ上で表示された時点から効力を生じるものとする。
102 | ```
103 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | # hottoSNS-BERT:大規模日本語SNSコーパスによる文分散表現モデル
3 |
4 | ## 概要
5 | * 大規模日本語SNSコーパスによる文分散表現モデル(以下,大規模SNSコーパス)から作成したbertによる文分散表現を構築した
6 | * 本文分散表現モデル(以下,hottoSNS-BERT)は下記登録フォームから登録した方のみに配布する
7 | * [利用規約](#利用規約)は本README.mdの末尾に記載されている.またLICENSE.mdにも同じ内容が記載されている.
8 |
9 | [登録フォーム](https://forms.office.com/Pages/ResponsePage.aspx?id=Zpu1Ffmdi02AfxgH3uo25PxaMnBWkvJLsXoQLeuzhoBUNU0zN01BR1VFNk9RSEUxWVRNSzAyWThZNSQlQCN0PWcu)
10 |
11 |
12 |
13 |
14 | [言語資源を利用した発表の登録フォーム](https://forms.office.com/r/EQizJpFBJg)
15 | ※PDFファイルの提出は不要です。
16 |
17 | ### 引用について
18 | * 本モデルに関する論文発表は未公開です.引用される方は,以下のbibtexをご利用ください.
19 | ```
20 | @article{hottoSNS-bert,
21 | author = Sakaki, Takeshi and Mizuki, Sakae and Gunji, Naoyuki},
22 | title = {BERT Pre-trained model Trained on Large-scale Japanese Social Media Corpus},
23 | year = {2019},
24 | howpublished = {\url{https://github.com/hottolink/hottoSNS-bert}}
25 | }
26 | ```
27 |
28 |
29 |
30 | - [hottoSNS-BERT:大規模日本語SNSコーパスによる文分散表現モデル](#hottosns-bert大規模日本語snsコーパスによる文分散表現モデル)
31 | - [概要](#概要)
32 | - [引用について](#引用について)
33 | - [配布リソースに関する説明](#配布リソースに関する説明)
34 | - [大規模日本語SNSコーパス](#大規模日本語snsコーパス)
35 | - [ファイル構成](#ファイル構成)
36 | - [利用方法](#利用方法)
37 | - [実行確認環境](#実行確認環境)
38 | - [付属評価コードの利用準備](#付属評価コードの利用準備)
39 | - [付属評価コードの利用方法](#付属評価コードの利用方法)
40 | - [モデルの読み込み方法](#モデルの読み込み方法)
41 | - [配布リソースの構築手順](#配布リソースの構築手順)
42 | - [コーパス・単語分散表現の構築方法](#コーパス・単語分散表現の構築方法)
43 | - [平文コーパスの収集・構築](#平文コーパスの収集・構築)
44 | - [前処理](#前処理)
45 | - [分かち書きコーパスの構築](#分かち書きコーパスの構築)
46 | - [後処理](#後処理)
47 | - [既存モデルとの前処理・分かち書きの比較](#既存モデルとの前処理・分かち書きの比較)
48 | - [統計量](#統計量)
49 | - [文分散表現の学習](#文分散表現の学習)
50 | - [pre-training](#pre-training)
51 | - [neuralnet architectureの比較](#neuralnet-architectureの比較)
52 | - [学習環境](#学習環境)
53 | - [配布リソースの性能評価](#配布リソースの性能評価)
54 | - [評価用データセット](#評価用データセット)
55 | - [downstream task: fine-tuning](#downstream-task-fine-tuning)
56 | - [実験結果](#実験結果)
57 | - [pytorch-transformersからの利用](#pytorch-transformersからの利用)
58 | - [利用規約](#利用規約)
59 |
60 |
61 |
62 | ## 配布リソースに関する説明
63 | ### 大規模日本語SNSコーパス
64 | * BERT自家版を学習するために,大規模な日本語ツイートのコーパスを構築した
65 | * 収集文の多様性が大きくなるように工夫している.具体的には,bot投稿・リツイートの除外,重複ツイート文の除外といった工夫を施している
66 | * 構築されたコーパスのツイート数は8,500万である
67 | * 本家BERTが用いたコーパス(=En Wikipedia + BookCorpus)と比較すると,35%程度の大きさである
68 |
69 | ### ファイル構成
70 |
71 | | モデル | ファイル名 |
72 | |-----------|----------------------|
73 | | hottoSNS-BERTモデル |bert_config.json|
74 | | | graph.pbtxt |
75 | | | model.ckpt-1000000.meta |
76 | | | model.ckpt-1000000.index |
77 | | | model.ckpt-1000000.data-00000-of-00001 |
78 | | sentencepieceモデル | tokenizer_spm_32K.model |
79 | | | tokenizer_spm_32K.vocab.to.bert |
80 | | | tokenizer_spm_32K.vocab
81 |
82 |
83 |
84 | ### 利用方法
85 | #### 実行確認環境
86 | * Python 3.6.6
87 | * tensorflow==1.11.0
88 | * sentencepiece==0.1.8
89 |
90 |
91 | #### 付属評価コードの利用準備
92 | `./hottoSNS-bert/evaluation_dataset/twitter_sentiment/`以下に[Twitter日本語評判分析データセット](https://www.db.info.gifu-u.ac.jp/sentiment_analysis/)からツイートを再現し,BERTモデル評価用に加工したデータが必要.詳細は[hottoSNS-bert/evaluation_dataset/twitter_sentiment/](https://github.com/hottolink/hottoSNS-bert/tree/master/evaluation_dataset/twitter_sentiment)参照.
93 | #### 付属評価コードの利用方法
94 | ```
95 | # リポジトリのClone
96 | git clone https://github.com/hottolink/hottoSNS-bert.git
97 | cd hottoSNS-bert
98 |
99 | # 取得した各BERTモデルファイルを `trained_model/` 以下に配置
100 | cp -r [hottoSNS-bert dir]/* ./trained_model/masked_lm_only_L-12_H-768_A-12/
101 | cp -r [日本語wikipedia model dir]/* ./trained_model/wikipedia_ja_L-12_H-768_A-12/
102 | cp -r [Multilingual model dir]/* ./trained_model/multi_cased_L-12_H-768_A-12/
103 |
104 |
105 | # 評価環境の構築・評価実行
106 | # ※テキストファイルから分散表現を読み込むため、実行に時間がかかります。
107 | sh setup.sh
108 | cd src
109 | sh run_classifier.sh
110 |
111 | ```
112 |
113 | #### モデルの読み込み方法
114 | サンプルコードを参照してください。
115 |
116 | ## 配布リソースの構築手順
117 | ### コーパス・単語分散表現の構築方法
118 |
119 | #### 平文コーパスの収集・構築
120 | * 期間:2017年〜2018年に投稿されたツイートから一部を抽出
121 | * 投稿クライアント:人間用のクライアントのみ
122 | * 実質的に,botによる投稿を除外
123 | * ツイート種別:オーガニックおよびメンション
124 |
125 | #### 前処理
126 | * 文字フィルタ:ReTweet記号(RT)・省略記号(...)の除外
127 | * 正規化:NFKC正規化,小文字化
128 | * 特殊トークン化:mention, url
129 | * 除外:正規化された本文が重複するツイートを削除
130 |
131 |
132 | * サンプルデータは以下の通り
133 |
134 | ```
135 | ゆめさんが、ファボしてくるあたり、世代だなって思いました( ̇- ̇ )笑
136 | 90秒に250円かけるかどうかは、まぁ個人の自由だしね()
137 | それでは聞いてください rainy
138 | ```
139 |
140 | #### 分かち書きコーパスの構築
141 | * sentencepieceを採用
142 | * 設定は以下の通り
143 |
144 |
145 | |argument|value|
146 | |--|--|
147 | |vocab_size|32,000 |
148 | |character_coverage|0.9995|
149 | |model_type|unigram|
150 | |add_dummy_prefix|FALSE|
151 | |user_defined_symbols|\,\|
152 | |control_symbols|[CLS],[SEP],[MASK]|
153 | |input_sentence_size|20,000,000 |
154 | |pad_id|0|
155 | |unk_id|1|
156 | |bos_id|-1|
157 | |eos_id|-1|
158 |
159 | * サンプルデータは以下の通り
160 |
161 | ```
162 | ゆめ さんが 、 ファボ してくる あたり 、 世代 だ なって思いました ( ▁̇ - ▁̇ ▁ ) 笑
163 |
164 | ▁ 90 秒 に 250 円 かける かどうかは 、 まぁ 個人の 自由 だしね ()
165 |
166 | ▁ それでは 聞いてください ▁ rain y ▁
167 | ```
168 |
169 | #### 後処理
170 | * 短すぎる・少なすぎるツイートを除外
171 | * 具体的には,以下に示すしきい値を下回るユーザおよびツイートを除外
172 |
173 | |limitation|value|
174 | |--|--|
175 | |トークン長さ|5|
176 | |ユーザあたりのツイート数|5|
177 |
178 |
179 | #### 既存モデルとの前処理・分かち書きの比較
180 | | 前処理 | | | 分かち書き | | | |
181 | |-------------------------------------|-------------------------|--------------------------------|----------------|---------|---------------|----------|
182 | | モデル名 | 文字正規化 | 特殊トークン化 | 小文字化 | 単語数 | 分かち書き | 言語 |
183 | | BERT MultiLingual | None | no | yes | 119,547 | WordPiece | multi※ |
184 | | BERT 日本語Wikipedia | NFKC | no | no | 32,000 | SentencePiece | ja |
185 | | hottoSNS-BERT | NFKC | yes | no | 32,000 | SentencePiece | ja |
186 |
187 |
188 |
189 |
190 |
191 |
192 | #### 統計量
193 | 構築されたコーパスの統計量は,以下の通り
194 |
195 | * コーパス全体
196 |
197 | |metric|value|
198 | |--|--|
199 | |n_user|1,872,623 |
200 | |n_post|85,925,384 |
201 | |n_token|1,441,078,317 |
202 |
203 | * トークン数・1ユーザあたりの投稿数
204 |
205 | |metric|n_token|n_post.per.user|
206 | |--|--|--|
207 | |min|5|5|
208 | |mean|16.77|45.89|
209 | |std|13.06|14.83|
210 | |q(0.99)|64|76|
211 | |max|227|781|
212 |
213 |
214 | ### 文分散表現の学習
215 | #### pre-training
216 | next sentence predictionはツイートに適用することが難しいため、masked language model のみを適用する.
217 | また、事前学習のタスク設定について,各サンプルのtoken数を最大64に制限した.
218 |
219 | #### neuralnet architectureの比較
220 |
221 | | neuralnet architecture | | | | | pre-training | | | |
222 | |-------------------------------------------|---------|---------|-------------|---------|---------------|-------------|---------|-----------|
223 | | モデル名 | n_dim_e | n_dim_h | n_attn_head | n_layer | max_pos_embed | max_seq_len | n_batch | n_step |
224 | | BERT MultiLingual | 768 | 3072 | 12 | 12 | 512 | 512 | 256 | 1,000,000 |
225 | | BERT 日本語Wikipedia | 768 | 3072 | 12 | 12 | 512 | 512 | 256 | 1,400,000 |
226 | | hottoSNS-BERT | 768 | 3072 | 12 | 12 | 512 | 64 | 512 | 1,000,000 |
227 |
228 |
229 |
230 | #### 学習環境
231 | * Google Computing Platform の Cloud TPU を使用.詳細は以下の通り
232 | * neuralnet framework は TensorFlow 1.12.0 を使用
233 | * 詳細は以下の通り
234 | * CPU:n1-standard-2(vCPU x 2、メモリ 7.5 GB)
235 | * ストレージ:Cloud Storage
236 | * TPU:v2-8
237 |
238 |
239 |
240 | ## 配布リソースの性能評価
241 | ### 評価用データセット
242 | * ツイート評判分析をdownstreamタスクとして、構築したBERTモデルの評価を行う.
243 | * 評判分析タスクは,2種類のデータセットを用いて評価する
244 | 1. [Twitter日本語評判分析データセット](https://www.db.info.gifu-u.ac.jp/sentiment_analysis/)[芥子+, 2017]
245 | * サンプル数:161K
246 | 2. 内製のデータセット;Twitter大規模トピックコーパス
247 | * サンプル数:12K
248 |
249 | * 統計量は以下の通り
250 |
251 | |データセット名|トピック|positive|negative|neutral|total|
252 | |:-:|:-:|:-:|:-:|:-:|:-:|
253 | |Twitter大規模トピックコーパス|指定なし|4,162 |3,031 |4,807 |12,000 |
254 | |Twitter日本語評判分析データセット|家電・携帯端末|10,249 |15,920 |134,928 |161,097 |
255 |
256 | ### downstream task: fine-tuning
257 | * downstream task の詳細は,以下の通りである
258 | * task type:日本語ツイートの評判分析;Positive/Negative/Neutral の3値分類
259 | * task dataset
260 | 1. Twitter日本語評判分析データセット[芥子+, 2017]
261 | 2. 内製のデータセット
262 | * methodology
263 | * task dataset を train:test = 9:1 に分割
264 | * hyper-parameter は,BERT論文[Devlin+, 2018] に準拠
265 | * 学習および評価を7回繰り返して,平均値を報告
266 | * evaluation metric:accuracy および macro F-value
267 |
268 | ```
269 | 芥子 育雄, 鈴木 優, 吉野 幸一郎, グラム ニュービッグ, 大原 一人, 向井 理朗, 中村 哲: 「単語意味ベクトル辞書を用いたTwitterからの日本語評判情報抽出」, 電子情報通信学会論文誌, Vol.J100-D, No.4, pp.530-543, 2017.4.
270 | Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv preprint arXiv:1810.04805,2018
271 | ```
272 |
273 |
274 | ### 実験結果
275 | 実験結果は以下の通り.
276 |
277 | | | Twitter大規模カテゴリコーパス | | Twitter日本語評判分析データセット | |
278 | |-------------------------------------|----------|-----------------------------------|----------|---------|
279 | | モデル名 | accuracy | F-value | accuracy | F-value |
280 | | BERT MultiLingual | 0.7019 | 0.7011 | 0.8776 | 0.7225 |
281 | | BERT 日本語Wikipedia | 0.7237 | 0.7239 | 0.8790 | 0.7359 |
282 | | hottoSNS-BERT| 0.7387 | 0.7396 | 0.8880 | 0.7503 |
283 |
284 | 下記のような結果であると言える.
285 |
286 | * Twitter評判分析タスクに対する性能は以下のようになった。
287 | * Multilingual < 日本語Wikipedia < 日本語SNS
288 | * Multilingual < 日本語Wikipediaであることから日本語を対象としたdownstreamタスクでは,日本語(の語彙)に特化した分かち書き方法および,日本語のコーパスを用いた事前学習の方が適していると考えられる
289 | * 日本語Wikipedia < 日本語SNS であることから,Twitterを対象としたdownstreamタスクでは、日本語Wikipediaよりもドメインに特化した大規模日本語SNSコーパスで学習したBERTモデルの方が良い性能が得られると考えられる
290 |
291 | ## pytorch-transformersからの利用
292 | * pytorch 1.8.1+cu102, tensorflow 2.4.1での動作例
293 | 1. hottoSNS-bertの読み込み
294 | ```
295 | # pytorch-transformers, tensorflowの読み込み
296 | import os, sys
297 | from transformers import BertForPreTraining, BertTokenizer
298 | import tensorflow as tf
299 |
300 | # hottoSNS-bertの読み込み
301 | sys.path.append("./hottoSNS-bert/src/")
302 | import tokenization
303 | from preprocess import normalizer
304 | ```
305 | 1. 必要なファイルの指定
306 | ```
307 | bert_model_dir = "./hottoSNS-bert/trained_model/masked_lm_only_L-12_H-768_A-12/"
308 | config_file = os.path.join(bert_model_dir, "bert_config.json")
309 | vocab_file = os.path.join(bert_model_dir, "tokenizer_spm_32K.vocab.to.bert")
310 | sp_model_file = os.path.join(bert_model_dir, "tokenizer_spm_32K.model")
311 | bert_model_file = os.path.join(bert_model_dir, "model.ckpt-1000000.index")
312 | ```
313 | 1. tokenizerのインスタンス化
314 | ```
315 | tokenizer = tokenization.JapaneseTweetTokenizer(
316 | vocab_file=vocab_file,
317 | model_file=model_file,
318 | normalizer= normalizer.twitter_normalizer_for_bert_encoder,
319 | do_lower_case=False)
320 | ```
321 | 1. tokenizeの実行例
322 | ```
323 | # 例文
324 | text = '@test ゆめさんが、ファボしてくるあたり、世代だなって思いました( ̇- ̇ )笑 http://test.com/test.html'
325 | # tokenize
326 | words = tokenizer.tokenize(text)
327 | print(words)
328 | # ['', '▁', 'ゆめ', 'さんが', '、', 'ファボ', 'してくる', 'あたり', '、', '世代', 'だ', 'なって思いました', '(', '▁̇', '-', '▁̇', '▁', ')', '笑', '▁', '']
329 | # idへ変換
330 | tokenizer.convert_tokens_to_ids(["[CLS]"]+words+["[SEP]"])
331 | # [2, 6, 7, 6372, 348, 8, 5249, 2135, 1438, 8, 3785, 63, 28146, 12, 112, 93, 112, 7, 13, 31, 7, 5, 3]
332 |
333 | ```
334 |
335 | 1. pretrainモデルの読み込み
336 | ```
337 | model = BertForPreTraining.from_pretrained(bert_model_file, from_tf=True, config=config_file)
338 | ```
339 |
340 |
341 | ## 利用規約
342 | 同一の内容をLICENSE.mdに記述
343 | ```
344 | 第1条(定義)
345 | 本契約で用いられる用語の意味は、以下に定めるとおりとする。
346 | (1)「本規約」とは、本利用規約をいう。
347 | (2)「甲」とは、 株式会社ホットリンク(以下「甲」という)をいう。
348 | (3)「乙」とは、本規約に同意し、甲の承認を得て、甲が配布する文分散表現データを利用する個人をいう。
349 | (4)「本データ」とは、甲が作成した文分散表現データおよびそれに付随する全部をいう。
350 |
351 |
352 | 第2条(利用許諾)
353 | 甲は、乙が本規約に従って本データを利用することを非独占的に許諾する。なお、甲及び乙は、本規約に明示的に定める以外に、乙に本データに関していかなる権利も付与するものではないことを確認する。
354 |
355 |
356 | 第4条(許諾の条件)
357 | 甲が乙に本データの利用を許諾する条件は、以下の通りとする。
358 | (1)利用目的: 日本語に関する学術研究・産業研究(以下「本研究」という)を遂行するため。
359 | (2)利用の範囲: 乙及び乙が所属する研究グループ
360 | (3)利用方法: 本研究のために本データを乙が管理するコンピューター端末またはサーバーに複製し、本データを分析・研究しデータベース等に保存した解析データ(以下「本解析データ」という)を得る。
361 |
362 |
363 | 第5条(利用申込)
364 | 1.乙は、甲が指定するウェブ上の入力フォーム(以下、入力フォーム)を通じて、乙の名前や所属、連絡先等、甲が指定する項目を甲に送信し、本データの利用について甲の承認を得るものとする。 なお、甲が承認しなかった場合、甲はその理由を開示する義務を負わない。
365 | 2.前項に基づき甲に申告した内容に変更が生じる場合、乙は遅滞なくこれを甲に報告し、改めて甲の承認を得るものとする。
366 | 3.乙が入力フォームを送信した時点で、乙は本規約に同意したものとみなされる。
367 |
368 | 第6条(禁止事項)
369 | 乙は、本データの利用にあたり、以下に定める行為をしてはならない。
370 | (1)本データ及びその複製物(それらを復元できるデータを含む)を譲渡、貸与、販売すること。また、書面による甲の事前許諾なくこれらを配布、公衆送信、刊行物に転載するなど前項に定める範囲を超えて利用し、甲または第三者の権利を侵害すること。
371 | (2)本データを用いて甲又は第三者の名誉を毀損し、あるいはプライバシーを侵害するなどの権利侵害を行うこと。
372 | (3)乙及び乙が所属する研究グループ以外の第三者に本データを利用させること。
373 | (4)本規約で明示的に許諾された目的及び手段以外にデータを利用 すること。
374 |
375 | 第7条(対価)
376 | 本規約に基づく本データの利用許諾の対価は発生しない。
377 |
378 | 第8条(公表)
379 | 1.乙は、学術研究の目的に限り、本データを使用して得られた研究成果や知見を公表することができる。これらの公表には、本解析データや処理プログラムの公表を含む。
380 | 2.乙は、公表にあたっては、本データをもとにした成果であることを明記し、成果の公表の前にその概要を書面やメール等で甲に報告する。
381 | 3.乙は、論文発表の際も、本データを利用した旨を明記し、提出先の学会、発表年月日を所定のフォームから甲に提出するものとする。
382 |
383 |
384 |
385 | 第9条(乙の責任)
386 | 1.乙は、本データをダウンロードする為に必要な通信機器やソフトウェア、通信回線等の全てを乙の責任と費用で準備し、操作、接続等をする。
387 | 2.乙は、本データを本研究の遂行のみに使用する。
388 | 3.乙は、本データが漏洩しないよう善良な管理者の注意義務をもって管理し、乙のコンピューター端末等に適切な対策を施すものとする。
389 | 4.乙が、本研究を乙が所属するグループのメンバーと共同で遂行する場合、乙は、本規約の内容を当該グループの他のメンバーに遵守させるものとし、万一、当該他のメンバーが本規約に違反し甲又は第三者に損害を与えた場合は、乙はこれを自らの行為として連帯して責任を負うものとする。
390 | 5.甲が必要と判断する場合、乙に対して、本データの利用状況の開示を求めることができるものとし、乙はこれに応じなければならない。
391 |
392 |
393 | 第10条(知的財産権の帰属)
394 | 甲及び乙は、本データに関する一切の知的財産権、本データの利用に関連して甲が提供する書類やデータ等に関する全ての知的財産権について、甲に帰属することを確認する。ただし、本データ作成の素材となった各文書の著作権は正当な権利を有する第三者に帰属する。
395 |
396 | 第11条(非保証等)
397 | 1.甲は、本データが、第三者の著作権、特許権、その他の無体財産権、営業秘密、ノウハウその他の権利を侵害していないこと、法令に違反していないこと、本データ作成に利用したアルゴリズムに誤り、エラー、バグがないことについて一切保証せず、また、それらの信頼性、正確性、速報性、完全性、及び有効性、特定目的への適合性について一切保証しないものとし、瑕疵担保責任も負わない。
398 | 2.本データに関し万一、第三者から知的財産権侵害等の主張がなされた場合には、乙はただちに甲に対しその旨を通知し、甲に対する情報提供等、当該紛争の解決に最大限協力するものとする。
399 |
400 |
401 | 第12条(違反時の措置)
402 | 1.甲は、乙が次の各号の一つにでも該当した場合、甲は乙に対して本データの利用を差止めることができる。
403 | (1)本規約に違反した場合
404 | (2)法令に違反した場合
405 | (3)虚偽の申告等の不正を行った場合
406 | (4)信頼関係を破壊するような行為を行った場合
407 | (5)その他甲が不適当と認めた場合
408 | 2.前項の規定は甲から乙に対する損害賠償請求を妨げるものではない。
409 | 3.第1項に基づき、甲が乙に対して本データの利用の差し止めを求めた場合、乙は、乙が管理する設備から、本データ、本解析データ及びその複製物の一切を消去するものとする。
410 |
411 | 第13条(甲の事情による利用許諾の取り消し)
412 | 1.甲は、その理由の如何を問わず、なんらの催告なしに、本データの利用許諾を停止することができるものとする。その際は、第15条に基づき、乙は速やかに本データおよびその複製物の一切を消去または破棄する。
413 | 2.前項の破棄、消去の対象に本解析データは含まない。
414 |
415 |
416 | 第14条(利用期間)
417 | 1.乙による本データの利用可能期間は、第5条にもとづく甲の承認日より1年間とする。
418 | 2.乙が1年間を超えて本データの利用継続を希望する場合、第5条に基づく方法で再度利用申請を行うこととする。
419 |
420 |
421 | 第15条(本契約終了後の措置等)
422 | 1.理由の如何を問わず、第14条に定める利用期間が終了したとき、もしくは、本データの利用許諾が取り消しとなった場合、乙は本データおよびその複製物の一切を消去または破棄する。
423 | 2.前項の破棄、消去の対象に本解析データは含まない。ただし、乙は、本解析データから本データを復元して再利用することはできないものとする。
424 | 3.第10条、第11条、第15条から第19条は、本契約の終了後も有効に存続する。
425 |
426 | 第16条(権利義務譲渡の禁止)
427 | 乙は、相手方の書面による事前の承諾なき限り、本契約上の地位及び本契約から生じる権利義務を第三者に譲渡又は担保に供してはならない。
428 |
429 | 第17条 (個人情報等の保護および法令遵守)
430 | 1.甲が取得した乙の個人情報は、別途定める甲2のプライバシーポリシーに従って取り扱われる。
431 | 2.甲は、サーバー設備の故障その他のトラブル等に対処するため、乙の個人情報を他のサーバーに複写することがある。
432 |
433 | 第18条(準拠法)
434 | 本契約の準拠法は、日本法とする。
435 |
436 | 第19条(管轄裁判所)
437 | 本契約に起因し又は関連して生じた一切の紛争については、東京地方裁判所を第一審の専属的合意管轄裁判所とする。
438 |
439 | 第20条(協 議)
440 | 本契約に定めのない事項及び疑義を生じた事項は、甲乙誠意をもって協議し、円満にその解決にあたる。
441 |
442 | 第21条(本規約の効力)
443 | 本規約は、本データの利用の関する一切について適用される。なお、本規約は随時変更されることがあるが、変更後の規約は特別に定める場合を除き、ウェブ上で表示された時点から効力を生じるものとする。
444 | ```
445 |
--------------------------------------------------------------------------------
/evaluation_dataset/twitter_sentiment/README.md:
--------------------------------------------------------------------------------
1 | # Twitter日本語評判分析データセット:評価用
2 | [Twitter日本語評判分析データセット](http://bigdata.naist.jp/~ysuzuki/data/twitter/)について,ツイート本文を復元した後,BERTモデル評価用に整形したもの
3 |
4 | ## 出典
5 | * Twitter日本語評判分析データセット[芥子+, 2017]をクリーニング,分割したもの
6 | * 以下のエントリを削除
7 | * 2つ以上の評判極性が付与されている
8 | * `pos,neg,neutral` 以外の評判極性が付与されている
9 | * 残ったエントリを `train:dev:test = 8:1:1` に分割
10 |
11 |
12 |
13 | ## ファイル
14 | * 本ディレクトリの格納ファイルは以下の通り
15 | * 復元された評判分析コーパスは `train.tsv,dev.tsv.test.tsv` である
16 |
17 | | ファイル名 | ファイル形式 | 説明 |
18 | |----------------------|-------------------|-------------------------------------------------------------------------------------------------|
19 | | train.tsv | TSV,タブ区切り | 学習用 |
20 | | dev.tsv | TSV,タブ区切り | 開発用 |
21 | | test.tsv | TSV,タブ区切り | 評価用 |
22 |
23 |
24 |
25 | ## フォーマット
26 | * `[train,dev,test].tsv` のフォーマットは以下の通り
27 |
28 | | 列名 | 型 | 説明 |
29 | |-------------|----------|---------------------------------------------------------------------------------------------------------|
30 | | id | str | データ番号.Twitter日本語評判分析データセットにて採番されたもの |
31 | | category_id | str | ジャンルID.詳細は[データセット説明](http://bigdata.naist.jp/~ysuzuki/data/twitter/)を参照 |
32 | | status_id | str | ツイートID.ツイートの一意な識別子 |
33 | | label_type | str | 評判極性ラベル,全5種類.詳細は[データセット説明](http://bigdata.naist.jp/~ysuzuki/data/twitter/)を参照 |
34 | | created_at | datetime | ツイートの投稿日時 |
35 | | user_id | str | 投稿者のID.Twitterユーザの一意な識別子 |
36 | | screen_name | str | 投稿者のスクリーン名 |
37 | | text | str | ツイート本文(改行コードを含む) |
38 |
39 |
40 |
41 | 以上
42 |
--------------------------------------------------------------------------------
/images/QR_hottoSNS-bert.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/images/QR_hottoSNS-bert.png
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow >= 1.11.0 # CPU Version of TensorFlow.
2 | # tensorflow-gpu >= 1.11.0 # GPU version of TensorFlow.
3 | regex
4 | sentencepiece==0.1.8
5 |
--------------------------------------------------------------------------------
/src/dataprocessor/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding:utf-8 -*-
--------------------------------------------------------------------------------
/src/dataprocessor/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/src/dataprocessor/__pycache__/__init__.cpython-36.pyc
--------------------------------------------------------------------------------
/src/dataprocessor/__pycache__/custom.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/src/dataprocessor/__pycache__/custom.cpython-36.pyc
--------------------------------------------------------------------------------
/src/dataprocessor/__pycache__/preset.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/src/dataprocessor/__pycache__/preset.cpython-36.pyc
--------------------------------------------------------------------------------
/src/dataprocessor/custom.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding:utf-8 -*-
3 |
4 | import os, csv
5 |
6 | from .preset import DataProcessor, InputExample
7 | import tokenization
8 |
9 |
10 | # dataset processor for Twitter日本語評判分析データセット [Suzuki+, 2017]
11 | class PublicTwitterSentimentProcessor(DataProcessor):
12 | """
13 | Processor for the Twitter日本語評判分析データセット .
14 | refer to: http://bigdata.naist.jp/~ysuzuki/data/twitter/
15 |
16 | """
17 |
18 | def get_train_examples(self, data_dir):
19 | """See base class."""
20 | return self._create_examples(
21 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
22 |
23 | def get_dev_examples(self, data_dir):
24 | """See base class."""
25 | return self._create_examples(
26 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
27 |
28 | def get_test_examples(self, data_dir):
29 | """See base class."""
30 | return self._create_examples(
31 | self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
32 |
33 | def get_labels(self):
34 | """See base class."""
35 | return ["pos", "neg", "neutral"]
36 |
37 | def _create_examples(self, lines, set_type):
38 | """Creates examples for the training and dev sets."""
39 | examples = []
40 | for (i, line) in enumerate(lines):
41 | if i == 0:
42 | continue
43 | guid = "%s-%s" % (set_type, i)
44 | if set_type in ["train","dev"]:
45 | text_a = tokenization.convert_to_unicode(line[7])
46 | label = tokenization.convert_to_unicode(line[3])
47 | elif set_type == "test":
48 | text_a = tokenization.convert_to_unicode(line[7])
49 | label = tokenization.convert_to_unicode(line[3])
50 | else:
51 | raise NotImplementedError(f"unsupported set type: {set_type}")
52 |
53 | if label in self.get_labels():
54 | examples.append(
55 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
56 | return examples
57 |
--------------------------------------------------------------------------------
/src/dataprocessor/preset.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding:utf-8 -*-
3 | import csv
4 | import os
5 |
6 | import tensorflow as tf
7 |
8 | import tokenization
9 |
10 |
11 | class InputExample(object):
12 | """A single training/test example for simple sequence classification."""
13 |
14 | def __init__(self, guid, text_a, text_b=None, label=None):
15 | """Constructs a InputExample.
16 |
17 | Args:
18 | guid: Unique id for the example.
19 | text_a: string. The untokenized text of the first sequence. For single
20 | sequence tasks, only this sequence must be specified.
21 | text_b: (Optional) string. The untokenized text of the second sequence.
22 | Only must be specified for sequence pair tasks.
23 | label: (Optional) string. The label of the example. This should be
24 | specified for train and dev examples, but not for test examples.
25 | """
26 | self.guid = guid
27 | self.text_a = text_a
28 | self.text_b = text_b
29 | self.label = label
30 |
31 |
32 | class InputFeatures(object):
33 | """A single set of features of data."""
34 |
35 | def __init__(self, input_ids, input_mask, segment_ids, label_id):
36 | self.input_ids = input_ids
37 | self.input_mask = input_mask
38 | self.segment_ids = segment_ids
39 | self.label_id = label_id
40 |
41 |
42 | class DataProcessor(object):
43 | """Base class for data converters for sequence classification data sets."""
44 |
45 | def get_train_examples(self, data_dir):
46 | """Gets a collection of `InputExample`s for the train set."""
47 | raise NotImplementedError()
48 |
49 | def get_dev_examples(self, data_dir):
50 | """Gets a collection of `InputExample`s for the dev set."""
51 | raise NotImplementedError()
52 |
53 | def get_test_examples(self, data_dir):
54 | """Gets a collection of `InputExample`s for prediction."""
55 | raise NotImplementedError()
56 |
57 | def get_labels(self):
58 | """Gets the list of labels for this data set."""
59 | raise NotImplementedError()
60 |
61 | @classmethod
62 | def _read_tsv(cls, input_file, quotechar=None):
63 | """Reads a tab separated value file."""
64 | with tf.gfile.Open(input_file, "r") as f:
65 | reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
66 | lines = []
67 | for line in reader:
68 | lines.append(line)
69 | return lines
70 |
71 |
72 | class XnliProcessor(DataProcessor):
73 | """Processor for the XNLI data set."""
74 |
75 | def __init__(self):
76 | self.language = "zh"
77 |
78 | def get_train_examples(self, data_dir):
79 | """See base class."""
80 | lines = self._read_tsv(
81 | os.path.join(data_dir, "multinli",
82 | "multinli.train.%s.tsv" % self.language))
83 | examples = []
84 | for (i, line) in enumerate(lines):
85 | if i == 0:
86 | continue
87 | guid = "train-%d" % (i)
88 | text_a = tokenization.convert_to_unicode(line[0])
89 | text_b = tokenization.convert_to_unicode(line[1])
90 | label = tokenization.convert_to_unicode(line[2])
91 | if label == tokenization.convert_to_unicode("contradictory"):
92 | label = tokenization.convert_to_unicode("contradiction")
93 | examples.append(
94 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
95 | return examples
96 |
97 | def get_dev_examples(self, data_dir):
98 | """See base class."""
99 | lines = self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv"))
100 | examples = []
101 | for (i, line) in enumerate(lines):
102 | if i == 0:
103 | continue
104 | guid = "dev-%d" % (i)
105 | language = tokenization.convert_to_unicode(line[0])
106 | if language != tokenization.convert_to_unicode(self.language):
107 | continue
108 | text_a = tokenization.convert_to_unicode(line[6])
109 | text_b = tokenization.convert_to_unicode(line[7])
110 | label = tokenization.convert_to_unicode(line[1])
111 | examples.append(
112 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
113 | return examples
114 |
115 | def get_labels(self):
116 | """See base class."""
117 | return ["contradiction", "entailment", "neutral"]
118 |
119 |
120 | class MnliProcessor(DataProcessor):
121 | """Processor for the MultiNLI data set (GLUE version)."""
122 |
123 | def get_train_examples(self, data_dir):
124 | """See base class."""
125 | return self._create_examples(
126 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
127 |
128 | def get_dev_examples(self, data_dir):
129 | """See base class."""
130 | return self._create_examples(
131 | self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
132 | "dev_matched")
133 |
134 | def get_test_examples(self, data_dir):
135 | """See base class."""
136 | return self._create_examples(
137 | self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test")
138 |
139 | def get_labels(self):
140 | """See base class."""
141 | return ["contradiction", "entailment", "neutral"]
142 |
143 | def _create_examples(self, lines, set_type):
144 | """Creates examples for the training and dev sets."""
145 | examples = []
146 | for (i, line) in enumerate(lines):
147 | if i == 0:
148 | continue
149 | guid = "%s-%s" % (set_type, tokenization.convert_to_unicode(line[0]))
150 | text_a = tokenization.convert_to_unicode(line[8])
151 | text_b = tokenization.convert_to_unicode(line[9])
152 | if set_type == "test":
153 | label = "contradiction"
154 | else:
155 | label = tokenization.convert_to_unicode(line[-1])
156 | examples.append(
157 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
158 | return examples
159 |
160 |
161 | class MrpcProcessor(DataProcessor):
162 | """Processor for the MRPC data set (GLUE version)."""
163 |
164 | def get_train_examples(self, data_dir):
165 | """See base class."""
166 | return self._create_examples(
167 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
168 |
169 | def get_dev_examples(self, data_dir):
170 | """See base class."""
171 | return self._create_examples(
172 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
173 |
174 | def get_test_examples(self, data_dir):
175 | """See base class."""
176 | return self._create_examples(
177 | self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
178 |
179 | def get_labels(self):
180 | """See base class."""
181 | return ["0", "1"]
182 |
183 | def _create_examples(self, lines, set_type):
184 | """Creates examples for the training and dev sets."""
185 | examples = []
186 | for (i, line) in enumerate(lines):
187 | if i == 0:
188 | continue
189 | guid = "%s-%s" % (set_type, i)
190 | text_a = tokenization.convert_to_unicode(line[3])
191 | text_b = tokenization.convert_to_unicode(line[4])
192 | if set_type == "test":
193 | label = "0"
194 | else:
195 | label = tokenization.convert_to_unicode(line[0])
196 | examples.append(
197 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
198 | return examples
199 |
200 |
201 | class ColaProcessor(DataProcessor):
202 | """Processor for the CoLA data set (GLUE version)."""
203 |
204 | def get_train_examples(self, data_dir):
205 | """See base class."""
206 | return self._create_examples(
207 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
208 |
209 | def get_dev_examples(self, data_dir):
210 | """See base class."""
211 | return self._create_examples(
212 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
213 |
214 | def get_test_examples(self, data_dir):
215 | """See base class."""
216 | return self._create_examples(
217 | self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
218 |
219 | def get_labels(self):
220 | """See base class."""
221 | return ["0", "1"]
222 |
223 | def _create_examples(self, lines, set_type):
224 | """Creates examples for the training and dev sets."""
225 | examples = []
226 | for (i, line) in enumerate(lines):
227 | # Only the test set has a header
228 | if set_type == "test" and i == 0:
229 | continue
230 | guid = "%s-%s" % (set_type, i)
231 | if set_type == "test":
232 | text_a = tokenization.convert_to_unicode(line[1])
233 | label = "0"
234 | else:
235 | text_a = tokenization.convert_to_unicode(line[3])
236 | label = tokenization.convert_to_unicode(line[1])
237 | examples.append(
238 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
239 | return examples
--------------------------------------------------------------------------------
/src/modeling.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Language Team Authors.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | """The main BERT model and related functions."""
16 |
17 | from __future__ import absolute_import
18 | from __future__ import division
19 | from __future__ import print_function
20 |
21 | import collections
22 | import copy
23 | import json
24 | import math
25 | import re
26 | import six
27 | import tensorflow as tf
28 |
29 |
30 | class BertConfig(object):
31 | """Configuration for `BertModel`."""
32 |
33 | def __init__(self,
34 | vocab_size,
35 | hidden_size=768,
36 | num_hidden_layers=12,
37 | num_attention_heads=12,
38 | intermediate_size=3072,
39 | hidden_act="gelu",
40 | hidden_dropout_prob=0.1,
41 | attention_probs_dropout_prob=0.1,
42 | max_position_embeddings=512,
43 | type_vocab_size=16,
44 | initializer_range=0.02):
45 | """Constructs BertConfig.
46 |
47 | Args:
48 | vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.
49 | hidden_size: Size of the encoder layers and the pooler layer.
50 | num_hidden_layers: Number of hidden layers in the Transformer encoder.
51 | num_attention_heads: Number of attention heads for each attention layer in
52 | the Transformer encoder.
53 | intermediate_size: The size of the "intermediate" (i.e., feed-forward)
54 | layer in the Transformer encoder.
55 | hidden_act: The non-linear activation function (function or string) in the
56 | encoder and pooler.
57 | hidden_dropout_prob: The dropout probability for all fully connected
58 | layers in the embeddings, encoder, and pooler.
59 | attention_probs_dropout_prob: The dropout ratio for the attention
60 | probabilities.
61 | max_position_embeddings: The maximum sequence length that this model might
62 | ever be used with. Typically set this to something large just in case
63 | (e.g., 512 or 1024 or 2048).
64 | type_vocab_size: The vocabulary size of the `token_type_ids` passed into
65 | `BertModel`.
66 | initializer_range: The stdev of the truncated_normal_initializer for
67 | initializing all weight matrices.
68 | """
69 | self.vocab_size = vocab_size
70 | self.hidden_size = hidden_size
71 | self.num_hidden_layers = num_hidden_layers
72 | self.num_attention_heads = num_attention_heads
73 | self.hidden_act = hidden_act
74 | self.intermediate_size = intermediate_size
75 | self.hidden_dropout_prob = hidden_dropout_prob
76 | self.attention_probs_dropout_prob = attention_probs_dropout_prob
77 | self.max_position_embeddings = max_position_embeddings
78 | self.type_vocab_size = type_vocab_size
79 | self.initializer_range = initializer_range
80 |
81 | @classmethod
82 | def from_dict(cls, json_object):
83 | """Constructs a `BertConfig` from a Python dictionary of parameters."""
84 | config = BertConfig(vocab_size=None)
85 | for (key, value) in six.iteritems(json_object):
86 | config.__dict__[key] = value
87 | return config
88 |
89 | @classmethod
90 | def from_json_file(cls, json_file):
91 | """Constructs a `BertConfig` from a json file of parameters."""
92 | with tf.gfile.GFile(json_file, "r") as reader:
93 | text = reader.read()
94 | return cls.from_dict(json.loads(text))
95 |
96 | def to_dict(self):
97 | """Serializes this instance to a Python dictionary."""
98 | output = copy.deepcopy(self.__dict__)
99 | return output
100 |
101 | def to_json_string(self):
102 | """Serializes this instance to a JSON string."""
103 | return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
104 |
105 |
106 | class BertModel(object):
107 | """BERT model ("Bidirectional Embedding Representations from a Transformer").
108 |
109 | Example usage:
110 |
111 | ```python
112 | # Already been converted into WordPiece token ids
113 | input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
114 | input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
115 | token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])
116 |
117 | config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
118 | num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
119 |
120 | model = modeling.BertModel(config=config, is_training=True,
121 | input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)
122 |
123 | label_embeddings = tf.get_variable(...)
124 | pooled_output = model.get_pooled_output()
125 | logits = tf.matmul(pooled_output, label_embeddings)
126 | ...
127 | ```
128 | """
129 |
130 | def __init__(self,
131 | config,
132 | is_training,
133 | input_ids,
134 | input_mask=None,
135 | token_type_ids=None,
136 | use_one_hot_embeddings=True,
137 | scope=None):
138 | """Constructor for BertModel.
139 |
140 | Args:
141 | config: `BertConfig` instance.
142 | is_training: bool. rue for training model, false for eval model. Controls
143 | whether dropout will be applied.
144 | input_ids: int32 Tensor of shape [batch_size, seq_length].
145 | input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
146 | token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
147 | use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
148 | embeddings or tf.embedding_lookup() for the word embeddings. On the TPU,
149 | it is must faster if this is True, on the CPU or GPU, it is faster if
150 | this is False.
151 | scope: (optional) variable scope. Defaults to "bert".
152 |
153 | Raises:
154 | ValueError: The config is invalid or one of the input tensor shapes
155 | is invalid.
156 | """
157 | config = copy.deepcopy(config)
158 | if not is_training:
159 | config.hidden_dropout_prob = 0.0
160 | config.attention_probs_dropout_prob = 0.0
161 |
162 | input_shape = get_shape_list(input_ids, expected_rank=2)
163 | batch_size = input_shape[0]
164 | seq_length = input_shape[1]
165 |
166 | if input_mask is None:
167 | input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)
168 |
169 | if token_type_ids is None:
170 | token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)
171 |
172 | with tf.variable_scope(scope, default_name="bert"):
173 | with tf.variable_scope("embeddings"):
174 | # Perform embedding lookup on the word ids.
175 | (self.embedding_output, self.embedding_table) = embedding_lookup(
176 | input_ids=input_ids,
177 | vocab_size=config.vocab_size,
178 | embedding_size=config.hidden_size,
179 | initializer_range=config.initializer_range,
180 | word_embedding_name="word_embeddings",
181 | use_one_hot_embeddings=use_one_hot_embeddings)
182 |
183 | # Add positional embeddings and token type embeddings, then layer
184 | # normalize and perform dropout.
185 | self.embedding_output = embedding_postprocessor(
186 | input_tensor=self.embedding_output,
187 | use_token_type=True,
188 | token_type_ids=token_type_ids,
189 | token_type_vocab_size=config.type_vocab_size,
190 | token_type_embedding_name="token_type_embeddings",
191 | use_position_embeddings=True,
192 | position_embedding_name="position_embeddings",
193 | initializer_range=config.initializer_range,
194 | max_position_embeddings=config.max_position_embeddings,
195 | dropout_prob=config.hidden_dropout_prob)
196 |
197 | with tf.variable_scope("encoder"):
198 | # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
199 | # mask of shape [batch_size, seq_length, seq_length] which is used
200 | # for the attention scores.
201 | attention_mask = create_attention_mask_from_input_mask(
202 | input_ids, input_mask)
203 |
204 | # Run the stacked transformer.
205 | # `sequence_output` shape = [batch_size, seq_length, hidden_size].
206 | self.all_encoder_layers = transformer_model(
207 | input_tensor=self.embedding_output,
208 | attention_mask=attention_mask,
209 | hidden_size=config.hidden_size,
210 | num_hidden_layers=config.num_hidden_layers,
211 | num_attention_heads=config.num_attention_heads,
212 | intermediate_size=config.intermediate_size,
213 | intermediate_act_fn=get_activation(config.hidden_act),
214 | hidden_dropout_prob=config.hidden_dropout_prob,
215 | attention_probs_dropout_prob=config.attention_probs_dropout_prob,
216 | initializer_range=config.initializer_range,
217 | do_return_all_layers=True)
218 |
219 | self.sequence_output = self.all_encoder_layers[-1]
220 | # The "pooler" converts the encoded sequence tensor of shape
221 | # [batch_size, seq_length, hidden_size] to a tensor of shape
222 | # [batch_size, hidden_size]. This is necessary for segment-level
223 | # (or segment-pair-level) classification tasks where we need a fixed
224 | # dimensional representation of the segment.
225 | with tf.variable_scope("pooler"):
226 | # We "pool" the model by simply taking the hidden state corresponding
227 | # to the first token. We assume that this has been pre-trained
228 | first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
229 | self.pooled_output = tf.layers.dense(
230 | first_token_tensor,
231 | config.hidden_size,
232 | activation=tf.tanh,
233 | kernel_initializer=create_initializer(config.initializer_range))
234 |
235 | def get_pooled_output(self):
236 | return self.pooled_output
237 |
238 | def get_sequence_output(self):
239 | """Gets final hidden layer of encoder.
240 |
241 | Returns:
242 | float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
243 | to the final hidden of the transformer encoder.
244 | """
245 | return self.sequence_output
246 |
247 | def get_all_encoder_layers(self):
248 | return self.all_encoder_layers
249 |
250 | def get_embedding_output(self):
251 | """Gets output of the embedding lookup (i.e., input to the transformer).
252 |
253 | Returns:
254 | float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
255 | to the output of the embedding layer, after summing the word
256 | embeddings with the positional embeddings and the token type embeddings,
257 | then performing layer normalization. This is the input to the transformer.
258 | """
259 | return self.embedding_output
260 |
261 | def get_embedding_table(self):
262 | return self.embedding_table
263 |
264 |
265 | def gelu(input_tensor):
266 | """Gaussian Error Linear Unit.
267 |
268 | This is a smoother version of the RELU.
269 | Original paper: https://arxiv.org/abs/1606.08415
270 |
271 | Args:
272 | input_tensor: float Tensor to perform activation.
273 |
274 | Returns:
275 | `input_tensor` with the GELU activation applied.
276 | """
277 | cdf = 0.5 * (1.0 + tf.erf(input_tensor / tf.sqrt(2.0)))
278 | return input_tensor * cdf
279 |
280 |
281 | def get_activation(activation_string):
282 | """Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`.
283 |
284 | Args:
285 | activation_string: String name of the activation function.
286 |
287 | Returns:
288 | A Python function corresponding to the activation function. If
289 | `activation_string` is None, empty, or "linear", this will return None.
290 | If `activation_string` is not a string, it will return `activation_string`.
291 |
292 | Raises:
293 | ValueError: The `activation_string` does not correspond to a known
294 | activation.
295 | """
296 |
297 | # We assume that anything that"s not a string is already an activation
298 | # function, so we just return it.
299 | if not isinstance(activation_string, six.string_types):
300 | return activation_string
301 |
302 | if not activation_string:
303 | return None
304 |
305 | act = activation_string.lower()
306 | if act == "linear":
307 | return None
308 | elif act == "relu":
309 | return tf.nn.relu
310 | elif act == "gelu":
311 | return gelu
312 | elif act == "tanh":
313 | return tf.tanh
314 | else:
315 | raise ValueError("Unsupported activation: %s" % act)
316 |
317 |
318 | def get_assignment_map_from_checkpoint(tvars, init_checkpoint):
319 | """Compute the union of the current variables and checkpoint variables."""
320 | assignment_map = {}
321 | initialized_variable_names = {}
322 |
323 | name_to_variable = collections.OrderedDict()
324 | for var in tvars:
325 | name = var.name
326 | m = re.match("^(.*):\\d+$", name)
327 | if m is not None:
328 | name = m.group(1)
329 | name_to_variable[name] = var
330 |
331 | init_vars = tf.train.list_variables(init_checkpoint)
332 |
333 | assignment_map = collections.OrderedDict()
334 | for x in init_vars:
335 | (name, var) = (x[0], x[1])
336 | if name not in name_to_variable:
337 | continue
338 | assignment_map[name] = name
339 | initialized_variable_names[name] = 1
340 | initialized_variable_names[name + ":0"] = 1
341 |
342 | return (assignment_map, initialized_variable_names)
343 |
344 |
345 | def dropout(input_tensor, dropout_prob):
346 | """Perform dropout.
347 |
348 | Args:
349 | input_tensor: float Tensor.
350 | dropout_prob: Python float. The probability of dropping out a value (NOT of
351 | *keeping* a dimension as in `tf.nn.dropout`).
352 |
353 | Returns:
354 | A version of `input_tensor` with dropout applied.
355 | """
356 | if dropout_prob is None or dropout_prob == 0.0:
357 | return input_tensor
358 |
359 | output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob)
360 | return output
361 |
362 |
363 | def layer_norm(input_tensor, name=None):
364 | """Run layer normalization on the last dimension of the tensor."""
365 | return tf.contrib.layers.layer_norm(
366 | inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name)
367 |
368 |
369 | def layer_norm_and_dropout(input_tensor, dropout_prob, name=None):
370 | """Runs layer normalization followed by dropout."""
371 | output_tensor = layer_norm(input_tensor, name)
372 | output_tensor = dropout(output_tensor, dropout_prob)
373 | return output_tensor
374 |
375 |
376 | def create_initializer(initializer_range=0.02):
377 | """Creates a `truncated_normal_initializer` with the given range."""
378 | return tf.truncated_normal_initializer(stddev=initializer_range)
379 |
380 |
381 | def embedding_lookup(input_ids,
382 | vocab_size,
383 | embedding_size=128,
384 | initializer_range=0.02,
385 | word_embedding_name="word_embeddings",
386 | use_one_hot_embeddings=False):
387 | """Looks up words embeddings for id tensor.
388 |
389 | Args:
390 | input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
391 | ids.
392 | vocab_size: int. Size of the embedding vocabulary.
393 | embedding_size: int. Width of the word embeddings.
394 | initializer_range: float. Embedding initialization range.
395 | word_embedding_name: string. Name of the embedding table.
396 | use_one_hot_embeddings: bool. If True, use one-hot method for word
397 | embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better
398 | for TPUs.
399 |
400 | Returns:
401 | float Tensor of shape [batch_size, seq_length, embedding_size].
402 | """
403 | # This function assumes that the input is of shape [batch_size, seq_length,
404 | # num_inputs].
405 | #
406 | # If the input is a 2D tensor of shape [batch_size, seq_length], we
407 | # reshape to [batch_size, seq_length, 1].
408 | if input_ids.shape.ndims == 2:
409 | input_ids = tf.expand_dims(input_ids, axis=[-1])
410 |
411 | embedding_table = tf.get_variable(
412 | name=word_embedding_name,
413 | shape=[vocab_size, embedding_size],
414 | initializer=create_initializer(initializer_range))
415 |
416 | if use_one_hot_embeddings:
417 | flat_input_ids = tf.reshape(input_ids, [-1])
418 | one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
419 | output = tf.matmul(one_hot_input_ids, embedding_table)
420 | else:
421 | output = tf.nn.embedding_lookup(embedding_table, input_ids)
422 |
423 | input_shape = get_shape_list(input_ids)
424 |
425 | output = tf.reshape(output,
426 | input_shape[0:-1] + [input_shape[-1] * embedding_size])
427 | return (output, embedding_table)
428 |
429 |
430 | def embedding_postprocessor(input_tensor,
431 | use_token_type=False,
432 | token_type_ids=None,
433 | token_type_vocab_size=16,
434 | token_type_embedding_name="token_type_embeddings",
435 | use_position_embeddings=True,
436 | position_embedding_name="position_embeddings",
437 | initializer_range=0.02,
438 | max_position_embeddings=512,
439 | dropout_prob=0.1):
440 | """Performs various post-processing on a word embedding tensor.
441 |
442 | Args:
443 | input_tensor: float Tensor of shape [batch_size, seq_length,
444 | embedding_size].
445 | use_token_type: bool. Whether to add embeddings for `token_type_ids`.
446 | token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
447 | Must be specified if `use_token_type` is True.
448 | token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
449 | token_type_embedding_name: string. The name of the embedding table variable
450 | for token type ids.
451 | use_position_embeddings: bool. Whether to add position embeddings for the
452 | position of each token in the sequence.
453 | position_embedding_name: string. The name of the embedding table variable
454 | for positional embeddings.
455 | initializer_range: float. Range of the weight initialization.
456 | max_position_embeddings: int. Maximum sequence length that might ever be
457 | used with this model. This can be longer than the sequence length of
458 | input_tensor, but cannot be shorter.
459 | dropout_prob: float. Dropout probability applied to the final output tensor.
460 |
461 | Returns:
462 | float tensor with same shape as `input_tensor`.
463 |
464 | Raises:
465 | ValueError: One of the tensor shapes or input values is invalid.
466 | """
467 | input_shape = get_shape_list(input_tensor, expected_rank=3)
468 | batch_size = input_shape[0]
469 | seq_length = input_shape[1]
470 | width = input_shape[2]
471 |
472 | output = input_tensor
473 |
474 | if use_token_type:
475 | if token_type_ids is None:
476 | raise ValueError("`token_type_ids` must be specified if"
477 | "`use_token_type` is True.")
478 | token_type_table = tf.get_variable(
479 | name=token_type_embedding_name,
480 | shape=[token_type_vocab_size, width],
481 | initializer=create_initializer(initializer_range))
482 | # This vocab will be small so we always do one-hot here, since it is always
483 | # faster for a small vocabulary.
484 | flat_token_type_ids = tf.reshape(token_type_ids, [-1])
485 | one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
486 | token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
487 | token_type_embeddings = tf.reshape(token_type_embeddings,
488 | [batch_size, seq_length, width])
489 | output += token_type_embeddings
490 |
491 | if use_position_embeddings:
492 | assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
493 | with tf.control_dependencies([assert_op]):
494 | full_position_embeddings = tf.get_variable(
495 | name=position_embedding_name,
496 | shape=[max_position_embeddings, width],
497 | initializer=create_initializer(initializer_range))
498 | # Since the position embedding table is a learned variable, we create it
499 | # using a (long) sequence length `max_position_embeddings`. The actual
500 | # sequence length might be shorter than this, for faster training of
501 | # tasks that do not have long sequences.
502 | #
503 | # So `full_position_embeddings` is effectively an embedding table
504 | # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
505 | # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
506 | # perform a slice.
507 | position_embeddings = tf.slice(full_position_embeddings, [0, 0],
508 | [seq_length, -1])
509 | num_dims = len(output.shape.as_list())
510 |
511 | # Only the last two dimensions are relevant (`seq_length` and `width`), so
512 | # we broadcast among the first dimensions, which is typically just
513 | # the batch size.
514 | position_broadcast_shape = []
515 | for _ in range(num_dims - 2):
516 | position_broadcast_shape.append(1)
517 | position_broadcast_shape.extend([seq_length, width])
518 | position_embeddings = tf.reshape(position_embeddings,
519 | position_broadcast_shape)
520 | output += position_embeddings
521 |
522 | output = layer_norm_and_dropout(output, dropout_prob)
523 | return output
524 |
525 |
526 | def create_attention_mask_from_input_mask(from_tensor, to_mask):
527 | """Create 3D attention mask from a 2D tensor mask.
528 |
529 | Args:
530 | from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...].
531 | to_mask: int32 Tensor of shape [batch_size, to_seq_length].
532 |
533 | Returns:
534 | float Tensor of shape [batch_size, from_seq_length, to_seq_length].
535 | """
536 | from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
537 | batch_size = from_shape[0]
538 | from_seq_length = from_shape[1]
539 |
540 | to_shape = get_shape_list(to_mask, expected_rank=2)
541 | to_seq_length = to_shape[1]
542 |
543 | to_mask = tf.cast(
544 | tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)
545 |
546 | # We don't assume that `from_tensor` is a mask (although it could be). We
547 | # don't actually care if we attend *from* padding tokens (only *to* padding)
548 | # tokens so we create a tensor of all ones.
549 | #
550 | # `broadcast_ones` = [batch_size, from_seq_length, 1]
551 | broadcast_ones = tf.ones(
552 | shape=[batch_size, from_seq_length, 1], dtype=tf.float32)
553 |
554 | # Here we broadcast along two dimensions to create the mask.
555 | mask = broadcast_ones * to_mask
556 |
557 | return mask
558 |
559 |
560 | def attention_layer(from_tensor,
561 | to_tensor,
562 | attention_mask=None,
563 | num_attention_heads=1,
564 | size_per_head=512,
565 | query_act=None,
566 | key_act=None,
567 | value_act=None,
568 | attention_probs_dropout_prob=0.0,
569 | initializer_range=0.02,
570 | do_return_2d_tensor=False,
571 | batch_size=None,
572 | from_seq_length=None,
573 | to_seq_length=None):
574 | """Performs multi-headed attention from `from_tensor` to `to_tensor`.
575 |
576 | This is an implementation of multi-headed attention based on "Attention
577 | is all you Need". If `from_tensor` and `to_tensor` are the same, then
578 | this is self-attention. Each timestep in `from_tensor` attends to the
579 | corresponding sequence in `to_tensor`, and returns a fixed-with vector.
580 |
581 | This function first projects `from_tensor` into a "query" tensor and
582 | `to_tensor` into "key" and "value" tensors. These are (effectively) a list
583 | of tensors of length `num_attention_heads`, where each tensor is of shape
584 | [batch_size, seq_length, size_per_head].
585 |
586 | Then, the query and key tensors are dot-producted and scaled. These are
587 | softmaxed to obtain attention probabilities. The value tensors are then
588 | interpolated by these probabilities, then concatenated back to a single
589 | tensor and returned.
590 |
591 | In practice, the multi-headed attention are done with transposes and
592 | reshapes rather than actual separate tensors.
593 |
594 | Args:
595 | from_tensor: float Tensor of shape [batch_size, from_seq_length,
596 | from_width].
597 | to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
598 | attention_mask: (optional) int32 Tensor of shape [batch_size,
599 | from_seq_length, to_seq_length]. The values should be 1 or 0. The
600 | attention scores will effectively be set to -infinity for any positions in
601 | the mask that are 0, and will be unchanged for positions that are 1.
602 | num_attention_heads: int. Number of attention heads.
603 | size_per_head: int. Size of each attention head.
604 | query_act: (optional) Activation function for the query transform.
605 | key_act: (optional) Activation function for the key transform.
606 | value_act: (optional) Activation function for the value transform.
607 | attention_probs_dropout_prob: (optional) float. Dropout probability of the
608 | attention probabilities.
609 | initializer_range: float. Range of the weight initializer.
610 | do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
611 | * from_seq_length, num_attention_heads * size_per_head]. If False, the
612 | output will be of shape [batch_size, from_seq_length, num_attention_heads
613 | * size_per_head].
614 | batch_size: (Optional) int. If the input is 2D, this might be the batch size
615 | of the 3D version of the `from_tensor` and `to_tensor`.
616 | from_seq_length: (Optional) If the input is 2D, this might be the seq length
617 | of the 3D version of the `from_tensor`.
618 | to_seq_length: (Optional) If the input is 2D, this might be the seq length
619 | of the 3D version of the `to_tensor`.
620 |
621 | Returns:
622 | float Tensor of shape [batch_size, from_seq_length,
623 | num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
624 | true, this will be of shape [batch_size * from_seq_length,
625 | num_attention_heads * size_per_head]).
626 |
627 | Raises:
628 | ValueError: Any of the arguments or tensor shapes are invalid.
629 | """
630 |
631 | def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
632 | seq_length, width):
633 | output_tensor = tf.reshape(
634 | input_tensor, [batch_size, seq_length, num_attention_heads, width])
635 |
636 | output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
637 | return output_tensor
638 |
639 | from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
640 | to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])
641 |
642 | if len(from_shape) != len(to_shape):
643 | raise ValueError(
644 | "The rank of `from_tensor` must match the rank of `to_tensor`.")
645 |
646 | if len(from_shape) == 3:
647 | batch_size = from_shape[0]
648 | from_seq_length = from_shape[1]
649 | to_seq_length = to_shape[1]
650 | elif len(from_shape) == 2:
651 | if (batch_size is None or from_seq_length is None or to_seq_length is None):
652 | raise ValueError(
653 | "When passing in rank 2 tensors to attention_layer, the values "
654 | "for `batch_size`, `from_seq_length`, and `to_seq_length` "
655 | "must all be specified.")
656 |
657 | # Scalar dimensions referenced here:
658 | # B = batch size (number of sequences)
659 | # F = `from_tensor` sequence length
660 | # T = `to_tensor` sequence length
661 | # N = `num_attention_heads`
662 | # H = `size_per_head`
663 |
664 | from_tensor_2d = reshape_to_matrix(from_tensor)
665 | to_tensor_2d = reshape_to_matrix(to_tensor)
666 |
667 | # `query_layer` = [B*F, N*H]
668 | query_layer = tf.layers.dense(
669 | from_tensor_2d,
670 | num_attention_heads * size_per_head,
671 | activation=query_act,
672 | name="query",
673 | kernel_initializer=create_initializer(initializer_range))
674 |
675 | # `key_layer` = [B*T, N*H]
676 | key_layer = tf.layers.dense(
677 | to_tensor_2d,
678 | num_attention_heads * size_per_head,
679 | activation=key_act,
680 | name="key",
681 | kernel_initializer=create_initializer(initializer_range))
682 |
683 | # `value_layer` = [B*T, N*H]
684 | value_layer = tf.layers.dense(
685 | to_tensor_2d,
686 | num_attention_heads * size_per_head,
687 | activation=value_act,
688 | name="value",
689 | kernel_initializer=create_initializer(initializer_range))
690 |
691 | # `query_layer` = [B, N, F, H]
692 | query_layer = transpose_for_scores(query_layer, batch_size,
693 | num_attention_heads, from_seq_length,
694 | size_per_head)
695 |
696 | # `key_layer` = [B, N, T, H]
697 | key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
698 | to_seq_length, size_per_head)
699 |
700 | # Take the dot product between "query" and "key" to get the raw
701 | # attention scores.
702 | # `attention_scores` = [B, N, F, T]
703 | attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
704 | attention_scores = tf.multiply(attention_scores,
705 | 1.0 / math.sqrt(float(size_per_head)))
706 |
707 | if attention_mask is not None:
708 | # `attention_mask` = [B, 1, F, T]
709 | attention_mask = tf.expand_dims(attention_mask, axis=[1])
710 |
711 | # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
712 | # masked positions, this operation will create a tensor which is 0.0 for
713 | # positions we want to attend and -10000.0 for masked positions.
714 | adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0
715 |
716 | # Since we are adding it to the raw scores before the softmax, this is
717 | # effectively the same as removing these entirely.
718 | attention_scores += adder
719 |
720 | # Normalize the attention scores to probabilities.
721 | # `attention_probs` = [B, N, F, T]
722 | attention_probs = tf.nn.softmax(attention_scores)
723 |
724 | # This is actually dropping out entire tokens to attend to, which might
725 | # seem a bit unusual, but is taken from the original Transformer paper.
726 | attention_probs = dropout(attention_probs, attention_probs_dropout_prob)
727 |
728 | # `value_layer` = [B, T, N, H]
729 | value_layer = tf.reshape(
730 | value_layer,
731 | [batch_size, to_seq_length, num_attention_heads, size_per_head])
732 |
733 | # `value_layer` = [B, N, T, H]
734 | value_layer = tf.transpose(value_layer, [0, 2, 1, 3])
735 |
736 | # `context_layer` = [B, N, F, H]
737 | context_layer = tf.matmul(attention_probs, value_layer)
738 |
739 | # `context_layer` = [B, F, N, H]
740 | context_layer = tf.transpose(context_layer, [0, 2, 1, 3])
741 |
742 | if do_return_2d_tensor:
743 | # `context_layer` = [B*F, N*V]
744 | context_layer = tf.reshape(
745 | context_layer,
746 | [batch_size * from_seq_length, num_attention_heads * size_per_head])
747 | else:
748 | # `context_layer` = [B, F, N*V]
749 | context_layer = tf.reshape(
750 | context_layer,
751 | [batch_size, from_seq_length, num_attention_heads * size_per_head])
752 |
753 | return context_layer
754 |
755 |
756 | def transformer_model(input_tensor,
757 | attention_mask=None,
758 | hidden_size=768,
759 | num_hidden_layers=12,
760 | num_attention_heads=12,
761 | intermediate_size=3072,
762 | intermediate_act_fn=gelu,
763 | hidden_dropout_prob=0.1,
764 | attention_probs_dropout_prob=0.1,
765 | initializer_range=0.02,
766 | do_return_all_layers=False):
767 | """Multi-headed, multi-layer Transformer from "Attention is All You Need".
768 |
769 | This is almost an exact implementation of the original Transformer encoder.
770 |
771 | See the original paper:
772 | https://arxiv.org/abs/1706.03762
773 |
774 | Also see:
775 | https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
776 |
777 | Args:
778 | input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
779 | attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
780 | seq_length], with 1 for positions that can be attended to and 0 in
781 | positions that should not be.
782 | hidden_size: int. Hidden size of the Transformer.
783 | num_hidden_layers: int. Number of layers (blocks) in the Transformer.
784 | num_attention_heads: int. Number of attention heads in the Transformer.
785 | intermediate_size: int. The size of the "intermediate" (a.k.a., feed
786 | forward) layer.
787 | intermediate_act_fn: function. The non-linear activation function to apply
788 | to the output of the intermediate/feed-forward layer.
789 | hidden_dropout_prob: float. Dropout probability for the hidden layers.
790 | attention_probs_dropout_prob: float. Dropout probability of the attention
791 | probabilities.
792 | initializer_range: float. Range of the initializer (stddev of truncated
793 | normal).
794 | do_return_all_layers: Whether to also return all layers or just the final
795 | layer.
796 |
797 | Returns:
798 | float Tensor of shape [batch_size, seq_length, hidden_size], the final
799 | hidden layer of the Transformer.
800 |
801 | Raises:
802 | ValueError: A Tensor shape or parameter is invalid.
803 | """
804 | if hidden_size % num_attention_heads != 0:
805 | raise ValueError(
806 | "The hidden size (%d) is not a multiple of the number of attention "
807 | "heads (%d)" % (hidden_size, num_attention_heads))
808 |
809 | attention_head_size = int(hidden_size / num_attention_heads)
810 | input_shape = get_shape_list(input_tensor, expected_rank=3)
811 | batch_size = input_shape[0]
812 | seq_length = input_shape[1]
813 | input_width = input_shape[2]
814 |
815 | # The Transformer performs sum residuals on all layers so the input needs
816 | # to be the same as the hidden size.
817 | if input_width != hidden_size:
818 | raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
819 | (input_width, hidden_size))
820 |
821 | # We keep the representation as a 2D tensor to avoid re-shaping it back and
822 | # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
823 | # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
824 | # help the optimizer.
825 | prev_output = reshape_to_matrix(input_tensor)
826 |
827 | all_layer_outputs = []
828 | for layer_idx in range(num_hidden_layers):
829 | with tf.variable_scope("layer_%d" % layer_idx):
830 | layer_input = prev_output
831 |
832 | with tf.variable_scope("attention"):
833 | attention_heads = []
834 | with tf.variable_scope("self"):
835 | attention_head = attention_layer(
836 | from_tensor=layer_input,
837 | to_tensor=layer_input,
838 | attention_mask=attention_mask,
839 | num_attention_heads=num_attention_heads,
840 | size_per_head=attention_head_size,
841 | attention_probs_dropout_prob=attention_probs_dropout_prob,
842 | initializer_range=initializer_range,
843 | do_return_2d_tensor=True,
844 | batch_size=batch_size,
845 | from_seq_length=seq_length,
846 | to_seq_length=seq_length)
847 | attention_heads.append(attention_head)
848 |
849 | attention_output = None
850 | if len(attention_heads) == 1:
851 | attention_output = attention_heads[0]
852 | else:
853 | # In the case where we have other sequences, we just concatenate
854 | # them to the self-attention head before the projection.
855 | attention_output = tf.concat(attention_heads, axis=-1)
856 |
857 | # Run a linear projection of `hidden_size` then add a residual
858 | # with `layer_input`.
859 | with tf.variable_scope("output"):
860 | attention_output = tf.layers.dense(
861 | attention_output,
862 | hidden_size,
863 | kernel_initializer=create_initializer(initializer_range))
864 | attention_output = dropout(attention_output, hidden_dropout_prob)
865 | attention_output = layer_norm(attention_output + layer_input)
866 |
867 | # The activation is only applied to the "intermediate" hidden layer.
868 | with tf.variable_scope("intermediate"):
869 | intermediate_output = tf.layers.dense(
870 | attention_output,
871 | intermediate_size,
872 | activation=intermediate_act_fn,
873 | kernel_initializer=create_initializer(initializer_range))
874 |
875 | # Down-project back to `hidden_size` then add the residual.
876 | with tf.variable_scope("output"):
877 | layer_output = tf.layers.dense(
878 | intermediate_output,
879 | hidden_size,
880 | kernel_initializer=create_initializer(initializer_range))
881 | layer_output = dropout(layer_output, hidden_dropout_prob)
882 | layer_output = layer_norm(layer_output + attention_output)
883 | prev_output = layer_output
884 | all_layer_outputs.append(layer_output)
885 |
886 | if do_return_all_layers:
887 | final_outputs = []
888 | for layer_output in all_layer_outputs:
889 | final_output = reshape_from_matrix(layer_output, input_shape)
890 | final_outputs.append(final_output)
891 | return final_outputs
892 | else:
893 | final_output = reshape_from_matrix(prev_output, input_shape)
894 | return final_output
895 |
896 |
897 | def get_shape_list(tensor, expected_rank=None, name=None):
898 | """Returns a list of the shape of tensor, preferring static dimensions.
899 |
900 | Args:
901 | tensor: A tf.Tensor object to find the shape of.
902 | expected_rank: (optional) int. The expected rank of `tensor`. If this is
903 | specified and the `tensor` has a different rank, and exception will be
904 | thrown.
905 | name: Optional name of the tensor for the error message.
906 |
907 | Returns:
908 | A list of dimensions of the shape of tensor. All static dimensions will
909 | be returned as python integers, and dynamic dimensions will be returned
910 | as tf.Tensor scalars.
911 | """
912 | if name is None:
913 | name = tensor.name
914 |
915 | if expected_rank is not None:
916 | assert_rank(tensor, expected_rank, name)
917 |
918 | shape = tensor.shape.as_list()
919 |
920 | non_static_indexes = []
921 | for (index, dim) in enumerate(shape):
922 | if dim is None:
923 | non_static_indexes.append(index)
924 |
925 | if not non_static_indexes:
926 | return shape
927 |
928 | dyn_shape = tf.shape(tensor)
929 | for index in non_static_indexes:
930 | shape[index] = dyn_shape[index]
931 | return shape
932 |
933 |
934 | def reshape_to_matrix(input_tensor):
935 | """Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix)."""
936 | ndims = input_tensor.shape.ndims
937 | if ndims < 2:
938 | raise ValueError("Input tensor must have at least rank 2. Shape = %s" %
939 | (input_tensor.shape))
940 | if ndims == 2:
941 | return input_tensor
942 |
943 | width = input_tensor.shape[-1]
944 | output_tensor = tf.reshape(input_tensor, [-1, width])
945 | return output_tensor
946 |
947 |
948 | def reshape_from_matrix(output_tensor, orig_shape_list):
949 | """Reshapes a rank 2 tensor back to its original rank >= 2 tensor."""
950 | if len(orig_shape_list) == 2:
951 | return output_tensor
952 |
953 | output_shape = get_shape_list(output_tensor)
954 |
955 | orig_dims = orig_shape_list[0:-1]
956 | width = output_shape[-1]
957 |
958 | return tf.reshape(output_tensor, orig_dims + [width])
959 |
960 |
961 | def assert_rank(tensor, expected_rank, name=None):
962 | """Raises an exception if the tensor rank is not of the expected rank.
963 |
964 | Args:
965 | tensor: A tf.Tensor to check the rank of.
966 | expected_rank: Python integer or list of integers, expected rank.
967 | name: Optional name of the tensor for the error message.
968 |
969 | Raises:
970 | ValueError: If the expected shape doesn't match the actual shape.
971 | """
972 | if name is None:
973 | name = tensor.name
974 |
975 | expected_rank_dict = {}
976 | if isinstance(expected_rank, six.integer_types):
977 | expected_rank_dict[expected_rank] = True
978 | else:
979 | for x in expected_rank:
980 | expected_rank_dict[x] = True
981 |
982 | actual_rank = tensor.shape.ndims
983 | if actual_rank not in expected_rank_dict:
984 | scope_name = tf.get_variable_scope().name
985 | raise ValueError(
986 | "For the tensor `%s` in scope `%s`, the actual rank "
987 | "`%d` (shape = %s) is not equal to the expected rank `%s`" %
988 | (name, scope_name, actual_rank, str(tensor.shape), str(expected_rank)))
989 |
--------------------------------------------------------------------------------
/src/optimization.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Language Team Authors.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | """Functions and classes related to optimization (weight updates)."""
16 |
17 | from __future__ import absolute_import
18 | from __future__ import division
19 | from __future__ import print_function
20 |
21 | import re
22 | import tensorflow as tf
23 |
24 |
25 | def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu):
26 | """Creates an optimizer training op."""
27 | global_step = tf.train.get_or_create_global_step()
28 |
29 | learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
30 |
31 | # Implements linear decay of the learning rate.
32 | learning_rate = tf.train.polynomial_decay(
33 | learning_rate,
34 | global_step,
35 | num_train_steps,
36 | end_learning_rate=0.0,
37 | power=1.0,
38 | cycle=False)
39 |
40 | # Implements linear warmup. I.e., if global_step < num_warmup_steps, the
41 | # learning rate will be `global_step/num_warmup_steps * init_lr`.
42 | if num_warmup_steps:
43 | global_steps_int = tf.cast(global_step, tf.int32)
44 | warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
45 |
46 | global_steps_float = tf.cast(global_steps_int, tf.float32)
47 | warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
48 |
49 | warmup_percent_done = global_steps_float / warmup_steps_float
50 | warmup_learning_rate = init_lr * warmup_percent_done
51 |
52 | is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
53 | learning_rate = (
54 | (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)
55 |
56 | # It is recommended that you use this optimizer for fine tuning, since this
57 | # is how the model was trained (note that the Adam m/v variables are NOT
58 | # loaded from init_checkpoint.)
59 | optimizer = AdamWeightDecayOptimizer(
60 | learning_rate=learning_rate,
61 | weight_decay_rate=0.01,
62 | beta_1=0.9,
63 | beta_2=0.999,
64 | epsilon=1e-6,
65 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
66 |
67 | if use_tpu:
68 | optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
69 |
70 | tvars = tf.trainable_variables()
71 | grads = tf.gradients(loss, tvars)
72 |
73 | # This is how the model was pre-trained.
74 | (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
75 |
76 | train_op = optimizer.apply_gradients(
77 | zip(grads, tvars), global_step=global_step)
78 |
79 | new_global_step = global_step + 1
80 | train_op = tf.group(train_op, [global_step.assign(new_global_step)])
81 | return train_op
82 |
83 |
84 | class AdamWeightDecayOptimizer(tf.train.Optimizer):
85 | """A basic Adam optimizer that includes "correct" L2 weight decay."""
86 |
87 | def __init__(self,
88 | learning_rate,
89 | weight_decay_rate=0.0,
90 | beta_1=0.9,
91 | beta_2=0.999,
92 | epsilon=1e-6,
93 | exclude_from_weight_decay=None,
94 | name="AdamWeightDecayOptimizer"):
95 | """Constructs a AdamWeightDecayOptimizer."""
96 | super(AdamWeightDecayOptimizer, self).__init__(False, name)
97 |
98 | self.learning_rate = learning_rate
99 | self.weight_decay_rate = weight_decay_rate
100 | self.beta_1 = beta_1
101 | self.beta_2 = beta_2
102 | self.epsilon = epsilon
103 | self.exclude_from_weight_decay = exclude_from_weight_decay
104 |
105 | def apply_gradients(self, grads_and_vars, global_step=None, name=None):
106 | """See base class."""
107 | assignments = []
108 | for (grad, param) in grads_and_vars:
109 | if grad is None or param is None:
110 | continue
111 |
112 | param_name = self._get_variable_name(param.name)
113 |
114 | m = tf.get_variable(
115 | name=param_name + "/adam_m",
116 | shape=param.shape.as_list(),
117 | dtype=tf.float32,
118 | trainable=False,
119 | initializer=tf.zeros_initializer())
120 | v = tf.get_variable(
121 | name=param_name + "/adam_v",
122 | shape=param.shape.as_list(),
123 | dtype=tf.float32,
124 | trainable=False,
125 | initializer=tf.zeros_initializer())
126 |
127 | # Standard Adam update.
128 | next_m = (
129 | tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
130 | next_v = (
131 | tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
132 | tf.square(grad)))
133 |
134 | update = next_m / (tf.sqrt(next_v) + self.epsilon)
135 |
136 | # Just adding the square of the weights to the loss function is *not*
137 | # the correct way of using L2 regularization/weight decay with Adam,
138 | # since that will interact with the m and v parameters in strange ways.
139 | #
140 | # Instead we want ot decay the weights in a manner that doesn't interact
141 | # with the m/v parameters. This is equivalent to adding the square
142 | # of the weights to the loss with plain (non-momentum) SGD.
143 | if self._do_use_weight_decay(param_name):
144 | update += self.weight_decay_rate * param
145 |
146 | update_with_lr = self.learning_rate * update
147 |
148 | next_param = param - update_with_lr
149 |
150 | assignments.extend(
151 | [param.assign(next_param),
152 | m.assign(next_m),
153 | v.assign(next_v)])
154 | return tf.group(*assignments, name=name)
155 |
156 | def _do_use_weight_decay(self, param_name):
157 | """Whether to use L2 weight decay for `param_name`."""
158 | if not self.weight_decay_rate:
159 | return False
160 | if self.exclude_from_weight_decay:
161 | for r in self.exclude_from_weight_decay:
162 | if re.search(r, param_name) is not None:
163 | return False
164 | return True
165 |
166 | def _get_variable_name(self, param_name):
167 | """Get the variable name from the tensor name."""
168 | m = re.match("^(.*):\\d+$", param_name)
169 | if m is not None:
170 | param_name = m.group(1)
171 | return param_name
172 |
--------------------------------------------------------------------------------
/src/preprocess/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding:utf-8 -*-
3 |
--------------------------------------------------------------------------------
/src/preprocess/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/src/preprocess/__pycache__/__init__.cpython-36.pyc
--------------------------------------------------------------------------------
/src/preprocess/__pycache__/normalizer.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/src/preprocess/__pycache__/normalizer.cpython-36.pyc
--------------------------------------------------------------------------------
/src/preprocess/normalizer.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding:utf-8 -*-
3 | import os, sys, io
4 |
5 | import unicodedata
6 | import regex
7 | import html
8 |
9 | # String Normalizer Module
10 | # Usage: from normalizer import twitter_normalizer
11 |
12 | ##charFilter: urlFilter
13 | url_regex = "https?://[-_.!~*\'()a-z0-9;/?:@&=+$,%#]+"
14 | url_pattern = regex.compile(url_regex, regex.IGNORECASE)
15 |
16 | ##charFilter: partialurlFilter
17 | partial_url_regex = "(((https|http)(.{1,3})?)|(htt|ht))$"
18 | partial_url_pattern = regex.compile(partial_url_regex, regex.IGNORECASE)
19 |
20 | ##charFilter: retweetflagFilter
21 | rt_regex = "rt (?=\@)"
22 | rt_pattern = regex.compile(rt_regex, regex.IGNORECASE)
23 |
24 | ##charFilter: screennameFilter
25 | scname_regex = "\@[a-z0-9\_]+:?"
26 | scname_pattern = regex.compile(scname_regex, regex.IGNORECASE)
27 |
28 | ##charFilter: truncationFilter
29 | truncation_regex = "…$" # NFKC:"...$"
30 | truncation_pattern = regex.compile(truncation_regex, regex.IGNORECASE)
31 |
32 | ##charFilter: hashtagFilter
33 | hashtag_regex = r"\#\S+"
34 | hashtag_pattern = regex.compile(hashtag_regex, regex.IGNORECASE)
35 |
36 | ##charFilter: whitespaceNormalizer
37 | ws_regex = "\p{Zs}"
38 | ws_pattern = regex.compile(ws_regex, regex.IGNORECASE)
39 |
40 | ##charFilter: controlcodeFilter
41 | cc_regex = "\p{Cc}"
42 | cc_pattern = regex.compile(cc_regex, regex.IGNORECASE)
43 |
44 | ##charFilter: singlequestionFilter
45 | sq_regex = "\?{1,}"
46 | sq_pattern = regex.compile(sq_regex, regex.IGNORECASE)
47 |
48 | SPECIAL_TOKENS = {
49 | "url":"",
50 | "screen_name":""
51 | }
52 |
53 |
54 | def twitter_normalizer(str_):
55 | # processing order is crucial.
56 |
57 | #unescape html entities
58 | str_ = html.unescape(str_)
59 | #charFilter: strip
60 | str_ = str_.strip()
61 | #charFilter: truncationFilter
62 | str_ = truncation_pattern.sub("", str_)
63 | #charFilter: icuNormalizer(NKFC)
64 | str_ = unicodedata.normalize('NFKC', str_)
65 | #charFilter: caseNormalizer
66 | str_ = str_.lower()
67 | #charFilter: retweetflagFilter
68 | str_ = rt_pattern.sub("", str_)
69 | ##charFilter: partialurlFilter
70 | str_ = partial_url_pattern.sub("", str_)
71 | ##charFilter: screennameFilter
72 | str_ = scname_pattern.sub(SPECIAL_TOKENS["screen_name"], str_)
73 | ##charFilter: urlFilter
74 | str_ = url_pattern.sub(SPECIAL_TOKENS["url"], str_)
75 | ##charFilter: strip(once again)
76 | str_ = str_.strip()
77 |
78 | return str_
79 |
80 | def question_remover(str_: str):
81 | return sq_pattern.sub(" ", str_)
82 |
83 | def whitespace_normalizer(str_: str):
84 | str_ = ws_pattern.sub(" ", str_)
85 | return str_
86 |
87 | def control_code_remover(str_: str):
88 | str_ = cc_pattern.sub(" ", str_)
89 | return str_
90 |
91 | def twitter_normalizer_for_bert_encoder(str_):
92 | # normalizer that is specialized to Twitter BERT Encoder.
93 |
94 | #unescape html entities
95 | str_ = html.unescape(str_)
96 | #charFilter: question mark
97 | str_ = sq_pattern.sub(" ", str_)
98 | #charFilter: strip
99 | str_ = str_.strip()
100 | #charFilter: truncationFilter
101 | str_ = truncation_pattern.sub("", str_)
102 | #charFilter: icuNormalizer(NKFC)
103 | str_ = unicodedata.normalize('NFKC', str_)
104 | #charFilter: caseNormalizer
105 | # str_ = str_.lower()
106 | #charFilter: retweetflagFilter
107 | str_ = rt_pattern.sub("", str_)
108 | #charFilter: partialurlFilter
109 | str_ = partial_url_pattern.sub("", str_)
110 | #charFilter: screennameFilter
111 | str_ = scname_pattern.sub(SPECIAL_TOKENS["screen_name"], str_)
112 | #charFilter: urlFilter
113 | str_ = url_pattern.sub(SPECIAL_TOKENS["url"], str_)
114 | #charFilter: control code such as newline
115 | str_ = cc_pattern.sub(" ", str_)
116 | #charFilter: strip(once again)
117 | str_ = str_.strip()
118 |
119 | return str_
120 |
--------------------------------------------------------------------------------
/src/run_classifier.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Language Team Authors.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | """BERT finetuning runner."""
16 |
17 | from __future__ import absolute_import
18 | from __future__ import division
19 | from __future__ import print_function
20 |
21 | import collections
22 | import os
23 | import modeling
24 | import optimization
25 | import tokenization
26 | import tensorflow as tf
27 | import numpy as np
28 | from preprocess import normalizer
29 |
30 | from dataprocessor.preset import InputFeatures, XnliProcessor, MnliProcessor, MrpcProcessor, ColaProcessor
31 | from dataprocessor.custom import PublicTwitterSentimentProcessor
32 | import utility
33 |
34 | flags = tf.flags
35 |
36 | FLAGS = flags.FLAGS
37 |
38 | ## Required parameters
39 | flags.DEFINE_string(
40 | "data_dir", None,
41 | "The input data dir. Should contain the .tsv files (or other data files) "
42 | "for the task.")
43 |
44 | flags.DEFINE_string(
45 | "bert_config_file", None,
46 | "The config json file corresponding to the pre-trained BERT model. "
47 | "This specifies the model architecture.")
48 |
49 | flags.DEFINE_string("task_name", None, "The name of the task to train.")
50 |
51 | flags.DEFINE_string("vocab_file", None,
52 | "The vocabulary file that the BERT model was trained on.")
53 |
54 | flags.DEFINE_string(
55 | "output_dir", None,
56 | "The output directory where the model checkpoints will be written.")
57 |
58 | ## Other parameters
59 |
60 | flags.DEFINE_string("normalizer_name", None,
61 | "The name of the normalizer that will be applied to dataset.")
62 |
63 | flags.DEFINE_string("spm_file", None,
64 | "The sentencepiece model file to tokenize texts.")
65 |
66 | flags.DEFINE_string(
67 | "init_checkpoint", None,
68 | "Initial checkpoint (usually from a pre-trained BERT model).")
69 |
70 | flags.DEFINE_bool(
71 | "do_lower_case", True,
72 | "Whether to lower case the input text. Should be True for uncased "
73 | "models and False for cased models.")
74 |
75 | flags.DEFINE_integer(
76 | "max_seq_length", 128,
77 | "The maximum total input sequence length after WordPiece tokenization. "
78 | "Sequences longer than this will be truncated, and sequences shorter "
79 | "than this will be padded.")
80 |
81 | flags.DEFINE_bool("do_train", False, "Whether to run training.")
82 |
83 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
84 |
85 | flags.DEFINE_bool(
86 | "do_predict", False,
87 | "Whether to run the model in inference mode on the test set.")
88 |
89 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
90 |
91 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
92 |
93 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
94 |
95 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
96 |
97 | flags.DEFINE_float("num_train_epochs", 3.0,
98 | "Total number of training epochs to perform.")
99 |
100 | flags.DEFINE_float(
101 | "warmup_proportion", 0.1,
102 | "Proportion of training to perform linear learning rate warmup for. "
103 | "E.g., 0.1 = 10% of training.")
104 |
105 | flags.DEFINE_integer("save_checkpoints_steps", 1000,
106 | "How often to save the model checkpoint.")
107 |
108 | flags.DEFINE_integer("iterations_per_loop", 1000,
109 | "How many steps to make in each estimator call.")
110 |
111 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
112 |
113 | tf.flags.DEFINE_string(
114 | "tpu_name", None,
115 | "The Cloud TPU to use for training. This should be either the name "
116 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
117 | "url.")
118 |
119 | tf.flags.DEFINE_string(
120 | "tpu_zone", None,
121 | "[Optional] GCE zone where the Cloud TPU is located in. If not "
122 | "specified, we will attempt to automatically detect the GCE project from "
123 | "metadata.")
124 |
125 | tf.flags.DEFINE_string(
126 | "gcp_project", None,
127 | "[Optional] Project name for the Cloud TPU-enabled project. If not "
128 | "specified, we will attempt to automatically detect the GCE project from "
129 | "metadata.")
130 |
131 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
132 |
133 | flags.DEFINE_integer(
134 | "num_tpu_cores", 8,
135 | "Only used if `use_tpu` is True. Total number of TPU cores to use.")
136 |
137 |
138 | def convert_single_example(ex_index, example, label_list, max_seq_length,
139 | tokenizer):
140 | """Converts a single `InputExample` into a single `InputFeatures`."""
141 | label_map = {}
142 | for (i, label) in enumerate(label_list):
143 | label_map[label] = i
144 |
145 | tokens_a = tokenizer.tokenize(example.text_a)
146 | tokens_b = None
147 | if example.text_b:
148 | tokens_b = tokenizer.tokenize(example.text_b)
149 |
150 | if tokens_b:
151 | # Modifies `tokens_a` and `tokens_b` in place so that the total
152 | # length is less than the specified length.
153 | # Account for [CLS], [SEP], [SEP] with "- 3"
154 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
155 | else:
156 | # Account for [CLS] and [SEP] with "- 2"
157 | if len(tokens_a) > max_seq_length - 2:
158 | tokens_a = tokens_a[0:(max_seq_length - 2)]
159 |
160 | # The convention in BERT is:
161 | # (a) For sequence pairs:
162 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
163 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
164 | # (b) For single sequences:
165 | # tokens: [CLS] the dog is hairy . [SEP]
166 | # type_ids: 0 0 0 0 0 0 0
167 | #
168 | # Where "type_ids" are used to indicate whether this is the first
169 | # sequence or the second sequence. The embedding vectors for `type=0` and
170 | # `type=1` were learned during pre-training and are added to the wordpiece
171 | # embedding vector (and position vector). This is not *strictly* necessary
172 | # since the [SEP] token unambiguously separates the sequences, but it makes
173 | # it easier for the model to learn the concept of sequences.
174 | #
175 | # For classification tasks, the first vector (corresponding to [CLS]) is
176 | # used as as the "sentence vector". Note that this only makes sense because
177 | # the entire model is fine-tuned.
178 | tokens = []
179 | segment_ids = []
180 | tokens.append("[CLS]")
181 | segment_ids.append(0)
182 | for token in tokens_a:
183 | tokens.append(token)
184 | segment_ids.append(0)
185 | tokens.append("[SEP]")
186 | segment_ids.append(0)
187 |
188 | if tokens_b:
189 | for token in tokens_b:
190 | tokens.append(token)
191 | segment_ids.append(1)
192 | tokens.append("[SEP]")
193 | segment_ids.append(1)
194 |
195 | input_ids = tokenizer.convert_tokens_to_ids(tokens)
196 |
197 | # The mask has 1 for real tokens and 0 for padding tokens. Only real
198 | # tokens are attended to.
199 | input_mask = [1] * len(input_ids)
200 |
201 | # Zero-pad up to the sequence length.
202 | while len(input_ids) < max_seq_length:
203 | input_ids.append(0)
204 | input_mask.append(0)
205 | segment_ids.append(0)
206 |
207 | assert len(input_ids) == max_seq_length
208 | assert len(input_mask) == max_seq_length
209 | assert len(segment_ids) == max_seq_length
210 |
211 | label_id = label_map[example.label]
212 | if ex_index < 5:
213 | tf.logging.info("*** Example ***")
214 | tf.logging.info("guid: %s" % (example.guid))
215 | tf.logging.info("tokens: %s" % " ".join(
216 | [tokenization.printable_text(x) for x in tokens]))
217 | tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
218 | tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
219 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
220 | tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
221 |
222 | feature = InputFeatures(
223 | input_ids=input_ids,
224 | input_mask=input_mask,
225 | segment_ids=segment_ids,
226 | label_id=label_id)
227 | return feature
228 |
229 |
230 | def file_based_convert_examples_to_features(
231 | examples, label_list, max_seq_length, tokenizer, output_file):
232 | """Convert a set of `InputExample`s to a TFRecord file."""
233 |
234 | writer = tf.python_io.TFRecordWriter(output_file)
235 |
236 | for (ex_index, example) in enumerate(examples):
237 | if ex_index % 10000 == 0:
238 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
239 |
240 | feature = convert_single_example(ex_index, example, label_list,
241 | max_seq_length, tokenizer)
242 |
243 | def create_int_feature(values):
244 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
245 | return f
246 |
247 | features = collections.OrderedDict()
248 | features["input_ids"] = create_int_feature(feature.input_ids)
249 | features["input_mask"] = create_int_feature(feature.input_mask)
250 | features["segment_ids"] = create_int_feature(feature.segment_ids)
251 | features["label_ids"] = create_int_feature([feature.label_id])
252 |
253 | tf_example = tf.train.Example(features=tf.train.Features(feature=features))
254 | writer.write(tf_example.SerializeToString())
255 |
256 |
257 | def file_based_input_fn_builder(input_file, seq_length, is_training,
258 | drop_remainder):
259 | """Creates an `input_fn` closure to be passed to TPUEstimator."""
260 |
261 | name_to_features = {
262 | "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
263 | "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
264 | "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
265 | "label_ids": tf.FixedLenFeature([], tf.int64),
266 | }
267 |
268 | def _decode_record(record, name_to_features):
269 | """Decodes a record to a TensorFlow example."""
270 | example = tf.parse_single_example(record, name_to_features)
271 |
272 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
273 | # So cast all int64 to int32.
274 | for name in list(example.keys()):
275 | t = example[name]
276 | if t.dtype == tf.int64:
277 | t = tf.to_int32(t)
278 | example[name] = t
279 |
280 | return example
281 |
282 | def input_fn(params):
283 | """The actual input function."""
284 | batch_size = params["batch_size"]
285 |
286 | # For training, we want a lot of parallel reading and shuffling.
287 | # For eval, we want no shuffling and parallel reading doesn't matter.
288 | d = tf.data.TFRecordDataset(input_file)
289 | if is_training:
290 | d = d.repeat()
291 | d = d.shuffle(buffer_size=100)
292 |
293 | d = d.apply(
294 | tf.contrib.data.map_and_batch(
295 | lambda record: _decode_record(record, name_to_features),
296 | batch_size=batch_size,
297 | drop_remainder=drop_remainder))
298 |
299 | return d
300 |
301 | return input_fn
302 |
303 |
304 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
305 | """Truncates a sequence pair in place to the maximum length."""
306 |
307 | # This is a simple heuristic which will always truncate the longer sequence
308 | # one token at a time. This makes more sense than truncating an equal percent
309 | # of tokens from each, since if one sequence is very short then each token
310 | # that's truncated likely contains more information than a longer sequence.
311 | while True:
312 | total_length = len(tokens_a) + len(tokens_b)
313 | if total_length <= max_length:
314 | break
315 | if len(tokens_a) > len(tokens_b):
316 | tokens_a.pop()
317 | else:
318 | tokens_b.pop()
319 |
320 |
321 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
322 | labels, num_labels, use_one_hot_embeddings):
323 | """Creates a classification model."""
324 | model = modeling.BertModel(
325 | config=bert_config,
326 | is_training=is_training,
327 | input_ids=input_ids,
328 | input_mask=input_mask,
329 | token_type_ids=segment_ids,
330 | use_one_hot_embeddings=use_one_hot_embeddings)
331 |
332 | # In the demo, we are doing a simple classification task on the entire
333 | # segment.
334 | #
335 | # If you want to use the token-level output, use model.get_sequence_output()
336 | # instead.
337 | output_layer = model.get_pooled_output()
338 |
339 | hidden_size = output_layer.shape[-1].value
340 |
341 | output_weights = tf.get_variable(
342 | "output_weights", [num_labels, hidden_size],
343 | initializer=tf.truncated_normal_initializer(stddev=0.02))
344 |
345 | output_bias = tf.get_variable(
346 | "output_bias", [num_labels], initializer=tf.zeros_initializer())
347 |
348 | with tf.variable_scope("loss"):
349 | if is_training:
350 | # I.e., 0.1 dropout
351 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
352 |
353 | logits = tf.matmul(output_layer, output_weights, transpose_b=True)
354 | logits = tf.nn.bias_add(logits, output_bias)
355 | probabilities = tf.nn.softmax(logits, axis=-1)
356 | log_probs = tf.nn.log_softmax(logits, axis=-1)
357 |
358 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
359 |
360 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
361 | loss = tf.reduce_mean(per_example_loss)
362 |
363 | return (loss, per_example_loss, logits, probabilities)
364 |
365 |
366 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
367 | num_train_steps, num_warmup_steps, use_tpu,
368 | use_one_hot_embeddings):
369 | """Returns `model_fn` closure for TPUEstimator."""
370 |
371 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
372 | """The `model_fn` for TPUEstimator."""
373 |
374 | tf.logging.info("*** Features ***")
375 | for name in sorted(features.keys()):
376 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
377 |
378 | input_ids = features["input_ids"]
379 | input_mask = features["input_mask"]
380 | segment_ids = features["segment_ids"]
381 | label_ids = features["label_ids"]
382 |
383 | is_training = (mode == tf.estimator.ModeKeys.TRAIN)
384 |
385 | (total_loss, per_example_loss, logits, probabilities) = create_model(
386 | bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
387 | num_labels, use_one_hot_embeddings)
388 |
389 | tvars = tf.trainable_variables()
390 | initialized_variable_names = {}
391 | scaffold_fn = None
392 | if init_checkpoint:
393 | (assignment_map, initialized_variable_names
394 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
395 | if use_tpu:
396 |
397 | def tpu_scaffold():
398 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
399 | return tf.train.Scaffold()
400 |
401 | scaffold_fn = tpu_scaffold
402 | else:
403 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
404 |
405 | tf.logging.info("**** Trainable Variables ****")
406 | for var in tvars:
407 | init_string = ""
408 | if var.name in initialized_variable_names:
409 | init_string = ", *INIT_FROM_CKPT*"
410 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
411 | init_string)
412 |
413 | output_spec = None
414 | if mode == tf.estimator.ModeKeys.TRAIN:
415 |
416 | train_op = optimization.create_optimizer(
417 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
418 |
419 | output_spec = tf.contrib.tpu.TPUEstimatorSpec(
420 | mode=mode,
421 | loss=total_loss,
422 | train_op=train_op,
423 | scaffold_fn=scaffold_fn)
424 | elif mode == tf.estimator.ModeKeys.EVAL:
425 |
426 | def metric_fn(per_example_loss, label_ids, logits):
427 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
428 | accuracy = tf.metrics.accuracy(label_ids, predictions)
429 | loss = tf.metrics.mean(per_example_loss)
430 | return {
431 | "eval_accuracy": accuracy,
432 | "eval_loss": loss,
433 | }
434 |
435 | eval_metrics = (metric_fn, [per_example_loss, label_ids, logits])
436 | output_spec = tf.contrib.tpu.TPUEstimatorSpec(
437 | mode=mode,
438 | loss=total_loss,
439 | eval_metrics=eval_metrics,
440 | scaffold_fn=scaffold_fn)
441 | else:
442 | output_spec = tf.contrib.tpu.TPUEstimatorSpec(
443 | mode=mode, predictions=probabilities, scaffold_fn=scaffold_fn)
444 | return output_spec
445 |
446 | return model_fn
447 |
448 |
449 | # This function is not used by this file but is still used by the Colab and
450 | # people who depend on it.
451 | def input_fn_builder(features, seq_length, is_training, drop_remainder):
452 | """Creates an `input_fn` closure to be passed to TPUEstimator."""
453 |
454 | all_input_ids = []
455 | all_input_mask = []
456 | all_segment_ids = []
457 | all_label_ids = []
458 |
459 | for feature in features:
460 | all_input_ids.append(feature.input_ids)
461 | all_input_mask.append(feature.input_mask)
462 | all_segment_ids.append(feature.segment_ids)
463 | all_label_ids.append(feature.label_id)
464 |
465 | def input_fn(params):
466 | """The actual input function."""
467 | batch_size = params["batch_size"]
468 |
469 | num_examples = len(features)
470 |
471 | # This is for demo purposes and does NOT scale to large data sets. We do
472 | # not use Dataset.from_generator() because that uses tf.py_func which is
473 | # not TPU compatible. The right way to load data is with TFRecordReader.
474 | d = tf.data.Dataset.from_tensor_slices({
475 | "input_ids":
476 | tf.constant(
477 | all_input_ids, shape=[num_examples, seq_length],
478 | dtype=tf.int32),
479 | "input_mask":
480 | tf.constant(
481 | all_input_mask,
482 | shape=[num_examples, seq_length],
483 | dtype=tf.int32),
484 | "segment_ids":
485 | tf.constant(
486 | all_segment_ids,
487 | shape=[num_examples, seq_length],
488 | dtype=tf.int32),
489 | "label_ids":
490 | tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
491 | })
492 |
493 | if is_training:
494 | d = d.repeat()
495 | d = d.shuffle(buffer_size=100)
496 |
497 | d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
498 | return d
499 |
500 | return input_fn
501 |
502 |
503 | # This function is not used by this file but is still used by the Colab and
504 | # people who depend on it.
505 | def convert_examples_to_features(examples, label_list, max_seq_length,
506 | tokenizer):
507 | """Convert a set of `InputExample`s to a list of `InputFeatures`."""
508 |
509 | features = []
510 | for (ex_index, example) in enumerate(examples):
511 | if ex_index % 10000 == 0:
512 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
513 |
514 | feature = convert_single_example(ex_index, example, label_list,
515 | max_seq_length, tokenizer)
516 |
517 | features.append(feature)
518 | return features
519 |
520 |
521 | def main(_):
522 | tf.logging.set_verbosity(tf.logging.INFO)
523 |
524 | processors = {
525 | "cola": ColaProcessor,
526 | "mnli": MnliProcessor,
527 | "mrpc": MrpcProcessor,
528 | "xnli": XnliProcessor,
529 | "publictwittersentiment": PublicTwitterSentimentProcessor
530 | }
531 |
532 | if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
533 | raise ValueError(
534 | "At least one of `do_train`, `do_eval` or `do_predict' must be True.")
535 |
536 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
537 |
538 | if FLAGS.max_seq_length > bert_config.max_position_embeddings:
539 | raise ValueError(
540 | "Cannot use sequence length %d because the BERT model "
541 | "was only trained up to sequence length %d" %
542 | (FLAGS.max_seq_length, bert_config.max_position_embeddings))
543 |
544 | tf.gfile.MakeDirs(FLAGS.output_dir)
545 |
546 | task_name = FLAGS.task_name.lower()
547 |
548 | if task_name not in processors:
549 | raise ValueError("Task not found: %s" % (task_name))
550 |
551 | processor = processors[task_name]()
552 |
553 | label_list = processor.get_labels()
554 |
555 | # if sentencepiece model is passed, it will use tweet tokenzier that is specialized to twitter bert encoder.
556 | _normalizer = None if FLAGS.normalizer_name is None else normalizer.__getattribute__(FLAGS.normalizer_name)
557 | print(f"do_lower_case: {FLAGS.do_lower_case}")
558 | if FLAGS.spm_file is not None:
559 | tokenizer = tokenization.JapaneseTweetTokenizer(
560 | vocab_file=FLAGS.vocab_file, model_file=FLAGS.spm_file, normalizer=_normalizer,
561 | do_lower_case=FLAGS.do_lower_case)
562 | # otherwise, it will use default tokenizer for English sentence.
563 | else:
564 | tokenizer = tokenization.FullTokenizer(
565 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case, normalizer=_normalizer)
566 |
567 | tpu_cluster_resolver = None
568 | if FLAGS.use_tpu and FLAGS.tpu_name:
569 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
570 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
571 |
572 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
573 | run_config = tf.contrib.tpu.RunConfig(
574 | cluster=tpu_cluster_resolver,
575 | master=FLAGS.master,
576 | model_dir=FLAGS.output_dir,
577 | save_checkpoints_steps=FLAGS.save_checkpoints_steps,
578 | tpu_config=tf.contrib.tpu.TPUConfig(
579 | iterations_per_loop=FLAGS.iterations_per_loop,
580 | num_shards=FLAGS.num_tpu_cores,
581 | per_host_input_for_training=is_per_host))
582 |
583 | train_examples = None
584 | num_train_steps = None
585 | num_warmup_steps = None
586 | if FLAGS.do_train:
587 | train_examples = processor.get_train_examples(FLAGS.data_dir)
588 | num_train_steps = int(
589 | len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
590 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
591 |
592 | model_fn = model_fn_builder(
593 | bert_config=bert_config,
594 | num_labels=len(label_list),
595 | init_checkpoint=FLAGS.init_checkpoint,
596 | learning_rate=FLAGS.learning_rate,
597 | num_train_steps=num_train_steps,
598 | num_warmup_steps=num_warmup_steps,
599 | use_tpu=FLAGS.use_tpu,
600 | use_one_hot_embeddings=FLAGS.use_tpu)
601 |
602 | # If TPU is not available, this will fall back to normal Estimator on CPU
603 | # or GPU.
604 | estimator = tf.contrib.tpu.TPUEstimator(
605 | use_tpu=FLAGS.use_tpu,
606 | model_fn=model_fn,
607 | config=run_config,
608 | train_batch_size=FLAGS.train_batch_size,
609 | eval_batch_size=FLAGS.eval_batch_size,
610 | predict_batch_size=FLAGS.predict_batch_size)
611 |
612 | if FLAGS.do_train:
613 | train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
614 | file_based_convert_examples_to_features(
615 | train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
616 | tf.logging.info("***** Running training *****")
617 | tf.logging.info(" Num examples = %d", len(train_examples))
618 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size)
619 | tf.logging.info(" Num steps = %d", num_train_steps)
620 | train_input_fn = file_based_input_fn_builder(
621 | input_file=train_file,
622 | seq_length=FLAGS.max_seq_length,
623 | is_training=True,
624 | drop_remainder=True)
625 | with utility.timer("train-time"):
626 | estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
627 |
628 | if FLAGS.do_eval:
629 | eval_examples = processor.get_dev_examples(FLAGS.data_dir)
630 | eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
631 | file_based_convert_examples_to_features(
632 | eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
633 |
634 | tf.logging.info("***** Running evaluation *****")
635 | tf.logging.info(" Num examples = %d", len(eval_examples))
636 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size)
637 |
638 | # This tells the estimator to run through the entire set.
639 | eval_steps = None
640 | # However, if running eval on the TPU, you will need to specify the
641 | # number of steps.
642 | if FLAGS.use_tpu:
643 | # Eval will be slightly WRONG on the TPU because it will truncate
644 | # the last batch.
645 | eval_steps = int(len(eval_examples) / FLAGS.eval_batch_size)
646 |
647 | eval_drop_remainder = True if FLAGS.use_tpu else False
648 | eval_input_fn = file_based_input_fn_builder(
649 | input_file=eval_file,
650 | seq_length=FLAGS.max_seq_length,
651 | is_training=False,
652 | drop_remainder=eval_drop_remainder)
653 |
654 | with utility.timer("dev-time"):
655 | result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
656 |
657 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
658 | with tf.gfile.GFile(output_eval_file, "w") as writer:
659 | tf.logging.info("***** Eval results *****")
660 | for key in sorted(result.keys()):
661 | tf.logging.info(" %s = %s", key, str(result[key]))
662 | writer.write("%s = %s\n" % (key, str(result[key])))
663 |
664 | if FLAGS.do_predict:
665 | predict_examples = processor.get_test_examples(FLAGS.data_dir)
666 | predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
667 | file_based_convert_examples_to_features(predict_examples, label_list,
668 | FLAGS.max_seq_length, tokenizer,
669 | predict_file)
670 |
671 | tf.logging.info("***** Running prediction*****")
672 | tf.logging.info(" Num examples = %d", len(predict_examples))
673 | tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size)
674 |
675 | if FLAGS.use_tpu:
676 | # Warning: According to tpu_estimator.py Prediction on TPU is an
677 | # experimental feature and hence not supported here
678 | raise ValueError("Prediction in TPU not supported")
679 |
680 | predict_drop_remainder = True if FLAGS.use_tpu else False
681 | predict_input_fn = file_based_input_fn_builder(
682 | input_file=predict_file,
683 | seq_length=FLAGS.max_seq_length,
684 | is_training=False,
685 | drop_remainder=predict_drop_remainder)
686 |
687 | result = estimator.predict(input_fn=predict_input_fn)
688 | lst_result = list(result)
689 |
690 | with utility.timer("predict-time"):
691 | # probability
692 | output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
693 | with tf.gfile.GFile(output_predict_file, "w") as writer:
694 | tf.logging.info("***** Predict results *****")
695 | for prediction in lst_result:
696 | output_line = "\t".join(
697 | str(class_probability) for class_probability in prediction) + "\n"
698 | writer.write(output_line)
699 | # predicted label
700 | output_predict_file = os.path.join(FLAGS.output_dir, "test_results_label.tsv")
701 | lst_labels = processor.get_labels()
702 | with tf.gfile.GFile(output_predict_file, "w") as writer:
703 | for prediction in lst_result:
704 | idx = np.argmax(prediction)
705 | output_line = lst_labels[idx] + "\n"
706 | writer.write(output_line)
707 | # ground-truth label
708 | output_ground_truth_file = os.path.join(FLAGS.output_dir, "test_results_ground_truth.tsv")
709 | pred_examples = processor.get_test_examples(FLAGS.data_dir)
710 | with tf.gfile.GFile(output_ground_truth_file, "w") as writer:
711 | for example in pred_examples:
712 | output_line = example.label + "\n"
713 | writer.write(output_line)
714 |
715 |
716 |
717 | if __name__ == "__main__":
718 | flags.mark_flag_as_required("data_dir")
719 | flags.mark_flag_as_required("task_name")
720 | flags.mark_flag_as_required("vocab_file")
721 | flags.mark_flag_as_required("bert_config_file")
722 | flags.mark_flag_as_required("output_dir")
723 | tf.app.run()
724 |
--------------------------------------------------------------------------------
/src/run_classifier.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 |
3 | MODEL_DIR=../trained_model/masked_lm_only_L-12_H-768_A-12
4 | EVAL_DIR=../evaluation_dataset
5 |
6 |
7 | # fine-tune and evaluate using Twitter日本語評判分析データセット[Suzuki, 2017]
8 | ## hottoSNS-BERT
9 | python run_classifier.py \
10 | --task_name=PublicTwitterSentiment \
11 | --do_train=true \
12 | --do_eval=true \
13 | --do_predict=true \
14 | --data_dir=$EVAL_DIR/twitter_sentiment \
15 | --vocab_file=$MODEL_DIR/tokenizer_spm_32K.vocab.to.bert \
16 | --spm_file=$MODEL_DIR/tokenizer_spm_32K.model \
17 | --bert_config_file=$MODEL_DIR/bert_config.json \
18 | --init_checkpoint=$MODEL_DIR/model.ckpt-1000000 \
19 | --max_seq_length=64 \
20 | --train_batch_size=32 \
21 | --learning_rate=2e-5 \
22 | --num_train_epochs=3.0 \
23 | --output_dir=./eval_hottoSNS/
24 |
25 |
26 |
27 | MODEL_DIR=../trained_model/wikipedia_ja_L-12_H-768_A-12
28 | EVAL_DIR=../evaluation_dataset
29 |
30 |
31 | ## Wikipedia JP
32 | python run_classifier.py \
33 | --task_name=PublicTwitterSentiment \
34 | --do_train=true \
35 | --do_eval=true \
36 | --do_predict=true \
37 | --data_dir=$EVAL_DIR/twitter_sentiment \
38 | --vocab_file=$MODEL_DIR/wiki-ja.vocab.to.bert \
39 | --spm_file=$MODEL_DIR/wiki-ja.model \
40 | --bert_config_file=$MODEL_DIR/bert_config.json \
41 | --init_checkpoint=$MODEL_DIR/model.ckpt-1400000 \
42 | --max_seq_length=128 \
43 | --train_batch_size=32 \
44 | --learning_rate=2e-5 \
45 | --num_train_epochs=3.0 \
46 | --output_dir=./eval_wikija/
47 |
48 |
49 |
50 | ## MultiLingual Model
51 | MODEL_DIR=../trained_model/multi_cased_L-12_H-768_A-12
52 | EVAL_DIR=../evaluation_dataset
53 |
54 |
55 | python run_classifier.py \
56 | --task_name=PublicTwitterSentiment \
57 | --do_train=true \
58 | --do_eval=true \
59 | --do_predict=true \
60 | --do_lower_case=false \
61 | --data_dir=$EVAL_DIR/twitter_sentiment \
62 | --vocab_file=$MODEL_DIR/vocab.txt \
63 | --bert_config_file=$MODEL_DIR/bert_config.json \
64 | --init_checkpoint=$MODEL_DIR/bert_model.ckpt \
65 | --max_seq_length=128 \
66 | --train_batch_size=32 \
67 | --learning_rate=2e-5 \
68 | --num_train_epochs=3.0 \
69 | --output_dir=./eval_multi/
70 |
--------------------------------------------------------------------------------
/src/tokenization.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Language Team Authors.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | """Tokenization classes."""
16 |
17 | from __future__ import absolute_import
18 | from __future__ import division
19 | from __future__ import print_function
20 |
21 | import collections
22 | import unicodedata
23 | import six
24 | import tensorflow as tf
25 | import sentencepiece as spm
26 | from distutils.version import LooseVersion
27 |
28 |
29 | def convert_to_unicode(text):
30 | """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
31 | if six.PY3:
32 | if isinstance(text, str):
33 | return text
34 | elif isinstance(text, bytes):
35 | return text.decode("utf-8", "ignore")
36 | else:
37 | raise ValueError("Unsupported string type: %s" % (type(text)))
38 | elif six.PY2:
39 | if isinstance(text, str):
40 | return text.decode("utf-8", "ignore")
41 | elif isinstance(text, unicode):
42 | return text
43 | else:
44 | raise ValueError("Unsupported string type: %s" % (type(text)))
45 | else:
46 | raise ValueError("Not running on Python2 or Python 3?")
47 |
48 |
49 | def printable_text(text):
50 | """Returns text encoded in a way suitable for print or `tf.logging`."""
51 |
52 | # These functions want `str` for both Python2 and Python3, but in one case
53 | # it's a Unicode string and in the other it's a byte string.
54 | if six.PY3:
55 | if isinstance(text, str):
56 | return text
57 | elif isinstance(text, bytes):
58 | return text.decode("utf-8", "ignore")
59 | else:
60 | raise ValueError("Unsupported string type: %s" % (type(text)))
61 | elif six.PY2:
62 | if isinstance(text, str):
63 | return text
64 | elif isinstance(text, unicode):
65 | return text.encode("utf-8")
66 | else:
67 | raise ValueError("Unsupported string type: %s" % (type(text)))
68 | else:
69 | raise ValueError("Not running on Python2 or Python 3?")
70 |
71 |
72 | def load_vocab(vocab_file):
73 | """Loads a vocabulary file into a dictionary."""
74 | vocab = collections.OrderedDict()
75 | index = 0
76 | # switch depending on the tf major version
77 | is_tensorflow_ver_2 = LooseVersion(tf.__version__) >= LooseVersion("2.0.0")
78 | if is_tensorflow_ver_2:
79 | GFile = tf.io.gfile.GFile
80 | else:
81 | GFile = tf.gfile.GFile
82 |
83 | with GFile(vocab_file, "r") as reader:
84 | while True:
85 | token = convert_to_unicode(reader.readline())
86 | if not token:
87 | break
88 | token = token.strip()
89 | vocab[token] = index
90 | index += 1
91 | return vocab
92 |
93 |
94 | def convert_by_vocab(vocab, items):
95 | """Converts a sequence of [tokens|ids] using the vocab."""
96 | output = []
97 | for item in items:
98 | output.append(vocab[item])
99 | return output
100 |
101 |
102 | def convert_tokens_to_ids(vocab, tokens):
103 | return convert_by_vocab(vocab, tokens)
104 |
105 |
106 | def convert_ids_to_tokens(inv_vocab, ids):
107 | return convert_by_vocab(inv_vocab, ids)
108 |
109 |
110 | def whitespace_tokenize(text):
111 | """Runs basic whitespace cleaning and splitting on a piece of text."""
112 | text = text.strip()
113 | if not text:
114 | return []
115 | tokens = text.split()
116 | return tokens
117 |
118 |
119 | class FullTokenizer(object):
120 | """Runs end-to-end tokenziation."""
121 |
122 | def __init__(self, vocab_file, normalizer=None, do_lower_case=True):
123 | self.vocab = load_vocab(vocab_file)
124 | self.inv_vocab = {v: k for k, v in self.vocab.items()}
125 | self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
126 | self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
127 | self.normalizer = normalizer
128 |
129 | def tokenize(self, text):
130 | split_tokens = []
131 | if self.normalizer is not None:
132 | text = self.normalizer(text)
133 | for token in self.basic_tokenizer.tokenize(text):
134 | for sub_token in self.wordpiece_tokenizer.tokenize(token):
135 | split_tokens.append(sub_token)
136 |
137 | return split_tokens
138 |
139 | def convert_tokens_to_ids(self, tokens):
140 | return convert_by_vocab(self.vocab, tokens)
141 |
142 | def convert_ids_to_tokens(self, ids):
143 | return convert_by_vocab(self.inv_vocab, ids)
144 |
145 |
146 | class BasicTokenizer(object):
147 | """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
148 |
149 | def __init__(self, do_lower_case=True):
150 | """Constructs a BasicTokenizer.
151 |
152 | Args:
153 | do_lower_case: Whether to lower case the input.
154 | """
155 | self.do_lower_case = do_lower_case
156 |
157 | def tokenize(self, text):
158 | """Tokenizes a piece of text."""
159 | text = convert_to_unicode(text)
160 | text = self._clean_text(text)
161 |
162 | # This was added on November 1st, 2018 for the multilingual and Chinese
163 | # models. This is also applied to the English models now, but it doesn't
164 | # matter since the English models were not trained on any Chinese data
165 | # and generally don't have any Chinese data in them (there are Chinese
166 | # characters in the vocabulary because Wikipedia does have some Chinese
167 | # words in the English Wikipedia.).
168 | text = self._tokenize_chinese_chars(text)
169 |
170 | orig_tokens = whitespace_tokenize(text)
171 | split_tokens = []
172 | for token in orig_tokens:
173 | if self.do_lower_case:
174 | token = token.lower()
175 | token = self._run_strip_accents(token)
176 | split_tokens.extend(self._run_split_on_punc(token))
177 |
178 | output_tokens = whitespace_tokenize(" ".join(split_tokens))
179 | return output_tokens
180 |
181 | def _run_strip_accents(self, text):
182 | """Strips accents from a piece of text."""
183 | text = unicodedata.normalize("NFD", text)
184 | output = []
185 | for char in text:
186 | cat = unicodedata.category(char)
187 | if cat == "Mn":
188 | continue
189 | output.append(char)
190 | return "".join(output)
191 |
192 | def _run_split_on_punc(self, text):
193 | """Splits punctuation on a piece of text."""
194 | chars = list(text)
195 | i = 0
196 | start_new_word = True
197 | output = []
198 | while i < len(chars):
199 | char = chars[i]
200 | if _is_punctuation(char):
201 | output.append([char])
202 | start_new_word = True
203 | else:
204 | if start_new_word:
205 | output.append([])
206 | start_new_word = False
207 | output[-1].append(char)
208 | i += 1
209 |
210 | return ["".join(x) for x in output]
211 |
212 | def _tokenize_chinese_chars(self, text):
213 | """Adds whitespace around any CJK character."""
214 | output = []
215 | for char in text:
216 | cp = ord(char)
217 | if self._is_chinese_char(cp):
218 | output.append(" ")
219 | output.append(char)
220 | output.append(" ")
221 | else:
222 | output.append(char)
223 | return "".join(output)
224 |
225 | def _is_chinese_char(self, cp):
226 | """Checks whether CP is the codepoint of a CJK character."""
227 | # This defines a "chinese character" as anything in the CJK Unicode block:
228 | # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
229 | #
230 | # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
231 | # despite its name. The modern Korean Hangul alphabet is a different block,
232 | # as is Japanese Hiragana and Katakana. Those alphabets are used to write
233 | # space-separated words, so they are not treated specially and handled
234 | # like the all of the other languages.
235 | if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
236 | (cp >= 0x3400 and cp <= 0x4DBF) or #
237 | (cp >= 0x20000 and cp <= 0x2A6DF) or #
238 | (cp >= 0x2A700 and cp <= 0x2B73F) or #
239 | (cp >= 0x2B740 and cp <= 0x2B81F) or #
240 | (cp >= 0x2B820 and cp <= 0x2CEAF) or
241 | (cp >= 0xF900 and cp <= 0xFAFF) or #
242 | (cp >= 0x2F800 and cp <= 0x2FA1F)): #
243 | return True
244 |
245 | return False
246 |
247 | def _clean_text(self, text):
248 | """Performs invalid character removal and whitespace cleanup on text."""
249 | output = []
250 | for char in text:
251 | cp = ord(char)
252 | if cp == 0 or cp == 0xfffd or _is_control(char):
253 | continue
254 | if _is_whitespace(char):
255 | output.append(" ")
256 | else:
257 | output.append(char)
258 | return "".join(output)
259 |
260 |
261 | class WordpieceTokenizer(object):
262 | """Runs WordPiece tokenziation."""
263 |
264 | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
265 | self.vocab = vocab
266 | self.unk_token = unk_token
267 | self.max_input_chars_per_word = max_input_chars_per_word
268 |
269 | def tokenize(self, text):
270 | """Tokenizes a piece of text into its word pieces.
271 |
272 | This uses a greedy longest-match-first algorithm to perform tokenization
273 | using the given vocabulary.
274 |
275 | For example:
276 | input = "unaffable"
277 | output = ["un", "##aff", "##able"]
278 |
279 | Args:
280 | text: A single token or whitespace separated tokens. This should have
281 | already been passed through `BasicTokenizer.
282 |
283 | Returns:
284 | A list of wordpiece tokens.
285 | """
286 |
287 | text = convert_to_unicode(text)
288 |
289 | output_tokens = []
290 | for token in whitespace_tokenize(text):
291 | chars = list(token)
292 | if len(chars) > self.max_input_chars_per_word:
293 | output_tokens.append(self.unk_token)
294 | continue
295 |
296 | is_bad = False
297 | start = 0
298 | sub_tokens = []
299 | while start < len(chars):
300 | end = len(chars)
301 | cur_substr = None
302 | while start < end:
303 | substr = "".join(chars[start:end])
304 | if start > 0:
305 | substr = "##" + substr
306 | if substr in self.vocab:
307 | cur_substr = substr
308 | break
309 | end -= 1
310 | if cur_substr is None:
311 | is_bad = True
312 | break
313 | sub_tokens.append(cur_substr)
314 | start = end
315 |
316 | if is_bad:
317 | output_tokens.append(self.unk_token)
318 | else:
319 | output_tokens.extend(sub_tokens)
320 | return output_tokens
321 |
322 |
323 | class JapaneseTweetTokenizer(object):
324 | """Runs end-to-end tokenziation."""
325 |
326 | def __init__(self, vocab_file, model_file, normalizer=None, do_lower_case=True):
327 | self.vocab = load_vocab(vocab_file)
328 | self.inv_vocab = {v: k for k, v in self.vocab.items()}
329 | self.model = spm.SentencePieceProcessor()
330 | self.model.Load(model_file)
331 | self.do_lower_case = do_lower_case
332 | self.normalizer = normalizer
333 |
334 | def tokenize(self, text):
335 | if self.do_lower_case:
336 | text = text.lower()
337 | if self.normalizer is not None:
338 | text = self.normalizer(text)
339 |
340 | split_tokens = self.model.EncodeAsPieces(text)
341 | return split_tokens
342 |
343 | def convert_tokens_to_ids(self, tokens):
344 | return self._convert_by_vocab(self.vocab, tokens)
345 |
346 | def convert_ids_to_tokens(self, ids):
347 | return self._convert_by_vocab(self.inv_vocab, ids)
348 |
349 | def _convert_by_vocab(self, vocab, items):
350 | """Converts a sequence of [tokens|ids] using the vocab."""
351 | output = []
352 | for item in items:
353 | output.append(vocab.get(item, vocab.get('', '')))
354 | return output
355 |
356 | def _is_whitespace(char):
357 | """Checks whether `chars` is a whitespace character."""
358 | # \t, \n, and \r are technically contorl characters but we treat them
359 | # as whitespace since they are generally considered as such.
360 | if char == " " or char == "\t" or char == "\n" or char == "\r":
361 | return True
362 | cat = unicodedata.category(char)
363 | if cat == "Zs":
364 | return True
365 | return False
366 |
367 |
368 | def _is_control(char):
369 | """Checks whether `chars` is a control character."""
370 | # These are technically control characters but we count them as whitespace
371 | # characters.
372 | if char == "\t" or char == "\n" or char == "\r":
373 | return False
374 | cat = unicodedata.category(char)
375 | if cat.startswith("C"):
376 | return True
377 | return False
378 |
379 |
380 | def _is_punctuation(char):
381 | """Checks whether `chars` is a punctuation character."""
382 | cp = ord(char)
383 | # We treat all non-letter/number ASCII as punctuation.
384 | # Characters such as "^", "$", and "`" are not in the Unicode
385 | # Punctuation class but we treat them as punctuation anyways, for
386 | # consistency.
387 | if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
388 | (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
389 | return True
390 | cat = unicodedata.category(char)
391 | if cat.startswith("P"):
392 | return True
393 | return False
394 |
--------------------------------------------------------------------------------
/src/utility.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding:utf-8 -*-
3 | from __future__ import absolute_import
4 | from __future__ import unicode_literals
5 | from __future__ import division
6 |
7 | from contextlib import contextmanager
8 | import time
9 | @contextmanager
10 | def timer(name):
11 | t0 = time.time()
12 | yield
13 | print(f'[{name}] done in {time.time() - t0:1.4f} s')
--------------------------------------------------------------------------------
/trained_model/masked_lm_only_L-12_H-768_A-12/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/trained_model/masked_lm_only_L-12_H-768_A-12/.gitkeep
--------------------------------------------------------------------------------
/trained_model/multi_cased_L-12_H-768_A-12/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/trained_model/multi_cased_L-12_H-768_A-12/.gitkeep
--------------------------------------------------------------------------------
/trained_model/wikipedia_ja_L-12_H-768_A-12/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/trained_model/wikipedia_ja_L-12_H-768_A-12/.gitkeep
--------------------------------------------------------------------------------