├── LICENSE.md ├── README.md ├── evaluation_dataset └── twitter_sentiment │ ├── README.md │ ├── dev.tsv │ ├── test.tsv │ └── train.tsv ├── images └── QR_hottoSNS-bert.png ├── requirements.txt ├── src ├── dataprocessor │ ├── __init__.py │ ├── __pycache__ │ │ ├── __init__.cpython-36.pyc │ │ ├── custom.cpython-36.pyc │ │ └── preset.cpython-36.pyc │ ├── custom.py │ └── preset.py ├── modeling.py ├── optimization.py ├── preprocess │ ├── __init__.py │ ├── __pycache__ │ │ ├── __init__.cpython-36.pyc │ │ └── normalizer.cpython-36.pyc │ └── normalizer.py ├── run_classifier.py ├── run_classifier.sh ├── tokenization.py └── utility.py └── trained_model ├── masked_lm_only_L-12_H-768_A-12 └── .gitkeep ├── multi_cased_L-12_H-768_A-12 └── .gitkeep └── wikipedia_ja_L-12_H-768_A-12 └── .gitkeep /LICENSE.md: -------------------------------------------------------------------------------- 1 | ``` 2 | 第1条(定義) 3 | 本契約で用いられる用語の意味は、以下に定めるとおりとする。 4 | (1)「本規約」とは、本利用規約をいう。 5 | (2)「甲」とは、 株式会社ホットリンク(以下「甲」という)をいう。 6 | (3)「乙」とは、本規約に同意し、甲の承認を得て、甲が配布する文分散表現データを利用する個人をいう。 7 | (4)「本データ」とは、甲が作成した文分散表現データおよびそれに付随する全部をいう。 8 | 9 | 10 | 第2条(利用許諾) 11 | 甲は、乙が本規約に従って本データを利用することを非独占的に許諾する。なお、甲及び乙は、本規約に明示的に定める以外に、乙に本データに関していかなる権利も付与するものではないことを確認する。 12 | 13 | 14 | 第4条(許諾の条件) 15 | 甲が乙に本データの利用を許諾する条件は、以下の通りとする。 16 | (1)利用目的: 日本語に関する学術研究・産業研究(以下「本研究」という)を遂行するため。 17 | (2)利用の範囲: 乙及び乙が所属する研究グループ 18 | (3)利用方法: 本研究のために本データを乙が管理するコンピューター端末またはサーバーに複製し、本データを分析・研究しデータベース等に保存した解析データ(以下「本解析データ」という)を得る。 19 | 20 | 21 | 第5条(利用申込) 22 | 1.乙は、甲が指定するウェブ上の入力フォーム(以下、入力フォーム)を通じて、乙の名前や所属、連絡先等、甲が指定する項目を甲に送信し、本データの利用について甲の承認を得るものとする。 なお、甲が承認しなかった場合、甲はその理由を開示する義務を負わない。 23 | 2.前項に基づき甲に申告した内容に変更が生じる場合、乙は遅滞なくこれを甲に報告し、改めて甲の承認を得るものとする。 24 | 3.乙が入力フォームを送信した時点で、乙は本規約に同意したものとみなされる。 25 | 26 | 第6条(禁止事項) 27 | 乙は、本データの利用にあたり、以下に定める行為をしてはならない。 28 | (1)本データ及びその複製物(それらを復元できるデータを含む)を譲渡、貸与、販売すること。また、書面による甲の事前許諾なくこれらを配布、公衆送信、刊行物に転載するなど前項に定める範囲を超えて利用し、甲または第三者の権利を侵害すること。 29 | (2)本データを用いて甲又は第三者の名誉を毀損し、あるいはプライバシーを侵害するなどの権利侵害を行うこと。 30 | (3)乙及び乙が所属する研究グループ以外の第三者に本データを利用させること。 31 | (4)本規約で明示的に許諾された目的及び手段以外にデータを利用 すること。 32 | 33 | 第7条(対価) 34 | 本規約に基づく本データの利用許諾の対価は発生しない。 35 | 36 | 第8条(公表) 37 | 1.乙は、学術研究の目的に限り、本データを使用して得られた研究成果や知見を公表することができる。これらの公表には、本解析データや処理プログラムの公表を含む。 38 | 2.乙は、公表にあたっては、本データをもとにした成果であることを明記し、成果の公表の前にその概要を書面やメール等で甲に報告する。 39 | 3.乙は、論文発表の際も、本データを利用した旨を明記し、提出先の学会、発表年月日を所定のフォームから甲に提出するものとする。 40 | 41 | 42 | 43 | 第9条(乙の責任) 44 | 1.乙は、本データをダウンロードする為に必要な通信機器やソフトウェア、通信回線等の全てを乙の責任と費用で準備し、操作、接続等をする。 45 | 2.乙は、本データを本研究の遂行のみに使用する。 46 | 3.乙は、本データが漏洩しないよう善良な管理者の注意義務をもって管理し、乙のコンピューター端末等に適切な対策を施すものとする。 47 | 4.乙が、本研究を乙が所属するグループのメンバーと共同で遂行する場合、乙は、本規約の内容を当該グループの他のメンバーに遵守させるものとし、万一、当該他のメンバーが本規約に違反し甲又は第三者に損害を与えた場合は、乙はこれを自らの行為として連帯して責任を負うものとする。 48 | 5.甲が必要と判断する場合、乙に対して、本データの利用状況の開示を求めることができるものとし、乙はこれに応じなければならない。 49 | 50 | 51 | 第10条(知的財産権の帰属) 52 | 甲及び乙は、本データに関する一切の知的財産権、本データの利用に関連して甲が提供する書類やデータ等に関する全ての知的財産権について、甲に帰属することを確認する。ただし、本データ作成の素材となった各文書の著作権は正当な権利を有する第三者に帰属する。 53 | 54 | 第11条(非保証等) 55 | 1.甲は、本データが、第三者の著作権、特許権、その他の無体財産権、営業秘密、ノウハウその他の権利を侵害していないこと、法令に違反していないこと、本データ作成に利用したアルゴリズムに誤り、エラー、バグがないことについて一切保証せず、また、それらの信頼性、正確性、速報性、完全性、及び有効性、特定目的への適合性について一切保証しないものとし、瑕疵担保責任も負わない。 56 | 2.本データに関し万一、第三者から知的財産権侵害等の主張がなされた場合には、乙はただちに甲に対しその旨を通知し、甲に対する情報提供等、当該紛争の解決に最大限協力するものとする。 57 | 58 | 59 | 第12条(違反時の措置) 60 | 1.甲は、乙が次の各号の一つにでも該当した場合、甲は乙に対して本データの利用を差止めることができる。 61 | (1)本規約に違反した場合 62 | (2)法令に違反した場合 63 | (3)虚偽の申告等の不正を行った場合 64 | (4)信頼関係を破壊するような行為を行った場合 65 | (5)その他甲が不適当と認めた場合 66 | 2.前項の規定は甲から乙に対する損害賠償請求を妨げるものではない。 67 | 3.第1項に基づき、甲が乙に対して本データの利用の差し止めを求めた場合、乙は、乙が管理する設備から、本データ、本解析データ及びその複製物の一切を消去するものとする。 68 | 69 | 第13条(甲の事情による利用許諾の取り消し) 70 | 1.甲は、その理由の如何を問わず、なんらの催告なしに、本データの利用許諾を停止することができるものとする。その際は、第15条に基づき、乙は速やかに本データおよびその複製物の一切を消去または破棄する。 71 | 2.前項の破棄、消去の対象に本解析データは含まない。 72 | 73 | 74 | 第14条(利用期間) 75 | 1.乙による本データの利用可能期間は、第5条にもとづく甲の承認日より1年間とする。 76 | 2.乙が1年間を超えて本データの利用継続を希望する場合、第5条に基づく方法で再度利用申請を行うこととする。 77 | 78 | 79 | 第15条(本契約終了後の措置等) 80 | 1.理由の如何を問わず、第14条に定める利用期間が終了したとき、もしくは、本データの利用許諾が取り消しとなった場合、乙は本データおよびその複製物の一切を消去または破棄する。 81 | 2.前項の破棄、消去の対象に本解析データは含まない。ただし、乙は、本解析データから本データを復元して再利用することはできないものとする。 82 | 3.第10条、第11条、第15条から第19条は、本契約の終了後も有効に存続する。 83 | 84 | 第16条(権利義務譲渡の禁止) 85 | 乙は、相手方の書面による事前の承諾なき限り、本契約上の地位及び本契約から生じる権利義務を第三者に譲渡又は担保に供してはならない。 86 | 87 | 第17条 (個人情報等の保護および法令遵守) 88 | 1.甲が取得した乙の個人情報は、別途定める甲2のプライバシーポリシーに従って取り扱われる。 89 | 2.甲は、サーバー設備の故障その他のトラブル等に対処するため、乙の個人情報を他のサーバーに複写することがある。 90 | 91 | 第18条(準拠法) 92 | 本契約の準拠法は、日本法とする。 93 | 94 | 第19条(管轄裁判所) 95 | 本契約に起因し又は関連して生じた一切の紛争については、東京地方裁判所を第一審の専属的合意管轄裁判所とする。 96 | 97 | 第20条(協 議) 98 | 本契約に定めのない事項及び疑義を生じた事項は、甲乙誠意をもって協議し、円満にその解決にあたる。 99 | 100 | 第21条(本規約の効力) 101 | 本規約は、本データの利用の関する一切について適用される。なお、本規約は随時変更されることがあるが、変更後の規約は特別に定める場合を除き、ウェブ上で表示された時点から効力を生じるものとする。 102 | ``` 103 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # hottoSNS-BERT:大規模日本語SNSコーパスによる文分散表現モデル 3 | 4 | ## 概要 5 | * 大規模日本語SNSコーパスによる文分散表現モデル(以下,大規模SNSコーパス)から作成したbertによる文分散表現を構築した 6 | * 本文分散表現モデル(以下,hottoSNS-BERT)は下記登録フォームから登録した方のみに配布する 7 | * [利用規約](#利用規約)は本README.mdの末尾に記載されている.またLICENSE.mdにも同じ内容が記載されている. 8 | 9 | [登録フォーム](https://forms.office.com/Pages/ResponsePage.aspx?id=Zpu1Ffmdi02AfxgH3uo25PxaMnBWkvJLsXoQLeuzhoBUNU0zN01BR1VFNk9RSEUxWVRNSzAyWThZNSQlQCN0PWcu) 10 | 11 | 12 | 13 | 14 | [言語資源を利用した発表の登録フォーム](https://forms.office.com/r/EQizJpFBJg) 15 | ※PDFファイルの提出は不要です。 16 | 17 | ### 引用について 18 | * 本モデルに関する論文発表は未公開です.引用される方は,以下のbibtexをご利用ください. 19 | ``` 20 | @article{hottoSNS-bert, 21 | author = Sakaki, Takeshi and Mizuki, Sakae and Gunji, Naoyuki}, 22 | title = {BERT Pre-trained model Trained on Large-scale Japanese Social Media Corpus}, 23 | year = {2019}, 24 | howpublished = {\url{https://github.com/hottolink/hottoSNS-bert}} 25 | } 26 | ``` 27 | 28 | 29 | 30 | - [hottoSNS-BERT:大規模日本語SNSコーパスによる文分散表現モデル](#hottosns-bert大規模日本語snsコーパスによる文分散表現モデル) 31 | - [概要](#概要) 32 | - [引用について](#引用について) 33 | - [配布リソースに関する説明](#配布リソースに関する説明) 34 | - [大規模日本語SNSコーパス](#大規模日本語snsコーパス) 35 | - [ファイル構成](#ファイル構成) 36 | - [利用方法](#利用方法) 37 | - [実行確認環境](#実行確認環境) 38 | - [付属評価コードの利用準備](#付属評価コードの利用準備) 39 | - [付属評価コードの利用方法](#付属評価コードの利用方法) 40 | - [モデルの読み込み方法](#モデルの読み込み方法) 41 | - [配布リソースの構築手順](#配布リソースの構築手順) 42 | - [コーパス・単語分散表現の構築方法](#コーパス・単語分散表現の構築方法) 43 | - [平文コーパスの収集・構築](#平文コーパスの収集・構築) 44 | - [前処理](#前処理) 45 | - [分かち書きコーパスの構築](#分かち書きコーパスの構築) 46 | - [後処理](#後処理) 47 | - [既存モデルとの前処理・分かち書きの比較](#既存モデルとの前処理・分かち書きの比較) 48 | - [統計量](#統計量) 49 | - [文分散表現の学習](#文分散表現の学習) 50 | - [pre-training](#pre-training) 51 | - [neuralnet architectureの比較](#neuralnet-architectureの比較) 52 | - [学習環境](#学習環境) 53 | - [配布リソースの性能評価](#配布リソースの性能評価) 54 | - [評価用データセット](#評価用データセット) 55 | - [downstream task: fine-tuning](#downstream-task-fine-tuning) 56 | - [実験結果](#実験結果) 57 | - [pytorch-transformersからの利用](#pytorch-transformersからの利用) 58 | - [利用規約](#利用規約) 59 | 60 | 61 | 62 | ## 配布リソースに関する説明 63 | ### 大規模日本語SNSコーパス 64 | * BERT自家版を学習するために,大規模な日本語ツイートのコーパスを構築した 65 | * 収集文の多様性が大きくなるように工夫している.具体的には,bot投稿・リツイートの除外,重複ツイート文の除外といった工夫を施している 66 | * 構築されたコーパスのツイート数は8,500万である 67 | * 本家BERTが用いたコーパス(=En Wikipedia + BookCorpus)と比較すると,35%程度の大きさである 68 | 69 | ### ファイル構成 70 | 71 | | モデル | ファイル名 | 72 | |-----------|----------------------| 73 | | hottoSNS-BERTモデル |bert_config.json| 74 | | | graph.pbtxt | 75 | | | model.ckpt-1000000.meta | 76 | | | model.ckpt-1000000.index | 77 | | | model.ckpt-1000000.data-00000-of-00001 | 78 | | sentencepieceモデル | tokenizer_spm_32K.model | 79 | | | tokenizer_spm_32K.vocab.to.bert | 80 | | | tokenizer_spm_32K.vocab 81 | 82 | 83 | 84 | ### 利用方法 85 | #### 実行確認環境 86 | * Python 3.6.6 87 | * tensorflow==1.11.0 88 | * sentencepiece==0.1.8 89 | 90 | 91 | #### 付属評価コードの利用準備 92 | `./hottoSNS-bert/evaluation_dataset/twitter_sentiment/`以下に[Twitter日本語評判分析データセット](https://www.db.info.gifu-u.ac.jp/sentiment_analysis/)からツイートを再現し,BERTモデル評価用に加工したデータが必要.詳細は[hottoSNS-bert/evaluation_dataset/twitter_sentiment/](https://github.com/hottolink/hottoSNS-bert/tree/master/evaluation_dataset/twitter_sentiment)参照. 93 | #### 付属評価コードの利用方法 94 | ``` 95 | # リポジトリのClone 96 | git clone https://github.com/hottolink/hottoSNS-bert.git 97 | cd hottoSNS-bert 98 | 99 | # 取得した各BERTモデルファイルを `trained_model/` 以下に配置 100 | cp -r [hottoSNS-bert dir]/* ./trained_model/masked_lm_only_L-12_H-768_A-12/ 101 | cp -r [日本語wikipedia model dir]/* ./trained_model/wikipedia_ja_L-12_H-768_A-12/ 102 | cp -r [Multilingual model dir]/* ./trained_model/multi_cased_L-12_H-768_A-12/ 103 | 104 | 105 | # 評価環境の構築・評価実行 106 | # ※テキストファイルから分散表現を読み込むため、実行に時間がかかります。 107 | sh setup.sh 108 | cd src 109 | sh run_classifier.sh 110 | 111 | ``` 112 | 113 | #### モデルの読み込み方法 114 | サンプルコードを参照してください。 115 | 116 | ## 配布リソースの構築手順 117 | ### コーパス・単語分散表現の構築方法 118 | 119 | #### 平文コーパスの収集・構築 120 | * 期間:2017年〜2018年に投稿されたツイートから一部を抽出 121 | * 投稿クライアント:人間用のクライアントのみ 122 | * 実質的に,botによる投稿を除外 123 | * ツイート種別:オーガニックおよびメンション 124 | 125 | #### 前処理 126 | * 文字フィルタ:ReTweet記号(RT)・省略記号(...)の除外 127 | * 正規化:NFKC正規化,小文字化 128 | * 特殊トークン化:mention, url 129 | * 除外:正規化された本文が重複するツイートを削除 130 | 131 | 132 | * サンプルデータは以下の通り 133 | 134 | ``` 135 | ゆめさんが、ファボしてくるあたり、世代だなって思いました( ̇- ̇ )笑 136 | 90秒に250円かけるかどうかは、まぁ個人の自由だしね() 137 | それでは聞いてください rainy 138 | ``` 139 | 140 | #### 分かち書きコーパスの構築 141 | * sentencepieceを採用 142 | * 設定は以下の通り 143 | 144 | 145 | |argument|value| 146 | |--|--| 147 | |vocab_size|32,000 | 148 | |character_coverage|0.9995| 149 | |model_type|unigram| 150 | |add_dummy_prefix|FALSE| 151 | |user_defined_symbols|\,\| 152 | |control_symbols|[CLS],[SEP],[MASK]| 153 | |input_sentence_size|20,000,000 | 154 | |pad_id|0| 155 | |unk_id|1| 156 | |bos_id|-1| 157 | |eos_id|-1| 158 | 159 | * サンプルデータは以下の通り 160 | 161 | ``` 162 | ゆめ さんが 、 ファボ してくる あたり 、 世代 だ なって思いました ( ▁̇ - ▁̇ ▁ ) 笑 163 | 164 | ▁ 90 秒 に 250 円 かける かどうかは 、 まぁ 個人の 自由 だしね () 165 | 166 | ▁ それでは 聞いてください ▁ rain y ▁ 167 | ``` 168 | 169 | #### 後処理 170 | * 短すぎる・少なすぎるツイートを除外 171 | * 具体的には,以下に示すしきい値を下回るユーザおよびツイートを除外 172 | 173 | |limitation|value| 174 | |--|--| 175 | |トークン長さ|5| 176 | |ユーザあたりのツイート数|5| 177 | 178 | 179 | #### 既存モデルとの前処理・分かち書きの比較 180 | | 前処理 | | | 分かち書き | | | | 181 | |-------------------------------------|-------------------------|--------------------------------|----------------|---------|---------------|----------| 182 | | モデル名 | 文字正規化 | 特殊トークン化 | 小文字化 | 単語数 | 分かち書き | 言語 | 183 | | BERT MultiLingual | None | no | yes | 119,547 | WordPiece | multi※ | 184 | | BERT 日本語Wikipedia | NFKC | no | no | 32,000 | SentencePiece | ja | 185 | | hottoSNS-BERT | NFKC | yes | no | 32,000 | SentencePiece | ja | 186 | 187 | 188 | 189 | 190 | 191 | 192 | #### 統計量 193 | 構築されたコーパスの統計量は,以下の通り 194 | 195 | * コーパス全体 196 | 197 | |metric|value| 198 | |--|--| 199 | |n_user|1,872,623 | 200 | |n_post|85,925,384 | 201 | |n_token|1,441,078,317 | 202 | 203 | * トークン数・1ユーザあたりの投稿数 204 | 205 | |metric|n_token|n_post.per.user| 206 | |--|--|--| 207 | |min|5|5| 208 | |mean|16.77|45.89| 209 | |std|13.06|14.83| 210 | |q(0.99)|64|76| 211 | |max|227|781| 212 | 213 | 214 | ### 文分散表現の学習 215 | #### pre-training 216 | next sentence predictionはツイートに適用することが難しいため、masked language model のみを適用する. 217 | また、事前学習のタスク設定について,各サンプルのtoken数を最大64に制限した. 218 | 219 | #### neuralnet architectureの比較 220 | 221 | | neuralnet architecture | | | | | pre-training | | | | 222 | |-------------------------------------------|---------|---------|-------------|---------|---------------|-------------|---------|-----------| 223 | | モデル名 | n_dim_e | n_dim_h | n_attn_head | n_layer | max_pos_embed | max_seq_len | n_batch | n_step | 224 | | BERT MultiLingual | 768 | 3072 | 12 | 12 | 512 | 512 | 256 | 1,000,000 | 225 | | BERT 日本語Wikipedia | 768 | 3072 | 12 | 12 | 512 | 512 | 256 | 1,400,000 | 226 | | hottoSNS-BERT | 768 | 3072 | 12 | 12 | 512 | 64 | 512 | 1,000,000 | 227 | 228 | 229 | 230 | #### 学習環境 231 | * Google Computing Platform の Cloud TPU を使用.詳細は以下の通り 232 | * neuralnet framework は TensorFlow 1.12.0 を使用 233 | * 詳細は以下の通り 234 | * CPU:n1-standard-2(vCPU x 2、メモリ 7.5 GB) 235 | * ストレージ:Cloud Storage 236 | * TPU:v2-8 237 | 238 | 239 | 240 | ## 配布リソースの性能評価 241 | ### 評価用データセット 242 | * ツイート評判分析をdownstreamタスクとして、構築したBERTモデルの評価を行う. 243 | * 評判分析タスクは,2種類のデータセットを用いて評価する 244 | 1. [Twitter日本語評判分析データセット](https://www.db.info.gifu-u.ac.jp/sentiment_analysis/)[芥子+, 2017] 245 | * サンプル数:161K 246 | 2. 内製のデータセット;Twitter大規模トピックコーパス 247 | * サンプル数:12K 248 | 249 | * 統計量は以下の通り 250 | 251 | |データセット名|トピック|positive|negative|neutral|total| 252 | |:-:|:-:|:-:|:-:|:-:|:-:| 253 | |Twitter大規模トピックコーパス|指定なし|4,162 |3,031 |4,807 |12,000 | 254 | |Twitter日本語評判分析データセット|家電・携帯端末|10,249 |15,920 |134,928 |161,097 | 255 | 256 | ### downstream task: fine-tuning 257 | * downstream task の詳細は,以下の通りである 258 | * task type:日本語ツイートの評判分析;Positive/Negative/Neutral の3値分類 259 | * task dataset 260 | 1. Twitter日本語評判分析データセット[芥子+, 2017] 261 | 2. 内製のデータセット 262 | * methodology 263 | * task dataset を train:test = 9:1 に分割 264 | * hyper-parameter は,BERT論文[Devlin+, 2018] に準拠 265 | * 学習および評価を7回繰り返して,平均値を報告 266 | * evaluation metric:accuracy および macro F-value 267 | 268 | ``` 269 | 芥子 育雄, 鈴木 優, 吉野 幸一郎, グラム ニュービッグ, 大原 一人, 向井 理朗, 中村 哲: 「単語意味ベクトル辞書を用いたTwitterからの日本語評判情報抽出」, 電子情報通信学会論文誌, Vol.J100-D, No.4, pp.530-543, 2017.4. 270 | Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv preprint arXiv:1810.04805,2018 271 | ``` 272 | 273 | 274 | ### 実験結果 275 | 実験結果は以下の通り. 276 | 277 | | | Twitter大規模カテゴリコーパス | | Twitter日本語評判分析データセット | | 278 | |-------------------------------------|----------|-----------------------------------|----------|---------| 279 | | モデル名 | accuracy | F-value | accuracy | F-value | 280 | | BERT MultiLingual | 0.7019 | 0.7011 | 0.8776 | 0.7225 | 281 | | BERT 日本語Wikipedia | 0.7237 | 0.7239 | 0.8790 | 0.7359 | 282 | | hottoSNS-BERT| 0.7387 | 0.7396 | 0.8880 | 0.7503 | 283 | 284 | 下記のような結果であると言える. 285 | 286 | * Twitter評判分析タスクに対する性能は以下のようになった。 287 | * Multilingual < 日本語Wikipedia < 日本語SNS 288 | * Multilingual < 日本語Wikipediaであることから日本語を対象としたdownstreamタスクでは,日本語(の語彙)に特化した分かち書き方法および,日本語のコーパスを用いた事前学習の方が適していると考えられる 289 | * 日本語Wikipedia < 日本語SNS であることから,Twitterを対象としたdownstreamタスクでは、日本語Wikipediaよりもドメインに特化した大規模日本語SNSコーパスで学習したBERTモデルの方が良い性能が得られると考えられる 290 | 291 | ## pytorch-transformersからの利用 292 | * pytorch 1.8.1+cu102, tensorflow 2.4.1での動作例 293 | 1. hottoSNS-bertの読み込み 294 | ``` 295 | # pytorch-transformers, tensorflowの読み込み 296 | import os, sys 297 | from transformers import BertForPreTraining, BertTokenizer 298 | import tensorflow as tf 299 | 300 | # hottoSNS-bertの読み込み 301 | sys.path.append("./hottoSNS-bert/src/") 302 | import tokenization 303 | from preprocess import normalizer 304 | ``` 305 | 1. 必要なファイルの指定 306 | ``` 307 | bert_model_dir = "./hottoSNS-bert/trained_model/masked_lm_only_L-12_H-768_A-12/" 308 | config_file = os.path.join(bert_model_dir, "bert_config.json") 309 | vocab_file = os.path.join(bert_model_dir, "tokenizer_spm_32K.vocab.to.bert") 310 | sp_model_file = os.path.join(bert_model_dir, "tokenizer_spm_32K.model") 311 | bert_model_file = os.path.join(bert_model_dir, "model.ckpt-1000000.index") 312 | ``` 313 | 1. tokenizerのインスタンス化 314 | ``` 315 | tokenizer = tokenization.JapaneseTweetTokenizer( 316 | vocab_file=vocab_file, 317 | model_file=model_file, 318 | normalizer= normalizer.twitter_normalizer_for_bert_encoder, 319 | do_lower_case=False) 320 | ``` 321 | 1. tokenizeの実行例 322 | ``` 323 | # 例文 324 | text = '@test ゆめさんが、ファボしてくるあたり、世代だなって思いました( ̇- ̇ )笑 http://test.com/test.html' 325 | # tokenize 326 | words = tokenizer.tokenize(text) 327 | print(words) 328 | # ['', '▁', 'ゆめ', 'さんが', '、', 'ファボ', 'してくる', 'あたり', '、', '世代', 'だ', 'なって思いました', '(', '▁̇', '-', '▁̇', '▁', ')', '笑', '▁', ''] 329 | # idへ変換 330 | tokenizer.convert_tokens_to_ids(["[CLS]"]+words+["[SEP]"]) 331 | # [2, 6, 7, 6372, 348, 8, 5249, 2135, 1438, 8, 3785, 63, 28146, 12, 112, 93, 112, 7, 13, 31, 7, 5, 3] 332 | 333 | ``` 334 | 335 | 1. pretrainモデルの読み込み 336 | ``` 337 | model = BertForPreTraining.from_pretrained(bert_model_file, from_tf=True, config=config_file) 338 | ``` 339 | 340 | 341 | ## 利用規約 342 | 同一の内容をLICENSE.mdに記述 343 | ``` 344 | 第1条(定義) 345 | 本契約で用いられる用語の意味は、以下に定めるとおりとする。 346 | (1)「本規約」とは、本利用規約をいう。 347 | (2)「甲」とは、 株式会社ホットリンク(以下「甲」という)をいう。 348 | (3)「乙」とは、本規約に同意し、甲の承認を得て、甲が配布する文分散表現データを利用する個人をいう。 349 | (4)「本データ」とは、甲が作成した文分散表現データおよびそれに付随する全部をいう。 350 | 351 | 352 | 第2条(利用許諾) 353 | 甲は、乙が本規約に従って本データを利用することを非独占的に許諾する。なお、甲及び乙は、本規約に明示的に定める以外に、乙に本データに関していかなる権利も付与するものではないことを確認する。 354 | 355 | 356 | 第4条(許諾の条件) 357 | 甲が乙に本データの利用を許諾する条件は、以下の通りとする。 358 | (1)利用目的: 日本語に関する学術研究・産業研究(以下「本研究」という)を遂行するため。 359 | (2)利用の範囲: 乙及び乙が所属する研究グループ 360 | (3)利用方法: 本研究のために本データを乙が管理するコンピューター端末またはサーバーに複製し、本データを分析・研究しデータベース等に保存した解析データ(以下「本解析データ」という)を得る。 361 | 362 | 363 | 第5条(利用申込) 364 | 1.乙は、甲が指定するウェブ上の入力フォーム(以下、入力フォーム)を通じて、乙の名前や所属、連絡先等、甲が指定する項目を甲に送信し、本データの利用について甲の承認を得るものとする。 なお、甲が承認しなかった場合、甲はその理由を開示する義務を負わない。 365 | 2.前項に基づき甲に申告した内容に変更が生じる場合、乙は遅滞なくこれを甲に報告し、改めて甲の承認を得るものとする。 366 | 3.乙が入力フォームを送信した時点で、乙は本規約に同意したものとみなされる。 367 | 368 | 第6条(禁止事項) 369 | 乙は、本データの利用にあたり、以下に定める行為をしてはならない。 370 | (1)本データ及びその複製物(それらを復元できるデータを含む)を譲渡、貸与、販売すること。また、書面による甲の事前許諾なくこれらを配布、公衆送信、刊行物に転載するなど前項に定める範囲を超えて利用し、甲または第三者の権利を侵害すること。 371 | (2)本データを用いて甲又は第三者の名誉を毀損し、あるいはプライバシーを侵害するなどの権利侵害を行うこと。 372 | (3)乙及び乙が所属する研究グループ以外の第三者に本データを利用させること。 373 | (4)本規約で明示的に許諾された目的及び手段以外にデータを利用 すること。 374 | 375 | 第7条(対価) 376 | 本規約に基づく本データの利用許諾の対価は発生しない。 377 | 378 | 第8条(公表) 379 | 1.乙は、学術研究の目的に限り、本データを使用して得られた研究成果や知見を公表することができる。これらの公表には、本解析データや処理プログラムの公表を含む。 380 | 2.乙は、公表にあたっては、本データをもとにした成果であることを明記し、成果の公表の前にその概要を書面やメール等で甲に報告する。 381 | 3.乙は、論文発表の際も、本データを利用した旨を明記し、提出先の学会、発表年月日を所定のフォームから甲に提出するものとする。 382 | 383 | 384 | 385 | 第9条(乙の責任) 386 | 1.乙は、本データをダウンロードする為に必要な通信機器やソフトウェア、通信回線等の全てを乙の責任と費用で準備し、操作、接続等をする。 387 | 2.乙は、本データを本研究の遂行のみに使用する。 388 | 3.乙は、本データが漏洩しないよう善良な管理者の注意義務をもって管理し、乙のコンピューター端末等に適切な対策を施すものとする。 389 | 4.乙が、本研究を乙が所属するグループのメンバーと共同で遂行する場合、乙は、本規約の内容を当該グループの他のメンバーに遵守させるものとし、万一、当該他のメンバーが本規約に違反し甲又は第三者に損害を与えた場合は、乙はこれを自らの行為として連帯して責任を負うものとする。 390 | 5.甲が必要と判断する場合、乙に対して、本データの利用状況の開示を求めることができるものとし、乙はこれに応じなければならない。 391 | 392 | 393 | 第10条(知的財産権の帰属) 394 | 甲及び乙は、本データに関する一切の知的財産権、本データの利用に関連して甲が提供する書類やデータ等に関する全ての知的財産権について、甲に帰属することを確認する。ただし、本データ作成の素材となった各文書の著作権は正当な権利を有する第三者に帰属する。 395 | 396 | 第11条(非保証等) 397 | 1.甲は、本データが、第三者の著作権、特許権、その他の無体財産権、営業秘密、ノウハウその他の権利を侵害していないこと、法令に違反していないこと、本データ作成に利用したアルゴリズムに誤り、エラー、バグがないことについて一切保証せず、また、それらの信頼性、正確性、速報性、完全性、及び有効性、特定目的への適合性について一切保証しないものとし、瑕疵担保責任も負わない。 398 | 2.本データに関し万一、第三者から知的財産権侵害等の主張がなされた場合には、乙はただちに甲に対しその旨を通知し、甲に対する情報提供等、当該紛争の解決に最大限協力するものとする。 399 | 400 | 401 | 第12条(違反時の措置) 402 | 1.甲は、乙が次の各号の一つにでも該当した場合、甲は乙に対して本データの利用を差止めることができる。 403 | (1)本規約に違反した場合 404 | (2)法令に違反した場合 405 | (3)虚偽の申告等の不正を行った場合 406 | (4)信頼関係を破壊するような行為を行った場合 407 | (5)その他甲が不適当と認めた場合 408 | 2.前項の規定は甲から乙に対する損害賠償請求を妨げるものではない。 409 | 3.第1項に基づき、甲が乙に対して本データの利用の差し止めを求めた場合、乙は、乙が管理する設備から、本データ、本解析データ及びその複製物の一切を消去するものとする。 410 | 411 | 第13条(甲の事情による利用許諾の取り消し) 412 | 1.甲は、その理由の如何を問わず、なんらの催告なしに、本データの利用許諾を停止することができるものとする。その際は、第15条に基づき、乙は速やかに本データおよびその複製物の一切を消去または破棄する。 413 | 2.前項の破棄、消去の対象に本解析データは含まない。 414 | 415 | 416 | 第14条(利用期間) 417 | 1.乙による本データの利用可能期間は、第5条にもとづく甲の承認日より1年間とする。 418 | 2.乙が1年間を超えて本データの利用継続を希望する場合、第5条に基づく方法で再度利用申請を行うこととする。 419 | 420 | 421 | 第15条(本契約終了後の措置等) 422 | 1.理由の如何を問わず、第14条に定める利用期間が終了したとき、もしくは、本データの利用許諾が取り消しとなった場合、乙は本データおよびその複製物の一切を消去または破棄する。 423 | 2.前項の破棄、消去の対象に本解析データは含まない。ただし、乙は、本解析データから本データを復元して再利用することはできないものとする。 424 | 3.第10条、第11条、第15条から第19条は、本契約の終了後も有効に存続する。 425 | 426 | 第16条(権利義務譲渡の禁止) 427 | 乙は、相手方の書面による事前の承諾なき限り、本契約上の地位及び本契約から生じる権利義務を第三者に譲渡又は担保に供してはならない。 428 | 429 | 第17条 (個人情報等の保護および法令遵守) 430 | 1.甲が取得した乙の個人情報は、別途定める甲2のプライバシーポリシーに従って取り扱われる。 431 | 2.甲は、サーバー設備の故障その他のトラブル等に対処するため、乙の個人情報を他のサーバーに複写することがある。 432 | 433 | 第18条(準拠法) 434 | 本契約の準拠法は、日本法とする。 435 | 436 | 第19条(管轄裁判所) 437 | 本契約に起因し又は関連して生じた一切の紛争については、東京地方裁判所を第一審の専属的合意管轄裁判所とする。 438 | 439 | 第20条(協 議) 440 | 本契約に定めのない事項及び疑義を生じた事項は、甲乙誠意をもって協議し、円満にその解決にあたる。 441 | 442 | 第21条(本規約の効力) 443 | 本規約は、本データの利用の関する一切について適用される。なお、本規約は随時変更されることがあるが、変更後の規約は特別に定める場合を除き、ウェブ上で表示された時点から効力を生じるものとする。 444 | ``` 445 | -------------------------------------------------------------------------------- /evaluation_dataset/twitter_sentiment/README.md: -------------------------------------------------------------------------------- 1 | # Twitter日本語評判分析データセット:評価用 2 | [Twitter日本語評判分析データセット](http://bigdata.naist.jp/~ysuzuki/data/twitter/)について,ツイート本文を復元した後,BERTモデル評価用に整形したもの 3 | 4 | ## 出典 5 | * Twitter日本語評判分析データセット[芥子+, 2017]をクリーニング,分割したもの 6 | * 以下のエントリを削除 7 | * 2つ以上の評判極性が付与されている 8 | * `pos,neg,neutral` 以外の評判極性が付与されている 9 | * 残ったエントリを `train:dev:test = 8:1:1` に分割 10 | 11 | 12 | 13 | ## ファイル 14 | * 本ディレクトリの格納ファイルは以下の通り 15 | * 復元された評判分析コーパスは `train.tsv,dev.tsv.test.tsv` である 16 | 17 | | ファイル名 | ファイル形式 | 説明 | 18 | |----------------------|-------------------|-------------------------------------------------------------------------------------------------| 19 | | train.tsv | TSV,タブ区切り | 学習用 | 20 | | dev.tsv | TSV,タブ区切り | 開発用 | 21 | | test.tsv | TSV,タブ区切り | 評価用 | 22 | 23 | 24 | 25 | ## フォーマット 26 | * `[train,dev,test].tsv` のフォーマットは以下の通り 27 | 28 | | 列名 | 型 | 説明 | 29 | |-------------|----------|---------------------------------------------------------------------------------------------------------| 30 | | id | str | データ番号.Twitter日本語評判分析データセットにて採番されたもの | 31 | | category_id | str | ジャンルID.詳細は[データセット説明](http://bigdata.naist.jp/~ysuzuki/data/twitter/)を参照 | 32 | | status_id | str | ツイートID.ツイートの一意な識別子 | 33 | | label_type | str | 評判極性ラベル,全5種類.詳細は[データセット説明](http://bigdata.naist.jp/~ysuzuki/data/twitter/)を参照 | 34 | | created_at | datetime | ツイートの投稿日時 | 35 | | user_id | str | 投稿者のID.Twitterユーザの一意な識別子 | 36 | | screen_name | str | 投稿者のスクリーン名 | 37 | | text | str | ツイート本文(改行コードを含む) | 38 | 39 | 40 | 41 | 以上 42 | -------------------------------------------------------------------------------- /images/QR_hottoSNS-bert.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/images/QR_hottoSNS-bert.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | tensorflow >= 1.11.0 # CPU Version of TensorFlow. 2 | # tensorflow-gpu >= 1.11.0 # GPU version of TensorFlow. 3 | regex 4 | sentencepiece==0.1.8 5 | -------------------------------------------------------------------------------- /src/dataprocessor/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- -------------------------------------------------------------------------------- /src/dataprocessor/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/src/dataprocessor/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /src/dataprocessor/__pycache__/custom.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/src/dataprocessor/__pycache__/custom.cpython-36.pyc -------------------------------------------------------------------------------- /src/dataprocessor/__pycache__/preset.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/src/dataprocessor/__pycache__/preset.cpython-36.pyc -------------------------------------------------------------------------------- /src/dataprocessor/custom.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | import os, csv 5 | 6 | from .preset import DataProcessor, InputExample 7 | import tokenization 8 | 9 | 10 | # dataset processor for Twitter日本語評判分析データセット [Suzuki+, 2017] 11 | class PublicTwitterSentimentProcessor(DataProcessor): 12 | """ 13 | Processor for the Twitter日本語評判分析データセット . 14 | refer to: http://bigdata.naist.jp/~ysuzuki/data/twitter/ 15 | 16 | """ 17 | 18 | def get_train_examples(self, data_dir): 19 | """See base class.""" 20 | return self._create_examples( 21 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 22 | 23 | def get_dev_examples(self, data_dir): 24 | """See base class.""" 25 | return self._create_examples( 26 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 27 | 28 | def get_test_examples(self, data_dir): 29 | """See base class.""" 30 | return self._create_examples( 31 | self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") 32 | 33 | def get_labels(self): 34 | """See base class.""" 35 | return ["pos", "neg", "neutral"] 36 | 37 | def _create_examples(self, lines, set_type): 38 | """Creates examples for the training and dev sets.""" 39 | examples = [] 40 | for (i, line) in enumerate(lines): 41 | if i == 0: 42 | continue 43 | guid = "%s-%s" % (set_type, i) 44 | if set_type in ["train","dev"]: 45 | text_a = tokenization.convert_to_unicode(line[7]) 46 | label = tokenization.convert_to_unicode(line[3]) 47 | elif set_type == "test": 48 | text_a = tokenization.convert_to_unicode(line[7]) 49 | label = tokenization.convert_to_unicode(line[3]) 50 | else: 51 | raise NotImplementedError(f"unsupported set type: {set_type}") 52 | 53 | if label in self.get_labels(): 54 | examples.append( 55 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) 56 | return examples 57 | -------------------------------------------------------------------------------- /src/dataprocessor/preset.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | import csv 4 | import os 5 | 6 | import tensorflow as tf 7 | 8 | import tokenization 9 | 10 | 11 | class InputExample(object): 12 | """A single training/test example for simple sequence classification.""" 13 | 14 | def __init__(self, guid, text_a, text_b=None, label=None): 15 | """Constructs a InputExample. 16 | 17 | Args: 18 | guid: Unique id for the example. 19 | text_a: string. The untokenized text of the first sequence. For single 20 | sequence tasks, only this sequence must be specified. 21 | text_b: (Optional) string. The untokenized text of the second sequence. 22 | Only must be specified for sequence pair tasks. 23 | label: (Optional) string. The label of the example. This should be 24 | specified for train and dev examples, but not for test examples. 25 | """ 26 | self.guid = guid 27 | self.text_a = text_a 28 | self.text_b = text_b 29 | self.label = label 30 | 31 | 32 | class InputFeatures(object): 33 | """A single set of features of data.""" 34 | 35 | def __init__(self, input_ids, input_mask, segment_ids, label_id): 36 | self.input_ids = input_ids 37 | self.input_mask = input_mask 38 | self.segment_ids = segment_ids 39 | self.label_id = label_id 40 | 41 | 42 | class DataProcessor(object): 43 | """Base class for data converters for sequence classification data sets.""" 44 | 45 | def get_train_examples(self, data_dir): 46 | """Gets a collection of `InputExample`s for the train set.""" 47 | raise NotImplementedError() 48 | 49 | def get_dev_examples(self, data_dir): 50 | """Gets a collection of `InputExample`s for the dev set.""" 51 | raise NotImplementedError() 52 | 53 | def get_test_examples(self, data_dir): 54 | """Gets a collection of `InputExample`s for prediction.""" 55 | raise NotImplementedError() 56 | 57 | def get_labels(self): 58 | """Gets the list of labels for this data set.""" 59 | raise NotImplementedError() 60 | 61 | @classmethod 62 | def _read_tsv(cls, input_file, quotechar=None): 63 | """Reads a tab separated value file.""" 64 | with tf.gfile.Open(input_file, "r") as f: 65 | reader = csv.reader(f, delimiter="\t", quotechar=quotechar) 66 | lines = [] 67 | for line in reader: 68 | lines.append(line) 69 | return lines 70 | 71 | 72 | class XnliProcessor(DataProcessor): 73 | """Processor for the XNLI data set.""" 74 | 75 | def __init__(self): 76 | self.language = "zh" 77 | 78 | def get_train_examples(self, data_dir): 79 | """See base class.""" 80 | lines = self._read_tsv( 81 | os.path.join(data_dir, "multinli", 82 | "multinli.train.%s.tsv" % self.language)) 83 | examples = [] 84 | for (i, line) in enumerate(lines): 85 | if i == 0: 86 | continue 87 | guid = "train-%d" % (i) 88 | text_a = tokenization.convert_to_unicode(line[0]) 89 | text_b = tokenization.convert_to_unicode(line[1]) 90 | label = tokenization.convert_to_unicode(line[2]) 91 | if label == tokenization.convert_to_unicode("contradictory"): 92 | label = tokenization.convert_to_unicode("contradiction") 93 | examples.append( 94 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 95 | return examples 96 | 97 | def get_dev_examples(self, data_dir): 98 | """See base class.""" 99 | lines = self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv")) 100 | examples = [] 101 | for (i, line) in enumerate(lines): 102 | if i == 0: 103 | continue 104 | guid = "dev-%d" % (i) 105 | language = tokenization.convert_to_unicode(line[0]) 106 | if language != tokenization.convert_to_unicode(self.language): 107 | continue 108 | text_a = tokenization.convert_to_unicode(line[6]) 109 | text_b = tokenization.convert_to_unicode(line[7]) 110 | label = tokenization.convert_to_unicode(line[1]) 111 | examples.append( 112 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 113 | return examples 114 | 115 | def get_labels(self): 116 | """See base class.""" 117 | return ["contradiction", "entailment", "neutral"] 118 | 119 | 120 | class MnliProcessor(DataProcessor): 121 | """Processor for the MultiNLI data set (GLUE version).""" 122 | 123 | def get_train_examples(self, data_dir): 124 | """See base class.""" 125 | return self._create_examples( 126 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 127 | 128 | def get_dev_examples(self, data_dir): 129 | """See base class.""" 130 | return self._create_examples( 131 | self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")), 132 | "dev_matched") 133 | 134 | def get_test_examples(self, data_dir): 135 | """See base class.""" 136 | return self._create_examples( 137 | self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test") 138 | 139 | def get_labels(self): 140 | """See base class.""" 141 | return ["contradiction", "entailment", "neutral"] 142 | 143 | def _create_examples(self, lines, set_type): 144 | """Creates examples for the training and dev sets.""" 145 | examples = [] 146 | for (i, line) in enumerate(lines): 147 | if i == 0: 148 | continue 149 | guid = "%s-%s" % (set_type, tokenization.convert_to_unicode(line[0])) 150 | text_a = tokenization.convert_to_unicode(line[8]) 151 | text_b = tokenization.convert_to_unicode(line[9]) 152 | if set_type == "test": 153 | label = "contradiction" 154 | else: 155 | label = tokenization.convert_to_unicode(line[-1]) 156 | examples.append( 157 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 158 | return examples 159 | 160 | 161 | class MrpcProcessor(DataProcessor): 162 | """Processor for the MRPC data set (GLUE version).""" 163 | 164 | def get_train_examples(self, data_dir): 165 | """See base class.""" 166 | return self._create_examples( 167 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 168 | 169 | def get_dev_examples(self, data_dir): 170 | """See base class.""" 171 | return self._create_examples( 172 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 173 | 174 | def get_test_examples(self, data_dir): 175 | """See base class.""" 176 | return self._create_examples( 177 | self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") 178 | 179 | def get_labels(self): 180 | """See base class.""" 181 | return ["0", "1"] 182 | 183 | def _create_examples(self, lines, set_type): 184 | """Creates examples for the training and dev sets.""" 185 | examples = [] 186 | for (i, line) in enumerate(lines): 187 | if i == 0: 188 | continue 189 | guid = "%s-%s" % (set_type, i) 190 | text_a = tokenization.convert_to_unicode(line[3]) 191 | text_b = tokenization.convert_to_unicode(line[4]) 192 | if set_type == "test": 193 | label = "0" 194 | else: 195 | label = tokenization.convert_to_unicode(line[0]) 196 | examples.append( 197 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 198 | return examples 199 | 200 | 201 | class ColaProcessor(DataProcessor): 202 | """Processor for the CoLA data set (GLUE version).""" 203 | 204 | def get_train_examples(self, data_dir): 205 | """See base class.""" 206 | return self._create_examples( 207 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 208 | 209 | def get_dev_examples(self, data_dir): 210 | """See base class.""" 211 | return self._create_examples( 212 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 213 | 214 | def get_test_examples(self, data_dir): 215 | """See base class.""" 216 | return self._create_examples( 217 | self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") 218 | 219 | def get_labels(self): 220 | """See base class.""" 221 | return ["0", "1"] 222 | 223 | def _create_examples(self, lines, set_type): 224 | """Creates examples for the training and dev sets.""" 225 | examples = [] 226 | for (i, line) in enumerate(lines): 227 | # Only the test set has a header 228 | if set_type == "test" and i == 0: 229 | continue 230 | guid = "%s-%s" % (set_type, i) 231 | if set_type == "test": 232 | text_a = tokenization.convert_to_unicode(line[1]) 233 | label = "0" 234 | else: 235 | text_a = tokenization.convert_to_unicode(line[3]) 236 | label = tokenization.convert_to_unicode(line[1]) 237 | examples.append( 238 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) 239 | return examples -------------------------------------------------------------------------------- /src/modeling.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """The main BERT model and related functions.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import collections 22 | import copy 23 | import json 24 | import math 25 | import re 26 | import six 27 | import tensorflow as tf 28 | 29 | 30 | class BertConfig(object): 31 | """Configuration for `BertModel`.""" 32 | 33 | def __init__(self, 34 | vocab_size, 35 | hidden_size=768, 36 | num_hidden_layers=12, 37 | num_attention_heads=12, 38 | intermediate_size=3072, 39 | hidden_act="gelu", 40 | hidden_dropout_prob=0.1, 41 | attention_probs_dropout_prob=0.1, 42 | max_position_embeddings=512, 43 | type_vocab_size=16, 44 | initializer_range=0.02): 45 | """Constructs BertConfig. 46 | 47 | Args: 48 | vocab_size: Vocabulary size of `inputs_ids` in `BertModel`. 49 | hidden_size: Size of the encoder layers and the pooler layer. 50 | num_hidden_layers: Number of hidden layers in the Transformer encoder. 51 | num_attention_heads: Number of attention heads for each attention layer in 52 | the Transformer encoder. 53 | intermediate_size: The size of the "intermediate" (i.e., feed-forward) 54 | layer in the Transformer encoder. 55 | hidden_act: The non-linear activation function (function or string) in the 56 | encoder and pooler. 57 | hidden_dropout_prob: The dropout probability for all fully connected 58 | layers in the embeddings, encoder, and pooler. 59 | attention_probs_dropout_prob: The dropout ratio for the attention 60 | probabilities. 61 | max_position_embeddings: The maximum sequence length that this model might 62 | ever be used with. Typically set this to something large just in case 63 | (e.g., 512 or 1024 or 2048). 64 | type_vocab_size: The vocabulary size of the `token_type_ids` passed into 65 | `BertModel`. 66 | initializer_range: The stdev of the truncated_normal_initializer for 67 | initializing all weight matrices. 68 | """ 69 | self.vocab_size = vocab_size 70 | self.hidden_size = hidden_size 71 | self.num_hidden_layers = num_hidden_layers 72 | self.num_attention_heads = num_attention_heads 73 | self.hidden_act = hidden_act 74 | self.intermediate_size = intermediate_size 75 | self.hidden_dropout_prob = hidden_dropout_prob 76 | self.attention_probs_dropout_prob = attention_probs_dropout_prob 77 | self.max_position_embeddings = max_position_embeddings 78 | self.type_vocab_size = type_vocab_size 79 | self.initializer_range = initializer_range 80 | 81 | @classmethod 82 | def from_dict(cls, json_object): 83 | """Constructs a `BertConfig` from a Python dictionary of parameters.""" 84 | config = BertConfig(vocab_size=None) 85 | for (key, value) in six.iteritems(json_object): 86 | config.__dict__[key] = value 87 | return config 88 | 89 | @classmethod 90 | def from_json_file(cls, json_file): 91 | """Constructs a `BertConfig` from a json file of parameters.""" 92 | with tf.gfile.GFile(json_file, "r") as reader: 93 | text = reader.read() 94 | return cls.from_dict(json.loads(text)) 95 | 96 | def to_dict(self): 97 | """Serializes this instance to a Python dictionary.""" 98 | output = copy.deepcopy(self.__dict__) 99 | return output 100 | 101 | def to_json_string(self): 102 | """Serializes this instance to a JSON string.""" 103 | return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n" 104 | 105 | 106 | class BertModel(object): 107 | """BERT model ("Bidirectional Embedding Representations from a Transformer"). 108 | 109 | Example usage: 110 | 111 | ```python 112 | # Already been converted into WordPiece token ids 113 | input_ids = tf.constant([[31, 51, 99], [15, 5, 0]]) 114 | input_mask = tf.constant([[1, 1, 1], [1, 1, 0]]) 115 | token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]]) 116 | 117 | config = modeling.BertConfig(vocab_size=32000, hidden_size=512, 118 | num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024) 119 | 120 | model = modeling.BertModel(config=config, is_training=True, 121 | input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids) 122 | 123 | label_embeddings = tf.get_variable(...) 124 | pooled_output = model.get_pooled_output() 125 | logits = tf.matmul(pooled_output, label_embeddings) 126 | ... 127 | ``` 128 | """ 129 | 130 | def __init__(self, 131 | config, 132 | is_training, 133 | input_ids, 134 | input_mask=None, 135 | token_type_ids=None, 136 | use_one_hot_embeddings=True, 137 | scope=None): 138 | """Constructor for BertModel. 139 | 140 | Args: 141 | config: `BertConfig` instance. 142 | is_training: bool. rue for training model, false for eval model. Controls 143 | whether dropout will be applied. 144 | input_ids: int32 Tensor of shape [batch_size, seq_length]. 145 | input_mask: (optional) int32 Tensor of shape [batch_size, seq_length]. 146 | token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length]. 147 | use_one_hot_embeddings: (optional) bool. Whether to use one-hot word 148 | embeddings or tf.embedding_lookup() for the word embeddings. On the TPU, 149 | it is must faster if this is True, on the CPU or GPU, it is faster if 150 | this is False. 151 | scope: (optional) variable scope. Defaults to "bert". 152 | 153 | Raises: 154 | ValueError: The config is invalid or one of the input tensor shapes 155 | is invalid. 156 | """ 157 | config = copy.deepcopy(config) 158 | if not is_training: 159 | config.hidden_dropout_prob = 0.0 160 | config.attention_probs_dropout_prob = 0.0 161 | 162 | input_shape = get_shape_list(input_ids, expected_rank=2) 163 | batch_size = input_shape[0] 164 | seq_length = input_shape[1] 165 | 166 | if input_mask is None: 167 | input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32) 168 | 169 | if token_type_ids is None: 170 | token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32) 171 | 172 | with tf.variable_scope(scope, default_name="bert"): 173 | with tf.variable_scope("embeddings"): 174 | # Perform embedding lookup on the word ids. 175 | (self.embedding_output, self.embedding_table) = embedding_lookup( 176 | input_ids=input_ids, 177 | vocab_size=config.vocab_size, 178 | embedding_size=config.hidden_size, 179 | initializer_range=config.initializer_range, 180 | word_embedding_name="word_embeddings", 181 | use_one_hot_embeddings=use_one_hot_embeddings) 182 | 183 | # Add positional embeddings and token type embeddings, then layer 184 | # normalize and perform dropout. 185 | self.embedding_output = embedding_postprocessor( 186 | input_tensor=self.embedding_output, 187 | use_token_type=True, 188 | token_type_ids=token_type_ids, 189 | token_type_vocab_size=config.type_vocab_size, 190 | token_type_embedding_name="token_type_embeddings", 191 | use_position_embeddings=True, 192 | position_embedding_name="position_embeddings", 193 | initializer_range=config.initializer_range, 194 | max_position_embeddings=config.max_position_embeddings, 195 | dropout_prob=config.hidden_dropout_prob) 196 | 197 | with tf.variable_scope("encoder"): 198 | # This converts a 2D mask of shape [batch_size, seq_length] to a 3D 199 | # mask of shape [batch_size, seq_length, seq_length] which is used 200 | # for the attention scores. 201 | attention_mask = create_attention_mask_from_input_mask( 202 | input_ids, input_mask) 203 | 204 | # Run the stacked transformer. 205 | # `sequence_output` shape = [batch_size, seq_length, hidden_size]. 206 | self.all_encoder_layers = transformer_model( 207 | input_tensor=self.embedding_output, 208 | attention_mask=attention_mask, 209 | hidden_size=config.hidden_size, 210 | num_hidden_layers=config.num_hidden_layers, 211 | num_attention_heads=config.num_attention_heads, 212 | intermediate_size=config.intermediate_size, 213 | intermediate_act_fn=get_activation(config.hidden_act), 214 | hidden_dropout_prob=config.hidden_dropout_prob, 215 | attention_probs_dropout_prob=config.attention_probs_dropout_prob, 216 | initializer_range=config.initializer_range, 217 | do_return_all_layers=True) 218 | 219 | self.sequence_output = self.all_encoder_layers[-1] 220 | # The "pooler" converts the encoded sequence tensor of shape 221 | # [batch_size, seq_length, hidden_size] to a tensor of shape 222 | # [batch_size, hidden_size]. This is necessary for segment-level 223 | # (or segment-pair-level) classification tasks where we need a fixed 224 | # dimensional representation of the segment. 225 | with tf.variable_scope("pooler"): 226 | # We "pool" the model by simply taking the hidden state corresponding 227 | # to the first token. We assume that this has been pre-trained 228 | first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1) 229 | self.pooled_output = tf.layers.dense( 230 | first_token_tensor, 231 | config.hidden_size, 232 | activation=tf.tanh, 233 | kernel_initializer=create_initializer(config.initializer_range)) 234 | 235 | def get_pooled_output(self): 236 | return self.pooled_output 237 | 238 | def get_sequence_output(self): 239 | """Gets final hidden layer of encoder. 240 | 241 | Returns: 242 | float Tensor of shape [batch_size, seq_length, hidden_size] corresponding 243 | to the final hidden of the transformer encoder. 244 | """ 245 | return self.sequence_output 246 | 247 | def get_all_encoder_layers(self): 248 | return self.all_encoder_layers 249 | 250 | def get_embedding_output(self): 251 | """Gets output of the embedding lookup (i.e., input to the transformer). 252 | 253 | Returns: 254 | float Tensor of shape [batch_size, seq_length, hidden_size] corresponding 255 | to the output of the embedding layer, after summing the word 256 | embeddings with the positional embeddings and the token type embeddings, 257 | then performing layer normalization. This is the input to the transformer. 258 | """ 259 | return self.embedding_output 260 | 261 | def get_embedding_table(self): 262 | return self.embedding_table 263 | 264 | 265 | def gelu(input_tensor): 266 | """Gaussian Error Linear Unit. 267 | 268 | This is a smoother version of the RELU. 269 | Original paper: https://arxiv.org/abs/1606.08415 270 | 271 | Args: 272 | input_tensor: float Tensor to perform activation. 273 | 274 | Returns: 275 | `input_tensor` with the GELU activation applied. 276 | """ 277 | cdf = 0.5 * (1.0 + tf.erf(input_tensor / tf.sqrt(2.0))) 278 | return input_tensor * cdf 279 | 280 | 281 | def get_activation(activation_string): 282 | """Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`. 283 | 284 | Args: 285 | activation_string: String name of the activation function. 286 | 287 | Returns: 288 | A Python function corresponding to the activation function. If 289 | `activation_string` is None, empty, or "linear", this will return None. 290 | If `activation_string` is not a string, it will return `activation_string`. 291 | 292 | Raises: 293 | ValueError: The `activation_string` does not correspond to a known 294 | activation. 295 | """ 296 | 297 | # We assume that anything that"s not a string is already an activation 298 | # function, so we just return it. 299 | if not isinstance(activation_string, six.string_types): 300 | return activation_string 301 | 302 | if not activation_string: 303 | return None 304 | 305 | act = activation_string.lower() 306 | if act == "linear": 307 | return None 308 | elif act == "relu": 309 | return tf.nn.relu 310 | elif act == "gelu": 311 | return gelu 312 | elif act == "tanh": 313 | return tf.tanh 314 | else: 315 | raise ValueError("Unsupported activation: %s" % act) 316 | 317 | 318 | def get_assignment_map_from_checkpoint(tvars, init_checkpoint): 319 | """Compute the union of the current variables and checkpoint variables.""" 320 | assignment_map = {} 321 | initialized_variable_names = {} 322 | 323 | name_to_variable = collections.OrderedDict() 324 | for var in tvars: 325 | name = var.name 326 | m = re.match("^(.*):\\d+$", name) 327 | if m is not None: 328 | name = m.group(1) 329 | name_to_variable[name] = var 330 | 331 | init_vars = tf.train.list_variables(init_checkpoint) 332 | 333 | assignment_map = collections.OrderedDict() 334 | for x in init_vars: 335 | (name, var) = (x[0], x[1]) 336 | if name not in name_to_variable: 337 | continue 338 | assignment_map[name] = name 339 | initialized_variable_names[name] = 1 340 | initialized_variable_names[name + ":0"] = 1 341 | 342 | return (assignment_map, initialized_variable_names) 343 | 344 | 345 | def dropout(input_tensor, dropout_prob): 346 | """Perform dropout. 347 | 348 | Args: 349 | input_tensor: float Tensor. 350 | dropout_prob: Python float. The probability of dropping out a value (NOT of 351 | *keeping* a dimension as in `tf.nn.dropout`). 352 | 353 | Returns: 354 | A version of `input_tensor` with dropout applied. 355 | """ 356 | if dropout_prob is None or dropout_prob == 0.0: 357 | return input_tensor 358 | 359 | output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob) 360 | return output 361 | 362 | 363 | def layer_norm(input_tensor, name=None): 364 | """Run layer normalization on the last dimension of the tensor.""" 365 | return tf.contrib.layers.layer_norm( 366 | inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name) 367 | 368 | 369 | def layer_norm_and_dropout(input_tensor, dropout_prob, name=None): 370 | """Runs layer normalization followed by dropout.""" 371 | output_tensor = layer_norm(input_tensor, name) 372 | output_tensor = dropout(output_tensor, dropout_prob) 373 | return output_tensor 374 | 375 | 376 | def create_initializer(initializer_range=0.02): 377 | """Creates a `truncated_normal_initializer` with the given range.""" 378 | return tf.truncated_normal_initializer(stddev=initializer_range) 379 | 380 | 381 | def embedding_lookup(input_ids, 382 | vocab_size, 383 | embedding_size=128, 384 | initializer_range=0.02, 385 | word_embedding_name="word_embeddings", 386 | use_one_hot_embeddings=False): 387 | """Looks up words embeddings for id tensor. 388 | 389 | Args: 390 | input_ids: int32 Tensor of shape [batch_size, seq_length] containing word 391 | ids. 392 | vocab_size: int. Size of the embedding vocabulary. 393 | embedding_size: int. Width of the word embeddings. 394 | initializer_range: float. Embedding initialization range. 395 | word_embedding_name: string. Name of the embedding table. 396 | use_one_hot_embeddings: bool. If True, use one-hot method for word 397 | embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better 398 | for TPUs. 399 | 400 | Returns: 401 | float Tensor of shape [batch_size, seq_length, embedding_size]. 402 | """ 403 | # This function assumes that the input is of shape [batch_size, seq_length, 404 | # num_inputs]. 405 | # 406 | # If the input is a 2D tensor of shape [batch_size, seq_length], we 407 | # reshape to [batch_size, seq_length, 1]. 408 | if input_ids.shape.ndims == 2: 409 | input_ids = tf.expand_dims(input_ids, axis=[-1]) 410 | 411 | embedding_table = tf.get_variable( 412 | name=word_embedding_name, 413 | shape=[vocab_size, embedding_size], 414 | initializer=create_initializer(initializer_range)) 415 | 416 | if use_one_hot_embeddings: 417 | flat_input_ids = tf.reshape(input_ids, [-1]) 418 | one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) 419 | output = tf.matmul(one_hot_input_ids, embedding_table) 420 | else: 421 | output = tf.nn.embedding_lookup(embedding_table, input_ids) 422 | 423 | input_shape = get_shape_list(input_ids) 424 | 425 | output = tf.reshape(output, 426 | input_shape[0:-1] + [input_shape[-1] * embedding_size]) 427 | return (output, embedding_table) 428 | 429 | 430 | def embedding_postprocessor(input_tensor, 431 | use_token_type=False, 432 | token_type_ids=None, 433 | token_type_vocab_size=16, 434 | token_type_embedding_name="token_type_embeddings", 435 | use_position_embeddings=True, 436 | position_embedding_name="position_embeddings", 437 | initializer_range=0.02, 438 | max_position_embeddings=512, 439 | dropout_prob=0.1): 440 | """Performs various post-processing on a word embedding tensor. 441 | 442 | Args: 443 | input_tensor: float Tensor of shape [batch_size, seq_length, 444 | embedding_size]. 445 | use_token_type: bool. Whether to add embeddings for `token_type_ids`. 446 | token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length]. 447 | Must be specified if `use_token_type` is True. 448 | token_type_vocab_size: int. The vocabulary size of `token_type_ids`. 449 | token_type_embedding_name: string. The name of the embedding table variable 450 | for token type ids. 451 | use_position_embeddings: bool. Whether to add position embeddings for the 452 | position of each token in the sequence. 453 | position_embedding_name: string. The name of the embedding table variable 454 | for positional embeddings. 455 | initializer_range: float. Range of the weight initialization. 456 | max_position_embeddings: int. Maximum sequence length that might ever be 457 | used with this model. This can be longer than the sequence length of 458 | input_tensor, but cannot be shorter. 459 | dropout_prob: float. Dropout probability applied to the final output tensor. 460 | 461 | Returns: 462 | float tensor with same shape as `input_tensor`. 463 | 464 | Raises: 465 | ValueError: One of the tensor shapes or input values is invalid. 466 | """ 467 | input_shape = get_shape_list(input_tensor, expected_rank=3) 468 | batch_size = input_shape[0] 469 | seq_length = input_shape[1] 470 | width = input_shape[2] 471 | 472 | output = input_tensor 473 | 474 | if use_token_type: 475 | if token_type_ids is None: 476 | raise ValueError("`token_type_ids` must be specified if" 477 | "`use_token_type` is True.") 478 | token_type_table = tf.get_variable( 479 | name=token_type_embedding_name, 480 | shape=[token_type_vocab_size, width], 481 | initializer=create_initializer(initializer_range)) 482 | # This vocab will be small so we always do one-hot here, since it is always 483 | # faster for a small vocabulary. 484 | flat_token_type_ids = tf.reshape(token_type_ids, [-1]) 485 | one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size) 486 | token_type_embeddings = tf.matmul(one_hot_ids, token_type_table) 487 | token_type_embeddings = tf.reshape(token_type_embeddings, 488 | [batch_size, seq_length, width]) 489 | output += token_type_embeddings 490 | 491 | if use_position_embeddings: 492 | assert_op = tf.assert_less_equal(seq_length, max_position_embeddings) 493 | with tf.control_dependencies([assert_op]): 494 | full_position_embeddings = tf.get_variable( 495 | name=position_embedding_name, 496 | shape=[max_position_embeddings, width], 497 | initializer=create_initializer(initializer_range)) 498 | # Since the position embedding table is a learned variable, we create it 499 | # using a (long) sequence length `max_position_embeddings`. The actual 500 | # sequence length might be shorter than this, for faster training of 501 | # tasks that do not have long sequences. 502 | # 503 | # So `full_position_embeddings` is effectively an embedding table 504 | # for position [0, 1, 2, ..., max_position_embeddings-1], and the current 505 | # sequence has positions [0, 1, 2, ... seq_length-1], so we can just 506 | # perform a slice. 507 | position_embeddings = tf.slice(full_position_embeddings, [0, 0], 508 | [seq_length, -1]) 509 | num_dims = len(output.shape.as_list()) 510 | 511 | # Only the last two dimensions are relevant (`seq_length` and `width`), so 512 | # we broadcast among the first dimensions, which is typically just 513 | # the batch size. 514 | position_broadcast_shape = [] 515 | for _ in range(num_dims - 2): 516 | position_broadcast_shape.append(1) 517 | position_broadcast_shape.extend([seq_length, width]) 518 | position_embeddings = tf.reshape(position_embeddings, 519 | position_broadcast_shape) 520 | output += position_embeddings 521 | 522 | output = layer_norm_and_dropout(output, dropout_prob) 523 | return output 524 | 525 | 526 | def create_attention_mask_from_input_mask(from_tensor, to_mask): 527 | """Create 3D attention mask from a 2D tensor mask. 528 | 529 | Args: 530 | from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...]. 531 | to_mask: int32 Tensor of shape [batch_size, to_seq_length]. 532 | 533 | Returns: 534 | float Tensor of shape [batch_size, from_seq_length, to_seq_length]. 535 | """ 536 | from_shape = get_shape_list(from_tensor, expected_rank=[2, 3]) 537 | batch_size = from_shape[0] 538 | from_seq_length = from_shape[1] 539 | 540 | to_shape = get_shape_list(to_mask, expected_rank=2) 541 | to_seq_length = to_shape[1] 542 | 543 | to_mask = tf.cast( 544 | tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32) 545 | 546 | # We don't assume that `from_tensor` is a mask (although it could be). We 547 | # don't actually care if we attend *from* padding tokens (only *to* padding) 548 | # tokens so we create a tensor of all ones. 549 | # 550 | # `broadcast_ones` = [batch_size, from_seq_length, 1] 551 | broadcast_ones = tf.ones( 552 | shape=[batch_size, from_seq_length, 1], dtype=tf.float32) 553 | 554 | # Here we broadcast along two dimensions to create the mask. 555 | mask = broadcast_ones * to_mask 556 | 557 | return mask 558 | 559 | 560 | def attention_layer(from_tensor, 561 | to_tensor, 562 | attention_mask=None, 563 | num_attention_heads=1, 564 | size_per_head=512, 565 | query_act=None, 566 | key_act=None, 567 | value_act=None, 568 | attention_probs_dropout_prob=0.0, 569 | initializer_range=0.02, 570 | do_return_2d_tensor=False, 571 | batch_size=None, 572 | from_seq_length=None, 573 | to_seq_length=None): 574 | """Performs multi-headed attention from `from_tensor` to `to_tensor`. 575 | 576 | This is an implementation of multi-headed attention based on "Attention 577 | is all you Need". If `from_tensor` and `to_tensor` are the same, then 578 | this is self-attention. Each timestep in `from_tensor` attends to the 579 | corresponding sequence in `to_tensor`, and returns a fixed-with vector. 580 | 581 | This function first projects `from_tensor` into a "query" tensor and 582 | `to_tensor` into "key" and "value" tensors. These are (effectively) a list 583 | of tensors of length `num_attention_heads`, where each tensor is of shape 584 | [batch_size, seq_length, size_per_head]. 585 | 586 | Then, the query and key tensors are dot-producted and scaled. These are 587 | softmaxed to obtain attention probabilities. The value tensors are then 588 | interpolated by these probabilities, then concatenated back to a single 589 | tensor and returned. 590 | 591 | In practice, the multi-headed attention are done with transposes and 592 | reshapes rather than actual separate tensors. 593 | 594 | Args: 595 | from_tensor: float Tensor of shape [batch_size, from_seq_length, 596 | from_width]. 597 | to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width]. 598 | attention_mask: (optional) int32 Tensor of shape [batch_size, 599 | from_seq_length, to_seq_length]. The values should be 1 or 0. The 600 | attention scores will effectively be set to -infinity for any positions in 601 | the mask that are 0, and will be unchanged for positions that are 1. 602 | num_attention_heads: int. Number of attention heads. 603 | size_per_head: int. Size of each attention head. 604 | query_act: (optional) Activation function for the query transform. 605 | key_act: (optional) Activation function for the key transform. 606 | value_act: (optional) Activation function for the value transform. 607 | attention_probs_dropout_prob: (optional) float. Dropout probability of the 608 | attention probabilities. 609 | initializer_range: float. Range of the weight initializer. 610 | do_return_2d_tensor: bool. If True, the output will be of shape [batch_size 611 | * from_seq_length, num_attention_heads * size_per_head]. If False, the 612 | output will be of shape [batch_size, from_seq_length, num_attention_heads 613 | * size_per_head]. 614 | batch_size: (Optional) int. If the input is 2D, this might be the batch size 615 | of the 3D version of the `from_tensor` and `to_tensor`. 616 | from_seq_length: (Optional) If the input is 2D, this might be the seq length 617 | of the 3D version of the `from_tensor`. 618 | to_seq_length: (Optional) If the input is 2D, this might be the seq length 619 | of the 3D version of the `to_tensor`. 620 | 621 | Returns: 622 | float Tensor of shape [batch_size, from_seq_length, 623 | num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is 624 | true, this will be of shape [batch_size * from_seq_length, 625 | num_attention_heads * size_per_head]). 626 | 627 | Raises: 628 | ValueError: Any of the arguments or tensor shapes are invalid. 629 | """ 630 | 631 | def transpose_for_scores(input_tensor, batch_size, num_attention_heads, 632 | seq_length, width): 633 | output_tensor = tf.reshape( 634 | input_tensor, [batch_size, seq_length, num_attention_heads, width]) 635 | 636 | output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3]) 637 | return output_tensor 638 | 639 | from_shape = get_shape_list(from_tensor, expected_rank=[2, 3]) 640 | to_shape = get_shape_list(to_tensor, expected_rank=[2, 3]) 641 | 642 | if len(from_shape) != len(to_shape): 643 | raise ValueError( 644 | "The rank of `from_tensor` must match the rank of `to_tensor`.") 645 | 646 | if len(from_shape) == 3: 647 | batch_size = from_shape[0] 648 | from_seq_length = from_shape[1] 649 | to_seq_length = to_shape[1] 650 | elif len(from_shape) == 2: 651 | if (batch_size is None or from_seq_length is None or to_seq_length is None): 652 | raise ValueError( 653 | "When passing in rank 2 tensors to attention_layer, the values " 654 | "for `batch_size`, `from_seq_length`, and `to_seq_length` " 655 | "must all be specified.") 656 | 657 | # Scalar dimensions referenced here: 658 | # B = batch size (number of sequences) 659 | # F = `from_tensor` sequence length 660 | # T = `to_tensor` sequence length 661 | # N = `num_attention_heads` 662 | # H = `size_per_head` 663 | 664 | from_tensor_2d = reshape_to_matrix(from_tensor) 665 | to_tensor_2d = reshape_to_matrix(to_tensor) 666 | 667 | # `query_layer` = [B*F, N*H] 668 | query_layer = tf.layers.dense( 669 | from_tensor_2d, 670 | num_attention_heads * size_per_head, 671 | activation=query_act, 672 | name="query", 673 | kernel_initializer=create_initializer(initializer_range)) 674 | 675 | # `key_layer` = [B*T, N*H] 676 | key_layer = tf.layers.dense( 677 | to_tensor_2d, 678 | num_attention_heads * size_per_head, 679 | activation=key_act, 680 | name="key", 681 | kernel_initializer=create_initializer(initializer_range)) 682 | 683 | # `value_layer` = [B*T, N*H] 684 | value_layer = tf.layers.dense( 685 | to_tensor_2d, 686 | num_attention_heads * size_per_head, 687 | activation=value_act, 688 | name="value", 689 | kernel_initializer=create_initializer(initializer_range)) 690 | 691 | # `query_layer` = [B, N, F, H] 692 | query_layer = transpose_for_scores(query_layer, batch_size, 693 | num_attention_heads, from_seq_length, 694 | size_per_head) 695 | 696 | # `key_layer` = [B, N, T, H] 697 | key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads, 698 | to_seq_length, size_per_head) 699 | 700 | # Take the dot product between "query" and "key" to get the raw 701 | # attention scores. 702 | # `attention_scores` = [B, N, F, T] 703 | attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True) 704 | attention_scores = tf.multiply(attention_scores, 705 | 1.0 / math.sqrt(float(size_per_head))) 706 | 707 | if attention_mask is not None: 708 | # `attention_mask` = [B, 1, F, T] 709 | attention_mask = tf.expand_dims(attention_mask, axis=[1]) 710 | 711 | # Since attention_mask is 1.0 for positions we want to attend and 0.0 for 712 | # masked positions, this operation will create a tensor which is 0.0 for 713 | # positions we want to attend and -10000.0 for masked positions. 714 | adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0 715 | 716 | # Since we are adding it to the raw scores before the softmax, this is 717 | # effectively the same as removing these entirely. 718 | attention_scores += adder 719 | 720 | # Normalize the attention scores to probabilities. 721 | # `attention_probs` = [B, N, F, T] 722 | attention_probs = tf.nn.softmax(attention_scores) 723 | 724 | # This is actually dropping out entire tokens to attend to, which might 725 | # seem a bit unusual, but is taken from the original Transformer paper. 726 | attention_probs = dropout(attention_probs, attention_probs_dropout_prob) 727 | 728 | # `value_layer` = [B, T, N, H] 729 | value_layer = tf.reshape( 730 | value_layer, 731 | [batch_size, to_seq_length, num_attention_heads, size_per_head]) 732 | 733 | # `value_layer` = [B, N, T, H] 734 | value_layer = tf.transpose(value_layer, [0, 2, 1, 3]) 735 | 736 | # `context_layer` = [B, N, F, H] 737 | context_layer = tf.matmul(attention_probs, value_layer) 738 | 739 | # `context_layer` = [B, F, N, H] 740 | context_layer = tf.transpose(context_layer, [0, 2, 1, 3]) 741 | 742 | if do_return_2d_tensor: 743 | # `context_layer` = [B*F, N*V] 744 | context_layer = tf.reshape( 745 | context_layer, 746 | [batch_size * from_seq_length, num_attention_heads * size_per_head]) 747 | else: 748 | # `context_layer` = [B, F, N*V] 749 | context_layer = tf.reshape( 750 | context_layer, 751 | [batch_size, from_seq_length, num_attention_heads * size_per_head]) 752 | 753 | return context_layer 754 | 755 | 756 | def transformer_model(input_tensor, 757 | attention_mask=None, 758 | hidden_size=768, 759 | num_hidden_layers=12, 760 | num_attention_heads=12, 761 | intermediate_size=3072, 762 | intermediate_act_fn=gelu, 763 | hidden_dropout_prob=0.1, 764 | attention_probs_dropout_prob=0.1, 765 | initializer_range=0.02, 766 | do_return_all_layers=False): 767 | """Multi-headed, multi-layer Transformer from "Attention is All You Need". 768 | 769 | This is almost an exact implementation of the original Transformer encoder. 770 | 771 | See the original paper: 772 | https://arxiv.org/abs/1706.03762 773 | 774 | Also see: 775 | https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py 776 | 777 | Args: 778 | input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size]. 779 | attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length, 780 | seq_length], with 1 for positions that can be attended to and 0 in 781 | positions that should not be. 782 | hidden_size: int. Hidden size of the Transformer. 783 | num_hidden_layers: int. Number of layers (blocks) in the Transformer. 784 | num_attention_heads: int. Number of attention heads in the Transformer. 785 | intermediate_size: int. The size of the "intermediate" (a.k.a., feed 786 | forward) layer. 787 | intermediate_act_fn: function. The non-linear activation function to apply 788 | to the output of the intermediate/feed-forward layer. 789 | hidden_dropout_prob: float. Dropout probability for the hidden layers. 790 | attention_probs_dropout_prob: float. Dropout probability of the attention 791 | probabilities. 792 | initializer_range: float. Range of the initializer (stddev of truncated 793 | normal). 794 | do_return_all_layers: Whether to also return all layers or just the final 795 | layer. 796 | 797 | Returns: 798 | float Tensor of shape [batch_size, seq_length, hidden_size], the final 799 | hidden layer of the Transformer. 800 | 801 | Raises: 802 | ValueError: A Tensor shape or parameter is invalid. 803 | """ 804 | if hidden_size % num_attention_heads != 0: 805 | raise ValueError( 806 | "The hidden size (%d) is not a multiple of the number of attention " 807 | "heads (%d)" % (hidden_size, num_attention_heads)) 808 | 809 | attention_head_size = int(hidden_size / num_attention_heads) 810 | input_shape = get_shape_list(input_tensor, expected_rank=3) 811 | batch_size = input_shape[0] 812 | seq_length = input_shape[1] 813 | input_width = input_shape[2] 814 | 815 | # The Transformer performs sum residuals on all layers so the input needs 816 | # to be the same as the hidden size. 817 | if input_width != hidden_size: 818 | raise ValueError("The width of the input tensor (%d) != hidden size (%d)" % 819 | (input_width, hidden_size)) 820 | 821 | # We keep the representation as a 2D tensor to avoid re-shaping it back and 822 | # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on 823 | # the GPU/CPU but may not be free on the TPU, so we want to minimize them to 824 | # help the optimizer. 825 | prev_output = reshape_to_matrix(input_tensor) 826 | 827 | all_layer_outputs = [] 828 | for layer_idx in range(num_hidden_layers): 829 | with tf.variable_scope("layer_%d" % layer_idx): 830 | layer_input = prev_output 831 | 832 | with tf.variable_scope("attention"): 833 | attention_heads = [] 834 | with tf.variable_scope("self"): 835 | attention_head = attention_layer( 836 | from_tensor=layer_input, 837 | to_tensor=layer_input, 838 | attention_mask=attention_mask, 839 | num_attention_heads=num_attention_heads, 840 | size_per_head=attention_head_size, 841 | attention_probs_dropout_prob=attention_probs_dropout_prob, 842 | initializer_range=initializer_range, 843 | do_return_2d_tensor=True, 844 | batch_size=batch_size, 845 | from_seq_length=seq_length, 846 | to_seq_length=seq_length) 847 | attention_heads.append(attention_head) 848 | 849 | attention_output = None 850 | if len(attention_heads) == 1: 851 | attention_output = attention_heads[0] 852 | else: 853 | # In the case where we have other sequences, we just concatenate 854 | # them to the self-attention head before the projection. 855 | attention_output = tf.concat(attention_heads, axis=-1) 856 | 857 | # Run a linear projection of `hidden_size` then add a residual 858 | # with `layer_input`. 859 | with tf.variable_scope("output"): 860 | attention_output = tf.layers.dense( 861 | attention_output, 862 | hidden_size, 863 | kernel_initializer=create_initializer(initializer_range)) 864 | attention_output = dropout(attention_output, hidden_dropout_prob) 865 | attention_output = layer_norm(attention_output + layer_input) 866 | 867 | # The activation is only applied to the "intermediate" hidden layer. 868 | with tf.variable_scope("intermediate"): 869 | intermediate_output = tf.layers.dense( 870 | attention_output, 871 | intermediate_size, 872 | activation=intermediate_act_fn, 873 | kernel_initializer=create_initializer(initializer_range)) 874 | 875 | # Down-project back to `hidden_size` then add the residual. 876 | with tf.variable_scope("output"): 877 | layer_output = tf.layers.dense( 878 | intermediate_output, 879 | hidden_size, 880 | kernel_initializer=create_initializer(initializer_range)) 881 | layer_output = dropout(layer_output, hidden_dropout_prob) 882 | layer_output = layer_norm(layer_output + attention_output) 883 | prev_output = layer_output 884 | all_layer_outputs.append(layer_output) 885 | 886 | if do_return_all_layers: 887 | final_outputs = [] 888 | for layer_output in all_layer_outputs: 889 | final_output = reshape_from_matrix(layer_output, input_shape) 890 | final_outputs.append(final_output) 891 | return final_outputs 892 | else: 893 | final_output = reshape_from_matrix(prev_output, input_shape) 894 | return final_output 895 | 896 | 897 | def get_shape_list(tensor, expected_rank=None, name=None): 898 | """Returns a list of the shape of tensor, preferring static dimensions. 899 | 900 | Args: 901 | tensor: A tf.Tensor object to find the shape of. 902 | expected_rank: (optional) int. The expected rank of `tensor`. If this is 903 | specified and the `tensor` has a different rank, and exception will be 904 | thrown. 905 | name: Optional name of the tensor for the error message. 906 | 907 | Returns: 908 | A list of dimensions of the shape of tensor. All static dimensions will 909 | be returned as python integers, and dynamic dimensions will be returned 910 | as tf.Tensor scalars. 911 | """ 912 | if name is None: 913 | name = tensor.name 914 | 915 | if expected_rank is not None: 916 | assert_rank(tensor, expected_rank, name) 917 | 918 | shape = tensor.shape.as_list() 919 | 920 | non_static_indexes = [] 921 | for (index, dim) in enumerate(shape): 922 | if dim is None: 923 | non_static_indexes.append(index) 924 | 925 | if not non_static_indexes: 926 | return shape 927 | 928 | dyn_shape = tf.shape(tensor) 929 | for index in non_static_indexes: 930 | shape[index] = dyn_shape[index] 931 | return shape 932 | 933 | 934 | def reshape_to_matrix(input_tensor): 935 | """Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix).""" 936 | ndims = input_tensor.shape.ndims 937 | if ndims < 2: 938 | raise ValueError("Input tensor must have at least rank 2. Shape = %s" % 939 | (input_tensor.shape)) 940 | if ndims == 2: 941 | return input_tensor 942 | 943 | width = input_tensor.shape[-1] 944 | output_tensor = tf.reshape(input_tensor, [-1, width]) 945 | return output_tensor 946 | 947 | 948 | def reshape_from_matrix(output_tensor, orig_shape_list): 949 | """Reshapes a rank 2 tensor back to its original rank >= 2 tensor.""" 950 | if len(orig_shape_list) == 2: 951 | return output_tensor 952 | 953 | output_shape = get_shape_list(output_tensor) 954 | 955 | orig_dims = orig_shape_list[0:-1] 956 | width = output_shape[-1] 957 | 958 | return tf.reshape(output_tensor, orig_dims + [width]) 959 | 960 | 961 | def assert_rank(tensor, expected_rank, name=None): 962 | """Raises an exception if the tensor rank is not of the expected rank. 963 | 964 | Args: 965 | tensor: A tf.Tensor to check the rank of. 966 | expected_rank: Python integer or list of integers, expected rank. 967 | name: Optional name of the tensor for the error message. 968 | 969 | Raises: 970 | ValueError: If the expected shape doesn't match the actual shape. 971 | """ 972 | if name is None: 973 | name = tensor.name 974 | 975 | expected_rank_dict = {} 976 | if isinstance(expected_rank, six.integer_types): 977 | expected_rank_dict[expected_rank] = True 978 | else: 979 | for x in expected_rank: 980 | expected_rank_dict[x] = True 981 | 982 | actual_rank = tensor.shape.ndims 983 | if actual_rank not in expected_rank_dict: 984 | scope_name = tf.get_variable_scope().name 985 | raise ValueError( 986 | "For the tensor `%s` in scope `%s`, the actual rank " 987 | "`%d` (shape = %s) is not equal to the expected rank `%s`" % 988 | (name, scope_name, actual_rank, str(tensor.shape), str(expected_rank))) 989 | -------------------------------------------------------------------------------- /src/optimization.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """Functions and classes related to optimization (weight updates).""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import re 22 | import tensorflow as tf 23 | 24 | 25 | def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu): 26 | """Creates an optimizer training op.""" 27 | global_step = tf.train.get_or_create_global_step() 28 | 29 | learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32) 30 | 31 | # Implements linear decay of the learning rate. 32 | learning_rate = tf.train.polynomial_decay( 33 | learning_rate, 34 | global_step, 35 | num_train_steps, 36 | end_learning_rate=0.0, 37 | power=1.0, 38 | cycle=False) 39 | 40 | # Implements linear warmup. I.e., if global_step < num_warmup_steps, the 41 | # learning rate will be `global_step/num_warmup_steps * init_lr`. 42 | if num_warmup_steps: 43 | global_steps_int = tf.cast(global_step, tf.int32) 44 | warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32) 45 | 46 | global_steps_float = tf.cast(global_steps_int, tf.float32) 47 | warmup_steps_float = tf.cast(warmup_steps_int, tf.float32) 48 | 49 | warmup_percent_done = global_steps_float / warmup_steps_float 50 | warmup_learning_rate = init_lr * warmup_percent_done 51 | 52 | is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32) 53 | learning_rate = ( 54 | (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate) 55 | 56 | # It is recommended that you use this optimizer for fine tuning, since this 57 | # is how the model was trained (note that the Adam m/v variables are NOT 58 | # loaded from init_checkpoint.) 59 | optimizer = AdamWeightDecayOptimizer( 60 | learning_rate=learning_rate, 61 | weight_decay_rate=0.01, 62 | beta_1=0.9, 63 | beta_2=0.999, 64 | epsilon=1e-6, 65 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) 66 | 67 | if use_tpu: 68 | optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer) 69 | 70 | tvars = tf.trainable_variables() 71 | grads = tf.gradients(loss, tvars) 72 | 73 | # This is how the model was pre-trained. 74 | (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0) 75 | 76 | train_op = optimizer.apply_gradients( 77 | zip(grads, tvars), global_step=global_step) 78 | 79 | new_global_step = global_step + 1 80 | train_op = tf.group(train_op, [global_step.assign(new_global_step)]) 81 | return train_op 82 | 83 | 84 | class AdamWeightDecayOptimizer(tf.train.Optimizer): 85 | """A basic Adam optimizer that includes "correct" L2 weight decay.""" 86 | 87 | def __init__(self, 88 | learning_rate, 89 | weight_decay_rate=0.0, 90 | beta_1=0.9, 91 | beta_2=0.999, 92 | epsilon=1e-6, 93 | exclude_from_weight_decay=None, 94 | name="AdamWeightDecayOptimizer"): 95 | """Constructs a AdamWeightDecayOptimizer.""" 96 | super(AdamWeightDecayOptimizer, self).__init__(False, name) 97 | 98 | self.learning_rate = learning_rate 99 | self.weight_decay_rate = weight_decay_rate 100 | self.beta_1 = beta_1 101 | self.beta_2 = beta_2 102 | self.epsilon = epsilon 103 | self.exclude_from_weight_decay = exclude_from_weight_decay 104 | 105 | def apply_gradients(self, grads_and_vars, global_step=None, name=None): 106 | """See base class.""" 107 | assignments = [] 108 | for (grad, param) in grads_and_vars: 109 | if grad is None or param is None: 110 | continue 111 | 112 | param_name = self._get_variable_name(param.name) 113 | 114 | m = tf.get_variable( 115 | name=param_name + "/adam_m", 116 | shape=param.shape.as_list(), 117 | dtype=tf.float32, 118 | trainable=False, 119 | initializer=tf.zeros_initializer()) 120 | v = tf.get_variable( 121 | name=param_name + "/adam_v", 122 | shape=param.shape.as_list(), 123 | dtype=tf.float32, 124 | trainable=False, 125 | initializer=tf.zeros_initializer()) 126 | 127 | # Standard Adam update. 128 | next_m = ( 129 | tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad)) 130 | next_v = ( 131 | tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2, 132 | tf.square(grad))) 133 | 134 | update = next_m / (tf.sqrt(next_v) + self.epsilon) 135 | 136 | # Just adding the square of the weights to the loss function is *not* 137 | # the correct way of using L2 regularization/weight decay with Adam, 138 | # since that will interact with the m and v parameters in strange ways. 139 | # 140 | # Instead we want ot decay the weights in a manner that doesn't interact 141 | # with the m/v parameters. This is equivalent to adding the square 142 | # of the weights to the loss with plain (non-momentum) SGD. 143 | if self._do_use_weight_decay(param_name): 144 | update += self.weight_decay_rate * param 145 | 146 | update_with_lr = self.learning_rate * update 147 | 148 | next_param = param - update_with_lr 149 | 150 | assignments.extend( 151 | [param.assign(next_param), 152 | m.assign(next_m), 153 | v.assign(next_v)]) 154 | return tf.group(*assignments, name=name) 155 | 156 | def _do_use_weight_decay(self, param_name): 157 | """Whether to use L2 weight decay for `param_name`.""" 158 | if not self.weight_decay_rate: 159 | return False 160 | if self.exclude_from_weight_decay: 161 | for r in self.exclude_from_weight_decay: 162 | if re.search(r, param_name) is not None: 163 | return False 164 | return True 165 | 166 | def _get_variable_name(self, param_name): 167 | """Get the variable name from the tensor name.""" 168 | m = re.match("^(.*):\\d+$", param_name) 169 | if m is not None: 170 | param_name = m.group(1) 171 | return param_name 172 | -------------------------------------------------------------------------------- /src/preprocess/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | -------------------------------------------------------------------------------- /src/preprocess/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/src/preprocess/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /src/preprocess/__pycache__/normalizer.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/src/preprocess/__pycache__/normalizer.cpython-36.pyc -------------------------------------------------------------------------------- /src/preprocess/normalizer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | import os, sys, io 4 | 5 | import unicodedata 6 | import regex 7 | import html 8 | 9 | # String Normalizer Module 10 | # Usage: from normalizer import twitter_normalizer 11 | 12 | ##charFilter: urlFilter 13 | url_regex = "https?://[-_.!~*\'()a-z0-9;/?:@&=+$,%#]+" 14 | url_pattern = regex.compile(url_regex, regex.IGNORECASE) 15 | 16 | ##charFilter: partialurlFilter 17 | partial_url_regex = "(((https|http)(.{1,3})?)|(htt|ht))$" 18 | partial_url_pattern = regex.compile(partial_url_regex, regex.IGNORECASE) 19 | 20 | ##charFilter: retweetflagFilter 21 | rt_regex = "rt (?=\@)" 22 | rt_pattern = regex.compile(rt_regex, regex.IGNORECASE) 23 | 24 | ##charFilter: screennameFilter 25 | scname_regex = "\@[a-z0-9\_]+:?" 26 | scname_pattern = regex.compile(scname_regex, regex.IGNORECASE) 27 | 28 | ##charFilter: truncationFilter 29 | truncation_regex = "…$" # NFKC:"...$" 30 | truncation_pattern = regex.compile(truncation_regex, regex.IGNORECASE) 31 | 32 | ##charFilter: hashtagFilter 33 | hashtag_regex = r"\#\S+" 34 | hashtag_pattern = regex.compile(hashtag_regex, regex.IGNORECASE) 35 | 36 | ##charFilter: whitespaceNormalizer 37 | ws_regex = "\p{Zs}" 38 | ws_pattern = regex.compile(ws_regex, regex.IGNORECASE) 39 | 40 | ##charFilter: controlcodeFilter 41 | cc_regex = "\p{Cc}" 42 | cc_pattern = regex.compile(cc_regex, regex.IGNORECASE) 43 | 44 | ##charFilter: singlequestionFilter 45 | sq_regex = "\?{1,}" 46 | sq_pattern = regex.compile(sq_regex, regex.IGNORECASE) 47 | 48 | SPECIAL_TOKENS = { 49 | "url":"", 50 | "screen_name":"" 51 | } 52 | 53 | 54 | def twitter_normalizer(str_): 55 | # processing order is crucial. 56 | 57 | #unescape html entities 58 | str_ = html.unescape(str_) 59 | #charFilter: strip 60 | str_ = str_.strip() 61 | #charFilter: truncationFilter 62 | str_ = truncation_pattern.sub("", str_) 63 | #charFilter: icuNormalizer(NKFC) 64 | str_ = unicodedata.normalize('NFKC', str_) 65 | #charFilter: caseNormalizer 66 | str_ = str_.lower() 67 | #charFilter: retweetflagFilter 68 | str_ = rt_pattern.sub("", str_) 69 | ##charFilter: partialurlFilter 70 | str_ = partial_url_pattern.sub("", str_) 71 | ##charFilter: screennameFilter 72 | str_ = scname_pattern.sub(SPECIAL_TOKENS["screen_name"], str_) 73 | ##charFilter: urlFilter 74 | str_ = url_pattern.sub(SPECIAL_TOKENS["url"], str_) 75 | ##charFilter: strip(once again) 76 | str_ = str_.strip() 77 | 78 | return str_ 79 | 80 | def question_remover(str_: str): 81 | return sq_pattern.sub(" ", str_) 82 | 83 | def whitespace_normalizer(str_: str): 84 | str_ = ws_pattern.sub(" ", str_) 85 | return str_ 86 | 87 | def control_code_remover(str_: str): 88 | str_ = cc_pattern.sub(" ", str_) 89 | return str_ 90 | 91 | def twitter_normalizer_for_bert_encoder(str_): 92 | # normalizer that is specialized to Twitter BERT Encoder. 93 | 94 | #unescape html entities 95 | str_ = html.unescape(str_) 96 | #charFilter: question mark 97 | str_ = sq_pattern.sub(" ", str_) 98 | #charFilter: strip 99 | str_ = str_.strip() 100 | #charFilter: truncationFilter 101 | str_ = truncation_pattern.sub("", str_) 102 | #charFilter: icuNormalizer(NKFC) 103 | str_ = unicodedata.normalize('NFKC', str_) 104 | #charFilter: caseNormalizer 105 | # str_ = str_.lower() 106 | #charFilter: retweetflagFilter 107 | str_ = rt_pattern.sub("", str_) 108 | #charFilter: partialurlFilter 109 | str_ = partial_url_pattern.sub("", str_) 110 | #charFilter: screennameFilter 111 | str_ = scname_pattern.sub(SPECIAL_TOKENS["screen_name"], str_) 112 | #charFilter: urlFilter 113 | str_ = url_pattern.sub(SPECIAL_TOKENS["url"], str_) 114 | #charFilter: control code such as newline 115 | str_ = cc_pattern.sub(" ", str_) 116 | #charFilter: strip(once again) 117 | str_ = str_.strip() 118 | 119 | return str_ 120 | -------------------------------------------------------------------------------- /src/run_classifier.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """BERT finetuning runner.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import collections 22 | import os 23 | import modeling 24 | import optimization 25 | import tokenization 26 | import tensorflow as tf 27 | import numpy as np 28 | from preprocess import normalizer 29 | 30 | from dataprocessor.preset import InputFeatures, XnliProcessor, MnliProcessor, MrpcProcessor, ColaProcessor 31 | from dataprocessor.custom import PublicTwitterSentimentProcessor 32 | import utility 33 | 34 | flags = tf.flags 35 | 36 | FLAGS = flags.FLAGS 37 | 38 | ## Required parameters 39 | flags.DEFINE_string( 40 | "data_dir", None, 41 | "The input data dir. Should contain the .tsv files (or other data files) " 42 | "for the task.") 43 | 44 | flags.DEFINE_string( 45 | "bert_config_file", None, 46 | "The config json file corresponding to the pre-trained BERT model. " 47 | "This specifies the model architecture.") 48 | 49 | flags.DEFINE_string("task_name", None, "The name of the task to train.") 50 | 51 | flags.DEFINE_string("vocab_file", None, 52 | "The vocabulary file that the BERT model was trained on.") 53 | 54 | flags.DEFINE_string( 55 | "output_dir", None, 56 | "The output directory where the model checkpoints will be written.") 57 | 58 | ## Other parameters 59 | 60 | flags.DEFINE_string("normalizer_name", None, 61 | "The name of the normalizer that will be applied to dataset.") 62 | 63 | flags.DEFINE_string("spm_file", None, 64 | "The sentencepiece model file to tokenize texts.") 65 | 66 | flags.DEFINE_string( 67 | "init_checkpoint", None, 68 | "Initial checkpoint (usually from a pre-trained BERT model).") 69 | 70 | flags.DEFINE_bool( 71 | "do_lower_case", True, 72 | "Whether to lower case the input text. Should be True for uncased " 73 | "models and False for cased models.") 74 | 75 | flags.DEFINE_integer( 76 | "max_seq_length", 128, 77 | "The maximum total input sequence length after WordPiece tokenization. " 78 | "Sequences longer than this will be truncated, and sequences shorter " 79 | "than this will be padded.") 80 | 81 | flags.DEFINE_bool("do_train", False, "Whether to run training.") 82 | 83 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") 84 | 85 | flags.DEFINE_bool( 86 | "do_predict", False, 87 | "Whether to run the model in inference mode on the test set.") 88 | 89 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") 90 | 91 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") 92 | 93 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") 94 | 95 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") 96 | 97 | flags.DEFINE_float("num_train_epochs", 3.0, 98 | "Total number of training epochs to perform.") 99 | 100 | flags.DEFINE_float( 101 | "warmup_proportion", 0.1, 102 | "Proportion of training to perform linear learning rate warmup for. " 103 | "E.g., 0.1 = 10% of training.") 104 | 105 | flags.DEFINE_integer("save_checkpoints_steps", 1000, 106 | "How often to save the model checkpoint.") 107 | 108 | flags.DEFINE_integer("iterations_per_loop", 1000, 109 | "How many steps to make in each estimator call.") 110 | 111 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 112 | 113 | tf.flags.DEFINE_string( 114 | "tpu_name", None, 115 | "The Cloud TPU to use for training. This should be either the name " 116 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " 117 | "url.") 118 | 119 | tf.flags.DEFINE_string( 120 | "tpu_zone", None, 121 | "[Optional] GCE zone where the Cloud TPU is located in. If not " 122 | "specified, we will attempt to automatically detect the GCE project from " 123 | "metadata.") 124 | 125 | tf.flags.DEFINE_string( 126 | "gcp_project", None, 127 | "[Optional] Project name for the Cloud TPU-enabled project. If not " 128 | "specified, we will attempt to automatically detect the GCE project from " 129 | "metadata.") 130 | 131 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") 132 | 133 | flags.DEFINE_integer( 134 | "num_tpu_cores", 8, 135 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 136 | 137 | 138 | def convert_single_example(ex_index, example, label_list, max_seq_length, 139 | tokenizer): 140 | """Converts a single `InputExample` into a single `InputFeatures`.""" 141 | label_map = {} 142 | for (i, label) in enumerate(label_list): 143 | label_map[label] = i 144 | 145 | tokens_a = tokenizer.tokenize(example.text_a) 146 | tokens_b = None 147 | if example.text_b: 148 | tokens_b = tokenizer.tokenize(example.text_b) 149 | 150 | if tokens_b: 151 | # Modifies `tokens_a` and `tokens_b` in place so that the total 152 | # length is less than the specified length. 153 | # Account for [CLS], [SEP], [SEP] with "- 3" 154 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) 155 | else: 156 | # Account for [CLS] and [SEP] with "- 2" 157 | if len(tokens_a) > max_seq_length - 2: 158 | tokens_a = tokens_a[0:(max_seq_length - 2)] 159 | 160 | # The convention in BERT is: 161 | # (a) For sequence pairs: 162 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 163 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 164 | # (b) For single sequences: 165 | # tokens: [CLS] the dog is hairy . [SEP] 166 | # type_ids: 0 0 0 0 0 0 0 167 | # 168 | # Where "type_ids" are used to indicate whether this is the first 169 | # sequence or the second sequence. The embedding vectors for `type=0` and 170 | # `type=1` were learned during pre-training and are added to the wordpiece 171 | # embedding vector (and position vector). This is not *strictly* necessary 172 | # since the [SEP] token unambiguously separates the sequences, but it makes 173 | # it easier for the model to learn the concept of sequences. 174 | # 175 | # For classification tasks, the first vector (corresponding to [CLS]) is 176 | # used as as the "sentence vector". Note that this only makes sense because 177 | # the entire model is fine-tuned. 178 | tokens = [] 179 | segment_ids = [] 180 | tokens.append("[CLS]") 181 | segment_ids.append(0) 182 | for token in tokens_a: 183 | tokens.append(token) 184 | segment_ids.append(0) 185 | tokens.append("[SEP]") 186 | segment_ids.append(0) 187 | 188 | if tokens_b: 189 | for token in tokens_b: 190 | tokens.append(token) 191 | segment_ids.append(1) 192 | tokens.append("[SEP]") 193 | segment_ids.append(1) 194 | 195 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 196 | 197 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 198 | # tokens are attended to. 199 | input_mask = [1] * len(input_ids) 200 | 201 | # Zero-pad up to the sequence length. 202 | while len(input_ids) < max_seq_length: 203 | input_ids.append(0) 204 | input_mask.append(0) 205 | segment_ids.append(0) 206 | 207 | assert len(input_ids) == max_seq_length 208 | assert len(input_mask) == max_seq_length 209 | assert len(segment_ids) == max_seq_length 210 | 211 | label_id = label_map[example.label] 212 | if ex_index < 5: 213 | tf.logging.info("*** Example ***") 214 | tf.logging.info("guid: %s" % (example.guid)) 215 | tf.logging.info("tokens: %s" % " ".join( 216 | [tokenization.printable_text(x) for x in tokens])) 217 | tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 218 | tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 219 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 220 | tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) 221 | 222 | feature = InputFeatures( 223 | input_ids=input_ids, 224 | input_mask=input_mask, 225 | segment_ids=segment_ids, 226 | label_id=label_id) 227 | return feature 228 | 229 | 230 | def file_based_convert_examples_to_features( 231 | examples, label_list, max_seq_length, tokenizer, output_file): 232 | """Convert a set of `InputExample`s to a TFRecord file.""" 233 | 234 | writer = tf.python_io.TFRecordWriter(output_file) 235 | 236 | for (ex_index, example) in enumerate(examples): 237 | if ex_index % 10000 == 0: 238 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 239 | 240 | feature = convert_single_example(ex_index, example, label_list, 241 | max_seq_length, tokenizer) 242 | 243 | def create_int_feature(values): 244 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) 245 | return f 246 | 247 | features = collections.OrderedDict() 248 | features["input_ids"] = create_int_feature(feature.input_ids) 249 | features["input_mask"] = create_int_feature(feature.input_mask) 250 | features["segment_ids"] = create_int_feature(feature.segment_ids) 251 | features["label_ids"] = create_int_feature([feature.label_id]) 252 | 253 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 254 | writer.write(tf_example.SerializeToString()) 255 | 256 | 257 | def file_based_input_fn_builder(input_file, seq_length, is_training, 258 | drop_remainder): 259 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 260 | 261 | name_to_features = { 262 | "input_ids": tf.FixedLenFeature([seq_length], tf.int64), 263 | "input_mask": tf.FixedLenFeature([seq_length], tf.int64), 264 | "segment_ids": tf.FixedLenFeature([seq_length], tf.int64), 265 | "label_ids": tf.FixedLenFeature([], tf.int64), 266 | } 267 | 268 | def _decode_record(record, name_to_features): 269 | """Decodes a record to a TensorFlow example.""" 270 | example = tf.parse_single_example(record, name_to_features) 271 | 272 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32. 273 | # So cast all int64 to int32. 274 | for name in list(example.keys()): 275 | t = example[name] 276 | if t.dtype == tf.int64: 277 | t = tf.to_int32(t) 278 | example[name] = t 279 | 280 | return example 281 | 282 | def input_fn(params): 283 | """The actual input function.""" 284 | batch_size = params["batch_size"] 285 | 286 | # For training, we want a lot of parallel reading and shuffling. 287 | # For eval, we want no shuffling and parallel reading doesn't matter. 288 | d = tf.data.TFRecordDataset(input_file) 289 | if is_training: 290 | d = d.repeat() 291 | d = d.shuffle(buffer_size=100) 292 | 293 | d = d.apply( 294 | tf.contrib.data.map_and_batch( 295 | lambda record: _decode_record(record, name_to_features), 296 | batch_size=batch_size, 297 | drop_remainder=drop_remainder)) 298 | 299 | return d 300 | 301 | return input_fn 302 | 303 | 304 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 305 | """Truncates a sequence pair in place to the maximum length.""" 306 | 307 | # This is a simple heuristic which will always truncate the longer sequence 308 | # one token at a time. This makes more sense than truncating an equal percent 309 | # of tokens from each, since if one sequence is very short then each token 310 | # that's truncated likely contains more information than a longer sequence. 311 | while True: 312 | total_length = len(tokens_a) + len(tokens_b) 313 | if total_length <= max_length: 314 | break 315 | if len(tokens_a) > len(tokens_b): 316 | tokens_a.pop() 317 | else: 318 | tokens_b.pop() 319 | 320 | 321 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, 322 | labels, num_labels, use_one_hot_embeddings): 323 | """Creates a classification model.""" 324 | model = modeling.BertModel( 325 | config=bert_config, 326 | is_training=is_training, 327 | input_ids=input_ids, 328 | input_mask=input_mask, 329 | token_type_ids=segment_ids, 330 | use_one_hot_embeddings=use_one_hot_embeddings) 331 | 332 | # In the demo, we are doing a simple classification task on the entire 333 | # segment. 334 | # 335 | # If you want to use the token-level output, use model.get_sequence_output() 336 | # instead. 337 | output_layer = model.get_pooled_output() 338 | 339 | hidden_size = output_layer.shape[-1].value 340 | 341 | output_weights = tf.get_variable( 342 | "output_weights", [num_labels, hidden_size], 343 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 344 | 345 | output_bias = tf.get_variable( 346 | "output_bias", [num_labels], initializer=tf.zeros_initializer()) 347 | 348 | with tf.variable_scope("loss"): 349 | if is_training: 350 | # I.e., 0.1 dropout 351 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) 352 | 353 | logits = tf.matmul(output_layer, output_weights, transpose_b=True) 354 | logits = tf.nn.bias_add(logits, output_bias) 355 | probabilities = tf.nn.softmax(logits, axis=-1) 356 | log_probs = tf.nn.log_softmax(logits, axis=-1) 357 | 358 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) 359 | 360 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 361 | loss = tf.reduce_mean(per_example_loss) 362 | 363 | return (loss, per_example_loss, logits, probabilities) 364 | 365 | 366 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate, 367 | num_train_steps, num_warmup_steps, use_tpu, 368 | use_one_hot_embeddings): 369 | """Returns `model_fn` closure for TPUEstimator.""" 370 | 371 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument 372 | """The `model_fn` for TPUEstimator.""" 373 | 374 | tf.logging.info("*** Features ***") 375 | for name in sorted(features.keys()): 376 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) 377 | 378 | input_ids = features["input_ids"] 379 | input_mask = features["input_mask"] 380 | segment_ids = features["segment_ids"] 381 | label_ids = features["label_ids"] 382 | 383 | is_training = (mode == tf.estimator.ModeKeys.TRAIN) 384 | 385 | (total_loss, per_example_loss, logits, probabilities) = create_model( 386 | bert_config, is_training, input_ids, input_mask, segment_ids, label_ids, 387 | num_labels, use_one_hot_embeddings) 388 | 389 | tvars = tf.trainable_variables() 390 | initialized_variable_names = {} 391 | scaffold_fn = None 392 | if init_checkpoint: 393 | (assignment_map, initialized_variable_names 394 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) 395 | if use_tpu: 396 | 397 | def tpu_scaffold(): 398 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 399 | return tf.train.Scaffold() 400 | 401 | scaffold_fn = tpu_scaffold 402 | else: 403 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 404 | 405 | tf.logging.info("**** Trainable Variables ****") 406 | for var in tvars: 407 | init_string = "" 408 | if var.name in initialized_variable_names: 409 | init_string = ", *INIT_FROM_CKPT*" 410 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, 411 | init_string) 412 | 413 | output_spec = None 414 | if mode == tf.estimator.ModeKeys.TRAIN: 415 | 416 | train_op = optimization.create_optimizer( 417 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) 418 | 419 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 420 | mode=mode, 421 | loss=total_loss, 422 | train_op=train_op, 423 | scaffold_fn=scaffold_fn) 424 | elif mode == tf.estimator.ModeKeys.EVAL: 425 | 426 | def metric_fn(per_example_loss, label_ids, logits): 427 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) 428 | accuracy = tf.metrics.accuracy(label_ids, predictions) 429 | loss = tf.metrics.mean(per_example_loss) 430 | return { 431 | "eval_accuracy": accuracy, 432 | "eval_loss": loss, 433 | } 434 | 435 | eval_metrics = (metric_fn, [per_example_loss, label_ids, logits]) 436 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 437 | mode=mode, 438 | loss=total_loss, 439 | eval_metrics=eval_metrics, 440 | scaffold_fn=scaffold_fn) 441 | else: 442 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 443 | mode=mode, predictions=probabilities, scaffold_fn=scaffold_fn) 444 | return output_spec 445 | 446 | return model_fn 447 | 448 | 449 | # This function is not used by this file but is still used by the Colab and 450 | # people who depend on it. 451 | def input_fn_builder(features, seq_length, is_training, drop_remainder): 452 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 453 | 454 | all_input_ids = [] 455 | all_input_mask = [] 456 | all_segment_ids = [] 457 | all_label_ids = [] 458 | 459 | for feature in features: 460 | all_input_ids.append(feature.input_ids) 461 | all_input_mask.append(feature.input_mask) 462 | all_segment_ids.append(feature.segment_ids) 463 | all_label_ids.append(feature.label_id) 464 | 465 | def input_fn(params): 466 | """The actual input function.""" 467 | batch_size = params["batch_size"] 468 | 469 | num_examples = len(features) 470 | 471 | # This is for demo purposes and does NOT scale to large data sets. We do 472 | # not use Dataset.from_generator() because that uses tf.py_func which is 473 | # not TPU compatible. The right way to load data is with TFRecordReader. 474 | d = tf.data.Dataset.from_tensor_slices({ 475 | "input_ids": 476 | tf.constant( 477 | all_input_ids, shape=[num_examples, seq_length], 478 | dtype=tf.int32), 479 | "input_mask": 480 | tf.constant( 481 | all_input_mask, 482 | shape=[num_examples, seq_length], 483 | dtype=tf.int32), 484 | "segment_ids": 485 | tf.constant( 486 | all_segment_ids, 487 | shape=[num_examples, seq_length], 488 | dtype=tf.int32), 489 | "label_ids": 490 | tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32), 491 | }) 492 | 493 | if is_training: 494 | d = d.repeat() 495 | d = d.shuffle(buffer_size=100) 496 | 497 | d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder) 498 | return d 499 | 500 | return input_fn 501 | 502 | 503 | # This function is not used by this file but is still used by the Colab and 504 | # people who depend on it. 505 | def convert_examples_to_features(examples, label_list, max_seq_length, 506 | tokenizer): 507 | """Convert a set of `InputExample`s to a list of `InputFeatures`.""" 508 | 509 | features = [] 510 | for (ex_index, example) in enumerate(examples): 511 | if ex_index % 10000 == 0: 512 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 513 | 514 | feature = convert_single_example(ex_index, example, label_list, 515 | max_seq_length, tokenizer) 516 | 517 | features.append(feature) 518 | return features 519 | 520 | 521 | def main(_): 522 | tf.logging.set_verbosity(tf.logging.INFO) 523 | 524 | processors = { 525 | "cola": ColaProcessor, 526 | "mnli": MnliProcessor, 527 | "mrpc": MrpcProcessor, 528 | "xnli": XnliProcessor, 529 | "publictwittersentiment": PublicTwitterSentimentProcessor 530 | } 531 | 532 | if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict: 533 | raise ValueError( 534 | "At least one of `do_train`, `do_eval` or `do_predict' must be True.") 535 | 536 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) 537 | 538 | if FLAGS.max_seq_length > bert_config.max_position_embeddings: 539 | raise ValueError( 540 | "Cannot use sequence length %d because the BERT model " 541 | "was only trained up to sequence length %d" % 542 | (FLAGS.max_seq_length, bert_config.max_position_embeddings)) 543 | 544 | tf.gfile.MakeDirs(FLAGS.output_dir) 545 | 546 | task_name = FLAGS.task_name.lower() 547 | 548 | if task_name not in processors: 549 | raise ValueError("Task not found: %s" % (task_name)) 550 | 551 | processor = processors[task_name]() 552 | 553 | label_list = processor.get_labels() 554 | 555 | # if sentencepiece model is passed, it will use tweet tokenzier that is specialized to twitter bert encoder. 556 | _normalizer = None if FLAGS.normalizer_name is None else normalizer.__getattribute__(FLAGS.normalizer_name) 557 | print(f"do_lower_case: {FLAGS.do_lower_case}") 558 | if FLAGS.spm_file is not None: 559 | tokenizer = tokenization.JapaneseTweetTokenizer( 560 | vocab_file=FLAGS.vocab_file, model_file=FLAGS.spm_file, normalizer=_normalizer, 561 | do_lower_case=FLAGS.do_lower_case) 562 | # otherwise, it will use default tokenizer for English sentence. 563 | else: 564 | tokenizer = tokenization.FullTokenizer( 565 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case, normalizer=_normalizer) 566 | 567 | tpu_cluster_resolver = None 568 | if FLAGS.use_tpu and FLAGS.tpu_name: 569 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 570 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 571 | 572 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 573 | run_config = tf.contrib.tpu.RunConfig( 574 | cluster=tpu_cluster_resolver, 575 | master=FLAGS.master, 576 | model_dir=FLAGS.output_dir, 577 | save_checkpoints_steps=FLAGS.save_checkpoints_steps, 578 | tpu_config=tf.contrib.tpu.TPUConfig( 579 | iterations_per_loop=FLAGS.iterations_per_loop, 580 | num_shards=FLAGS.num_tpu_cores, 581 | per_host_input_for_training=is_per_host)) 582 | 583 | train_examples = None 584 | num_train_steps = None 585 | num_warmup_steps = None 586 | if FLAGS.do_train: 587 | train_examples = processor.get_train_examples(FLAGS.data_dir) 588 | num_train_steps = int( 589 | len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs) 590 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) 591 | 592 | model_fn = model_fn_builder( 593 | bert_config=bert_config, 594 | num_labels=len(label_list), 595 | init_checkpoint=FLAGS.init_checkpoint, 596 | learning_rate=FLAGS.learning_rate, 597 | num_train_steps=num_train_steps, 598 | num_warmup_steps=num_warmup_steps, 599 | use_tpu=FLAGS.use_tpu, 600 | use_one_hot_embeddings=FLAGS.use_tpu) 601 | 602 | # If TPU is not available, this will fall back to normal Estimator on CPU 603 | # or GPU. 604 | estimator = tf.contrib.tpu.TPUEstimator( 605 | use_tpu=FLAGS.use_tpu, 606 | model_fn=model_fn, 607 | config=run_config, 608 | train_batch_size=FLAGS.train_batch_size, 609 | eval_batch_size=FLAGS.eval_batch_size, 610 | predict_batch_size=FLAGS.predict_batch_size) 611 | 612 | if FLAGS.do_train: 613 | train_file = os.path.join(FLAGS.output_dir, "train.tf_record") 614 | file_based_convert_examples_to_features( 615 | train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file) 616 | tf.logging.info("***** Running training *****") 617 | tf.logging.info(" Num examples = %d", len(train_examples)) 618 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) 619 | tf.logging.info(" Num steps = %d", num_train_steps) 620 | train_input_fn = file_based_input_fn_builder( 621 | input_file=train_file, 622 | seq_length=FLAGS.max_seq_length, 623 | is_training=True, 624 | drop_remainder=True) 625 | with utility.timer("train-time"): 626 | estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) 627 | 628 | if FLAGS.do_eval: 629 | eval_examples = processor.get_dev_examples(FLAGS.data_dir) 630 | eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record") 631 | file_based_convert_examples_to_features( 632 | eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) 633 | 634 | tf.logging.info("***** Running evaluation *****") 635 | tf.logging.info(" Num examples = %d", len(eval_examples)) 636 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) 637 | 638 | # This tells the estimator to run through the entire set. 639 | eval_steps = None 640 | # However, if running eval on the TPU, you will need to specify the 641 | # number of steps. 642 | if FLAGS.use_tpu: 643 | # Eval will be slightly WRONG on the TPU because it will truncate 644 | # the last batch. 645 | eval_steps = int(len(eval_examples) / FLAGS.eval_batch_size) 646 | 647 | eval_drop_remainder = True if FLAGS.use_tpu else False 648 | eval_input_fn = file_based_input_fn_builder( 649 | input_file=eval_file, 650 | seq_length=FLAGS.max_seq_length, 651 | is_training=False, 652 | drop_remainder=eval_drop_remainder) 653 | 654 | with utility.timer("dev-time"): 655 | result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) 656 | 657 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") 658 | with tf.gfile.GFile(output_eval_file, "w") as writer: 659 | tf.logging.info("***** Eval results *****") 660 | for key in sorted(result.keys()): 661 | tf.logging.info(" %s = %s", key, str(result[key])) 662 | writer.write("%s = %s\n" % (key, str(result[key]))) 663 | 664 | if FLAGS.do_predict: 665 | predict_examples = processor.get_test_examples(FLAGS.data_dir) 666 | predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record") 667 | file_based_convert_examples_to_features(predict_examples, label_list, 668 | FLAGS.max_seq_length, tokenizer, 669 | predict_file) 670 | 671 | tf.logging.info("***** Running prediction*****") 672 | tf.logging.info(" Num examples = %d", len(predict_examples)) 673 | tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size) 674 | 675 | if FLAGS.use_tpu: 676 | # Warning: According to tpu_estimator.py Prediction on TPU is an 677 | # experimental feature and hence not supported here 678 | raise ValueError("Prediction in TPU not supported") 679 | 680 | predict_drop_remainder = True if FLAGS.use_tpu else False 681 | predict_input_fn = file_based_input_fn_builder( 682 | input_file=predict_file, 683 | seq_length=FLAGS.max_seq_length, 684 | is_training=False, 685 | drop_remainder=predict_drop_remainder) 686 | 687 | result = estimator.predict(input_fn=predict_input_fn) 688 | lst_result = list(result) 689 | 690 | with utility.timer("predict-time"): 691 | # probability 692 | output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv") 693 | with tf.gfile.GFile(output_predict_file, "w") as writer: 694 | tf.logging.info("***** Predict results *****") 695 | for prediction in lst_result: 696 | output_line = "\t".join( 697 | str(class_probability) for class_probability in prediction) + "\n" 698 | writer.write(output_line) 699 | # predicted label 700 | output_predict_file = os.path.join(FLAGS.output_dir, "test_results_label.tsv") 701 | lst_labels = processor.get_labels() 702 | with tf.gfile.GFile(output_predict_file, "w") as writer: 703 | for prediction in lst_result: 704 | idx = np.argmax(prediction) 705 | output_line = lst_labels[idx] + "\n" 706 | writer.write(output_line) 707 | # ground-truth label 708 | output_ground_truth_file = os.path.join(FLAGS.output_dir, "test_results_ground_truth.tsv") 709 | pred_examples = processor.get_test_examples(FLAGS.data_dir) 710 | with tf.gfile.GFile(output_ground_truth_file, "w") as writer: 711 | for example in pred_examples: 712 | output_line = example.label + "\n" 713 | writer.write(output_line) 714 | 715 | 716 | 717 | if __name__ == "__main__": 718 | flags.mark_flag_as_required("data_dir") 719 | flags.mark_flag_as_required("task_name") 720 | flags.mark_flag_as_required("vocab_file") 721 | flags.mark_flag_as_required("bert_config_file") 722 | flags.mark_flag_as_required("output_dir") 723 | tf.app.run() 724 | -------------------------------------------------------------------------------- /src/run_classifier.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | MODEL_DIR=../trained_model/masked_lm_only_L-12_H-768_A-12 4 | EVAL_DIR=../evaluation_dataset 5 | 6 | 7 | # fine-tune and evaluate using Twitter日本語評判分析データセット[Suzuki, 2017] 8 | ## hottoSNS-BERT 9 | python run_classifier.py \ 10 | --task_name=PublicTwitterSentiment \ 11 | --do_train=true \ 12 | --do_eval=true \ 13 | --do_predict=true \ 14 | --data_dir=$EVAL_DIR/twitter_sentiment \ 15 | --vocab_file=$MODEL_DIR/tokenizer_spm_32K.vocab.to.bert \ 16 | --spm_file=$MODEL_DIR/tokenizer_spm_32K.model \ 17 | --bert_config_file=$MODEL_DIR/bert_config.json \ 18 | --init_checkpoint=$MODEL_DIR/model.ckpt-1000000 \ 19 | --max_seq_length=64 \ 20 | --train_batch_size=32 \ 21 | --learning_rate=2e-5 \ 22 | --num_train_epochs=3.0 \ 23 | --output_dir=./eval_hottoSNS/ 24 | 25 | 26 | 27 | MODEL_DIR=../trained_model/wikipedia_ja_L-12_H-768_A-12 28 | EVAL_DIR=../evaluation_dataset 29 | 30 | 31 | ## Wikipedia JP 32 | python run_classifier.py \ 33 | --task_name=PublicTwitterSentiment \ 34 | --do_train=true \ 35 | --do_eval=true \ 36 | --do_predict=true \ 37 | --data_dir=$EVAL_DIR/twitter_sentiment \ 38 | --vocab_file=$MODEL_DIR/wiki-ja.vocab.to.bert \ 39 | --spm_file=$MODEL_DIR/wiki-ja.model \ 40 | --bert_config_file=$MODEL_DIR/bert_config.json \ 41 | --init_checkpoint=$MODEL_DIR/model.ckpt-1400000 \ 42 | --max_seq_length=128 \ 43 | --train_batch_size=32 \ 44 | --learning_rate=2e-5 \ 45 | --num_train_epochs=3.0 \ 46 | --output_dir=./eval_wikija/ 47 | 48 | 49 | 50 | ## MultiLingual Model 51 | MODEL_DIR=../trained_model/multi_cased_L-12_H-768_A-12 52 | EVAL_DIR=../evaluation_dataset 53 | 54 | 55 | python run_classifier.py \ 56 | --task_name=PublicTwitterSentiment \ 57 | --do_train=true \ 58 | --do_eval=true \ 59 | --do_predict=true \ 60 | --do_lower_case=false \ 61 | --data_dir=$EVAL_DIR/twitter_sentiment \ 62 | --vocab_file=$MODEL_DIR/vocab.txt \ 63 | --bert_config_file=$MODEL_DIR/bert_config.json \ 64 | --init_checkpoint=$MODEL_DIR/bert_model.ckpt \ 65 | --max_seq_length=128 \ 66 | --train_batch_size=32 \ 67 | --learning_rate=2e-5 \ 68 | --num_train_epochs=3.0 \ 69 | --output_dir=./eval_multi/ 70 | -------------------------------------------------------------------------------- /src/tokenization.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """Tokenization classes.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import collections 22 | import unicodedata 23 | import six 24 | import tensorflow as tf 25 | import sentencepiece as spm 26 | from distutils.version import LooseVersion 27 | 28 | 29 | def convert_to_unicode(text): 30 | """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" 31 | if six.PY3: 32 | if isinstance(text, str): 33 | return text 34 | elif isinstance(text, bytes): 35 | return text.decode("utf-8", "ignore") 36 | else: 37 | raise ValueError("Unsupported string type: %s" % (type(text))) 38 | elif six.PY2: 39 | if isinstance(text, str): 40 | return text.decode("utf-8", "ignore") 41 | elif isinstance(text, unicode): 42 | return text 43 | else: 44 | raise ValueError("Unsupported string type: %s" % (type(text))) 45 | else: 46 | raise ValueError("Not running on Python2 or Python 3?") 47 | 48 | 49 | def printable_text(text): 50 | """Returns text encoded in a way suitable for print or `tf.logging`.""" 51 | 52 | # These functions want `str` for both Python2 and Python3, but in one case 53 | # it's a Unicode string and in the other it's a byte string. 54 | if six.PY3: 55 | if isinstance(text, str): 56 | return text 57 | elif isinstance(text, bytes): 58 | return text.decode("utf-8", "ignore") 59 | else: 60 | raise ValueError("Unsupported string type: %s" % (type(text))) 61 | elif six.PY2: 62 | if isinstance(text, str): 63 | return text 64 | elif isinstance(text, unicode): 65 | return text.encode("utf-8") 66 | else: 67 | raise ValueError("Unsupported string type: %s" % (type(text))) 68 | else: 69 | raise ValueError("Not running on Python2 or Python 3?") 70 | 71 | 72 | def load_vocab(vocab_file): 73 | """Loads a vocabulary file into a dictionary.""" 74 | vocab = collections.OrderedDict() 75 | index = 0 76 | # switch depending on the tf major version 77 | is_tensorflow_ver_2 = LooseVersion(tf.__version__) >= LooseVersion("2.0.0") 78 | if is_tensorflow_ver_2: 79 | GFile = tf.io.gfile.GFile 80 | else: 81 | GFile = tf.gfile.GFile 82 | 83 | with GFile(vocab_file, "r") as reader: 84 | while True: 85 | token = convert_to_unicode(reader.readline()) 86 | if not token: 87 | break 88 | token = token.strip() 89 | vocab[token] = index 90 | index += 1 91 | return vocab 92 | 93 | 94 | def convert_by_vocab(vocab, items): 95 | """Converts a sequence of [tokens|ids] using the vocab.""" 96 | output = [] 97 | for item in items: 98 | output.append(vocab[item]) 99 | return output 100 | 101 | 102 | def convert_tokens_to_ids(vocab, tokens): 103 | return convert_by_vocab(vocab, tokens) 104 | 105 | 106 | def convert_ids_to_tokens(inv_vocab, ids): 107 | return convert_by_vocab(inv_vocab, ids) 108 | 109 | 110 | def whitespace_tokenize(text): 111 | """Runs basic whitespace cleaning and splitting on a piece of text.""" 112 | text = text.strip() 113 | if not text: 114 | return [] 115 | tokens = text.split() 116 | return tokens 117 | 118 | 119 | class FullTokenizer(object): 120 | """Runs end-to-end tokenziation.""" 121 | 122 | def __init__(self, vocab_file, normalizer=None, do_lower_case=True): 123 | self.vocab = load_vocab(vocab_file) 124 | self.inv_vocab = {v: k for k, v in self.vocab.items()} 125 | self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) 126 | self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) 127 | self.normalizer = normalizer 128 | 129 | def tokenize(self, text): 130 | split_tokens = [] 131 | if self.normalizer is not None: 132 | text = self.normalizer(text) 133 | for token in self.basic_tokenizer.tokenize(text): 134 | for sub_token in self.wordpiece_tokenizer.tokenize(token): 135 | split_tokens.append(sub_token) 136 | 137 | return split_tokens 138 | 139 | def convert_tokens_to_ids(self, tokens): 140 | return convert_by_vocab(self.vocab, tokens) 141 | 142 | def convert_ids_to_tokens(self, ids): 143 | return convert_by_vocab(self.inv_vocab, ids) 144 | 145 | 146 | class BasicTokenizer(object): 147 | """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" 148 | 149 | def __init__(self, do_lower_case=True): 150 | """Constructs a BasicTokenizer. 151 | 152 | Args: 153 | do_lower_case: Whether to lower case the input. 154 | """ 155 | self.do_lower_case = do_lower_case 156 | 157 | def tokenize(self, text): 158 | """Tokenizes a piece of text.""" 159 | text = convert_to_unicode(text) 160 | text = self._clean_text(text) 161 | 162 | # This was added on November 1st, 2018 for the multilingual and Chinese 163 | # models. This is also applied to the English models now, but it doesn't 164 | # matter since the English models were not trained on any Chinese data 165 | # and generally don't have any Chinese data in them (there are Chinese 166 | # characters in the vocabulary because Wikipedia does have some Chinese 167 | # words in the English Wikipedia.). 168 | text = self._tokenize_chinese_chars(text) 169 | 170 | orig_tokens = whitespace_tokenize(text) 171 | split_tokens = [] 172 | for token in orig_tokens: 173 | if self.do_lower_case: 174 | token = token.lower() 175 | token = self._run_strip_accents(token) 176 | split_tokens.extend(self._run_split_on_punc(token)) 177 | 178 | output_tokens = whitespace_tokenize(" ".join(split_tokens)) 179 | return output_tokens 180 | 181 | def _run_strip_accents(self, text): 182 | """Strips accents from a piece of text.""" 183 | text = unicodedata.normalize("NFD", text) 184 | output = [] 185 | for char in text: 186 | cat = unicodedata.category(char) 187 | if cat == "Mn": 188 | continue 189 | output.append(char) 190 | return "".join(output) 191 | 192 | def _run_split_on_punc(self, text): 193 | """Splits punctuation on a piece of text.""" 194 | chars = list(text) 195 | i = 0 196 | start_new_word = True 197 | output = [] 198 | while i < len(chars): 199 | char = chars[i] 200 | if _is_punctuation(char): 201 | output.append([char]) 202 | start_new_word = True 203 | else: 204 | if start_new_word: 205 | output.append([]) 206 | start_new_word = False 207 | output[-1].append(char) 208 | i += 1 209 | 210 | return ["".join(x) for x in output] 211 | 212 | def _tokenize_chinese_chars(self, text): 213 | """Adds whitespace around any CJK character.""" 214 | output = [] 215 | for char in text: 216 | cp = ord(char) 217 | if self._is_chinese_char(cp): 218 | output.append(" ") 219 | output.append(char) 220 | output.append(" ") 221 | else: 222 | output.append(char) 223 | return "".join(output) 224 | 225 | def _is_chinese_char(self, cp): 226 | """Checks whether CP is the codepoint of a CJK character.""" 227 | # This defines a "chinese character" as anything in the CJK Unicode block: 228 | # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) 229 | # 230 | # Note that the CJK Unicode block is NOT all Japanese and Korean characters, 231 | # despite its name. The modern Korean Hangul alphabet is a different block, 232 | # as is Japanese Hiragana and Katakana. Those alphabets are used to write 233 | # space-separated words, so they are not treated specially and handled 234 | # like the all of the other languages. 235 | if ((cp >= 0x4E00 and cp <= 0x9FFF) or # 236 | (cp >= 0x3400 and cp <= 0x4DBF) or # 237 | (cp >= 0x20000 and cp <= 0x2A6DF) or # 238 | (cp >= 0x2A700 and cp <= 0x2B73F) or # 239 | (cp >= 0x2B740 and cp <= 0x2B81F) or # 240 | (cp >= 0x2B820 and cp <= 0x2CEAF) or 241 | (cp >= 0xF900 and cp <= 0xFAFF) or # 242 | (cp >= 0x2F800 and cp <= 0x2FA1F)): # 243 | return True 244 | 245 | return False 246 | 247 | def _clean_text(self, text): 248 | """Performs invalid character removal and whitespace cleanup on text.""" 249 | output = [] 250 | for char in text: 251 | cp = ord(char) 252 | if cp == 0 or cp == 0xfffd or _is_control(char): 253 | continue 254 | if _is_whitespace(char): 255 | output.append(" ") 256 | else: 257 | output.append(char) 258 | return "".join(output) 259 | 260 | 261 | class WordpieceTokenizer(object): 262 | """Runs WordPiece tokenziation.""" 263 | 264 | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200): 265 | self.vocab = vocab 266 | self.unk_token = unk_token 267 | self.max_input_chars_per_word = max_input_chars_per_word 268 | 269 | def tokenize(self, text): 270 | """Tokenizes a piece of text into its word pieces. 271 | 272 | This uses a greedy longest-match-first algorithm to perform tokenization 273 | using the given vocabulary. 274 | 275 | For example: 276 | input = "unaffable" 277 | output = ["un", "##aff", "##able"] 278 | 279 | Args: 280 | text: A single token or whitespace separated tokens. This should have 281 | already been passed through `BasicTokenizer. 282 | 283 | Returns: 284 | A list of wordpiece tokens. 285 | """ 286 | 287 | text = convert_to_unicode(text) 288 | 289 | output_tokens = [] 290 | for token in whitespace_tokenize(text): 291 | chars = list(token) 292 | if len(chars) > self.max_input_chars_per_word: 293 | output_tokens.append(self.unk_token) 294 | continue 295 | 296 | is_bad = False 297 | start = 0 298 | sub_tokens = [] 299 | while start < len(chars): 300 | end = len(chars) 301 | cur_substr = None 302 | while start < end: 303 | substr = "".join(chars[start:end]) 304 | if start > 0: 305 | substr = "##" + substr 306 | if substr in self.vocab: 307 | cur_substr = substr 308 | break 309 | end -= 1 310 | if cur_substr is None: 311 | is_bad = True 312 | break 313 | sub_tokens.append(cur_substr) 314 | start = end 315 | 316 | if is_bad: 317 | output_tokens.append(self.unk_token) 318 | else: 319 | output_tokens.extend(sub_tokens) 320 | return output_tokens 321 | 322 | 323 | class JapaneseTweetTokenizer(object): 324 | """Runs end-to-end tokenziation.""" 325 | 326 | def __init__(self, vocab_file, model_file, normalizer=None, do_lower_case=True): 327 | self.vocab = load_vocab(vocab_file) 328 | self.inv_vocab = {v: k for k, v in self.vocab.items()} 329 | self.model = spm.SentencePieceProcessor() 330 | self.model.Load(model_file) 331 | self.do_lower_case = do_lower_case 332 | self.normalizer = normalizer 333 | 334 | def tokenize(self, text): 335 | if self.do_lower_case: 336 | text = text.lower() 337 | if self.normalizer is not None: 338 | text = self.normalizer(text) 339 | 340 | split_tokens = self.model.EncodeAsPieces(text) 341 | return split_tokens 342 | 343 | def convert_tokens_to_ids(self, tokens): 344 | return self._convert_by_vocab(self.vocab, tokens) 345 | 346 | def convert_ids_to_tokens(self, ids): 347 | return self._convert_by_vocab(self.inv_vocab, ids) 348 | 349 | def _convert_by_vocab(self, vocab, items): 350 | """Converts a sequence of [tokens|ids] using the vocab.""" 351 | output = [] 352 | for item in items: 353 | output.append(vocab.get(item, vocab.get('', ''))) 354 | return output 355 | 356 | def _is_whitespace(char): 357 | """Checks whether `chars` is a whitespace character.""" 358 | # \t, \n, and \r are technically contorl characters but we treat them 359 | # as whitespace since they are generally considered as such. 360 | if char == " " or char == "\t" or char == "\n" or char == "\r": 361 | return True 362 | cat = unicodedata.category(char) 363 | if cat == "Zs": 364 | return True 365 | return False 366 | 367 | 368 | def _is_control(char): 369 | """Checks whether `chars` is a control character.""" 370 | # These are technically control characters but we count them as whitespace 371 | # characters. 372 | if char == "\t" or char == "\n" or char == "\r": 373 | return False 374 | cat = unicodedata.category(char) 375 | if cat.startswith("C"): 376 | return True 377 | return False 378 | 379 | 380 | def _is_punctuation(char): 381 | """Checks whether `chars` is a punctuation character.""" 382 | cp = ord(char) 383 | # We treat all non-letter/number ASCII as punctuation. 384 | # Characters such as "^", "$", and "`" are not in the Unicode 385 | # Punctuation class but we treat them as punctuation anyways, for 386 | # consistency. 387 | if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or 388 | (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): 389 | return True 390 | cat = unicodedata.category(char) 391 | if cat.startswith("P"): 392 | return True 393 | return False 394 | -------------------------------------------------------------------------------- /src/utility.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | from __future__ import absolute_import 4 | from __future__ import unicode_literals 5 | from __future__ import division 6 | 7 | from contextlib import contextmanager 8 | import time 9 | @contextmanager 10 | def timer(name): 11 | t0 = time.time() 12 | yield 13 | print(f'[{name}] done in {time.time() - t0:1.4f} s') -------------------------------------------------------------------------------- /trained_model/masked_lm_only_L-12_H-768_A-12/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/trained_model/masked_lm_only_L-12_H-768_A-12/.gitkeep -------------------------------------------------------------------------------- /trained_model/multi_cased_L-12_H-768_A-12/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/trained_model/multi_cased_L-12_H-768_A-12/.gitkeep -------------------------------------------------------------------------------- /trained_model/wikipedia_ja_L-12_H-768_A-12/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hottolink/hottoSNS-bert/07b1fcc8e0bd1e763dffa126cde87f651681d15b/trained_model/wikipedia_ja_L-12_H-768_A-12/.gitkeep --------------------------------------------------------------------------------